摘要
【目的】实现复杂专利文本中的关键技术信息自动化抽取,缓解传统自然语言处理抽取模型强领域知识标注依赖的问题。【方法】本文提出一种基于大语言模型知识自蒸馏的无标注关键信息抽取方法,基于多重角色策略,对德温特改写专利的摘要进行结构化分析,通过知识自蒸馏策略增强大语言模型关键内容抽取与结构化分析的能力。【结果】本文方法在实体抽取任务和关系抽取任务的测试中,召回率分别达到了95.40%和51.49%,并且结构化分析的格式正确率达到100%。在关系三元组抽取任务数据集RE-DocRED上,本文方法在无监督和零样本的设置下F1值达到5.01%。【结论】本文方法能够出色地完成无数据标注的专利文本关键信息抽取任务。
[Objective]This paper aims to automate extracting key technical information from complex patent texts and to overcome the dependency on robust domain knowledge annotations in traditional natural language processing models.[Methods]We proposed an unsupervised key information extraction method based on knowledge self-distillation in the large language model.By employing a multiple-role strategy,we conducted a structured analysis of Derwent’s rewritten patent abstracts.This method enhanced the ability of large language models to extract and structurally analyze key content through the knowledge self-distillation strategy.[Results]In the entity and relation extraction tasks,our method’s recall rate reached 95.40%and 51.49%,respectively.The accuracy of the structural analysis format reached 100%.We also achieved an F1-score of 5.01%on the REDocRED dataset,a public dataset for the relation triplet extraction task,under unsupervised and zero-shot settings.[Conclusions]The proposed method can effectively extract key information from patent texts without data annotation.
作者
赵建飞
陈挺
王小梅
冯冲
Zhao Jianfei;Chen Ting;Wang Xiaomei;Feng Chong(School of Computer Science&Technology,Beijing Institute of Technology,Beijing 100081,China;Institutes of Science and Development,Chinese Academy of Sciences,Beijing 100190,China;Department of Information Resources Management,School of Economics and Management,University of Chinese Academy of Sciences,Beijing 100190,China;Southeast Academy of Information Technology,Beijing Institute of Technology,Putian 351100,China)
出处
《数据分析与知识发现》
EI
CSSCI
CSCD
北大核心
2024年第8期133-143,共11页
Data Analysis and Knowledge Discovery
基金
中国科学院文献情报能力建设专项(项目编号:GHJ-QBZX-2021-04)的研究成果之一
关键词
大语言模型
信息抽取
专利分析
Large Language Model
Information Extraction
Patent Analysis
作者简介
通讯作者:王小梅,ORCID:0000-0002-9895-1511,E-mail:wangxm@casisd.cn。