摘要
[目的/意义]为解决潜在标准必要专利识别任务中的序列信息的关键实体识别问题和序列建模的长距离依赖问题,实现识别精准度的提升和识别结果可解释性的增强。基于此,本文提出一种融合预训练模型XLNet和实体识别模型BiLSTM-CRF的潜在标准必要专利识别模型(XLNet-BiLSTM-CRF-CNN,XLBLCC)。[方法/过程]通过XLNet模型联合上下文语境实现专利文本的词向量化与语义关系表达,并基于BiLSTM-CRF模型生成NER标签,用于标注文本中的命名实体边界,再通过构建CNN模型来学习标准必要专利(SEP)文本中的特征,实现潜在标准必要专利的识别和预测。实证部分以ETSI数据库中检索的SEP和incopat数据库中检索的非SEP构建的数据集对模型性能进行验证。[结果/结论]研究发现:XLBLCC模型在准确率(86%)、F1值(89%)和AUC值(84%)上均超越其他基线模型;XLNet模型在全局语义理解上较之BERT等模型具备优越性;在高价值专利与标准必要专利的对比实验中,该模型表现出较强的泛化能力。
[Purpose/Significance]This study addresses the challenges of entity recognition and long-distance depen-dencies in sequence modeling for potential Standard Essential Patent(SEP)identification tasks.The goal is to improve rec-ognition accuracy and enhance the interpretability of results,based on which,a novel model,XLNet-BiLSTM-CRF-CNN(XLBLCC)is proposed to identify potential SEPs.[Method/Process]The XLNet model was used to capture contextual semantics in patent text,providing rich vector representations and semantic relations.The BiLSTM-CRF model was applied to generate Named Entity Recognition(NER)tags,which helped identify the boundaries of entities in the text.To further enhance feature extraction,a CNN model was employed to learn the important characteristics of SEP text for accu-rate prediction.The model’s performance was validated on a dataset containing SEPs from the ETSI database and non-SEPs from the Incopat database.[Result/Conclusion]The XLBLCC model outperform baseline models,achieving an accu-racy of 86%,an F1 score of 89%,and an AUC of 84%.The XLNet model demonstrate superior global semantic under-standing compared to models like BERT.In experiments comparing high-value patents with SEPs,the proposed model show strong generalization capabilities,making it an effective and robust tool for SEP identification in patent analysis.
作者
窦路遥
周志刚
冯宇
Dou Luyao;Zhou Zhigang;Feng Yu(National Science Library(Wuhan),Chinese Academy of Sciences,Wuhan 430071,China;Department of Information Resources Management,School of Economics and Management,University of Chinese Academy of Sciences,Beijing 100190,China;School of Information,Shanxi University of Finance and Economics,Taiyuan 030006,China;School of Economics and Business Administration,Chongqing University,Chongqing 400000,China)
出处
《现代情报》
北大核心
2025年第10期16-25,共10页
Journal of Modern Information
基金
国家自然科学基金项目“多源数据融合场景下的对抗式隐私洞察靶向保护技术研究”(项目编号:61902226)。
作者简介
窦路遥(1999-),男,博士研究生,研究方向:专利分析与机器学习。;周志刚(1986-),男,副教授,博士,硕士生导师,研究方向:专利分析与数据融合。;通信作者:冯宇(1998-),女,博士研究生,研究方向:专利分析与自然语言处理。