摘要
由于学科的不断细化和学科间发展速度的不均衡,个别学科可用于分类训练的数据极少,为科技文献分类工作带来了一定困难。为此,针对科技文献长尾问题严重且传统文本分类方法已经无法取得更好分类效果的问题,提出一种基于BERT-Prototypical模型的小样本科技文献分类方法。该模型以迁移学习中的原型网络为基础,首先借助BERT预训练模型深入挖掘科技文献文本间的关系以获得更好的特征表示;然后将编码后的文本特征输入到原型网络中,通过优化原型网络的编码方式和参数设置提高科技文献分类效果。实验结果表明,在5-way 20-shot任务中,BERT-Prototypical模型的分类准确率达到95.6%;在样本有限的5-way 5-shot任务中,BERT-Prototypical模型的分类准确率可达78.4%,相较对照模型的分类效果有所提升。
Due to the continuous refinement of disciplines and the uneven development speed between disciplines,there is very little data avail‐able for classification training in individual disciplines,which brings certain difficulties to the classification of scientific literature.To address the serious problem of long tail in scientific literature and the inability of traditional text classification methods to achieve better classification results,a small sample scientific literature classification method based on BERT Prototypal model is proposed.This model is based on the prototype net‐work in transfer learning,and first uses the BERT pre trained model to deeply explore the relationships between scientific literature texts to obtain better feature representations;Then input the encoded text features into the prototype network,and improve the classification performance of sci‐entific literature by optimizing the encoding method and parameter settings of the prototype network.The experimental results show that in the 5-way 20 shot task,the classification accuracy of the BERT Prototypal model reaches 95.6%;In the 5-way 5-shot task with limited samples,the classification accuracy of the BERT Prototypal model can reach 78.4%,which is improved compared to the control model.
作者
白文清
崔彩霞
BAI Wenqing;CUI Caixia(College of Computer Science and Technology,Taiyuan Normal University,Jinzhong 030619,China)
出处
《软件导刊》
2025年第4期42-47,共6页
Software Guide
基金
山西省基础研究计划(自由探索)项目(20210302123334)。
关键词
科技文献分类
小样本学习
原型网络
BERT模型
不平衡数据
scientific and technological literature classification
few-shot learning
prototype networks
BERT model
imbalanced data
作者简介
白文清(1999-),女,CCF会员,太原师范学院计算机科学与技术学院硕士研究生,研究方向为机器学习、数据挖掘;通讯作者:崔彩霞(1974-),女,博士,太原师范学院计算机科学与技术学院副教授、硕士生导师,研究方向为机器学习、数据挖掘。