期刊文献+

融合预训练模型文本特征的短文本分类方法 被引量:10

Short-Text Classification Method with Text Features from Pre-trained Models
原文传递
导出
摘要 【目的】综合运用不同预训练模型的词向量进行文本语义增强,解决基于Word2Vec、BERT等模型所表示的词向量存在先验知识缺失的问题,提升在新闻数据集上的分类效果。【方法】以今日头条新闻公开数据集和THUCNews新闻数据集为实验对象,使用BERT、ERNIE模型通过领域预训练,分别提取上下文语义信息和实体、短语的先验知识信息;结合TextCNN模型生成高阶文本特征向量并进行特征融合,实现语义增强,进而提升短文本分类效果。【结果】相较于传统的Word2Vec词向量表示,使用预训练词向量表示的分类算法模型准确率分别提升了6.37个百分点和3.50个百分点;相较于BERT和ERNIE词向量表示,融合BERTERNIE词向量表示的分类算法模型准确率分别提升1.98个百分点和1.51个百分点。【局限】领域预训练采用的新闻领域语料有待进一步丰富。【结论】所提方法能够对海量的短文本数据实现快速而准确的分类,对后续文本挖掘工作具有重要意义。 [Objective]This paper uses word vectors from different pre-trained models to enhance text semantics of Word2Vec,BERT and others,and then significantly improve the news classification.[Methods]We utilized the BERT and ERNIE models to extract context semantics,and the prior knowledge of entities and phrases through Domain-Adaptive Pretraining.Combined with the TextCNN model,the proposed method generated high-order text feature vectors.It also merged these features to achieve semantic enhancement and better short text classification.[Results]We examined the proposed method with public data sets from Today’s Headline News and THUCNews.Compared with the traditional Word2Vec word vector representation,the accuracy of our new model improved by 6.37%and 3.50%.Compared with the BERT and ERNIE methods,the accuracy of our new model improved by 1.98%and 1.51%respectively.[Limitations]The news corpus in our study needs to be further expanded.[Conclusions]The proposed method could effectively classify massive short text data,which is of great significance to the follow-up text mining.
作者 陈杰 马静 李晓峰 Chen Jie;Ma Jing;Li Xiaofeng(College of Economics and Management,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China)
出处 《数据分析与知识发现》 CSSCI CSCD 北大核心 2021年第9期21-30,共10页 Data Analysis and Knowledge Discovery
基金 国家社会科学基金重大招标项目(项目编号:20ZDA092) 中央高校基本科研业务费专项前瞻性发展策略研究资助项目(项目编号:NW2020001) 研究生创新基地(实验室)开放基金(项目编号:kfjj20200905)的研究成果之一。
关键词 BERT ERNIE 短文本分类 文本特征融合 领域预训练 BERT ERNIE Short Text Classification Text Feature Fusion Domain-Adaptive Pretraining
作者简介 通讯作者:马静,ORCID:0000-0001-8472-2581,E-mail:majing5525@126.com。
  • 相关文献

参考文献9

二级参考文献65

  • 1周钦强,孙炳达,王义.文本自动分类系统文本预处理方法的研究[J].计算机应用研究,2005,22(2):85-86. 被引量:15
  • 2罗欣,夏德麟,晏蒲柳.基于词频差异的特征选取及改进的TF-IDF公式[J].计算机应用,2005,25(9):2031-2033. 被引量:56
  • 3苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:393
  • 4张玉芳,彭时名,吕佳.基于文本分类TFIDF方法的改进与应用[J].计算机工程,2006,32(19):76-78. 被引量:121
  • 5Sebastiani F. Machine learning in automated text categorization[J]. ACM Computing Surveys, 2002,34(1): 1-47.
  • 6Yiming Yang, Xin Liu. A re-examination of text categorization methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99)[EB/OL]. http://portal.acm.org/citation.cfm?id=312624.312647,2011- 03-21.
  • 7Sahon G,Wong A,Yang CS.A vector space model for automatic indexing[J].Communications of ACM, 1975,18(5):613-620.
  • 8Sebastiani F. Machine Learning in Automated Text Categorization [ J 1. ACM Computing Surveys (CSUR) , 2002,34 ( 1 ) : 1 - 47.
  • 9Forman G. BNS Feature Scaling: An Improved Representation over tf - idf for SVM Text Classification[ C]. In : Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 2008 : 263 - 270.
  • 10Lan M, Tan C L, Low H B, et al. A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with Sup- port Vector Machines [ C ]. In : Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. New York, NY, USA: ACM, 2005:1032-1033.

共引文献251

同被引文献135

引证文献10

二级引证文献34

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部