摘要
【目的】综合运用不同预训练模型的词向量进行文本语义增强,解决基于Word2Vec、BERT等模型所表示的词向量存在先验知识缺失的问题,提升在新闻数据集上的分类效果。【方法】以今日头条新闻公开数据集和THUCNews新闻数据集为实验对象,使用BERT、ERNIE模型通过领域预训练,分别提取上下文语义信息和实体、短语的先验知识信息;结合TextCNN模型生成高阶文本特征向量并进行特征融合,实现语义增强,进而提升短文本分类效果。【结果】相较于传统的Word2Vec词向量表示,使用预训练词向量表示的分类算法模型准确率分别提升了6.37个百分点和3.50个百分点;相较于BERT和ERNIE词向量表示,融合BERTERNIE词向量表示的分类算法模型准确率分别提升1.98个百分点和1.51个百分点。【局限】领域预训练采用的新闻领域语料有待进一步丰富。【结论】所提方法能够对海量的短文本数据实现快速而准确的分类,对后续文本挖掘工作具有重要意义。
[Objective]This paper uses word vectors from different pre-trained models to enhance text semantics of Word2Vec,BERT and others,and then significantly improve the news classification.[Methods]We utilized the BERT and ERNIE models to extract context semantics,and the prior knowledge of entities and phrases through Domain-Adaptive Pretraining.Combined with the TextCNN model,the proposed method generated high-order text feature vectors.It also merged these features to achieve semantic enhancement and better short text classification.[Results]We examined the proposed method with public data sets from Today’s Headline News and THUCNews.Compared with the traditional Word2Vec word vector representation,the accuracy of our new model improved by 6.37%and 3.50%.Compared with the BERT and ERNIE methods,the accuracy of our new model improved by 1.98%and 1.51%respectively.[Limitations]The news corpus in our study needs to be further expanded.[Conclusions]The proposed method could effectively classify massive short text data,which is of great significance to the follow-up text mining.
作者
陈杰
马静
李晓峰
Chen Jie;Ma Jing;Li Xiaofeng(College of Economics and Management,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2021年第9期21-30,共10页
Data Analysis and Knowledge Discovery
基金
国家社会科学基金重大招标项目(项目编号:20ZDA092)
中央高校基本科研业务费专项前瞻性发展策略研究资助项目(项目编号:NW2020001)
研究生创新基地(实验室)开放基金(项目编号:kfjj20200905)的研究成果之一。
作者简介
通讯作者:马静,ORCID:0000-0001-8472-2581,E-mail:majing5525@126.com。