摘要
将word2vec和LDA算法相结合,对文本主题进行提取研究。通过已有的分词工具实现文本分词,提取文本中的词汇;对语料库依据LDA主题模型进行建模,提取主题相关词汇作为初始主题词集;依据word2vec模型提取与初始主题词集语义相似的词汇,将初始主题词汇之间的相似度和向量邻接关系按照权重不同重新分配,改进Gibbs抽样,对LDA进行改进,提高主题挖掘的准确性和稳定性。实验结果表明,当训练语料分布合理时,经过LDA和word2vec的有效结合,主题词抽取效果有所提高,验证了该方法的可行性。
Word2vec and LDA algorithm were combined to extract the text topic.Through the existing word segmentation tool,text segmentation was achieved and text vocabulary was extracted.The corpus was modeled according to the LDA theme model,and the related topic words were extracted as the initial keyword set.Based on the word2vec model and the initial extraction of thematic term set of semantic similarity between words,the initial theme of lexical similarity and vector adjacency relation were reassigned according to different weight redistributions,thus improving Gibbs sampling to improve the accuracy and stability of the theme mining on LDA.Experimental results show that,when the distribution of training corpus is reasonable,the effective combination of LDA and word2vec improves the keyword extraction,verifying the feasibility of the method.
作者
徐守坤
周佳
李宁
石林
XU Shou-kun;ZHOU Jia;LI Ning;SHI Lin(School of Information Science and Engineering,Changzhou University,Changzhou 213164,China;Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang College),Fuzhou 350108,China)
出处
《计算机工程与设计》
北大核心
2018年第9期2764-2769,共6页
Computer Engineering and Design
基金
闽江学院福建省信息处理与智能控制重点实验室开放课题基金项目(MJUKF201740)
作者简介
徐守坤(1972),男,吉林蛟河人,博士,教授,CCF会员,研究方向为人工智能、普适计算等;周佳(1991),女,江苏常州人,硕士研究生,研究方向为自然语言处理与图像处理,E-mail:zjjuly@163.com;李宁(1974),男,甘肃庆阳人,博士,副教授,研究方向为数据与信息处理;石林(1979),男,江苏常州人,硕士,副教授,研究方向为数据处理、图像识别。