期刊文献+

基于统计语言模型改进的Word2Vec优化策略研究 被引量:14

Word2Vec Optimization Strategy Based on an Improved Statistical Language Model
在线阅读 下载PDF
导出
摘要 该文从训练词向量的语言模型入手,研究了经典skip-gram、CBOW语言模型训练出的词向量的优缺点,引入TFIDF文本关键词计算法,提出了一种基于关键词改进的语言模型。研究发现,经典skip-gram、CBOW语言模型只考虑到词本身与其上下文的联系,而改进的语言模型通过文本关键词建立了词本身与整个文本之间的联系,在词向量训练结果的查准率和相似度方面,改进模型训练出的词向量较skip-gram、CBOW语言模型有一个小幅度的提升。通过基于维基百科1.5GB中文语料的词向量训练实验对比后发现,使用CBOW-TFIDF模型训练出的词向量在相似词测试任务中结果最佳;把改进的词向量应用到情感倾向性分析任务中,正向评价的精确率和F1值分别提高了4.79%、4.92%,因此基于统计语言模型改进的词向量,对于情感倾向性分析等以词向量为基础的应用研究工作有较为重要的实践意义。 This paper introduces the TFIDF method and proposes a keyword integrated language model for word embedding.Compared with the classic skip-gram and CBOW language models considering only the relationship between the word itself and its context,the proposed method establishes the connection between the word itself and the whole text.Trained on Wikipedia 1.5 GChinese corpus,the word embedding generated by CBOW-TFIDF achieves the best result in synonym test,and improves the accuracy and F-score of the positive evaluation by 4.79%and4.92%,respectively in the sentiment tendency analysis task.
作者 张克君 史泰猛 李伟男 钱榕 ZHANG Kejun;SHI Taimeng;LI Weinan;QIAN Rong(Beijing Electronic Science&Technology Institute,Beijing 100071,China;School of Computer Science and Technology,Xidian University,Xi'an,Shaanxi 710071,China)
出处 《中文信息学报》 CSCD 北大核心 2019年第7期11-19,共9页 Journal of Chinese Information Processing
基金 国家重点研发计划(2018YFB1004101) 国家自然科学基金(61170037)
关键词 词向量 统计语言模型 TFIDF 文本关键词 CBOW-TFIDF word vector statistical language model TFIDF key words CBOW-TFIDF
作者简介 张克君(1972—),博士,副教授,主要研究领域为数据挖掘、知识发现。E-mail:zkj@besti.edu.cn;李伟男(1994—),硕士研究生,主要研究领域为机器学习、数据挖掘。E-mail:568793056@qq.com;通信作者:史泰猛(1995—),硕士研究生,主要研究领域为机器学习、自然语言处理。E-mail:shitaimeng@163.com
  • 相关文献

参考文献6

二级参考文献66

  • 1Graff D. The 1998 broadcast news speech and language-model corpus. Slides from lecture at the 1997 DARPA Speech Recognition Workshop, Feb. 1997.
  • 2Rosenfeld R. A maximum entropy approach to adaptive statistical language modeling. Computer Speech and Language, 1996, 10:187-228.
  • 3Katz S M. Estimation of probabilities from sparse data for the language model component of speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 1987, ASSP35:400-401.
  • 4Jelinek F,Mercer R L. Interpolated estimation of Markov source parameters from sparse data. In:Proc. of the Workshop on Pattern Recognition in Practice, Amsterdam, The Netherlands: North-Holland, May 1980,381-397.
  • 5Magerman D M. Natural Language Parrsing as Statistical Pattern Recognition:[PhD Thesis]. Stanford University, 1994.
  • 6Bahl L R,Brown P F, De Souza P V, Mercer R L. A tree-based statistical language model for natural language speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing. 1989, 37(7): 1001-1008.
  • 7Rosenfeld R. Adaptive Statistical Language Modeling: A Maximum Entropy Approach: [PhD thesis]. Carnegie Mellon University, 1994- CMU Technical Report CMU-CS-94-138.
  • 8Darroch J, RatclifI D. Generalized iterative scaling for log-linear models. The annals of Mathematical statistics 1972, 43: 1470-1480.
  • 9Berger A L. Della Pietra S A, Della Pietra V J. A maximum entropy approach to natural language processing. Computational Linguistics 1996,22(1) : 39-71.
  • 10RosenIeld R. Two decades oI Statistical Language Modeling: Where Do We Go From Here? Proceedings of the IEEE, 2000, 88(8).

共引文献500

同被引文献96

引证文献14

二级引证文献59

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部