期刊文献+

基于分类的微博新情感词抽取方法和特征分析 被引量:20

A Classification Based Sentiment Words Extracting Method from Microblogs and Its Feature Engineering
在线阅读 下载PDF
导出
摘要 情感或情绪分析在舆情分析、商品评论分析、商品推荐等领域应用广泛,而文本中的情感或情绪分析通常以情感词典为基础.人工情感词典虽然准确但构建代价大、难以及时更新,很难适应微博这类新情感词快速更迭的数据.微博平台为新情感词的发布和传播提供了便捷的途径,是新情感词的重要来源.考虑到已有规模较大的人工情感词典及大量包含新情感词的微博数据,在统计、分析、对比中、英两种语言微博中情感词分布差异的基础上,提出了与特定语言无关的基于分类思想的微博新情感词抽取方法cNSEm.cNSEm根据微博数据集和情感词典自动构建训练数据、训练分类器并判别候选词的情感极性,最后采用投票机制确定候选词的情感极性.通过大量而细致的实验,分析了cNSEm在中、英文两种语言的微博数据上的表现、六类特征的作用和用法以及抽取的新情感词对微博情感分类任务的帮助作用.实验结果表明,cNSEm比经典的基于共现和极性传播的方法要好,特别是当考虑中文微博数据集中的名词类情感词时.对cNSEm抽取的新情感词进行了直接和间接两种方法评测,前者利用人工情感词典作参照,后者考察抽取的新情感词对情感分类的帮助作用,从评测指标上看,cNSEm抽取的新情感词与人工情感词典的质量相当,并且cNSEm能适应有较大差异的中、英两个语种. Text sentiment analysis tries to get the orientation(attitude,point of view,or emotion)of information publishers,which is widely used in the field of public opinion supervision,product reviews analysis,et al.,and has become one of the hottest topics in natural language processing,social media processing,data mining,etc.Sentiment analysis or emotion analysis on text is always based on a sentiment dictionary.Manually-built sentiment dictionary may produces high accuracy however with limited coverage and updating difficulty,which is hard to cope with situation under Web 2.0,where new sentiment words are created more frequently and spread more quickly.Microblog platforms,such as Twitter and Sina Weibo,allow users to publish and transmit information freely,and become important sources of new sentiment words.By using large manually-built sentiment dictionaries and microblog data with mass sentiment words online,this paper analyzes distribution difference of Chinese and English sentiment words,and cNSEm is proposed to extract new sentiment words from microblogs,based on classification principle.cNSEm automatically generates candidate samples,which are classified by a trained classifier,and then sorted and extracted according to a voting strategy.The classification based methods have been used to extract new sentiment words in some related works.However,most of them extracted sentiment words from web pages,Wordnet,or product reviews,and candidate words are usually constrained on adjectives.cNSEm has to deal with not only the informal expression of microblogs but also the expanded POS candidates,especially when nouns are included.Based on some carefully designed experiments,we analyze the performance of cNSEm on both Chinese and English microblogs.We also analyze and compare the impacts of six categories of features used in cNSEm,including context,POS,language mode,modify relationship,sentence feature and co-occurrence with other sentiment words.Experimental results show that six categories of features employed by cNSEm play different roles in sentiment words extraction and polarity setting in different languages.Experimental results on Chinese microblogs also show that the classical co-occurrence besed methods are effective when candidates are adjectives,but their performance degraded when nouns are included.However,cNSEm performs better than co-occurrence based methods,especially when nouns are considered as candidate sentiment words on Chinese microblogs.To evaluate cNSEm performance,we also test the impacts of extracted sentiment words on sentiment classification tasks.Experimental results on Chinese microblogs show that the performance of microblog subjectivity classification and polarity classification has been improved significantly after sentiment dictionary expanded by cNSEm,and cNSEm performs better than benchmark method.As for classifying subjective terms on English microblogs,the benchmark method and cNSEm perform closely,while cNSEm perform better than benchmark method for polarity classification task.Surprisingly,the sentiment words extracted by cNSEm are more helpful for sentiment classification tasks than manual sentiment dictionaries.In conclusion,both the direct evaluation results by ideal sentiment dictionaries and the indirect evaluation results by sentiment classification tasks show that the new sentiment word extracted by cNSEm are competitive with manual sentiment words.Moreover,cNSEm is adaptive to both Chinese and English microblogs,which have great difference between two languages.
作者 刘德喜 聂建云 万常选 刘喜平 廖述梅 廖国琼 钟敏娟 江腾蛟 LIU De-Xi;NIE Jian-Yun;WAN Chang-Xuan;LIU Xi-Ping;LIAO Shu-Mei;LIAO Guo-Qiong;ZHONG Min-Juan;JIANG Teng-Jiao(School of InforTnation Technology,Jiangxi University of Finance and Economics,Nanchang 330013;Department of Computer Science and Operations Research,University of Montreal,Montreal H3C3J7,Canada)
出处 《计算机学报》 EI CSCD 北大核心 2018年第7期1574-1597,共24页 Chinese Journal of Computers
基金 国家自然科学基金(61762042 61363039 61562032) 江西省落地计划项目(KJLD14035) 江西省自然科学基金(20171BAB202021 20152ACB20003)资助~~
关键词 微博 新情感词抽取 cNSEm方法 特征分析 microblogs new sentiment words extraction cNSEm method feature engineering
作者简介 刘德喜,男,1975年生,博士,教授,博士生导师,中国计算机学会(CCF)会员,主要研究领域为社会媒体处理、信息检索、自然语言处理.E-mail:dexi.liu@163.com.;聂建云,男,1963年生,博士,教授,博士生导师,主要研究领域为信息检索.;万常选,男,1962年生,博士,教授,博士生导师,中国计算机学会(CCF)会员,主要研究领域为Web数据管理、数据挖掘.;刘喜平,男,1981年生,博士,副教授,主要研究方向为Web数据管理、数据挖掘.;廖述梅,女,1976年生,博士,副教授,主要研究方向为信息管理与信息系统.;廖国琼,男,1969年生,博士,教授,博士生导师,主要研究领域为社会计算.;钟敏娟,女,1976年生,博士,副教授,主要研究方向为Web数据管理、数据挖掘.;江腾蛟,女,1976年生,博士,讲师,主要研究方向为情感分析
  • 相关文献

参考文献5

二级参考文献19

共引文献749

同被引文献160

引证文献20

二级引证文献82

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部