With the rising and spreading of micro-blog, the sentiment classification of short texts has become a research hotspot. Some methods have been developed in the past decade. However, since the Chinese and English are d...With the rising and spreading of micro-blog, the sentiment classification of short texts has become a research hotspot. Some methods have been developed in the past decade. However, since the Chinese and English are different in language syntax, semantics and pragmatics, sentiment classification methods that are effective for English twitter may fail on Chinese micro-blog. In addition, the colloquialism and conciseness of short Chinese texts introduces additional challenges to sentiment classification. In this work, a novel hybrid learning model was proposed for sentiment classification of Chinese micro-blogs, which included two stages. In the first stage, emotional scores were calculated over the whole dataset by utilizing an improved Chinese-oriented sentiment dictionary classification method. Data with extremely high or low scores were directly labeled. In the second stage, the remaining data were labeled by using an integrated classification method based on sentiment dictionary, support vector machine(SVM) and k-nearest neighbor(KNN). An improved feature selection method was adopted to enhance the discriminative power of the selected features. The two-stage hybrid framework made the proposed method effective for sentiment classification of Chinese micro-blogs. Experiments on the COAE2014(Chinese Opinion Analysis Evaluation 2014) dataset show that the proposed method outperforms other schemes.展开更多
针对传统VSM(vector space model)在短文本分类中维数高、语义特征不明显的问题,提出基于LDA(latent Dirichlet allocation)模型主题分布相似度分类方法;针对短文本内容少、长度短、特征稀疏的问题,提出基于LDA模型主题-词分布矩阵的主...针对传统VSM(vector space model)在短文本分类中维数高、语义特征不明显的问题,提出基于LDA(latent Dirichlet allocation)模型主题分布相似度分类方法;针对短文本内容少、长度短、特征稀疏的问题,提出基于LDA模型主题-词分布矩阵的主题分布向量改进方法。与传统VSM分类方法相比,该方法降低了相似度计算维度,融合了一定语义特征。实验结果表明,与传统VSM分类方法相比,基于主题分布相似度方法的平均F1值提高了4.5%,基于LDA模型主题-词分布矩阵主题分布向量改进方法的平均F1值提高了5.2%,验证了以上方法的有效性。展开更多
基金Projects(61573380,61303185)supported by the National Natural Science Foundation of ChinaProject(13BTQ052)supported by the National Social Science Foundation of China+1 种基金Project(2016M592450)supported by the China Postdoctoral Science FoundationProject(2016JJ4119)supported by the Hunan Provincial Natural Science Foundation of China
文摘With the rising and spreading of micro-blog, the sentiment classification of short texts has become a research hotspot. Some methods have been developed in the past decade. However, since the Chinese and English are different in language syntax, semantics and pragmatics, sentiment classification methods that are effective for English twitter may fail on Chinese micro-blog. In addition, the colloquialism and conciseness of short Chinese texts introduces additional challenges to sentiment classification. In this work, a novel hybrid learning model was proposed for sentiment classification of Chinese micro-blogs, which included two stages. In the first stage, emotional scores were calculated over the whole dataset by utilizing an improved Chinese-oriented sentiment dictionary classification method. Data with extremely high or low scores were directly labeled. In the second stage, the remaining data were labeled by using an integrated classification method based on sentiment dictionary, support vector machine(SVM) and k-nearest neighbor(KNN). An improved feature selection method was adopted to enhance the discriminative power of the selected features. The two-stage hybrid framework made the proposed method effective for sentiment classification of Chinese micro-blogs. Experiments on the COAE2014(Chinese Opinion Analysis Evaluation 2014) dataset show that the proposed method outperforms other schemes.
文摘针对传统VSM(vector space model)在短文本分类中维数高、语义特征不明显的问题,提出基于LDA(latent Dirichlet allocation)模型主题分布相似度分类方法;针对短文本内容少、长度短、特征稀疏的问题,提出基于LDA模型主题-词分布矩阵的主题分布向量改进方法。与传统VSM分类方法相比,该方法降低了相似度计算维度,融合了一定语义特征。实验结果表明,与传统VSM分类方法相比,基于主题分布相似度方法的平均F1值提高了4.5%,基于LDA模型主题-词分布矩阵主题分布向量改进方法的平均F1值提高了5.2%,验证了以上方法的有效性。