摘要
随着微博的大量普及和关注度的不断提高,微博热点话题发现已成为当前研究热点。针对于短文本、向量空间模型(VSM)文本表示方法存在高维度、稀疏,以及同义多义问题,导致难以准确度量文本相似度,提出一种基于隐含语义分析的两阶段聚类话题发现方法。引入话题热度的概念来选取具有一定关注度的微博文本,用隐含语义分析(LSA)对数据集进行建模;用层次聚类的CURE算法确定初始类中心;用K-means聚类得到热点话题的聚类结果。真实微博数据集的实验结果验证了该方法的有效性。
As the large popularity of micro-blog and awareness continues to improve, hot topic of micro-blog detecting has become the current research focuses. For short texts, there exist high-dimension, sparse, synonymy and polysemy problems for Vector Space Model(VSM)text presentation, making it difficult to measure the similarity of the texts accu-rately. This paper presents a two-stage cluster based on Latent Semantic Analysis(LSA)topic detection approach. Firstly, the concept of hot topic is introduced to select micro-blogs with certain attention, using LSA to model the dataset. Then CURE algorithm of hierarchical clustering is employed to determine the initial centers. Finally, the hot topic clustering results are obtained through K-means clustering. Experimental results on real micro-blog dataset verify the validity of the method.
出处
《计算机工程与应用》
CSCD
2014年第1期96-100,共5页
Computer Engineering and Applications
基金
重庆市自然科学基金(No.cstc2011jjA40023)
作者简介
马雯雯(1986-),女,硕士,主要研究方向:计算机网络与信息安全
魏文晗(1986-),男,硕士,主要研究方向:信息安全
邓一贵(1971~),男,博士,高级工程师,主要研究方向:计算机网络与信息安全,移动代理。E-mail:rlla-wenl024@163.com