摘要
如何高效地从海量数据中检测微博突发事件,成为近年来国内外学者的研究热点。分析突发事件的特征,采用词频增量、基于命名实体和微博传播特性的TF-PDF公式提取突发特征;引入项间关联规则,利用突发词的项间距离结合改进的Single-pass聚类算法生成突发簇集,识别出突发事件。通过新浪微博真实数据集的实验表明,该方法从海量微博中有效检测出微博突发事件。
How to efficiently detect data from the mass microblog emergencies in recent years become a hot research scholars at home and abroad.Analyses the emergency feature, uses word frequency increment, feature-based named entity extraction burst and microblog propagation characteristics of TF-PDF equation; the introduction of inter-item association rules between items, uses the word burst from the combination of an improved Single-pass clustering algorithm to generate a burst clusters, identified emergencies. Experiment by Sina microblog real data sets show that the method from the mass microblog can effectively detect emergencies.
关键词
事件检测
特征
突发事件
聚类
Event Detection
Feature
Emergencies
Clustering
作者简介
杨子(1993-),女,江苏徐州人,硕士研究生,研究方向为数据挖掘
栾翠菊(1974-),女,吉林梅河口人,博士,副教授,研究方向为智能决策、数据挖掘等