摘要
从微博中准确而高效地挖掘出突发事件是近年来的研究热点。通过词频统计、词增长率计算和TF-PDF算法抽取突发词集,使用突发词表示文本并结合微博突发事件的描述特征进行文本过滤;提出一种"绝对聚类"算法,对描述突发事件的文本进行聚类,并通过微博的回复数和转发数加权计算热度,检测各类事件中热度最大的作为突发事件。检测准确率为92.60%,召回率为85.51%,F值为0.89。实验结果表明,相比于传统的突发事件检测方法,该方法能够比较准确地检测到微博中的突发事件,有一定的应用价值。
Much attention is paid to mining bursty topics accurately and efficiently from micro -blog nowadays. In this paper, a set of burst terms are extracted by counting the term frequency, calculating the growth rate of the terms and using Term Frequency - Proportional Document Frequency ( TF - PDF) algorithm to measure the weight. And then micro - blog texts are described with the burst terms. Analyzing the characteristic that bursty topics propagate in the platform of micro -blog, the authors filter the texts that do not contribute to detect bursty topics. The paper proposes a novel clustering strategy of "Absolute Clustering" to cluster the micro- blog texts. By figuring up the hot spot of the texts with weighted value of reply and retweet number, the top 5 texts are extracted as the result of burst topics detection. The experiments show that the precision is 92.60% , the recall is 85.51% and the F - measure is 0.89. Contrast with the traditional meth- od, the validity of the proposed method is proved.
出处
《现代图书情报技术》
CSSCI
北大核心
2013年第2期57-62,共6页
New Technology of Library and Information Service
基金
国家自然科学基金项目"基于本体的专利自动标引研究"(项目编号:61271304)
国家自然科学基金项目"网页内容真实性评价研究"(项目编号:61171159)
北京市教委科技发展计划重点项目暨北京市自然科学基金B类重点项目"面向领域的互联网多模态信息精准搜索方法研究"(项目编号:KZ201311232037)
国家科技支撑计划课题"增强型搜索引擎关键技术研究与示范"(项目编号:2011BAH11B03)的研究成果之一
关键词
突发事件
突发词
文本过滤
绝对聚类
Bursty topics Burst terms Filter Absolute clustering
作者简介
E—mail:wy514674793@126.com