摘要
为解决如何从海量新闻报道中检测并追踪到目标话题,选择了自增式聚类Single-Pass算法进行研究.在原有的基础上对其进行改进得到改进后的Single-Pass聚类算法,期望能得到更好的解决方法.对于原有算法进行的改进主要有在新闻文本的特征词选取中加入权重系数表达特征词位置信息,同时辅以时间特征进行新闻文本相似度计算,并且在Single-Pass聚类算法步骤中添加子话题阈值判断过程.实验验证改进后的Single-Pass聚类算法不仅可得到不同粒度的话题聚类效果,同时也提升了聚类效率.实验结果证明,在相同条件下,改进后的Single-Pass聚类算法在漏检率和误检率上有明显的改善.
In order to solve the problem of how to detect and track the target topic from massive news reports,an auto-increasing clustering Single-Pass algorithm was selected to research.Based on the improvement of the original Single-Pass clustering algorithm,it is expected to get a better solution.The improvement of the original algorithm mainly includes adding weight coefficients to select feature words in news text to express feature word position information,supplemented by temporal features to calculate similarity of news text,and adding sub-segments in the Single-Pass clustering algorithm Topic threshold judgment process.The experiments verify that the improved Single-Pass clustering algorithm can not only obtain the clustering effect of topics with different granularities,but also improve the clustering efficiency.The experimental results show that under the same conditions,the missed detection rate and false detection rate of the improved Single-Pass clustering algorithm are significantly improved.
作者
张帆
潘亚雄
胡勇
Zhang Fan;Pan Yaxiong;Hu Yong(College of Cybersecurity,Sichuan University,Chengdu 610065;Chengdu Science and Technology Development Center of China Academy of Engineering and Physics,Chengdu 610200)
出处
《信息安全研究》
2020年第5期396-403,共8页
Journal of Information Security Research
作者简介
张帆,硕士研究生,主要研究方向为网络数据分析与数据安全处理.E-mail:1105988653@qq.com;潘亚雄,硕士,高级工程师,主要研究方向为网络信息安全.E-mail:panyaxiong@163.com;胡勇,博士,研究员,主要研究方向为网络信息安全.E-mail:huyong@scu.edu.cn。