摘要
对社会化媒体产生的大量短文本进行聚类分析具有重要的应用价值,但短文本往往具有噪音数据多、增长迅速且数据量大的特点,导致现有相关算法难于有效处理.提出一种基于增量式鲁棒非负矩阵分解的短文本在线聚类算法STOCIRNMF.STOCIRNMF基于非负矩阵分解构建短文本聚类模型,通过l_(2,1)范数设计模型的优化求解目标函数提高鲁棒性,同时应用增量式迭代更新规则实现短文本的在线聚类.在搜狐新闻标题和微博短文本数据集上进行相关实验,结果表明STOCIRNMF不仅比现有代表性算法具有更好的聚类性能,而且能够有效对微博话题进行在线检测.
Clustering a large number of short texts in social media has great value in applications.However,short texts often have these characteristics:lots of noises,growing rapidly and massive data.Most existing short text clustering algorithms are not effectively enough to process such short texts.Aiming at this problem,we propose an algorithm of short text online clustering based on incremental robust nonnegative matrix factorization (STOCIRNMF).This algorithm uses NMF to build the short text clustering model and applies l 2,1 norm to devise its objective function for improving its robustness.Meanwhile,STOCIRNMF can cluster short texts incrementally by using incremental iterative update rules.We conduct extensive experiments on real Sohu news titles and Weibo datasets.The results show that STOCIRNMF not only has better performance of short text clustering than some representative algorithms,but also is very effective to detect micro blog′s topics online.
作者
贺超波
汤庸
张琼
刘双印
刘海
HE Chao-bo;TANG Yong;ZHANG Qiong;LIU Shuang-yin;LIU Hai(School of Information Science and Technology,Zhongkai University of Agriculture and Engineering,Guangzhou,Guangdong 510225,China;School of Computer,South China Normal University,Guangzhou,Guangdong 510631,China;School of Data and Computer Science,Sun Yat-sen University,Guangzhou,Guangdong 510006,China)
出处
《电子学报》
EI
CAS
CSCD
北大核心
2019年第5期1086-1093,共8页
Acta Electronica Sinica
基金
国家自然科学基金(No.61772211)
广东省科技计划项目(No.2017A040405057
No.2017A030303074
No.2016A030303058)
广州市科技计划项目(No.201807010043)
关键词
短文本聚类
鲁棒非负矩阵分解
在线聚类
l2
1范数
增量式迭代更新规则
short text clustering
robust nonnegative matrix factorization
online clustering
l 2,1 norm
incremental iterative update rules
作者简介
贺超波,男,1981年生于广东河源,现为仲恺农业工程学院副教授,主要研究方向为数据挖掘、机器学习与大数据技术.E-mail:hechaobo@foxmail.com;通讯作者:汤庸,男,1964年生于湖南张家界,现为华南师范大学计算机学院教授,主要研究方向为数据智能与云服务、学术社交网络与教育大数据.E-mail:ytang@m.scnu.edu.cn.