摘要
针对最近邻优先吸收聚类算法难以应用在海量数据聚类处理上的不足,基于MapReduce提出改进算法。通过引入MapReduce并行框架,利用Canopy粗聚类优化计算过程,并对聚簇交叉部分的处理进行改进。采用3组大小不同的数据集进行实验,结果表明,与K-means算法和最近邻优先吸收聚类算法相比,改进算法在保证聚类质量的基础上具有较快的运行速度,并适用于海量数据的聚类分析。
Aiming at the problem that the Nearest Neighbor Absorption First(NNAF)clustering algorithm is difficult to be applied in the massive data clustering process,an improved algorithm is proposed based on MapReduce.By introducing MapReduce parallel programming framework and using Canopy coarse clustering,it optimizes the calculation process and improves the process of clustering the intersection.Three different data sets are used to compare the K-means algorithm,the improved NNAF clustering algorithm and the NNAF clustering algorithm.Experimental results show that the improved algorithm can guarantee the clustering quality to a certain extent and has higher running speed.It is suitable for clustering analysis of massive data.
作者
宁可
孙同晶
徐洁洁
NING Ke;SUN Tongjing;XU Jiejie(School of Automation,Hangzhou Dianzi University,Hangzhou 310018,China;Zhejiang Province Electronic Information Products Testing Institute,Hangzhou 310007,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2018年第4期35-40,共6页
Computer Engineering
基金
浙江省信息安全重点实验室基金(KYZ066816004)
作者简介
宁可(1992—),男,硕士研究生,主研方向为海量数据挖掘,E-mail:961289941@qq.com;孙同晶,副教授、博士;;徐洁洁,工程师。