摘要
介绍了Hadoop平台下MapReduce的编程模型;分析了传统聚类Kmeans和Canopy算法的优缺点,并提出了基于Canopy的改进Kmeans算法。针对Canopy-Kmeans算法中Canopy选取的随机性问题,采用"最小最大原则"对该算法进行改进,避免了Cannopy选取的盲目性。采用MapReduce并行编程方法,以海量新闻信息聚类作为应用背景。实验结果表明,此方法相对于传统Kmeans和Canopy算法有着更高的准确率和稳定性。
This paper studies MapReduce programming model under the Hadoop platform, analyzes the advan- tages and the disadvantages of traditional Kmeans and Canopy algorithms, and then proposes an improved Kmeans al- gorithm based on Canopy. The "minimum maximum principle" is used to improve the randomicity problem of Cano- py-Kmeans algorithm to avoid the blindness of Cannopy. The MapReduce parallel programming method is carried out in massive news aggregation. The experiments show that this method has higher accuracy and stability than the tradi- tional Kmeans and Canopy algorithms.
出处
《电子科技》
2014年第2期29-31,共3页
Electronic Science and Technology
作者简介
赵庆(1988-),男,硕士研究生.研究方向:云计算,Hadoop平台下大数据及大规模数据挖掘.E-mail:522698733@qq.com