期刊文献+

一种采用伪随机分区的自适应并行密度聚类算法

Adaptive Parallel Density Clustering Algorithm Using Pseudo Random Partitioning
在线阅读 下载PDF
导出
摘要 针对大数据环境下并行密度聚类算法存在数据分区效率低下、负载不均衡、局部聚类合并不准确以及并行化效率较低等问题,本文提出了一种基于伪随机分区策略构建单元子图的并行密度聚类算法.该算法采用伪随机分区策略快速进行数据分区,并使用Spark在每个分区中构建单元子图实现局部聚类.同时,本文还提出了一种新的局部簇合并策略,提高了合并的准确率.此外,针对传统DBSCAN算法需要手动确定参数的问题,本文使用了一种改进的自适应参数方法,通过使用高斯核函数和最小化积分均方误差(MISE)方法确定eps和minpts的值.经实验证明,该算法在人工数据集和大规模真实数据集上都展现出了出色的并行性能和高准确率. Aiming at the problems of inefficient data partitioning,load imbalance,inaccurate local clustering and merging as well as low parallelisation efficiency of parallel density clustering algorithm in big data environment,this paper proposes a parallel density clustering algorithm based on the pseudo-random partitioning strategy to construct cell subgraph.The algorithm adopts a pseudo-random partitioning strategy to quickly partition the data,and uses Spark to construct a cell subgraph in each partition to achieve local clustering.Meanwhile,this paper also proposes a new local cluster merging strategy to improve the accuracy of merging.Furthermore,in response to the problem of manually determining parameters in the traditional DBSCAN algorithm,this paper uses an improved adaptive DBSCAN parameter method,which uses Gaussian kernel function and Mean Integrated Squared Error(MISE)method to determine eps and minpts.It is experimentally demonstrated that the algorithm exhibits excellent parallel performance and high accuracy on both synthetic datasets and large-scale real-world datasets.
作者 曾鸿斌 钱雪忠 宋威 ZENG Hongbin;QIAN Xuezhong;SONG Wei(School of Artificial Intelligence and Computer Science,Jiangnan University,Wuxi 214122,China)
出处 《小型微型计算机系统》 北大核心 2025年第6期1349-1357,共9页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(62076110)资助 江苏省自然科学基金项目(BK20181341)资助.
关键词 DBSCAN 伪随机分区 SPARK 自适应参数 聚类合并 DBSCAN pseudo random partitioning Spark adaptive parameter cluster merging
作者简介 曾鸿斌,男,1998年生,硕士研究生,CCF学生会员,研究方向为数据挖掘、机器学习,E-mail:6213113137@stu.jiangnan.edu.cn;钱雪忠,男,1967年生,硕士研究生,副教授,CCF会员,研究方向为数据挖掘、机器学习、人工智能;宋威,男,1981年生,博士,教授,博士生导师,研究方向为数据挖掘、机器学习、模式识别.
  • 相关文献

参考文献6

二级参考文献47

  • 1熊元新,陈允平.离散傅里叶变换的定义研究[J].武汉大学学报(工学版),2006,39(1):89-91. 被引量:10
  • 2Ester M, Kriegel H P, Sander J, Xu X. A densitybased algorithm for dis?covering clusters in large spatial databases. Data Mining and Knowl?edge Discovery, 1996,96: 226-231.
  • 3MacQueen J B. Some methods for classification and analysis of multi?variate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967,281-297.
  • 4Zhang T, Ramakrishnan R, Livny M. Birch: an efficient data cluster?ing method for very large databases. In: Proceedings of 1996 the ACM SIGMOD Conference on Managemnet of Data. 1996, lO3-114.
  • 5Dempster A P, Laird N M, Rubin D B. Maximum likelihood from in?complete data via the EM algorithm. Journal of the Royal Statisticai Societ, 1977,39(1): 1-38.
  • 6Wang W, Yang J, Muntz R R. Sting: A statistical information grid ap?proach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases, 1997, 186-195.
  • 7Microsoft Academic Search. Top publications in data mining. http://academic.research.microsoft.com/CSDirectory/papeccategory_ 7.html. 2013.
  • 8Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. 2008, lO7-113.
  • 9White T. Hadoop: The Definitive Guide, 1st edition. O'Reilly Media, Inc., 2009.
  • 10Berger M, Bokhari S. A partitioning strategy for nonuniform problems on multiprocessors. IEEE Transactions on Computers, 1987,36: 570- 580.

共引文献171

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部