一种采用伪随机分区的自适应并行密度聚类算法

Adaptive Parallel Density Clustering Algorithm Using Pseudo Random Partitioning

在线阅读下载PDF

导出

摘要针对大数据环境下并行密度聚类算法存在数据分区效率低下、负载不均衡、局部聚类合并不准确以及并行化效率较低等问题,本文提出了一种基于伪随机分区策略构建单元子图的并行密度聚类算法.该算法采用伪随机分区策略快速进行数据分区,并使用Spark在每个分区中构建单元子图实现局部聚类.同时,本文还提出了一种新的局部簇合并策略,提高了合并的准确率.此外,针对传统DBSCAN算法需要手动确定参数的问题,本文使用了一种改进的自适应参数方法,通过使用高斯核函数和最小化积分均方误差(MISE)方法确定eps和minpts的值.经实验证明,该算法在人工数据集和大规模真实数据集上都展现出了出色的并行性能和高准确率. Aiming at the problems of inefficient data partitioning,load imbalance,inaccurate local clustering and merging as well as low parallelisation efficiency of parallel density clustering algorithm in big data environment,this paper proposes a parallel density clustering algorithm based on the pseudo-random partitioning strategy to construct cell subgraph.The algorithm adopts a pseudo-random partitioning strategy to quickly partition the data,and uses Spark to construct a cell subgraph in each partition to achieve local clustering.Meanwhile,this paper also proposes a new local cluster merging strategy to improve the accuracy of merging.Furthermore,in response to the problem of manually determining parameters in the traditional DBSCAN algorithm,this paper uses an improved adaptive DBSCAN parameter method,which uses Gaussian kernel function and Mean Integrated Squared Error(MISE)method to determine eps and minpts.It is experimentally demonstrated that the algorithm exhibits excellent parallel performance and high accuracy on both synthetic datasets and large-scale real-world datasets.

作者曾鸿斌钱雪忠宋威 ZENG Hongbin;QIAN Xuezhong;SONG Wei(School of Artificial Intelligence and Computer Science,Jiangnan University,Wuxi 214122,China)

机构地区江南大学人工智能与计算机学院

出处《小型微型计算机系统》北大核心 2025年第6期1349-1357,共9页 Journal of Chinese Computer Systems

基金国家自然科学基金项目(62076110)资助江苏省自然科学基金项目(BK20181341)资助.

关键词 DBSCAN 伪随机分区 SPARK 自适应参数聚类合并 DBSCAN pseudo random partitioning Spark adaptive parameter cluster merging

分类号 TP301 [自动化与计算机技术—计算机系统结构]

作者简介曾鸿斌,男,1998年生,硕士研究生,CCF学生会员,研究方向为数据挖掘、机器学习,E-mail:6213113137@stu.jiangnan.edu.cn;钱雪忠,男,1967年生,硕士研究生,副教授,CCF会员,研究方向为数据挖掘、机器学习、人工智能;宋威,男,1981年生,博士,教授,博士生导师,研究方向为数据挖掘、机器学习、模式识别.

引文网络
相关文献

参考文献6

1Yaobin HE,Haoyu TAN,Wuman LUO,Shengzhong FENG,Jianping FAN.MR-DBSCAN： a scalable MapReduce-based DBSCAN algorithm for heavily skewed data[J].Frontiers of Computer Science,2014,8(1):83-99. 被引量：19
2宋董飞,徐华.DBSCAN算法研究及并行化实现[J].计算机工程与应用,2018,54(24):52-56. 被引量：23
3赵永彬,陈硕,刘明,王佳楠,贲驰.采用分布式DBSCAN算法的用电行为分析[J].小型微型计算机系统,2018,39(5):1108-1112. 被引量：9
4李文杰,闫世强,蒋莹,张松芝,王成良.自适应确定DBSCAN算法参数的算法研究[J].计算机工程与应用,2019,55(5):1-7. 被引量：127
5王丽娟,邢津萍,尹明,郝志峰,蔡瑞初,温雯.基于一致性图的权重自适应多视角谱聚类算法[J].计算机工程,2024,50(2):122-131. 被引量：2
6蔡莉,王浩宇,周君,何婧,刘俊晖.一种改进的自适应网格划分的分布式聚类算法[J].小型微型计算机系统,2023,44(4):731-736. 被引量：2

二级参考文献47

1熊元新,陈允平.离散傅里叶变换的定义研究[J].武汉大学学报（工学版）,2006,39(1):89-91. 被引量：10
2Ester M, Kriegel H P, Sander J, Xu X. A densitybased algorithm for dis?covering clusters in large spatial databases. Data Mining and Knowl?edge Discovery, 1996,96: 226-231.
3MacQueen J B. Some methods for classification and analysis of multi?variate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967,281-297.
4Zhang T, Ramakrishnan R, Livny M. Birch: an efficient data cluster?ing method for very large databases. In: Proceedings of 1996 the ACM SIGMOD Conference on Managemnet of Data. 1996, lO3-114.
5Dempster A P, Laird N M, Rubin D B. Maximum likelihood from in?complete data via the EM algorithm. Journal of the Royal Statisticai Societ, 1977,39(1): 1-38.
6Wang W, Yang J, Muntz R R. Sting: A statistical information grid ap?proach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases, 1997, 186-195.
7Microsoft Academic Search. Top publications in data mining. http://academic.research.microsoft.com/CSDirectory/papeccategory_ 7.html. 2013.
8Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. 2008, lO7-113.
9White T. Hadoop: The Definitive Guide, 1st edition. O'Reilly Media, Inc., 2009.
10Berger M, Bokhari S. A partitioning strategy for nonuniform problems on multiprocessors. IEEE Transactions on Computers, 1987,36: 570- 580.

共引文献171

1周润,滕奇志.基于改进DBSCAN算法的金相图像晶粒聚集检测方法[J].智能计算机与应用,2021,11(4):44-48. 被引量：1
2聂辰辰,程峰,王成,王金亮,吴骏恩.地面激光雷达点云数据乔灌分离方法研究[J].测绘科学,2024,49(1):106-116.
3杨帆,徐建刚,周亮.基于DBSCAN空间聚类的广州市区餐饮集群识别及空间特征分析[J].经济地理,2016,36(10):110-116. 被引量：59
4J.E.Judith,J.Jayakumari.Distributed Document Clustering Analysis Based on a Hybrid Method[J].China Communications,2017,14(2):131-142. 被引量：2
5赵坤,张羽君,张建龙,王勇.基于SLIC分层分割的无人机图像极小目标检测方法[J].数据采集与处理,2017,32(4):737-745. 被引量：6
6Cheqing JIN,Jie CHEN,Huiping LIU.MapReduce-based entity matching with multiple blocking functions[J].Frontiers of Computer Science,2017,11(5):895-911. 被引量：1
7陈梅,林俊山,温晓芳.基于慕课的大数据课程翻转课堂研究[J].宁夏师范学院学报,2017,38(6):105-110. 被引量：4
8李晓旭,于亚新,张文超,王磊.Coteries轨迹模式挖掘及个性化旅游路线推荐[J].软件学报,2018,29(3):587-598. 被引量：12
9余翔,陈国洪,李霆,陈珺.基于孤立森林算法的用电数据异常检测研究[J].信息技术,2018,42(12):88-92. 被引量：39
10孟海东,任敬佩.基于云计算平台的动态增量密度算法研究[J].计算机应用与软件,2016,33(6):16-19. 被引量：1

1官却才让,杨毛加,柔特,班玛宝,才让加.基于数字实体特征的藏文问答数据集构建[J].中文信息学报,2025,39(3):59-65.
2孙勤.从一道例题看函数y=A sin(ωx+φ)中参数“ω”的求法[J].中学数学,2025(9):126-127.
3尚建贞,王欣欣.密度估计下异构网络数据异常辨识算法设计[J].计算机仿真,2024,41(12):477-481.
4聂芬,朱健民,汪雄良.参数方法在洛朗级数展开中的应用[J].高等数学研究,2025,28(3):71-74.
5李慧.电力企业档案管理工作创新的探讨[J].中文科技期刊数据库(文摘版)经济管理,2017(4):00168-00168.
6吕莉,贺智鹏,张法滢,张莹莹,康平,李院民.基于马氏距离的密度加权最小二乘孪生支持向量机[J].江西师范大学学报(自然科学版),2025,49(1):37-48.
7陈沛权,邓汝杰,张艺斌,李磐,齐克奇,刘河山,罗子人.太极计划星间激光通信测距的伪随机码选取[J].中国光学(中英文),2025,18(3):547-556.
8陈强,蒋硕,何永彩,胡晓宇,李英.青海部分地区青海血蜱中肠细菌多样性分析[J].青海畜牧兽医杂志,2025,55(2):11-16.
9李富松,赵海宾,胡瑞雪,沈炳振,董宏刚.基于CNN-Former的电动汽车永磁同步电机故障诊断[J].机床与液压,2025,53(10):130-138.
10王梦珍,张德生,张晓.基于加权局部密度的双超球支持向量机算法[J].计算机工程,2025,51(5):188-195.

小型微型计算机系统

2025年第6期

浏览历史

内容加载中请稍等...

一种采用伪随机分区的自适应并行密度聚类算法

参考文献6

二级参考文献47

共引文献171

相关作者

相关机构

相关主题

浏览历史