摘要
针对大数据环境下并行密度聚类算法存在数据分区效率低下、负载不均衡、局部聚类合并不准确以及并行化效率较低等问题,本文提出了一种基于伪随机分区策略构建单元子图的并行密度聚类算法.该算法采用伪随机分区策略快速进行数据分区,并使用Spark在每个分区中构建单元子图实现局部聚类.同时,本文还提出了一种新的局部簇合并策略,提高了合并的准确率.此外,针对传统DBSCAN算法需要手动确定参数的问题,本文使用了一种改进的自适应参数方法,通过使用高斯核函数和最小化积分均方误差(MISE)方法确定eps和minpts的值.经实验证明,该算法在人工数据集和大规模真实数据集上都展现出了出色的并行性能和高准确率.
Aiming at the problems of inefficient data partitioning,load imbalance,inaccurate local clustering and merging as well as low parallelisation efficiency of parallel density clustering algorithm in big data environment,this paper proposes a parallel density clustering algorithm based on the pseudo-random partitioning strategy to construct cell subgraph.The algorithm adopts a pseudo-random partitioning strategy to quickly partition the data,and uses Spark to construct a cell subgraph in each partition to achieve local clustering.Meanwhile,this paper also proposes a new local cluster merging strategy to improve the accuracy of merging.Furthermore,in response to the problem of manually determining parameters in the traditional DBSCAN algorithm,this paper uses an improved adaptive DBSCAN parameter method,which uses Gaussian kernel function and Mean Integrated Squared Error(MISE)method to determine eps and minpts.It is experimentally demonstrated that the algorithm exhibits excellent parallel performance and high accuracy on both synthetic datasets and large-scale real-world datasets.
作者
曾鸿斌
钱雪忠
宋威
ZENG Hongbin;QIAN Xuezhong;SONG Wei(School of Artificial Intelligence and Computer Science,Jiangnan University,Wuxi 214122,China)
出处
《小型微型计算机系统》
北大核心
2025年第6期1349-1357,共9页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(62076110)资助
江苏省自然科学基金项目(BK20181341)资助.
作者简介
曾鸿斌,男,1998年生,硕士研究生,CCF学生会员,研究方向为数据挖掘、机器学习,E-mail:6213113137@stu.jiangnan.edu.cn;钱雪忠,男,1967年生,硕士研究生,副教授,CCF会员,研究方向为数据挖掘、机器学习、人工智能;宋威,男,1981年生,博士,教授,博士生导师,研究方向为数据挖掘、机器学习、模式识别.