MR-DBSCAN： a scalable MapReduce-based DBSCAN algorithm for heavily skewed data 被引量：18

MR-DBSCAN： a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

导出

摘要 DBSCAN （density-based spatial clustering of ap- plications with noise） is an important spatial clustering tech- nique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel process- ing of complex data analysis such as DBSCAN becomes in- dispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to prop- erly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these al- gorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily de- signed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the crit- ical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential process- ing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation us- ing real large datasets with up to 1.2 billion points. The ex- periment results well confirm the efficiency and scalability of MR-DBSCAN. DBSCAN （density-based spatial clustering of ap- plications with noise） is an important spatial clustering tech- nique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel process- ing of complex data analysis such as DBSCAN becomes in- dispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to prop- erly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these al- gorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily de- signed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the crit- ical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential process- ing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation us- ing real large datasets with up to 1.2 billion points. The ex- periment results well confirm the efficiency and scalability of MR-DBSCAN.

作者 Yaobin HE Haoyu TAN Wuman LUO Shengzhong FENG Jianping FAN

机构地区 Shenzhen Institutes of Advanced Technology Department of Computer Science University of Chinese Academy of Sciences

出处《Frontiers of Computer Science》 SCIE EI CSCD 2014年第1期83-99,共17页 中国计算机科学前沿（英文版）

关键词 data clustering parallel algorithm data mining data clustering, parallel algorithm, data mining,

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论] TP393 [自动化与计算机技术—计算机应用技术]

作者简介 E-mail： yb.he@siat.ac.cn.frankho117@gmail.com Yaobin He is a PhD candidate of Uni versity of Chinese Academy of Sci- ences （CAS）, China. He is also work- ing as an engineer at Shenzhen Insti- tutes of Advanced Technology, CAS. His research interests include parallel computing, high performance comput- ing, and data mining.Haoyu Tan is a research associate at Guangzhou HKUST Fok Ying Tung Research Institute, China. He received the PhD degree in computer science and engineering from HKUST in 2013. His research interests include big data processing, large scale data mining, and distributed systems.Wuman Luo is a research associate at Guangzhou HKUST Fok Ying Tung Research Institute, China. She received the PhD degree in computer science and engineering from H KUST in 2013. Her research interests include big data processing, distributed database, and spatio-temporal database.Shengzhong Feng is a professor at the Shenzhen Institutes of Advanced Tech- nology, Chinese Academy of Sciences, China. His research focuses on parallel algorithms, grid computing and bioin- formatics. Specially, now his interests are in developing novel methods for digital city modeling and application.Jianping FAN is the president of Shen- zhen Institutes of Advanced Technol- ogy, Chinese Academy of Sciences, China. He took part in designing and building Dawning series supercomput- ers from 1990s＇. He accomplished 11 projects of 863 programs, held 5 patents and published a book, and over 60 papers.

引文网络
相关文献

参考文献21

1Ester M, Kriegel H P, Sander J, Xu X. A densitybased algorithm for dis?covering clusters in large spatial databases. Data Mining and Knowl?edge Discovery, 1996,96: 226-231.
2MacQueen J B. Some methods for classification and analysis of multi?variate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967,281-297.
3Zhang T, Ramakrishnan R, Livny M. Birch: an efficient data cluster?ing method for very large databases. In: Proceedings of 1996 the ACM SIGMOD Conference on Managemnet of Data. 1996, lO3-114.
4Dempster A P, Laird N M, Rubin D B. Maximum likelihood from in?complete data via the EM algorithm. Journal of the Royal Statisticai Societ, 1977,39(1): 1-38.
5Wang W, Yang J, Muntz R R. Sting: A statistical information grid ap?proach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases, 1997, 186-195.
6Microsoft Academic Search. Top publications in data mining. http://academic.research.microsoft.com/CSDirectory/papeccategory_ 7.html. 2013.
7Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. 2008, lO7-113.
8White T. Hadoop: The Definitive Guide, 1st edition. O'Reilly Media, Inc., 2009.
9Berger M, Bokhari S. A partitioning strategy for nonuniform problems on multiprocessors. IEEE Transactions on Computers, 1987,36: 570- 580.
10Dai B R, Lin I C. Efficient map/reduce-based dbscan algorithm with optimized data partition. In: Proceedings of the 5th IEEE International Conference on Cloud Computing. 2012, 59--66.

同被引文献97

1刘金义,刘爽.Voronoi图应用综述[J].工程图学学报,2004,25(2):125-132. 被引量：76
2张丽,陈志强,高文焕,康克军.均值加速的快速中值滤波算法[J].清华大学学报（自然科学版）,2004,44(9):1157-1159. 被引量：54
3李新延,李德仁.DBSCAN空间聚类算法及其在城市规划中的应用[J].测绘科学,2005,30(3):51-53. 被引量：39
4曾明,李建勋.基于自适应形态学Top-Hat滤波器的红外弱小目标检测方法[J].上海交通大学学报,2006,40(1):90-93. 被引量：28
5郭志刚,巫锡炜.泊松回归在生育率研究中的应用[J].中国人口科学,2006(4):2-15. 被引量：31
6杨钦,张俊安,李吉刚,金茂忠.二维限定Voronoi网格剖分细化算法[J].计算机辅助设计与图形学学报,2006,18(10):1547-1552. 被引量：5
7刘青宝,侯东风,邓苏,张维明.基于相对密度的增量式聚类算法[J].国防科技大学学报,2006,28(5):73-79. 被引量：13
8张文超,王岩飞,陈贺新.基于Tophat变换的复杂背景下运动点目标识别算法[J].中国图象图形学报,2007,12(5):871-874. 被引量：16
9胡旺,李志蜀,黄奇.基于双窗口和极值压缩的自适应中值滤波[J].中国图象图形学报,2007,12(1):43-50. 被引量：20
10冯少荣,肖文俊.DBSCAN聚类算法的研究与改进[J].中国矿业大学学报,2008,37(1):105-111. 被引量：90

引证文献18

1杨帆,徐建刚,周亮.基于DBSCAN空间聚类的广州市区餐饮集群识别及空间特征分析[J].经济地理,2016,36(10):110-116. 被引量：57
2J.E.Judith,J.Jayakumari.Distributed Document Clustering Analysis Based on a Hybrid Method[J].China Communications,2017,14(2):131-142. 被引量：2
3赵坤,张羽君,张建龙,王勇.基于SLIC分层分割的无人机图像极小目标检测方法[J].数据采集与处理,2017,32(4):737-745. 被引量：6
4Cheqing JIN,Jie CHEN,Huiping LIU.MapReduce-based entity matching with multiple blocking functions[J].Frontiers of Computer Science,2017,11(5):895-911. 被引量：1
5陈梅,林俊山,温晓芳.基于慕课的大数据课程翻转课堂研究[J].宁夏师范学院学报,2017,38(6):105-110. 被引量：4
6李晓旭,于亚新,张文超,王磊.Coteries轨迹模式挖掘及个性化旅游路线推荐[J].软件学报,2018,29(3):587-598. 被引量：12
7孟海东,任敬佩.基于云计算平台的动态增量密度算法研究[J].计算机应用与软件,2016,33(6):16-19. 被引量：1
8王荣荣,傅秀芬.一种改进的m_(pts)-HDBSCAN算法[J].广东工业大学学报,2017,34(3):49-53. 被引量：1
9刘雄峰,黄云.基于位置特征的景区图片搜索[J].福建电脑,2019,35(2):10-13.
10Jitao Li,Yongquan Liang,Jie Zhang,Jungang Yang,Pingjian Song,Wei Cui.A new automatic oceanic mesoscale eddy detection method using satellite altimeter data based on density clustering[J].Acta Oceanologica Sinica,2019,38(5):134-141. 被引量：1

二级引证文献361

1张子伟,郭齐胜,董志明,陈冉,李林.基于关联规则挖掘的体系作战效能分析[J].装甲兵学报,2022(2):43-49. 被引量：1
2冯建英,石岩,王博,穆维松.基于聚类分析的数据挖掘技术及其农业应用研究进展[J].农业机械学报,2022,53(S01):201-212. 被引量：15
3许文坚,高维新,程耀坤.基于钻石模型的广东省生猪产业竞争力评价分析[J].现代畜牧兽医,2022(12):56-62. 被引量：3
4刘振宇,丁宇祺.自然环境中被遮挡果实的识别方法研究[J].计算机应用研究,2020,37(S02):333-335. 被引量：8
5张晨,王建东,罗宵,赵鲲,廖勇.工程管理数字化关键技术研究进展[J].计算机应用,2023,43(S01):187-195. 被引量：13
6朱杰,郑加柱,陈红华,杨静,胡平昌,陆敏燕.结合POI数据的南京市商业中心识别与集聚特征研究[J].现代测绘,2022,45(6):34-39.
7马静涵,刘天宝.全球创新1000强上市公司的行业特征与空间格局[J].经济地理,2021(1):121-130. 被引量：2
8刘丙章,高建华,彭宝玉,万荣辉.基于POI数据的苏州市金融服务业空间格局及细分行业分布[J].河南大学学报（自然科学版）,2021(1):29-39. 被引量：8
9张仲宸,周浩,林波荣,李嘉麒,田昕,吴佳欣,陈帅元,黄莉.基于数据挖掘的办公建筑运行阶段碳排放分析[J].建筑节能,2020,48(11):1-6. 被引量：11
10薛丁文,李建中.基于KD树的k-means聚类算法优化[J].智能计算机与应用,2021,11(11):194-197. 被引量：6

1丁立军.网络化制造环境下实施并行工程关键技术研究[J].机电工程技术,2008,37(6):13-14. 被引量：1
2陈顾,王福明,郭彦青.基于PLC的多工艺调度实现方法[J].机械工程与自动化,2016(1):171-173.
3杨传明,周志刚.基于并行移动代理的电子商务系统[J].计算机与现代化,2004(6):69-72.
4雷先明,陈国新,曾周亮.基于数据驱动的并行产品设计过程建模和分析[J].矿山机械,2006,34(3):102-103.
5王珉,吴广茂,韩联庆.基于组件开发的并行过程模型研究[J].航空计算技术,2006,36(1):39-43. 被引量：6
6王宁国.CAD/CAM技术的发展历程[J].中外企业家,2011(7X):140-141. 被引量：6
7Guan Ji hong 1, Zhou Shui geng 2, Bian Fu ling 3, He Yan xiang 1 1. School of Computer, Wuhan University, Wuhan 430072, China,2.State Key Laboratory of Software Engineering, Wuhan University, Wuhan 430072, China,3.College of Remote Sensin.Scaling up the DBSCAN Algorithm for Clustering Large Spatial Databases Based on Sampling Technique[J].Wuhan University Journal of Natural Sciences,2001,6(Z1):467-473. 被引量：9
8凯文.凯利.三大突破让人工智能近在眼前[J].商学院,2015,0(1):42-42.
9袁伟,孙永强.基于松散耦合MIMD计算机系统的函数式语言并行实现技术[J].上海交通大学学报,1993,27(5):54-63. 被引量：1
10徐杰锋,舒继武,郑纬民.并行处理可视化监测环境[J].清华大学学报（自然科学版）,2003,43(4):532-535. 被引量：4

Frontiers of Computer Science

2014年第1期

浏览历史

内容加载中请稍等...

MR-DBSCAN： a scalable MapReduce-based DBSCAN algorithm for heavily skewed data 被引量：18

参考文献21

同被引文献97

引证文献18

二级引证文献361

相关作者

相关机构

相关主题

浏览历史