摘要
DBSCAN (density-based spatial clustering of ap- plications with noise) is an important spatial clustering tech- nique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel process- ing of complex data analysis such as DBSCAN becomes in- dispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to prop- erly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these al- gorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily de- signed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the crit- ical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential process- ing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation us- ing real large datasets with up to 1.2 billion points. The ex- periment results well confirm the efficiency and scalability of MR-DBSCAN.
DBSCAN (density-based spatial clustering of ap- plications with noise) is an important spatial clustering tech- nique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel process- ing of complex data analysis such as DBSCAN becomes in- dispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to prop- erly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these al- gorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily de- signed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the crit- ical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential process- ing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation us- ing real large datasets with up to 1.2 billion points. The ex- periment results well confirm the efficiency and scalability of MR-DBSCAN.
作者简介
E-mail: yb.he@siat.ac.cn.frankho117@gmail.com Yaobin He is a PhD candidate of Uni versity of Chinese Academy of Sci- ences (CAS), China. He is also work- ing as an engineer at Shenzhen Insti- tutes of Advanced Technology, CAS. His research interests include parallel computing, high performance comput- ing, and data mining.Haoyu Tan is a research associate at Guangzhou HKUST Fok Ying Tung Research Institute, China. He received the PhD degree in computer science and engineering from HKUST in 2013. His research interests include big data processing, large scale data mining, and distributed systems.Wuman Luo is a research associate at Guangzhou HKUST Fok Ying Tung Research Institute, China. She received the PhD degree in computer science and engineering from H KUST in 2013. Her research interests include big data processing, distributed database, and spatio-temporal database.Shengzhong Feng is a professor at the Shenzhen Institutes of Advanced Tech- nology, Chinese Academy of Sciences, China. His research focuses on parallel algorithms, grid computing and bioin- formatics. Specially, now his interests are in developing novel methods for digital city modeling and application.Jianping FAN is the president of Shen- zhen Institutes of Advanced Technol- ogy, Chinese Academy of Sciences, China. He took part in designing and building Dawning series supercomput- ers from 1990s'. He accomplished 11 projects of 863 programs, held 5 patents and published a book, and over 60 papers.