摘要
Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis.In cluster computing,data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability.In this paper,we present a comprehensive survey of the methods and techniques of data partitioning and sampling with respect to big data processing and analysis.We start with an overview of the mainstream big data frameworks on Hadoop clusters.The basic methods of data partitioning are then discussed including three classical horizontal partitioning schemes:range,hash,and random partitioning.Data partitioning on Hadoop clusters is also discussed with a summary of new strategies for big data partitioning,including the new Random Sample Partition(RSP)distributed model.The classical methods of data sampling are then investigated,including simple random sampling,stratified sampling,and reservoir sampling.Two common methods of big data sampling on computing clusters are also discussed:record-level sampling and blocklevel sampling.Record-level sampling is not as efficient as block-level sampling on big distributed data.On the other hand,block-level sampling on data blocks generated with the classical data partitioning methods does not necessarily produce good representative samples for approximate computing of big data.In this survey,we also summarize the prevailing strategies and related work on sampling-based approximation on Hadoop clusters.We believe that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects.
基金
Supported in part by the National Natural Science Foundation of China(No.61972261)
the National Key R&D Program of China(No.2017YFC0822604-2)
作者简介
corresponding author:Mohammad Sultan Mahmud is currently a PhD candidate at Shenzhen University,China.He received the master degree from King Mongkut’s University of Technology North Bangkok,Thailand,in 2014,and the bachelor degree from BGC Trust University Bangladesh,Bangladesh,in 2008.Mr.Mahmud was awarded the Outstanding Doctoral Student of Shenzhen University in 2017 and Shenzhen Universiade International Scholarship in 2018.Also,he received Information Technology-King Mongkut’s University of Technology North Bangkok scholarship for two years in 2012.His current research focuses on big data mining and distributed and parallel computing,E-mail:ssalloum@szu.edu.cn;Joshua Z.Huang received the PhD degree from the Royal Institute of Technology,Sweden,in 1993.He is a distinguished professor of the College of Computer Science&Software Engineering at Shenzhen University.Also,he is the director of Big Data Institute and the deputy director of the National Engineering Laboratory for Big Data System Computing Technology.His main research interests include big data technology and applications.Prof.Huang has published over 200 research papers in conferences and journals.In 2006,he received the most influential paper award in the First Pacific-Asia Conference on Knowledge Discovery and Data Mining.Prof.Huang is known for his contributions to the development of a series of k-means type clustering algorithms in data mining,such as k-modes,fuzzy k-modes,k-prototypes,and w-k-means,that are widely cited and used,and some of which have been included in commercial software.He has extensive industry expertise in business intelligence and data mining,and has been involved in numerous consulting projects in Australia and China,E-mail:zx.huang@szu.edu.cn;Salman Salloum received the PhD degree from Shenzhen University,Shenzhen,China,in 2019,and the MS degree from Damascus University,Damascus,Syria,in 2013.He is currently an associate researcher with the College of Computer Science and Software Engineering,Shenzhen University,Shenzhen,China.From 2007 to 2014,he had worked as an instructional designer and a project manager in ePedia-SY,a digital content company in Syria.He was also a tutor at Syrian Virtual University from 2012 to 2014.His current research is focused on cluster computing and approximate computing for big data analysis,E-mail:sultan@szu.edu.cn;Tamer Z.Emara is currently a PhD candidate at Big Data Institute,Shenzhen University,China.In 2015,he got the MS degree from Mansoura University,Egypt.Also,he received the BS degree from Tanta University,Egypt,in 2005.He is now a lecturer at the Higher Institute of Engineering and Technology,Kafrelsheikh,Egypt.His main research interest is big data management.He is a member of IEEE and ACM,E-mail:tamer@szu.edu.cn;Sadatdiynov Kuanishbay currently is a PhD candidate at Shenzhen University,China.He received the BS and the MS degrees from Tashkent University of Information Technologies,Uzbekistan,in 2012 and 2014,respectively.His research interests include edge computing,network architecture,and big data analysis,E-mail:kuanishbay@szu.edu.cn