A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis 被引量：18

导出

摘要 Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis.In cluster computing,data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability.In this paper,we present a comprehensive survey of the methods and techniques of data partitioning and sampling with respect to big data processing and analysis.We start with an overview of the mainstream big data frameworks on Hadoop clusters.The basic methods of data partitioning are then discussed including three classical horizontal partitioning schemes:range,hash,and random partitioning.Data partitioning on Hadoop clusters is also discussed with a summary of new strategies for big data partitioning,including the new Random Sample Partition(RSP)distributed model.The classical methods of data sampling are then investigated,including simple random sampling,stratified sampling,and reservoir sampling.Two common methods of big data sampling on computing clusters are also discussed:record-level sampling and blocklevel sampling.Record-level sampling is not as efficient as block-level sampling on big distributed data.On the other hand,block-level sampling on data blocks generated with the classical data partitioning methods does not necessarily produce good representative samples for approximate computing of big data.In this survey,we also summarize the prevailing strategies and related work on sampling-based approximation on Hadoop clusters.We believe that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects.

作者 Mohammad Sultan Mahmud Joshua Zhexue Huang Salman Salloum Tamer Z.Emara Kuanishbay Sadatdiynov

机构地区 National Engineering Laboratory for Big Data System Computing Technology Big Data Institute

出处《Big Data Mining and Analytics》 2020年第2期85-101,共17页 大数据挖掘与分析（英文）

基金 Supported in part by the National Natural Science Foundation of China(No.61972261) the National Key R&D Program of China(No.2017YFC0822604-2)

关键词 big data analysis data partitioning data sampling distributed and parallel computing approximate computing

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

作者简介 corresponding author:Mohammad Sultan Mahmud is currently a PhD candidate at Shenzhen University,China.He received the master degree from King Mongkut’s University of Technology North Bangkok,Thailand,in 2014,and the bachelor degree from BGC Trust University Bangladesh,Bangladesh,in 2008.Mr.Mahmud was awarded the Outstanding Doctoral Student of Shenzhen University in 2017 and Shenzhen Universiade International Scholarship in 2018.Also,he received Information Technology-King Mongkut’s University of Technology North Bangkok scholarship for two years in 2012.His current research focuses on big data mining and distributed and parallel computing,E-mail:ssalloum@szu.edu.cn;Joshua Z.Huang received the PhD degree from the Royal Institute of Technology,Sweden,in 1993.He is a distinguished professor of the College of Computer Science&Software Engineering at Shenzhen University.Also,he is the director of Big Data Institute and the deputy director of the National Engineering Laboratory for Big Data System Computing Technology.His main research interests include big data technology and applications.Prof.Huang has published over 200 research papers in conferences and journals.In 2006,he received the most influential paper award in the First Pacific-Asia Conference on Knowledge Discovery and Data Mining.Prof.Huang is known for his contributions to the development of a series of k-means type clustering algorithms in data mining,such as k-modes,fuzzy k-modes,k-prototypes,and w-k-means,that are widely cited and used,and some of which have been included in commercial software.He has extensive industry expertise in business intelligence and data mining,and has been involved in numerous consulting projects in Australia and China,E-mail:zx.huang@szu.edu.cn;Salman Salloum received the PhD degree from Shenzhen University,Shenzhen,China,in 2019,and the MS degree from Damascus University,Damascus,Syria,in 2013.He is currently an associate researcher with the College of Computer Science and Software Engineering,Shenzhen University,Shenzhen,China.From 2007 to 2014,he had worked as an instructional designer and a project manager in ePedia-SY,a digital content company in Syria.He was also a tutor at Syrian Virtual University from 2012 to 2014.His current research is focused on cluster computing and approximate computing for big data analysis,E-mail:sultan@szu.edu.cn;Tamer Z.Emara is currently a PhD candidate at Big Data Institute,Shenzhen University,China.In 2015,he got the MS degree from Mansoura University,Egypt.Also,he received the BS degree from Tanta University,Egypt,in 2005.He is now a lecturer at the Higher Institute of Engineering and Technology,Kafrelsheikh,Egypt.His main research interest is big data management.He is a member of IEEE and ACM,E-mail:tamer@szu.edu.cn;Sadatdiynov Kuanishbay currently is a PhD candidate at Shenzhen University,China.He received the BS and the MS degrees from Tashkent University of Information Technologies,Uzbekistan,in 2012 and 2014,respectively.His research interests include edge computing,network architecture,and big data analysis,E-mail:kuanishbay@szu.edu.cn

引文网络
相关文献

同被引文献99

1贺建风,李宏煜.大数据背景下基于社交网络的聚类随机游走抽样算法研究[J].统计研究,2021(4):131-144. 被引量：10
2Bo Liu,Shijiao Tang,Xiangguo Sun,Qiaoyun Chen,Jiuxin Cao,Junzhou Luo,Shanshan Zhao.Context-Aware Social Media User Sentiment Analysis[J].Tsinghua Science and Technology,2020,25(4):528-541. 被引量：7
3Youssef Nait Malek,Mehdi Najib,Mohamed Bakhouya,Mohammed Essaaidi.Multivariate Deep Learning Approach for Electric Vehicle Speed Forecasting[J].Big Data Mining and Analytics,2021,4(1):56-64. 被引量：7
4Khalid AL Fararni,Fouad Nafis,Badraddine Aghoutane,Ali Yahyaouy,Jamal Riffi,Abdelouahed Sabri.Hybrid Recommender System for Tourism Based on Big Data and AI:A Conceptual Framework[J].Big Data Mining and Analytics,2021,4(1):47-55. 被引量：2
5El Arbi Abdellaoui Alaoui,Stéphane Cédric Koumetio Tekouabou,Sri Hartini,Zuherman Rustam,Hassan Silkan,Said Agoujil.Improvement in Automated Diagnosis of Soft Tissues Tumors Using Machine Learning[J].Big Data Mining and Analytics,2021,4(1):33-46. 被引量：3
6Azidine Guezzaz,Younes Asimi,Mourade Azrour,Ahmed Asimi.Mathematical Validation of Proposed Machine Learning Classifier for Heterogeneous Traffic and Anomaly Detection[J].Big Data Mining and Analytics,2021,4(1):18-24. 被引量：4
7Jamal Mabrouki,Mourade Azrour,Ghizlane Fattah,Driss Dhiba,Souad El Hajjaji.Intelligent Monitoring System for Biogas Detection Based on the Internet of Things: Mohammedia, Morocco City Landfill Case[J].Big Data Mining and Analytics,2021,4(1):10-17. 被引量：3
8Mourade Azrour,Jamal Mabrouki,Azedine Guezzaz,Yousef Farhaoui.New Enhanced Authentication Protocol for Internet of Things[J].Big Data Mining and Analytics,2021,4(1):1-9. 被引量：8
9Qixuan Hou,Meng Han,Zhipeng Cai.Survey on Data Analysis in Social Media:A Practical Application Aspect[J].Big Data Mining and Analytics,2020,3(4):259-279. 被引量：4
10Zaobo He,Junxiu Zhou.Inference Attacks on Genomic Data Based on Probabilistic Graphical Models[J].Big Data Mining and Analytics,2020,3(3):225-233. 被引量：3

引证文献18

1Amanpreet Kaur Sandhu.Big Data with Cloud Computing:Discussions and Challenges[J].Big Data Mining and Analytics,2022,5(1):32-40. 被引量：15
2张琳琳,王顺江,郭星池,凌兆伟,李朗,句荣滨.电力调度大数据应用平台系统技术研究[J].电力大数据,2021,24(1):48-54. 被引量：6
3舒宏,李双宏.证券客户价值指标体系及评估模型设计[J].微型电脑应用,2021,37(7):116-119. 被引量：2
4Yu Tian,Ruiqing Zheng,Zhenlan Liang,Suning Li,Fang-Xiang Wu,Min Li.A Data-Driven Clustering Recommendation Method for Single-Cell RNA-Sequencing Data[J].Tsinghua Science and Technology,2021,26(5):772-789. 被引量：3
5瞿强,杨凯利,张其静,张雪清,娄红红.一种针对电力大数据融合与异常检测的改进方法[J].电力大数据,2021,24(7):24-30. 被引量：1
6金勇进,刘晓宇.大数据背景下的抽样调查[J].系统科学与数学,2022,42(1):2-16. 被引量：5
7Jinbao Wang,Zhuojun Duan,Xixian Han,Donghua Yang.Efficient Top/Bottom-k Fraction Estimation in Spatial Databases Using Bounded Main Memory[J].Tsinghua Science and Technology,2022,27(2):223-234.
8Xu Zheng,Lizong Zhang,Kaiyang Li,Xi Zeng.Efficient Publication of Distributed and Overlapping Graph Data Under Differential Privacy[J].Tsinghua Science and Technology,2022,27(2):235-243.
9Qunying Yuan,Dongxing Wang,Yuanyuan Zhao,Yong Sang,Fan Wang,Yuwen Liu,Ying Miao.Privacy-Aware Examination Results Ranking for the Balance Between Teachers and Mothers[J].Tsinghua Science and Technology,2022,27(3):581-588.
10Yihong Yang,Sheng Ding,Yuwen Liu,Shunmei Meng,Xiaoxiao Chi,Rui Ma,Chao Yan.Fast wireless sensor for anomaly detection based on data stream in an edge-computing-enabled smart greenhouse[J].Digital Communications and Networks,2022,8(4):498-507. 被引量：3

二级引证文献43

1张彦菊,石兵波,赵娇燕,张健康.中国石油新能源电力系统[J].新疆石油天然气,2022,18(2):21-25. 被引量：4
2金现孔.调度日报自动生成设计与实践[J].机电工程技术,2022,51(5):157-160. 被引量：2
3曲浩.融合多源数据的证券企业客户流失预警研究——以某券商A为例[J].商展经济,2022(16):70-72. 被引量：1
4Weihua Liu,Haoyang Wan,Boyuan Yan.Short Video Recommendation Algorithm Incorporating Temporal Contextual Information and User Context[J].Computer Modeling in Engineering & Sciences,2023(4):239-258.
5Lianyong Qi,Jin Li,Mehdi Elahi,Keshav Sood,Yuan Yuan,Mohammad Khosravi.Guest editorial:Special issue on security and privacy for AI-powered smart IoT applications[J].Digital Communications and Networks,2022,8(4):411-414. 被引量：1
6Dengcheng Yan,Yuchuan Zhao,Zhongxiu Yang,Ying Jin,Yiwen Zhang.FedCDR:Privacy-preserving federated cross-domain recommendation[J].Digital Communications and Networks,2022,8(4):552-560. 被引量：2
7Sai Ji,Dachuan Xu,Donglei Du,Ling Gai,Zhongrui Zhao.Approximation Algorithm for the Balanced 2-Correlation Clustering Problem[J].Tsinghua Science and Technology,2022,27(5):777-784.
8向黎藜,肖私宇,钟爱,郭娇,段凯,张人杰.基于PCA和决策树模型的异常电费数据检测和识别研究[J].电力大数据,2022,25(4):42-47. 被引量：6
9夏甫开提·阿力甫,周京涛,努尔比亚吾素因,买买提·牙森.基于边缘计算的医疗资源配置重构优化模型构建[J].中国医学物理学杂志,2022,39(11):1407-1411. 被引量：1
10张士成,张诗钰,王雪,刘君涵,李富,江涛.国内继续医学教育主题研究的趋势回顾与展望[J].中国继续医学教育,2022,14(22):188-194. 被引量：6

1DING Zeliu,GUO Deke,CHEN Xi,CHEN Jin.MapReduce rationality verification based on object Petri net[J].Journal of Systems Engineering and Electronics,2019,30(5):861-874. 被引量：5
2Tayfun Kucukyilmaz.Parallel K-Means Algorithm for Shared Memory Multiprocessors[J].Journal of Computer and Communications,2014,2(11):15-23.
3Jane de Almeida,Guido Lemos de Souza Filho,Carlos Eduardo Coelho Freire Batista,Lucenildo Lins de Aquino Junior,Jose Ivan Vilarouca Filho,Manoel Silva Neto,Cicero Inacio da Silva,Leandro Ciuffo,Iara Machado,Clayton Reis da Silva.Integration of Fogo Player and SAGE (Scalable Adaptive Graphics Environment) for 8K UHD Video Exhibition[J].Journal of Computer and Communications,2014,2(12):50-55.
4Cong-Shuai Wang,Tian-Zhen Li,Si-Jia Liu,Yu-Chen Zhang,Shuang Deng,Yinchun Jiao,Feng Shi.Axially Chiral Aryl-Alkene-lndole Framework: A Nascent Member of the Atropisomeric Family and Its Catalytic Asymmetric Construction[J].Chinese Journal of Chemistry,2020,38(6):543-552. 被引量：7
5Yu-Xin Gong,Jia-Jun Wang.Solid-state batteries: from fundamental interface characterization to realize sustainable promise[J].Rare Metals,2020,39(7):743-744. 被引量：7
6景藜,郑文,侯杭州.浅谈真空吸土车发展现状及应用[J].中国机械,2020(9):86-87.
7Yiran Chen,Yuan Xie,Linghao Song,Fan Chen,Tianqi Tang.A Survey of Accelerator Architectures for Deep Neural Networks[J].Engineering,2020,6(3):264-274. 被引量：9
8Hao Li,Chong Liu,Wei Zhao,Zhan-Ying Yang,Wen-Li Yang.Breather-induced quantised superfluid vortex filaments and their characterisation[J].Communications in Theoretical Physics,2020,72(7):132-140.
9Zi-Kui Liu.View and Comments on the Data Ecosystem:"Ocean of Data"[J].Engineering,2020,6(6):604-608. 被引量：4
10Preeti Sharma.Women From Members of Parliament to Leaders of Parliament:A Comparative Analysis of India and Bangladesh[J].Psychology Research,2020,10(5):203-211.

Big Data Mining and Analytics

2020年第2期

浏览历史

内容加载中请稍等...

A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis 被引量：18

同被引文献99

引证文献18

二级引证文献43

相关作者

相关机构

相关主题

浏览历史