期刊文献+

基于RDD关键度的Spark检查点管理策略 被引量:6

Criticality Checkpoint Management Strategy Based on RDD Characteristics in Spark
在线阅读 下载PDF
导出
摘要 Spark默认容错机制由程序员设置检查点,并利用弹性分布式数据集(resilient distributed dataset,RDD)的血统(lineage)进行计算.在应用程序复杂度高、迭代次数多以及数据量较大时,恢复过程需要耗费大量的计算开销.同时,在执行恢复任务时,仅考虑数据本地性选择节点,并未考虑节点的计算能力,这都会导致恢复时间增加,无法最大化发挥集群的性能.因此,在建立Spark执行模型、检查点模型和RDD关键度模型的基础上,提出一种基于关键度的检查点管理(criticality checkpoint management,CCM)策略,其中包括检查点设置算法、失效恢复算法和清理算法.其中检查点设置算法通过分析作业中RDD的属性以及对作业恢复时间的影响,选择关键度大的RDD作为检查点存储;恢复算法根据各节点的计算能力做出决策,选择合适的节点执行恢复任务;清理算法在磁盘空间不足时,清除关键度较低的检查点.实验结果表明:该策略在略增加执行时间的情况下,能够选择有备份价值的RDD作为检查点,在节点失效时能够有效地降低恢复开销,提高节点的磁盘有效利用率. The default fault tolerance mechanism of Spark is setting the checkpoint by programmer.When facing data loss,Spark recomputes the tasks based on the RDD lineage to recovery the data.Meanwhile,in the circumstance of complicated application with multiple iterations and large amount of input data,the recovery process may cost a lot of computation time.In addition,the recompute task only considers the data locality by default regardless the computing capabilities of nodes,which increases the length of recovery time.To reduce recovery cost,we establish and demonstrate the Spark execution model,the checkpoint model and the RDD critically model.Based on the theory,the criticality checkpoint management(CCM)strategy is proposed,which includes the checkpoint algorithm,the failure recovery algorithm and the cleaning algorithm.The checkpoint algorithm is used to analyze the RDD charactersitics and its influence on the recovery time,and selects valuable RDDs as checkpoints.The failure recovery algorithm is used to choose the appropriate nodes to recompute the lost RDDs,and cleaning algorithm cleans checkpoints when the disk space becomes insufficient.Experimental results show that:the strategy can reduce the recovery overhead efficiently,select valuable RDDs as checkpoints,and increase the efficiency of disk usage on the nodes with sacrificing the execution time slightly.
作者 英昌甜 于炯 卞琛 王维庆 鲁亮 钱育蓉 Ying Changtian;Yu Jiong;Bian Chen;Wang Weiqing;Lu Liang;Qian Yurong(Postdoctoral Research Station of Electrical Engineering, Xinjiang University, Urumqi 830046;School of Software, Xinjiang University, Urumqi 830008;School of Electrical Engineering, Xinjiang University, Urumqi 830046)
出处 《计算机研究与发展》 EI CSCD 北大核心 2017年第12期2849-2863,共15页 Journal of Computer Research and Development
基金 国家自然科学基金项目(61262088 61462079 61363083 61562086 51667020) 新疆维吾尔自治区自然科学基金项目(2017D01A20) 新疆维吾尔自治区高校科研计划(XJEDU2016S106)~~
关键词 内存计算 SPARK 检查点管理 失效恢复 RDD属性 memory computing Spark checkpoint management failure recovery RDD characteristics
作者简介 yingct@xju.edu.com.Ying Changtian, born in 1989. PhD in Xinjiang University. Student member of CCF. Her main research interests include parallel computing, distributed system, and memory computing, etc.;通信作者:于炯(yujiong@xju.edu.cn).Yu Jiong, born in 1964. Professor and PhD supervisor. Senior member of CCF. His main research interests include grid computing, parallel computing, etc.;Bian Chen, born in 1981. Associate professor and PhD. Senior member of CCF. His main research interests include parallel computing, distributed system, etc.;Wang Weiqing, born in 1959. Professor and PhD supervisor. His main research interests include power system relay protection, wind power generation control and grid connection technology (wwq59@xju.edu.cn).;Lu Liang, born in 1990. PhD candidate in Xinjiang University. Student member of CCF. His main research interests include flow processing, real time computing.;Qian Yurong, born in 1981. Professor and master supervisor. Senior member of CCF. Her main research interests include data mining.
  • 相关文献

参考文献4

二级参考文献194

  • 1E.N. Elnozahy, D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. School of Computer Science, Carnegie Mellon University, Tech Rep: CMU-CS-96-181, 1996
  • 2Pierre Lemarinier, Aurelien Bouteiller. Improved message logging versus improved coordinated checkpointing for fault tolerant MPI.IEEE Int'l Conf. Cluster Computing (Cluster 2003), Hong Kong, 2003
  • 3Chandy K M, Lamport L. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Computer Systems, 1985, 3(1): 63~75
  • 4谢旻 邢座程.NICHAL通信软件接口设计与实现[J].计算机研究与发展,2002,39:189-203.
  • 5Nature. Big Data [EB/OL]. [2012-10-02]. http,//www. nature, com/news/specials/bigdata/index, html.
  • 6Bryant R E, Katz R H, Lazowska E D. Big-Data computing : Creating revolutionary breakthroughs in commerce, science, and society [R]. [2012-10-02]. http:// www. cra. org/ccc/docs/init/Big_Data, pdf.
  • 7Science. Special online collection: Dealing with data [EB/OL]. [2012-10-02]. http://www, sciencemag, org/site/ special/data/, 2011.
  • 8Agrawal D, Bernstein P, Bertino E, et al. Challenges and opportunities with big data A community white paper developed by leading researchers across the United States [R/OL]. [2012-10-02]. http://cra, org/ccc/docs/init/bigdata whitepaper, pdf.
  • 9Manyika J, Chui M, Brown B, et al. Big data: The next frontier for innovation, competition, and productivity [R/OL]. [ 2012-10-02 ]. http://www, mekinsey, corn/ Insights]MGI[Research/Teehnology _ and _ Innovation]Big _ data The next frontier for innovation.
  • 10World Economic Forum. Big data, big impact: New possibilities for international development [R/OL]. [2012- 10-02]. http://www3, weforum, org/docs/WEF TC MFS BigDataBigImpact_Briefing 2012. pdf.

共引文献2426

同被引文献36

引证文献6

二级引证文献27

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部