期刊文献+

Bayesian serial revision method for RLLC cluster systems failure prediction

Bayesian serial revision method for RLLC cluster systems failure prediction
在线阅读 下载PDF
导出
摘要 Failure prediction plays an important role for many tasks such as optimal resource management in large-scale system. However, accurately failure number prediction of repairable large-scale long-running computing (RLLC) is a challenge because of the reparability and large-scale. To address the challenge, a general Bayesian serial revision prediction method based on Bootstrap approach and moving average approach is put forward, which can make an accurately prediction for the failure number. To demonstrate the performance gains of our method, extensive experiments on the data of Los Alamos National Laboratory (LANL) cluster is implemented, which is a typical RLLC system. And experimental results show that the prediction accuracy of our method is 80.2 %, and it is a greatly improvement with 4 % compared with some typical methods. Finally, the managerial implications of the models are discussed. Failure prediction plays an important role for many tasks such as optimal resource management in large-scale system. However, accurately failure number prediction of repairable large-scale long-running computing (RLLC) is a challenge because of the reparability and large-scale. To address the challenge, a general Bayesian serial revision prediction method based on Bootstrap approach and moving average approach is put forward, which can make an accurately prediction for the failure number. To demonstrate the performance gains of our method, extensive experiments on the data of Los Alamos National Laboratory (LANL) cluster is implemented, which is a typical RLLC system. And experimental results show that the prediction accuracy of our method is 80.2 %, and it is a greatly improvement with 4 % compared with some typical methods. Finally, the managerial implications of the models are discussed.
出处 《Journal of Systems Engineering and Electronics》 SCIE EI CSCD 2011年第2期238-246,共9页 系统工程与电子技术(英文版)
基金 supported by the National Natural Science Foundationof China (60701006 60804054 71071158)
关键词 failure prediction cluster systems Bayesian approach failure rate. failure prediction, cluster systems, Bayesian approach, failure rate.
作者简介 Corresponding author.Qiang Liu was born in 1983. He is a Ph.D. student in the College of Information System and Management at National University of Defense and Technology. He is also a join-student of School of Computer Science at McGill University, Canada. Currently his research interests are system pattern recognition, system reliability, Bayesian method, and prognostics and health management.E-mail: qiangliu.mcgill @ gmail.comGuang Jin was born in 1973. He is a Ph.D. and an associate professor in National University of Defense and Technology. His research interests are system modeling and simulation, system reli- ability estimation, experiment and evaluation, and prognostics and health management. E-mail: kingbayes@21cn.comJinglun Zhou was born in 1955. He is a Ph.D. and a professor in National University of Defense and Technology. His research interests are system reliability, risk assessment and evaluation, information management and decision, failure diagnostics, and prognostics and health management. E-mail: jlzhou@nudt.edu.cnQuan Sun was born in 1973. He is a Ph.D. and an associate professor in National University of Defense and Technology. He is also a visiting scholar of School of Industrial and Systems Engineering at Georgia Institute of Technology, USA. His research interests are risk assessment and evaluation, physics of reliability, and failure diagnostics. E-mail: quansun.nudt @gmail.comMin Xi was born in 1979. He is a Ph.D. and a lecturer in Xi'an Jiaotong University. His research interests are system modeling and simulation, complex wireless sensor network, and failure diagnostics. E-mail: ximin.xjtu @gmail.com
  • 相关文献

参考文献24

  • 1B. Schroeder, G. A. Gibson. A large-scale study of failures in high-performance computing systems. Proc. of the Interna- tional Conference on Dependable Systems and Networks, 249- 258.
  • 2The advanced computing systems association, http://cfdr. usenix.org/.
  • 3B. Schroeder, G. A. Gibson. The computer failure data repos- itory (CFDR). Proc. of the Workshop on Reliability Analysis of System Failure Data, 2007.
  • 4M. J. Brim, T. G. Mattson, S. L. Scott. Open source cluster ap- plication resources. Proc. of Ottawa Linux Symposium, 2001.
  • 5K. J. Ryan, C. S. Reese. Estimating reliability trends for the world's fastest computer. Los Alamos National Laboratory Technical Report, 2000.
  • 6R. K. Sahoo, A. Sivasubramaniam. Failure data analysis of a large-scale heterogeneous server environment. Proc. of the In- ternational Conference on Dependable Systems and Networks, 2004: 772-783.
  • 7K. W. Harris. Asymmetries in soft-error rates in a large clus- ter system. IEEE Trans. on Device and Materials Reliability, 2005, 5(2): 336-342.
  • 8S. Fu, C. Z. Xu. Exploring event correlation for failure predic- tion in coalitions of clusters. Proc. of the ACM/IEEE Confer- ence on High Performance Networking and Computing, 2007.
  • 9S. E. Michalak, K. W. Harris, N. W. Hengartner. Predicting the number of fatal soft errors in Los Alamos National Lab- oratorys ASC Q supercomputer. IEEE Trans. on Device and Materials Reliability, 2005, 5(3): 329-335.
  • 10J. A. Beiser, S. E. Rigdon. Bayes prediction for the number of failures of a repairable system. IEEE Trans. on Reliability, 1997, 146(2): 291-297.

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部