摘要
Failure prediction plays an important role for many tasks such as optimal resource management in large-scale system. However, accurately failure number prediction of repairable large-scale long-running computing (RLLC) is a challenge because of the reparability and large-scale. To address the challenge, a general Bayesian serial revision prediction method based on Bootstrap approach and moving average approach is put forward, which can make an accurately prediction for the failure number. To demonstrate the performance gains of our method, extensive experiments on the data of Los Alamos National Laboratory (LANL) cluster is implemented, which is a typical RLLC system. And experimental results show that the prediction accuracy of our method is 80.2 %, and it is a greatly improvement with 4 % compared with some typical methods. Finally, the managerial implications of the models are discussed.
Failure prediction plays an important role for many tasks such as optimal resource management in large-scale system. However, accurately failure number prediction of repairable large-scale long-running computing (RLLC) is a challenge because of the reparability and large-scale. To address the challenge, a general Bayesian serial revision prediction method based on Bootstrap approach and moving average approach is put forward, which can make an accurately prediction for the failure number. To demonstrate the performance gains of our method, extensive experiments on the data of Los Alamos National Laboratory (LANL) cluster is implemented, which is a typical RLLC system. And experimental results show that the prediction accuracy of our method is 80.2 %, and it is a greatly improvement with 4 % compared with some typical methods. Finally, the managerial implications of the models are discussed.
基金
supported by the National Natural Science Foundationof China (60701006
60804054
71071158)
作者简介
Corresponding author.Qiang Liu was born in 1983. He is a Ph.D. student in the College of Information System and Management at National University of Defense and Technology. He is also a join-student of School of Computer Science at McGill University, Canada. Currently his research interests are system pattern recognition, system reliability, Bayesian method, and prognostics and health management.E-mail: qiangliu.mcgill @ gmail.comGuang Jin was born in 1973. He is a Ph.D. and an associate professor in National University of Defense and Technology. His research interests are system modeling and simulation, system reli- ability estimation, experiment and evaluation, and prognostics and health management. E-mail: kingbayes@21cn.comJinglun Zhou was born in 1955. He is a Ph.D. and a professor in National University of Defense and Technology. His research interests are system reliability, risk assessment and evaluation, information management and decision, failure diagnostics, and prognostics and health management. E-mail: jlzhou@nudt.edu.cnQuan Sun was born in 1973. He is a Ph.D. and an associate professor in National University of Defense and Technology. He is also a visiting scholar of School of Industrial and Systems Engineering at Georgia Institute of Technology, USA. His research interests are risk assessment and evaluation, physics of reliability, and failure diagnostics. E-mail: quansun.nudt @gmail.comMin Xi was born in 1979. He is a Ph.D. and a lecturer in Xi'an Jiaotong University. His research interests are system modeling and simulation, complex wireless sensor network, and failure diagnostics. E-mail: ximin.xjtu @gmail.com