摘要
利用Spark平台对电力用户侧的大数据进行分析,提出基于梯度提升树的并行负荷预测方法.首先对历史负荷和天气数据集进行并行化分割处理,并采用特征提取与转换方法获取到预测模型所需的特征向量;然后合理设定Spark集群节点数以及调节Hadoop分布式文件系统(HDFS)分块大小;最后将参数调优后的梯度提升树模型部署到Spark分布式平台上进行训练与预测,并将该模型预测结果与其他预测模型进行精度比较.研究结果表明:通过合理划分HDFS中存储块的大小能有效提高集群对于大数据处理的效率,分布式梯度提升树算法在快速性与准确性上均有比较大的优势,能够满足电力负荷预测的要求.
A parallel load forecasting method based on gradient boosting decision tree was proposed and Spark platform was used to analyze big data of user-side.Firstly,the historical load and weather data set were parallelized and segmented,and the feature extraction and transformation methods were used to obtain the feature vector required by the prediction model.Then,the number of Spark cluster nodes and the HDFS(Hadoop distributed file system)block size were adjusted.Finally,the parameter-tuned gradient lifting tree model was deployed to the Spark distributed platform for training and prediction,and the model prediction results were compared with other prediction models.Research results show that the cluster processing efficiency for large data sets can be improved effectively by dividing the size of the storage block reasonably in HDFS.It is also demonstrated that the distributed gradient boosting decision tree algorithm has some advantages in rapidity and accuracy,which could meet requirements of the power load forecasting quite well.
作者
许贤泽
刘静
施元
谭盛煌
XU Xianze;LIU Jing;SHI Yuan;TAN Shenghuang(School of Electronic Information,Wuhan University,Wuhan 430072,China)
出处
《华中科技大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2019年第5期84-89,共6页
Journal of Huazhong University of Science and Technology(Natural Science Edition)
基金
国家自然科学基金资助项目(51705375)
关键词
负荷预测
分布式计算
大数据
梯度提升树
Spark平台
load forecasting
distributed computing
big data
gradient boosting decision tree
Spark platform
作者简介
许贤泽(1967-),男,教授,E-mail:xxz@whu.edu.cn.