摘要
数据资产价值评估对数据要素化发展具有战略意义,为理清数据资产价值评估指标的贡献率,平衡机器学习模型的准确性及可解释性。提出一种结合数据预处理技术和特征选择工程预测框架(data preprocessing-feature selection-back propagation neural network, DP-FS-BP),并运用SHAP(Shapley additive explanations)算法对预测模型指标贡献进行解释。以优易数据网采集的交易块数据为例,首先运用数据预处理和特征选择对数据进行清洗与指标选择,其次将处理后的数据与原始数据在线性回归、支持向量机(support vector machine, SVM)、决策树、k-最近邻(k-nearest neighbors, KNN)、随机森林、XGBoost和DP-FS-BP模型上对比相关系数拟合优度R^(2)、均方根误差(root mean squared error, RMSE)、平均绝对误差(mean absolute error, MAE)的值,结果表明,DP-FS-BP模型获得最理想的预测结果,在预测精度上比其他模型有着显著优势;SHAP算法对BP神经网络模型进行解释。结果表明科研技术和数据样本量的SHAP值的平均绝对值分别为209.25和191.24,位居第一和第二。通过将特征对输出的贡献率可视化,为建立相应的数据资产价值评价指标体系提供决策依据。
Data asset valuation is of strategic significance to the development of data elementalization,in order to clarify the contribution rate of data asset valuation indicators and balance the accuracy and interpretability of machine learning models,a data preprocessing-feature selection-back propagation neural network(DP-FS-BP)prediction framework prediction framework was proposed,and the Shapley Additive exPlanations(SHAP)algorithm was used to explain the metric contribution of the prediction model.Taking the transaction block data collected by Youe data network as an example,data preprocessing and feature selection were used to clean the data and select indicators,and then the values of R^(2),root mean squared error(RMSE)and mean absolute error(MAE)were compared with the original data on linear regression,support vector machine(SVM),decision tree,k-nearest neighbors(KNN),random forest,XGBoost and DP-FS-BP models.The results show that the DP-FS-BP model obtains the most ideal prediction results,and has a significant advantage over other models in prediction accuracy.The results of explaining the BP neural network model using the SHAP algorithm show that the average absolute values of SHAP values for scientific research techniques and data sample sizes are 209.25 and 191.24,respectively,ranking first and second.By visualizing the contribution rate of features to the output,a decision-making basis is provided for establishing a corresponding data asset value evaluation index system.
作者
周翠平
李少波
张仪宗
袁攀亮
廖子豪
张星星
ZHOU Cui-ping;LI Shao-bo;ZHANG Yi-zong;YUAN Pan-liang;LIAO Zi-hao;ZHANG Xing-xing(State Key Laboratory of Public Big Data,Guizhou University,Guiyang 550025,China;School of Mechanical Engineering,Guizhou University,Guiyang 550025,China)
出处
《科学技术与工程》
北大核心
2024年第33期14317-14329,共13页
Science Technology and Engineering
基金
中央引导地方科技发展资金储备项目(黔科合中地引[2023]002)
国家自然科学基金面上项目(52275480)
贵州省高等学校集成攻关大平台项目(黔教合KY字[2020]005)。
关键词
数据预处理
特征选择
模型可解释性
BP神经网络
贡献率
data preprocessing
feature selection
model interpretability
back propagation neural network
contribution rate
作者简介
第一作者:周翠平(1994-),女,汉族,贵州毕节人,硕士研究生。研究方向:数据要素化。E-mail:15761635645@163.com;通信作者:李少波(1973-),男,汉族,湖南岳阳人,博士,教授,博士研究生导师。研究方向:大数据与智能制造。E-mail:lishaobo@guz.edu.cn。