摘要
为提高多源异构环境数据清洗的效率和降低多源异构数据清洗的复杂度,针对多源异构数据环境下存在大量不精确数据的问题,提出一种层次约减分类清洗方法。通过重要度度量算法在数据源层、数据属性层、数据元组层进行层次约减,基于分类算法思想构建TAN网,然后利用数据概率值完成对不精确数据的分类清洗。实验表明所提方法能够有效地提高不精确数据清洗的准确率和清洗效率。
In order to improve the efficiency and reduce the complexity of multi-source heterogeneous data cleaning,in view of the problem that there is a large number of imprecise data in multi-source heterogeneous data environment, a data cleaning method based on hierarchical reduction and classified cleaning is proposed. The data are reduced at data source layer,data attribute layer,and data tuple layer,by measurement of importance degree of data source. Then TAN( Tree Augmented Bayes Network) is constructed by classification. Data attribute and tuple weight tag,and machine learning classification algorithm. The classification cleaning of imprecise data is completed by using of probability value of data. The experiments show that the presented method can effectively improve the accuracy and efficiency of imprecise data cleaning.
作者
杨尚林
农英雄
黄汝维
陈宁江
梁碧枘
YANG Shang-lin;NONG Ying-xiong;HUANG Ru-wei;CHEN Nlng-jiang;LIANG Bi-rui(School of Computer and Electronic Intonnation,Guangxi University,Nanning 530004,China;Intonnation Center of China Tobacco Guangxi Industrial CO.,LTD.,Nanning 530001,China)
出处
《广西大学学报(自然科学版)》
CAS
北大核心
2018年第3期1053-1061,共9页
Journal of Guangxi University(Natural Science Edition)
基金
国家自然科学基金资助项目(61762008)
广西重点研发计划项目(桂科AB17195014)
南宁市科技开发计划项目(20173161)
广西自然科学基金资助项目(2017GXNSFAA198141
2016GXNSFAA380115)
关键词
数据清洗
属性约减
TAN网络
data cleaning
attribute reduction
TAN network
作者简介
通讯作者:陈宁江(1975-),男,广西南宁人,广西大学教授,博士;E—mail:chnj@gxu.edu.cn。