Can Automatic Classification Help to Increase Accuracy in Data Collection?

Can Automatic Classification Help to Increase Accuracy in Data Collection?

在线阅读下载PDF

导出

摘要 Purpose： The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets. Design/methodology/approach： The paper is centered on cleaning datasets gathered from publishers and online resources by the use of specific keywords. In this case, we analyzed data from the Web of Science. The accuracy of various forms of automatic classification was tested here in comparison with manual coding in order to determine their usefulness for data collection and cleaning. We assessed the performance of seven supervised classification algorithms （Support Vector Machine （SVM）, Scaled Linear Discriminant Analysis, Lasso and elastic-net regularized generalized linear models, Maximum Entropy, Regression Tree, Boosting, and Random Forest） and analyzed two properties： accuracy and recall. We assessed not only each algorithm individually, but also their combinations through a voting scheme. We also tested the performance of these algorithms with different sizes of training data. When assessing the performance of different combinations, we used an indicator of coverage to account for the agreement and disagreement on classification between algorithms. Findings： We found that the performance of the algorithms used vary with the size of the sample for training. However, for the classification exercise in this paper the best performing algorithms were SVM and Boosting. The combination of these two algorithms achieved a high agreement on coverage and was highly accurate. This combination performs well with a small training dataset （10%）, which may reduce the manual work needed for classification tasks. Research limitations： The dataset gathered has significantly more records related to the topic of interest compared to unrelated topics. This may affect the performance of some algorithms, especially in their identification of unrelated papers. Practical implications： Although the classification achieved by this means is not completely accurate, the amount of manual coding needed can be greatly reduced by using classification algorithms. This can be of great help when the dataset is big. With the help of accuracy, recall,and coverage measures, it is possible to have an estimation of the error involved in this classification, which could open the possibility of incorporating the use of these algorithms in software specifically designed for data cleaning and classification. Purpose： The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets. Design/methodology/approach： The paper is centered on cleaning datasets gathered from publishers and online resources by the use of specific keywords. In this case, we analyzed data from the Web of Science. The accuracy of various forms of automatic classification was tested here in comparison with manual coding in order to determine their usefulness for data collection and cleaning. We assessed the performance of seven supervised classification algorithms （Support Vector Machine （SVM）, Scaled Linear Discriminant Analysis, Lasso and elastic-net regularized generalized linear models, Maximum Entropy, Regression Tree, Boosting, and Random Forest） and analyzed two properties： accuracy and recall. We assessed not only each algorithm individually, but also their combinations through a voting scheme. We also tested the performance of these algorithms with different sizes of training data. When assessing the performance of different combinations, we used an indicator of coverage to account for the agreement and disagreement on classification between algorithms. Findings： We found that the performance of the algorithms used vary with the size of the sample for training. However, for the classification exercise in this paper the best performing algorithms were SVM and Boosting. The combination of these two algorithms achieved a high agreement on coverage and was highly accurate. This combination performs well with a small training dataset （10%）, which may reduce the manual work needed for classification tasks. Research limitations： The dataset gathered has significantly more records related to the topic of interest compared to unrelated topics. This may affect the performance of some algorithms, especially in their identification of unrelated papers. Practical implications： Although the classification achieved by this means is not completely accurate, the amount of manual coding needed can be greatly reduced by using classification algorithms. This can be of great help when the dataset is big. With the help of accuracy, recall,and coverage measures, it is possible to have an estimation of the error involved in this classification, which could open the possibility of incorporating the use of these algorithms in software specifically designed for data cleaning and classification.

作者 Frederique Lang Diego Chavarro Yuxian Liu

机构地区 Science Policy Research Unit (SPRU) Tongji University Library

出处《Journal of Data and Information Science》 2016年第3期42-58,共17页 数据与情报科学学报（英文版）

基金 supported by National Natural Science Foundation of China(NSFC)(Grant No.:71173154) The National Social Science Fund of China(NSSFC)(Grant No.:08BZX076) the Fundamental Research Funds for the Central Universities

关键词 DISAMBIGUATION Machine leaming Data cleaning Classification ACCURACY RECALL COVERAGE Disambiguation Machine leaming Data cleaning Classification Accuracy Recall Coverage

分类号 TP274.2 [自动化与计算机技术—检测技术与自动化装置]

作者简介 Corresponding author： Yuxian Liu （E-mail： yxliu@tongji.edu.cn）.

引文网络
相关文献

1胡舒立.追求精确──美国报纸的编辑“质量管理”[J].国际新闻界,1995,17(1):58-61.
2Dehua HU,Juan ZHANG,Dan CHE,Aijing LUO.An initial study of information seeking behavior of researchers as faculty/student team members[J].Chinese Journal of Library and Information Science,2014(2):43-54.
3Fang LI,Yihua ZHANG.A comparison of mapping strategies from DDC to CLC[J].Chinese Journal of Library and Information Science,2012(3):47-61. 被引量：1
4Yang ZHANG,Wanyang LING.A comparative study of information diffusion in weblogs and microblogs based on social network analysis[J].Chinese Journal of Library and Information Science,2012(4):51-66. 被引量：2
5Wenjing HE,Xiaoyu CHEN,Yichen LI,Yueyi QIU.What leads to readers' satisfaction with the mobile news apps service? An investigation into the roles of content originality and user experience[J].Chinese Journal of Library and Information Science,2015(3):76-89. 被引量：1
6Yan ZHOU,Wei LI,Xingfu YUAN,Pengyi ZHANG.Ontology modeling of semantics in social media:Public issue knowledge base (PIKB)of the Weibo[J].Chinese Journal of Library and Information Science,2014(1):16-30. 被引量：2
7Yunpeng QU,Huiwei LIANG,Lei SA,Yumei XU,Wenjie DUN,Rongrong ZHAO,Qiuhui CHEN.A survey of public needs for government information service in libraries[J].Chinese Journal of Library and Information Science,2014(1):46-56.
8JOHAN BJORKSTEN.Media Training Is The Key To A Great Interview[J].China International Business,2009(12):50-50.
9王蕊.媒介渲染不应成为商业运作的美丽谎言——浅析《Let美人》整容节目[J].今传媒,2014,22(5):94-95.
10李新乐.一种新的图书馆管理方法——介绍英国图书馆的馆间比较[J].图书馆论坛,1981,3(3).

Journal of Data and Information Science

2016年第3期

浏览历史

内容加载中请稍等...

Can Automatic Classification Help to Increase Accuracy in Data Collection?

相关作者

相关机构

相关主题

浏览历史