期刊文献+

基于Word2Vec的SCI地址字段数据清洗方法研究 被引量:15

Research on SCI Address Field Data Cleaning Method Based on Word2Vec
在线阅读 下载PDF
导出
摘要 [目的/意义]旨在设计一种有效针对SCI地址字段的数据清洗方案,将Word2Vec词向量模型引入到SCI地址字段的清洗过程中,利用地址字段中上下文的信息,识别SCI地址字段中机构名称的不同写法,最终建立"机构名称映射表",达到数据清洗的目的。[方法/过程]首先,对SCI地址字段的数据进行预处理,按照规律将地址字段的信息构建成专有名词。然后,引入Word2Vec模型训练,利用训练好的模型结合余弦相似度找出与待清洗机构名相似的拼写形式。最后,建立"机构名称映射表"完成清洗。[结果/结论]通过实证分析发现,第一,在相同阈值下,该方法针对机构的识别准确要比传统字符匹配的方法要高。第二,在机构名变体与缩写的识别能力上有较好的表现。第三,该方法的运算速度是传统字符匹配算法的近40倍。Word2Vec词向量模型在数据清洗中有一定应用价值,能够根据SCI地址字段的上下文信息,清洗出指定机构名称的形似、变体和缩写机构名,从而达到数据规范化的目的。 [Purpose/Significance]The purpose of this research is to design a data cleaning method for SCI address field.The Word2Vec model is introduced into the cleaning process,the different spellings of the organization name in the SCI address field are identified according to the context information,and the mapping table of organization is established finally.[Method/Process]First,preprocessing the SCI address data,constructing proper nouns according to the information of the data.Then,training the Word2Vec model,finding the similar expression form of organization name to be cleaned with the trained model and cosine distance.At last,establishing the mapping table for data cleaning.[Result/Conclusion]Through empirical analysis,it is found that first,under the same threshold,the recognition accuracy of the Word2Vec model is higher than the traditional character matching method.Second,it also has a good performance in the recognition of abbreviations.Third,the speed of the algorithm is nearly 40 times that of the traditional character matching algorithm.The Word2Vec model has some application value in data cleaning.It can clean out the variant and abbreviation names of the designated organization according to the context information of the SCI address field.It is an effective method for data normalization.
作者 孙源 Sun Yuan(Wuhan Documentation and Information Center, CAS, Wuhan 430071)
出处 《情报杂志》 CSSCI 北大核心 2019年第2期195-200,共6页 Journal of Intelligence
关键词 数据清洗 Word2Vec 词向量模型 SCI地址字段 data cleaning Word2Vec word vector model SCI address field
作者简介 孙源(ORCID:0000-0001-5526-4124),男,1988年生,硕士,研究方向:事数据挖掘、自然语言处理。
  • 相关文献

参考文献5

二级参考文献58

共引文献168

同被引文献167

引证文献15

二级引证文献33

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部