摘要
实体统一是数据融合中的重点研究内容。传统的实体统一方法主要针对小数据集,重点关注统一结果的准确性,随着大数据时代的到来,传统的实体统一方法由于时间复杂度较高,难以处理海量的数据集,快速地筛选出有价值的数据成为大数据环境下更值得关注的问题。提出一种适应于大数据环境下的实体统一方法,通过数据分块、块内模式匹配以及块间模式匹配进行实体统一,其中,模式匹配采用了一种基于模式快速扫描算法,在尽量不损失精度的同时提高实体统一计算效率。结合Spark框架,基于DBLP数据集验证了该方法在保证数据实体统一质量的基础上具有良好的时效性。
Entity resolution(ER) is the main contents of data fusion. Traditional method of entity resolution mainly focuses on the small data set, focusing on the accuracy of the resolution. With development of big data, traditional ER is difficult to deal with massive data sets due to the high time complexity, and the rapid retrieval of valuable data becomes a more important issue in the big data environment. This paper proposes a method of ER in big data environment, which is solved by data blocking, intra block pattern matching and pattern matching between block and block. Among them, the pattern matching uses a pattern rapid scanning algorithm as far as possible without loss of precision while improving the computational efficiency of ER. Combining with the Spark framework, the DBLP dataset is used to verify that the method has good timeliness, guaranteeing the quality of ER.
作者
熊安萍
詹妮
邹毅
龙林波
Xiong Anping1,Zhan Ni2,Zou Yi3,Long Linbo1(1.School of Computer Science and Technology,Chongqing University of Posts and Telecommunications, Chongqing 400065,China;2.School of Software Engineering,Chongqing University of Posts and Telecommunications, Chongqing 400065,China;3.Chongqing Municipal Public Security Bureau of Network Security Corps, Chongqing 401121,Chin)
出处
《计算机应用与软件》
北大核心
2018年第8期87-92,97,共7页
Computer Applications and Software
基金
重庆市基础科学与前沿技术研究项目(cstc2017jcyjAX0164)
关键词
实体统一
数据融合
大数据
模式匹配
Entity resolution
Data fusion
Big data
Pattern matching
作者简介
熊安萍,教授,主研领域:海量信息处理与大数据安全。;詹妮,硕士。;邹毅,高级工程师。;龙林波,博士。