期刊文献+

大数据环境下一种基于模式匹配的实体统一方法 被引量:4

A METHOD OF ENTITY RESOLUTION BASED ON PATTERN MATCHING IN BIG DATA ENVIRONMENT
在线阅读 下载PDF
导出
摘要 实体统一是数据融合中的重点研究内容。传统的实体统一方法主要针对小数据集,重点关注统一结果的准确性,随着大数据时代的到来,传统的实体统一方法由于时间复杂度较高,难以处理海量的数据集,快速地筛选出有价值的数据成为大数据环境下更值得关注的问题。提出一种适应于大数据环境下的实体统一方法,通过数据分块、块内模式匹配以及块间模式匹配进行实体统一,其中,模式匹配采用了一种基于模式快速扫描算法,在尽量不损失精度的同时提高实体统一计算效率。结合Spark框架,基于DBLP数据集验证了该方法在保证数据实体统一质量的基础上具有良好的时效性。 Entity resolution(ER) is the main contents of data fusion. Traditional method of entity resolution mainly focuses on the small data set, focusing on the accuracy of the resolution. With development of big data, traditional ER is difficult to deal with massive data sets due to the high time complexity, and the rapid retrieval of valuable data becomes a more important issue in the big data environment. This paper proposes a method of ER in big data environment, which is solved by data blocking, intra block pattern matching and pattern matching between block and block. Among them, the pattern matching uses a pattern rapid scanning algorithm as far as possible without loss of precision while improving the computational efficiency of ER. Combining with the Spark framework, the DBLP dataset is used to verify that the method has good timeliness, guaranteeing the quality of ER.
作者 熊安萍 詹妮 邹毅 龙林波 Xiong Anping1,Zhan Ni2,Zou Yi3,Long Linbo1(1.School of Computer Science and Technology,Chongqing University of Posts and Telecommunications, Chongqing 400065,China;2.School of Software Engineering,Chongqing University of Posts and Telecommunications, Chongqing 400065,China;3.Chongqing Municipal Public Security Bureau of Network Security Corps, Chongqing 401121,Chin)
出处 《计算机应用与软件》 北大核心 2018年第8期87-92,97,共7页 Computer Applications and Software
基金 重庆市基础科学与前沿技术研究项目(cstc2017jcyjAX0164)
关键词 实体统一 数据融合 大数据 模式匹配 Entity resolution Data fusion Big data Pattern matching
作者简介 熊安萍,教授,主研领域:海量信息处理与大数据安全。;詹妮,硕士。;邹毅,高级工程师。;龙林波,博士。
  • 相关文献

参考文献2

二级参考文献21

  • 1Elmagarmid A K. Ipeirotis P G. Verykios V S. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering. 2007. 19(1): 1-16.
  • 2Hernandez M A. Stolfo SJ. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery. 1998. 2(1): 9-37.
  • 3Christen P. Data Matching: Concepts and Techniques for Record Linkage. Entity Resolution. and Duplicate Detection. Berlin: Springer. 2012.
  • 4Jain A K. Murty M N. Flynn PJ. Data clustering: A review. ACM Computing Surveys. 1999. 31(3): 264-323.
  • 5Winkler W E. Overview of Record Linkage and Current Research Directions. Washington: Statistical Research Division. 2006.
  • 6Benjelloun O. Garcia-Molina H. Menestrina D. et al. Swoosh: A generic approach to entity resolution. The InternationalJournal on Very Large Data Bases. 2009. 18 (1): 255-276.
  • 7Monge A E. Elkan C P. An efficient domain-independent algorithm for detecting approximately duplicate database records//Proceedings of the 2nd ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery. Vancouver. Canada. 1997: 23-29.
  • 8Fellegi I P. Sunter A B. A theory for record linkage.Journal of the American Statistical Association. 1969. 64 (328): 1183-1210.
  • 9Hernandez M A. Stolfo SJ. The merge/purge problem for large databases//Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data. SanJose. USA. 1995. 24(2): 127-138.
  • 10Garcia-Molina H. UllmanJ D. WidomJ. Database System Implementation. Upper Saddle River. NJ: Prentice Hall. 2000.

共引文献140

同被引文献47

引证文献4

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部