期刊文献+

中文短文本去重方法研究 被引量:4

Research on method to detect reduplicative Chinese short texts
在线阅读 下载PDF
导出
摘要 针对中文短文本冗余问题,提出了有效的去重算法框架。考虑到短文本海量性和简短性的特点,以及中文与英文之间的区别,引入了Bloom Filter、Trie树以及SimHash算法。算法框架的第一阶段由Bloom Filter或Trie树进行完全去重,第二阶段由SimHash算法进行相似去重。设计了该算法框架的各项参数,并通过仿真实验证实了该算法框架的可行性及合理性。 The article presents an effective algorithm framework for text de-duplication, focusing on redundancy problem of Chinese short texts. In view of the brevity and huge volumes of short texts, Bloom Filter have been introduced, Trie tree and the SimHash algorithm have been introduced. In the first stage of the algorithm framework, Bloom Filter or Trie tree is designed to remove duplications completely;in the second stage, the SimHash algorithm is used to detect similar duplications. This text has designed the parameters used in the algorithm framework, and the feasibility and rationality is testified.
作者 高翔 李兵
出处 《计算机工程与应用》 CSCD 2014年第16期192-197,共6页 Computer Engineering and Applications
基金 教育部人文社会科学项目(No.11YJA870017)
关键词 文本去重 中文短文本 TRIE树 SimHash算法 Bloom Filter text de-duplication Chinese short texts Bloom Filter Trie tree SimHash algorithm
作者简介 高翔,男,硕士;李兵,男,博士,副教授。E-mail:gx8600@126.com
  • 相关文献

参考文献29

  • 1耿崇,薛德军.中文文档复制检测方法研究[J].现代图书情报技术,2007(6):33-37. 被引量:4
  • 2曹玉娟,牛振东,赵堃,彭学平.基于概念和语义网络的近似网页检测算法[J].软件学报,2011,22(8):1816-1826. 被引量:15
  • 3鲍军鹏,沈钧毅,刘晓东,宋擒豹.自然语言文档复制检测研究综述[J].软件学报,2003,14(10):1753-1760. 被引量:69
  • 4Manber U.Finding similar files in a large file system[C]// Proceedings of the Winter USENIX Conference, 1994 : 1-10.
  • 5Heintze N.Scalable document fingerprinting[C]//Proceedings of the 2nd USENIX Workshop on Electronic Commerce. 1996.http ://www.cs.cmu.edu/afs/cs/user/nch/www/koala/main. html.
  • 6Broder A Z,Glassman S C,Manasse M S.Syntactic clus- tering of the Web[C/OL]//Proceedings of the 6th Interna- tional Web Conference. 1997.http ://gatekeeper.research.com- paq.com/pub/DEC/SRC/technical-notes/SRC- 1997-015-html/.
  • 7Brin S,Davis J,Garcia-Molina H.Copy detection mech- anisms for digital documents[C]//Proceedings of the ACM SIGMOD Annual Conference, 1995.
  • 8Monostori K, Zaslavsky A, Schmidt H.MatchDetectReveal : finding overlapping and similar digital documents[C/OLJ// Proceedings of the Information Resources Management Association International Conference(IRMA2000), 2000. http : //www.csse.monash.edu.au/ projects/MD R/papers/.
  • 9Monostori K,Zaslavsky A,Vajk I.Suffix vector:a space- efficient representation of a suffix tree[R].2001.
  • 10Wise MJ.YAP3: Improved detection of similarities in computer programs and other texts[C/OL].Proceedings of the SIGCSE' 96.1996 : 130-134.http ://citeseer.nj.nec. com/wise96yap.html.

二级参考文献79

共引文献204

同被引文献33

  • 1刘俊辉.MD5消息摘要算法实现及改进[J].福建电脑,2007,23(4):92-93. 被引量:10
  • 2CHODOROWKristina.MongoDB权威指南[M].北京:人民邮电出版社,2010.
  • 3MANBER U. Finding similar files in a large file system [C]// Proceedings of the Winter 1994 USENIX Technical Conference. San Fransisco, CA, USA: [s.n.], 1994: 1-10.
  • 4BRODER A Z. On the resemblance and containment of docu- ments [C]// Proceedings of the International Conference on Com- pression and Complexity of Sequences. Salerno, Italy: [s.n.], 1997 : 21-29.
  • 5RIVEST R. The MD5 message- digest algorithm [J]. RFC 1321, Internet Engineering Task Force, 1992, 22(1) : 15- 26.
  • 6Manyika J, Chui M, Brown B, et al. Big Data: The Next Frontier for Innovation, Competition, and Productivity. McKinsey Global Institute, 2011.
  • 7Gray J. What Next A Dozen Information-Technology Research Goals[Technical Report]. Microsoft Research, 1999 MS-TR-99-50.
  • 8Bolosky WJ, Corbin S, Goebel D, Douceur JR. Singleinstance storage in Windows 2000. Proc. of the 4th USENIX Windows System Symposium, August 2000.
  • 9Quinlan S, Dorward S. Venti: a new approach to archival storage. Proc. of the First USENIX Conference on File and Storage Technologies. Monterey, CA, USA. 2002.
  • 10Muthitacharoen A, Chen B, Mazieres D. A low-bandwidth network file system. Proc. of the Symposium on Operating Systems Principles. 2001.74-187.

引证文献4

二级引证文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部