摘要
针对中文短文本冗余问题,提出了有效的去重算法框架。考虑到短文本海量性和简短性的特点,以及中文与英文之间的区别,引入了Bloom Filter、Trie树以及SimHash算法。算法框架的第一阶段由Bloom Filter或Trie树进行完全去重,第二阶段由SimHash算法进行相似去重。设计了该算法框架的各项参数,并通过仿真实验证实了该算法框架的可行性及合理性。
The article presents an effective algorithm framework for text de-duplication, focusing on redundancy problem of Chinese short texts. In view of the brevity and huge volumes of short texts, Bloom Filter have been introduced, Trie tree and the SimHash algorithm have been introduced. In the first stage of the algorithm framework, Bloom Filter or Trie tree is designed to remove duplications completely;in the second stage, the SimHash algorithm is used to detect similar duplications. This text has designed the parameters used in the algorithm framework, and the feasibility and rationality is testified.
出处
《计算机工程与应用》
CSCD
2014年第16期192-197,共6页
Computer Engineering and Applications
基金
教育部人文社会科学项目(No.11YJA870017)
关键词
文本去重
中文短文本
TRIE树
SimHash算法
Bloom Filter
text de-duplication
Chinese short texts
Bloom Filter
Trie tree
SimHash algorithm
作者简介
高翔,男,硕士;李兵,男,博士,副教授。E-mail:gx8600@126.com