摘要
在中文文本相似去重中的关键词计算和提取阶段,文本分词后,存在高维、稀疏和缺乏语义词项,而这些大多没有实际意义的词会给计算带来噪音,不利于文本去重。为此,需要提取文本特征,使该特征能够表示文本的主要内容。针对此问题,提出了一种结合词频、词项间互信息关联度及其语义相似度的改进的关键词提取方法。该方法综合考虑候选词的统计特征、词项间的相关度和相似度,并将此方法应用于Sim Hash文本相似计算模型中。实验结果表明,基于该模型的特征提取在相似文本去重计算上有着较高的准确率、召回率和F1值,优于传统方法。
The stage of keywords calculation and extraction in Chinese text similarity de-duplication, text segmentation exists high dimen- sion, sparsity and lack of semantic words, and most of them have no practical significance that brings noise to calculation, not conducive to text de-duplication. Therefore,it' s necessary to extract text feature, which can represent the main content of text. To solve this prob- lem,propose an improved keywords extraction method by combining word frequency ,mutual information correlation and semantic simi- larity between words. The method comprehensively considers the statistical characteristics of candidate words, relevance and similarity be- tween words,it' s applied to SimHash text similarity computing model. The experimental results show that the feature extraction method based on this model can achieve high precision,recall and Fl value in the calculation of text similarity de-duplication,and it's better than traditional methods.
出处
《计算机技术与发展》
2015年第12期22-27,共6页
Computer Technology and Development
基金
国家自然科学基金资助项目(61170120)
关键词
文本特征
相似计算
互信息
SimHash
特征提取
文本去重
text feature
similarity calculation
mutual information
SimHash
feature extraction
text de-duplication
作者简介
石雁(1986-),男,硕士,CCF会员,研究方向为搜索引擎、推荐系统;
李朝锋,教授,博士,硕士生导师,研究方向为图像处理、人工智能、模式识别。