摘要
搜索引擎返回的重复网页不但浪费了存储资源,而且加重了用户浏览的负担。针对网页重复的特征和网页文本自身的特点,提出了一种动态的网页去重方法。该方法通过将网页的正文表示成目录结构树的形式,实现了一种动态的特征提取算法和层次指纹的相似度计算算法。实验证明,该方法对全文重复和部分重复的网页都能进行准确的检测。
Similar Web pages that search engine returns not only waste storage resources but also increase the burden on Web users. A dynamic method to detect similar Web pages was proposed. By this method, Texts of Web pages were expressed in the style of catalogue structure trees according to the features of similar Web pages and the features of Web pages themselves. A dynamic algorithm to extract features of texts and a layer fingerprint algorithm to calculate similar degree were implemented. The experimental results show that completely similar Web pages are detected accurately, and partly similar Web pages are detected exactly.
出处
《计算机应用》
CSCD
北大核心
2007年第11期2854-2856,共3页
journal of Computer Applications
基金
国家自然科学基金资助项目(60473139
60775041)
山西省自然科学基金资助项目(20051034)
关键词
层次指纹
文本结构
网页去重
layer fingerprint
text structure
detection and elimination of similar Web pages
作者简介
魏丽霞(1981-),女,山西繁峙人,硕士研究生,主要研究方向:自然语言处理;(goodwlx@163.com)
郑家恒(1948-),女,湖南人,教授,博士生导师,主要研究方向:自然语言处理。