摘要
搜索引擎返回的重复网页不但浪费了存储资源,而且加重了用户浏览的负担。针对网页重复的特征,提出了一种基于主题的去重方法。该方法通过组块的思想提取出网页正文的主题,然后进行主题的相似度计算,把重复的网页去除。实验证明,该方法对全文重复和部分重复的网页都能进行准确的检测。
Similar web pages that search engine returns not only waste storage resources but also increase the burden to the web users. A method based on topic is proposed to detect similar web pages. Using this method, text topic of web pages is extracted through chunk. Then similarity of the web pages is calculated and the similar Web pages are eliminated. The experiment results show that not only completely similar web pages, but also partly similar web pages can be detected.
出处
《电脑开发与应用》
2008年第4期4-6,25,共4页
Computer Development & Applications
基金
国家自然科学基金(60775041)项目
关键词
组块
向量空间
网页去重
主题
chunk, vector space, detection and elimination of similar web pages, topic
作者简介
樊勇,男,1979年生,硕士研究生,研究方向:自然语言处理。