期刊文献+

基于主题的网页去重 被引量:2

Detection and Elimination of Similar Web Pages based on Topic
在线阅读 下载PDF
导出
摘要 搜索引擎返回的重复网页不但浪费了存储资源,而且加重了用户浏览的负担。针对网页重复的特征,提出了一种基于主题的去重方法。该方法通过组块的思想提取出网页正文的主题,然后进行主题的相似度计算,把重复的网页去除。实验证明,该方法对全文重复和部分重复的网页都能进行准确的检测。 Similar web pages that search engine returns not only waste storage resources but also increase the burden to the web users. A method based on topic is proposed to detect similar web pages. Using this method, text topic of web pages is extracted through chunk. Then similarity of the web pages is calculated and the similar Web pages are eliminated. The experiment results show that not only completely similar web pages, but also partly similar web pages can be detected.
作者 樊勇 郑家恒
出处 《电脑开发与应用》 2008年第4期4-6,25,共4页 Computer Development & Applications
基金 国家自然科学基金(60775041)项目
关键词 组块 向量空间 网页去重 主题 chunk, vector space, detection and elimination of similar web pages, topic
作者简介 樊勇,男,1979年生,硕士研究生,研究方向:自然语言处理。
  • 相关文献

参考文献8

二级参考文献31

  • 1周强.规则和统计相结合的汉语词类标注方法[J].中文信息学报,1995,9(3):1-10. 被引量:43
  • 2孙宏林.从标注语料库中姨纳语法规则:“V+N”序列试验分析.语言工程[M].北京:清华大学出版社,1997.157-163.
  • 3[1]Narayanan Shivakumar,et al.Finding near-replicas of documents on the web[DB/OL].http://dbpubs.stanford.edu/pub/1998-31.
  • 4[2]J.Liu,M.Lei,J.Wang,and B.Chen.Digging for gold on the web:Experience with the WebGather[A].Proc.of the 4th Inter.Conf.on High Performance Computing in the Asia-Pacific Region[C],Beijing,P.R.China,May 2000:751-755.
  • 5[3]U.Manber.Finding similar files in a large file system[R].Technical Report TR 93-33,University of Arizona,Tuscon,Arizona,October 1993.
  • 6J. Carbonell, J. Goldstein, 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries [ A],In: Proceedings of the 21st ACM-SIGIR International Conference on Research and Development in Information Retrieval [C], Melbourne, Australia.
  • 7Lin, Chin-Yew and E. H. Hovy 2003. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics [ A ]. In Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003) [C],Edmonton,Canada,May 27- June 1,2003.
  • 8Lin, Chin-Yew and E. H. Hovy. 2002. Automated Multi-document Summarization in NeATS [ A ]. In Proceedings of the Human Language Technology Conference (HLT2002) [C] ,San Diego,CA,U.S.A. ,March 23-27,2002.
  • 9Radev,D.R. ,Jing,H. ,and Budzikowska,M.2000. Centroid-based summarization of multiple documents [A] .In ANLP-NAACL workshop on summarization [ C].
  • 10Hovy, E. and Lin, C. 1997. Automated text summarization in SUMMARIST [ A]. Pages 18- 24. In A CL '97 workshop on Intelligent Scalable Text Summarization [ C].

共引文献79

同被引文献28

引证文献2

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部