期刊文献+

基于网页文本结构的网页去重 被引量:13

Detection and elimination of similar Web pages based on text structure
在线阅读 下载PDF
导出
摘要 搜索引擎返回的重复网页不但浪费了存储资源,而且加重了用户浏览的负担。针对网页重复的特征和网页文本自身的特点,提出了一种动态的网页去重方法。该方法通过将网页的正文表示成目录结构树的形式,实现了一种动态的特征提取算法和层次指纹的相似度计算算法。实验证明,该方法对全文重复和部分重复的网页都能进行准确的检测。 Similar Web pages that search engine returns not only waste storage resources but also increase the burden on Web users. A dynamic method to detect similar Web pages was proposed. By this method, Texts of Web pages were expressed in the style of catalogue structure trees according to the features of similar Web pages and the features of Web pages themselves. A dynamic algorithm to extract features of texts and a layer fingerprint algorithm to calculate similar degree were implemented. The experimental results show that completely similar Web pages are detected accurately, and partly similar Web pages are detected exactly.
出处 《计算机应用》 CSCD 北大核心 2007年第11期2854-2856,共3页 journal of Computer Applications
基金 国家自然科学基金资助项目(60473139 60775041) 山西省自然科学基金资助项目(20051034)
关键词 层次指纹 文本结构 网页去重 layer fingerprint text structure detection and elimination of similar Web pages
作者简介 魏丽霞(1981-),女,山西繁峙人,硕士研究生,主要研究方向:自然语言处理;(goodwlx@163.com) 郑家恒(1948-),女,湖南人,教授,博士生导师,主要研究方向:自然语言处理。
  • 相关文献

参考文献7

  • 1中国互联网信息中心.第十九次中国互联网络发展状况统计报告[EB/OL].[2007-05-05].http://www.cnnic.net.cn/index/OE/00/11/index.htm.
  • 2王建勇,谢正茂,雷鸣,李晓明.近似镜像网页检测算法的研究与评价[J].电子学报,2000,28(z1):130-132. 被引量:21
  • 3MANBER U.Finding similar files in a large file system[C/OL]// Proceedings of the Winter 1994 USENIX Technical Conference.1994:1 -10[2007 -05 -10].http://manber.com/publications.html.
  • 4BRIN S,DAVIS J,GARCIA-MOLINA H.Copy detection mechanisms for digital documents[C/OL] // Proceedings of the ACM SICMOD Annual Conference.1995:398-409[2007 -05-10].http://www-db.stanford.edu/pub/brin/1995/copy.ps.
  • 5HEINTZE N.Scalable document fingerprinting[C/OL]//Proceedings of the 2nd USENIX Workshop on Electronic Commerce.1996:191 -200[2007 -05-10].http://www.cs.cmu.edu/afs/cs/user/nch/www/koala/main.html.
  • 6BORDER A Z,GLASSMAN S C,MANASSES M S,et al.Syntactic clustering of the web[C/OL]// Proceedings of the 6th ACM International Conference on World Wide Web.USA:ACM Press,1997:1157 -1166[2007-05-10].http://gatekeeper.research.compaq.com/ pub/DEC/SRC/technical-notes/SRC-1997-015-html/.
  • 7冯是聪,单松巍,龚笔宏,张志刚,李晓明.“天网”目录导航服务研究[J].计算机研究与发展,2004,41(4):653-659. 被引量:8

二级参考文献13

  • 1[1]Narayanan Shivakumar,et al.Finding near-replicas of documents on the web[DB/OL].http://dbpubs.stanford.edu/pub/1998-31.
  • 2[2]J.Liu,M.Lei,J.Wang,and B.Chen.Digging for gold on the web:Experience with the WebGather[A].Proc.of the 4th Inter.Conf.on High Performance Computing in the Asia-Pacific Region[C],Beijing,P.R.China,May 2000:751-755.
  • 3[3]U.Manber.Finding similar files in a large file system[R].Technical Report TR 93-33,University of Arizona,Tuscon,Arizona,October 1993.
  • 4WebInfomallWebsitshttp://net.cs.pku.edu.cn/-webg/infomall/index.html . 2002
  • 5TianwangsearchengineWebsits http://e.pku.edu.cn . 1997
  • 6http://cn.yahoo.com . 2003
  • 7YYang,XLiu.Are examinationoftextcategorizationmethods[].ACMSIGIRConfonResearchandDevelopmentinInformationRetrieval.1999
  • 8FengShicong,ShanSongwei,ZhangZhigongetal.AdatasetofChineseWebpagesanditscategorization[].ProcoftheCross straitInformationTechnologyWorkshop.2002
  • 9YYang,JanOPedersen.Acomparativestudyonfeatureselectionintextcategorization[].ThethInt’’lConfonMachineLearning.1997
  • 10YYang.Astudyonthresholdingstrategiesfortextcategoriza tion[].ACMSIGIRConfonResearchandDevelopmentinInforma tionRetrieval.2001

共引文献27

同被引文献95

引证文献13

二级引证文献35

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部