期刊文献+

基于改进遗传算法的聚焦爬虫设计 被引量:3

Design of a Focused Crawler Based on the Improved Genetic Algorithm
在线阅读 下载PDF
导出
摘要 本文提出以爬行控制器和页面分析过滤器为核心的聚焦爬虫设计方法。从待检索主题出发,在以改进的遗传算法为基础并结合内容评价和链接结构搜索策略优点的爬行策略引导下,以待爬行URL作为遗传个体,基于主题词集的向量空间模型评估个体适应度,引入新的URL实现交叉、变异操作,将具有相同URL前缀的链接按小生境处理。实践证明,该爬虫具有较好的性能。 The paper presents the design method for a focused crawler based on the crawling controller and the page a- nalysis filter. Starting from the theme to be retrieved, the method based on the improved genetic algorithm combines with the advantages of both content evaluation and link structure. The crawler regards the URL link as the genetic individual, and the topic-words-hased VSM is applied to assess individual fitness, and imports new URLs to achieve crossover and mutation operations, and the URLs that have the same prefix are regarded as niche. The experimental results show that the approach has better performance.
出处 《计算机工程与科学》 CSCD 北大核心 2010年第5期126-129,共4页 Computer Engineering & Science
基金 重庆市教委科学技术研究项目(KJ091309)
关键词 聚焦爬虫 爬行控制器 主题相关度 数据抽取 focused crawler crawling controller topic relevancy data extraction
作者简介 范会联(1971-),男,重庆石柱人,硕士,副教授,CCF会员(E200013523M),研究方向为软件工程和智能信息处理;通讯地址:408100重庆市长江师范学院数学与计算机学院;Tel:13330383538;E-mail:fhlmx@163.com 李献礼:教授,研究方向为非线性算法和数据挖掘; 曾广朴,讲师,研究方向为网络信息系统和数据挖掘。
  • 相关文献

参考文献10

二级参考文献58

  • 1欧阳柳波,李学勇,李国徽,王鑫.专业搜索引擎搜索策略综述[J].计算机工程,2004,30(13):32-33. 被引量:34
  • 2赫枫龄,左万利.利用超链接信息改进网页爬行器的搜索策略[J].吉林大学学报(信息科学版),2005,23(1):59-63. 被引量:8
  • 3SuGuiyang LiJianhua MaYinghua LiShenghong SongJuping.New focused crawling algorithm[J].Journal of Systems Engineering and Electronics,2005,16(1):199-203. 被引量:1
  • 4吴安清,张颖江,涂军.主题搜索ROBOT综合爬行策略的研究[J].武汉理工大学学报,2006,28(2):74-76. 被引量:6
  • 5Chakrabarti S,van den Berg M,Dom B.Focused crawling:a new approach to topic-specific Web resource discoyery [J].Computer Networks, 1999,31 ( 11 - 16 ) : 1623-1640.
  • 6Menczer F,Pant G,Ruiz M,el al.Evaluating topic-driven Web crawlers[C]//Proceedings of 24th Annual lnternalional ACM SIGIR Conference on Research and Development in Information Retrieval, 2001:241-249.
  • 7Mukherjeas.WTMS:a syslem for collecling and analyziug topicspecific Web information[C]//Proceedings of the 9th International World Wide Web Conference.Amsterdam:Netherlands ACM Press, 2000:15-19.
  • 8Diligenti M,Coetzee F,Lawrence S,et al.Focused crawling using context graphs[C]//Proceedings of the 26th International Conference on Very Large Databases(VLDB), Cairo, Egypt, 2000.
  • 9Song Ruihua,Liu Haifeng,Wen Ji-Rong,et al.Learning block importance models for web pages[C]//WWW 2004,New York,NY USA May 2004:17-22.
  • 10Hersovici M,Jacovi M,Maarek Y S,et al.The shark-search algorithm-an application:tailored Web site mapping[C]//Proc of the 7th International World-wide Web Conference,1998.

共引文献133

同被引文献16

引证文献3

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部