摘要
网络论坛已经成为互联网信息发布的主要形式,对论坛信息的检索和挖掘都涉及到论坛信息的获取,然而传统的针对静态网页的广度优先采集工具,不能有效地获取论坛信息。该文利用论坛的结构特点,提出了一种“版面-主题关联判断”(BTCJ)算法,采用一种基于版面扩展的采集策略。实验证明,该方法在论坛采集准确率和覆盖率方面显著优于广度优先策略;具有良好的泛化能力,应用在实践中已覆盖各种类型的论坛12000余个。
Web Forums have been one of dominating ways for information release and exchange in lnternet. Crawling is the groundwork of searching and mining information from Web Forums. However, traditional crawling component usually using "Broad-first" strategy can not fetch information from Web Forums effectively. Exploring inner structure-features of forums, this paper presents a crawling strategy, which is based on "board-topic correlation judgments" algorithm. Compared with "board-first" strategy, this solution performs remarkably better both in precisions and recall. In practice, the algorithm is performed over 12 000 different Web forums and achieves a good result.
出处
《计算机工程》
CAS
CSCD
北大核心
2007年第6期80-82,共3页
Computer Engineering
基金
国家"973"计划基金资助项目"大规模文本内容计算"(2004CB318109)
关键词
互联网论坛
信息采集
动态网页
WWW forums
Information crawling
Dynamic Web page
作者简介
李魁(1982-),男,硕士生,主研方向:信息检索,自然语言处理;E-mail:ibucan@126.com
程学旗,研究员;
郭岩,助理研究员
张凯,助理研究员