摘要
随着网络信息技术的不断发展,网络上充斥着大量的各类被称为大数据的非结构化数据。然而,这些数据不容易被存储到本地数据库中进行访问和处理。人们渐渐地意识到,高效率地从各式各样、含有大量干扰的网络上获得最新有用的信息至关重要。靠人力搜集信息劳神费力,因此网络爬虫技术应运而生。但是现有的搜索引擎在主题相似性判断和网页排序算法中还是存在不足。因此,本文将PageRank算法应用于主题爬虫,构建了一个垂直搜索引擎。
With the continuous development of network information technology,the network is full of a large number of unstructured data known as big data.However,these data are not easily stored in a local database for access and processing.Increasingly,people are realizing the importance of efficiently accessing the latest and most useful information from a wide variety of networks that involve a lot of interference.The effort to gather information by human hands has led to the emergence of web crawler technology.However,the existing search engines still have shortcomings in topic similarity judgment and page sorting algorithm.Therefore,this paper applies PageRank algorithm to topic crawler and constructs a vertical search engine.
作者
于林轩
李业丽
曾庆涛
YU Linxuan;LI Yeli;ZENG Qingtao(Integrated Laboratory for Applied Research and Services of Key Technologies in Press and Publication Field,Beijing Institute of Graphic Communication,Beijing 102600,China)
出处
《北京印刷学院学报》
2020年第10期143-147,共5页
Journal of Beijing Institute of Graphic Communication
基金
北京科技创新服务能力建设项目(PXM2016_014223_000025)
广东省科技重大专项项目(190826175545233)。