摘要
为提高主题网络爬虫的效率及收获率,提出一种基于主题语义URL的信息搜索方法。该方法将种子URL映射到主题树的主题结点上,以主题路径上的主题文本扩充种子URL的语义,引导爬虫高效准确地抓取主题页面,并利用链接重要度与页面重要度因子在抓取过程中自动选育新的URL优良种子。重点阐述上述搜索方法的原理及其在系统中的实现。实验结果表明,该搜索方法能有效改善网络爬虫的搜索效率及收获率,且种子链接的选育性能良好。
This paper presents a topic semantics URL-based information search method for improving the efficiency and harvest ratio of topic networks crawler.The method maps the seed URL onto the topic nodes of topic tree,and expands the semantics of seed URL by using the topic text on topic path as well as guides the crawler to efficiently and precisely crawl the topic pages.Furthermore,it makes use of the factors of link importance and page importance to automatically select and breed new URL seeds during the crawling process.The paper emphatically elucidates the principle of the search method above mentioned and its realisation in the system.Experimental results demonstrate that this method can effectively improve the search efficiency and harvest ratio of network crawlers,and the selection and breeding performance of seeds link is excellent as well.
出处
《计算机应用与软件》
CSCD
2015年第6期42-45,共4页
Computer Applications and Software
基金
湖南省教育厅科研项目(10C1064)
怀化学院科研项目(HHUY2010-18)
怀化学院重点学科建设项目