摘要
根据网络页面结构的特点,提出通过页面之间的主题传递来预测页面主题相关性的方法,解决了主题爬虫通道堵塞,抓取遗漏的问题。首先根据锚文本传递一个相关性信息值,如果锚文本给出的信息是相关,相关阈值就直接传递;如果是不相关,就乘以遗传基因比例之后传递。传递的过程中如果遇到相关的网页就恢复链接的相关性信息值到初始值。最后根据实验结果验证了算法的查全率与查准率,查全率有显著的提高。
According to the characteristics of the cyber page structure,this paper proposes the theme which predicts the correlativity by delivering the theme among the pages,and solves the problems of channel jamming and capture omission.Firstly,a correlative information value is delivered according to the anchor text.If the information given by the anchor text is correlated,the correlative threshold will be delivered directly. Otherwise,it will be multiplied by the genetic ratio before delivery.In the process of the delivery, correlative information value may be reset to the initial value if it encounters the correlative Web page.At last,the recall ratio is proven to be greatly improved based on the experimental result.
出处
《计算机系统应用》
2010年第3期49-52,共4页
Computer Systems & Applications
关键词
网络爬虫
搜索引擎
主题相关
遗传
抓取
cyber worm
search engine
theme correlativity
genetic algorithm
crawl