摘要
隧道穿越一直是主题网络蜘蛛爬行研究的难点,本文在分析了网页主题特征和普通隧道技术爬行算法缺点的基础上,提出了使用主题相似度指导网络蜘蛛穿越隧道的爬行算法,并用朴素贝叶斯分类器方法提高主题相似度计算精度。实验表明,本文提出的隧道穿越技术在查准率和查全率方面都比普通隧道技术有很大提高。
Tunneling is always the difficulty of topical web crawling. On the basis of analysing the Web topical features and the shortcomings of the general tunneling technology, this paper raises the algorithm using topical similarity to guide the web crawler though tunnels, and improves the accuracy of topical similarity using the Naive Bayesian classifier. The experimental results show that this algorithm is better than the general tunneling technology in precision and recall rate.
出处
《计算机工程与科学》
CSCD
北大核心
2009年第10期126-128,共3页
Computer Engineering & Science
基金
广西自然科学基金资助项目(桂科青0832101)