摘要
介绍了一个基于语景图的Web主题爬取器的初步设计。描述了NB分类器的文本学习的向量空间模型——Bernoulli模型及NaiveBayes分类器设计提出了简化的前端队列优先排序的设计方案,即下载文档的归一化文档向量与查询向量的余弦相似度,作为层内下载文档的排序准则,以便与各层队列中文档的类似然率得分排序进行对比。介绍了自动实现爬取结果与主题分类目录的集成设想。
This paper designes a focused crawler using context graph. The crawler is based on a set of Naive Bayes classifiers, which adopt both VSM and probability model for design comparison purpose. The frontier priority queue within a layer of the context graph is sorted by the cosine similarity between a downloaded normalized document vector and the query vector. An approach to classifying search results into a pre-defined category is presented.
出处
《计算机工程》
EI
CAS
CSCD
北大核心
2006年第12期208-209,228,共3页
Computer Engineering
关键词
主题爬取
机器学习
语景图
Focused crawling
Machine learning
Context graph