摘要
充分利用相邻网页(包括链入和链出)的相关信息,提出一种基于Web超链接结构信息的网页分类改进方法.其方法分为5步:(1)预处理训练集,提取文本信息和超链接结构信息;(2)抽取特征向量和训练一个Web页面的全文本分类器;(3)根据网页的各个入口的锚点文本和扩展锚点文本创建虚文档,用虚文档代替Web页面全文本训练一个虚文档分类器;(4)利用Naive Bayes方法协调两个分类器得到初步分类结果;(5)利用链出网页对初步分类结果进行修正,得到最终分类结果.根据改进方法实现了网页自动分类原型系统,并进行分类实验,实验表明该方法有效提高了分类性能.
This paper presents a new method to improve webpage classification by making use of the Hyperlinks structure information. The method is fundamental divided into five steps. (1)Preprocessing training set, extracting text messages and Hyperlinks structure information. (2)Constructing feature vector and training a full text classifier of the Web pages. (3)Creating virtual documents from the anchortext and inbound extended anchortext,and then using the virtual documents as a replacement for the full-text to train a virtual document classifier. (4)Coordinating the preliminary results gotten from the two classifiers by using NaiveBayes methods. (5)Revising the preliminary results to get the final classification. Finally, an automatic web page classification prototype system based on the method proposed in this paper is implemented. The experiment shows that the new method improves classification.
出处
《泉州师范学院学报》
2008年第4期25-29,47,共6页
Journal of Quanzhou Normal University