摘要
本文充分利用网页数据的超链接关系和文本信息,提出了一种用于网页分类的归纳式半监督学习算法:基于图的Co-training网页分类算法(Graph based Co-training algorithmfor web page classification),简称GCo-training,并从理论上证明了算法的有效性.GCo-training在Co-training算法框架下,迭代地学习一个基于由超链接信息构造的图的半监督分类器和一个基于文本特征的Bayes分类器.基于图的半监督分类器只利用少量的标记数据,通过挖掘数据间大量的关系信息就可达到比较高的预测精度,可为Bayes分类器提供大量的标记信息;反过来学习大量标记信息后的Bayes分类器也可为基于图的分类器提供有效信息.迭代过程中,二者互相帮助,不断提高各自的性能,而后Bayes分类器可以用来预测大量未见数据的类别.在Web→KB数据集上的实验结果表明,与利用文本特征和锚文本特征的Co-training算法和基于EM的Bayes算法相比,GCo-training算法性能优越.
This paper proposes a novel inductive semi-supervised algorithm for web page classification named GCo-training,exploiting texts in web pages and hyperlinks among them.GCo-training iteratively trains two classifiers-a graph-based semi-supervised classifier based on hyperlinks among web pages and a Bayes classifier based on texts in web pages,under the framework of Co-training.On the one hand,the graph-based semi-supervised classifier obtains high accuracy based on a small set of labeled examples through exploiting links among web pages and can augment labeled examples for the Bayes classifier.On the other hand,the Bayes classifier can also provide labeled example for the graph-based classifier after it learning on labeled set augmented by the graph-based classifier.Therefore,the two classifiers help each other and improve their respective performance during the process of training.Finally,the Bayes classifier can classify a large number of unseen examples.We test GCo-training algorithm,Co-training algorithm based on words occurring on web pages and words occurring in hyperlinks and Bayes algorithm based on EM on the Web→KB dataset.Experimental results show GCo-training performs much better than the other algorithms.
出处
《电子学报》
EI
CAS
CSCD
北大核心
2009年第10期2173-2180,2219,共9页
Acta Electronica Sinica
基金
国家自然科学基金(No.60602064
No.60702062)
教育部重点项目(No.108115)
国家973重点基础研究发展规划(No.2006CB705707)
国家863高技术研究发展计划(No.2007AA12Z223)
国家部委科技项目(No.51307040103)
教育部长江学者和创新团队支持计划(No.IRT0645)