期刊文献+

基于图的Co-Training网页分类 被引量:9

Graph Based Co-Training Algorithm for Web Page Classification
下载PDF
导出
摘要 本文充分利用网页数据的超链接关系和文本信息,提出了一种用于网页分类的归纳式半监督学习算法:基于图的Co-training网页分类算法(Graph based Co-training algorithmfor web page classification),简称GCo-training,并从理论上证明了算法的有效性.GCo-training在Co-training算法框架下,迭代地学习一个基于由超链接信息构造的图的半监督分类器和一个基于文本特征的Bayes分类器.基于图的半监督分类器只利用少量的标记数据,通过挖掘数据间大量的关系信息就可达到比较高的预测精度,可为Bayes分类器提供大量的标记信息;反过来学习大量标记信息后的Bayes分类器也可为基于图的分类器提供有效信息.迭代过程中,二者互相帮助,不断提高各自的性能,而后Bayes分类器可以用来预测大量未见数据的类别.在Web→KB数据集上的实验结果表明,与利用文本特征和锚文本特征的Co-training算法和基于EM的Bayes算法相比,GCo-training算法性能优越. This paper proposes a novel inductive semi-supervised algorithm for web page classification named GCo-training,exploiting texts in web pages and hyperlinks among them.GCo-training iteratively trains two classifiers-a graph-based semi-supervised classifier based on hyperlinks among web pages and a Bayes classifier based on texts in web pages,under the framework of Co-training.On the one hand,the graph-based semi-supervised classifier obtains high accuracy based on a small set of labeled examples through exploiting links among web pages and can augment labeled examples for the Bayes classifier.On the other hand,the Bayes classifier can also provide labeled example for the graph-based classifier after it learning on labeled set augmented by the graph-based classifier.Therefore,the two classifiers help each other and improve their respective performance during the process of training.Finally,the Bayes classifier can classify a large number of unseen examples.We test GCo-training algorithm,Co-training algorithm based on words occurring on web pages and words occurring in hyperlinks and Bayes algorithm based on EM on the Web→KB dataset.Experimental results show GCo-training performs much better than the other algorithms.
出处 《电子学报》 EI CAS CSCD 北大核心 2009年第10期2173-2180,2219,共9页 Acta Electronica Sinica
基金 国家自然科学基金(No.60602064 No.60702062) 教育部重点项目(No.108115) 国家973重点基础研究发展规划(No.2006CB705707) 国家863高技术研究发展计划(No.2007AA12Z223) 国家部委科技项目(No.51307040103) 教育部长江学者和创新团队支持计划(No.IRT0645)
关键词 半监督 CO-TRAINING 归纳式 网页分类 graph semi-supervised Co-training inductive web page classification
  • 相关文献

参考文献18

  • 1T Joachims. Transductive inference for text classification using support vector machines[ A ]. Proceedings of the 16th International Conference on Machine Learning [ C ], San Fransisco: Norgan Kaufmann, 1999. 200 - 209.
  • 2K Nigam, A McCallum, S Thrun, T Mitchell. Text classification from labeled and unlabeled documents using EM[ J ]. Machine Learning,2000,39:103 - 134.
  • 3X Zhu,Z Ghahramani, J Lafferty. Semi-supervised learning usiug gaussian fields and harmonic functions[ A]. Proceedings of the 20th International Conference on Machine Learning [ C ].New York:AAAI Press,2003.912 - 919.
  • 4D Zhou, O Bousquet, T Lal, J Weston, B Scholkopf. Learning with local and global consistency[A]. Advances in Neural Information Processing System 16[C]. Cambridge: MIT Press, 2004. 321 - 328.
  • 5D Zhou,B Scholkopf, T Hofmann. Semi-supervised learning on directed graphs[ A]. Advances in Neural Information Processing System 17 [ C ]. Cambridge: MIT Press 2005.1633 - 1640.
  • 6D Zhou, J Huang, B Scholkopf. Learning from labeled and unlabeled data on directed graph[ A]. Proceedings of the 22nd International Conference on Machine Learning [ C]. New York: ACM Press,2005. 1041 - 1048.
  • 7A Blum, T Mitchell. Combining labeled and unlabeled data with Co-training[ A] .Proceedings of the 11th Annual Conference on Computational Learning Theory[ C] New York: ACM Press, 1998.92- 100.
  • 8X Zhu. Semi-Supervised Learning Literature Survey[R]. Technical Report 1530, Department of Computer Sciences, University of Wisconsin, Madison. 2005.
  • 9K Niga, R Ghani. Analyzing the effectiveness and applicability of Co-training[ A]. Proceedings of the 17th International Conference on Machine Learning[ C]. San Fransisco: Norgan Kaufmann, 2000.86 - 93.
  • 10Z Zhou, M Li. Tri-Training: exploiting unlabeled data using three classifiers[ J]. IEEE Transactions on Knowledge and Data Engineering, 2005,17( 11 ) : 1529 - 1541.

同被引文献159

引证文献9

二级引证文献26

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部