期刊文献+

基于无标记Web数据的层次式文本分类

Hierarchical text classification with non-labeled web data
下载PDF
导出
摘要 传统的文本分类方法需要标注好的语料来训练分类器,然而人工标记语料代价高昂并且耗时。对此,通过无类别标记的Web数据来训练文本分类器,提出一种基于无标记Web数据的层次式文本分类方法,该方法结合类别知识和主题层次信息来构造Web查询,从多种Web数据中搜索相关文档并抽取学习样本,为监督学习找到分类依据,并结合层次式支持向量机进行分类器的学习。实验结果表明,该方法能够利用无标记Web数据学习分类器,并取得了较好的分类效果,其性能接近于有标记训练样本的监督分类方法。 Traditional text classification methods require a labeled corpus to train classifiers,however,it is costly and time-consuming to label corpus manually. This paper proposes a hierarchical text classification method,which trains the text classifier with web data that does not require any classification labels. This method constructs web inquiry by combining classification knowledge and topic hierarchical information,searches relevant documents and extracts the learning sample from many kinds of web data,finds a classification basis to monitor the learning,and combines a hierarchical support vector machine to train classifiers. The experimental results show that this method is able to train classifiers through non-labeled web data,and gains a better result of classification with a performance that is at a level close to the supervised classification method with labeled training samples.
出处 《智能系统学报》 CSCD 北大核心 2014年第3期330-335,共6页 CAAI Transactions on Intelligent Systems
基金 国家"863"计划资助项目(2010AA012505 2011AA010702 2012AA01A401 2012AA01A402) 国家重点基础研究发展计划资助项目(2013CB329601 2013CB329602) 国家自然科学基金资助项目(60933005 91124002) 国家科技支撑计划资助项目(2012BAH38B04) 国家242信息安全计划资助项目(2011A010)
关键词 层次式文本分类 主题层次 无标记数据分类 支持向量机 hierarchical text classification topic hierarchy classification without labeled data support vector machine
  • 相关文献

参考文献17

  • 1CHEN Y, LI Z, NIE L, et al. A semi-supervised bayesian network model for microblog topic classification[ C ]//Pro- ceedings of the 24th International Conference on Computa- tional Linguistics. Mumbai, India, 2012: 561-576.
  • 2HA-THUC V, RENDERS J M. Large-scale hierarchical text classification without labelled data [ C ]//Proceedings of the fourth ACM International Conference on Web Search and Data Mining. Hong Kong, China, 2011: 685-694.
  • 3WETZKER R, ALPCAN T, BAUCKHAGE C, et al. An unsupervised hierarchical approach to document categoriza- tion[ Cl//Proceedings of the IEEE/WIC/ACM Internation- al Conference on Web Intelligence. Silicon Valley, USA, 2007 : 482-486.
  • 4ZHANG C, XUE G R, YU Y. Knowledge supervised text classification with no labeled documents [ C ]//Proceedings of the 10th Pacific Rim International Conference on Artifi- cial Intelligence. Hanoi, Vietnam, 2008: 509-520.
  • 5HUANG C C, CHUANG S L, CHIEN L F. Liveclassifier: creating hierarchical text classifiers through Web corpora [ C]//Proceedings of the 13th International Conference on World Wide Web. New York, USA, 2004: 184-192.
  • 6WANG P, DOMENICONI C. Towards a universal text clas- sifier : transfer learning using encyclopedic knowledge [ C ]// Proceedings of the Ninth IEEE International Conference on Data Mining Workshops. Miami, USA, 2009: 435-440.
  • 7HUNG C M, CHIEN L F. Web-based text classification in the absence of manually labeled training documents [ J ]. Journal of the American Society for Information Science and Technology, 2007, 58(1) : 88-96.
  • 8HUNG C M, CHIEN L F. Text classification using Web cor- pora and em algorithms [ C ]//Proceedings of the Asia Infor- mation Retrieval Symposium. Beijing, China, 2005: 12-23.
  • 9刘丽珍,宋瀚涛,陆玉昌.无标记训练样本的Web文本分类方法[J].计算机科学,2006,33(3):200-201. 被引量:2
  • 10WEISS G M. Mining with rarity: a unifying framework [ J ]. ACM SIGKDD Explorations Newsletter, 2004, 6 ( 1 ) : 7- 19.

二级参考文献6

  • 1Linoff G S,J.a.Berry M.Mining the web,America,2001,348.
  • 2Mena J.Data Mining your website.America,2000,368.
  • 3Wang Shi,Gao Wen.Web data mining.Computer Science,2000,27(4) :237~240.
  • 4Hutter M.Distribution of Mutual Information.In:Proc.of the 14th Intl.Conf.on Neural Information Processing Systems,NIPS-2001.
  • 5Keogh E J,et al.Learning Augmented Bayesian Classifiers:A Comparison of Distribution-based and Classification-based Approache,2002 http://citeseer.nj.nec.com/context.
  • 6宫秀军,孙建平,史忠植.主动贝叶斯网络分类器[J].计算机研究与发展,2002,39(5):574-579. 被引量:37

共引文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部