基于无标记Web数据的层次式文本分类

Hierarchical text classification with non-labeled web data

下载PDF

导出

摘要传统的文本分类方法需要标注好的语料来训练分类器,然而人工标记语料代价高昂并且耗时。对此,通过无类别标记的Web数据来训练文本分类器,提出一种基于无标记Web数据的层次式文本分类方法,该方法结合类别知识和主题层次信息来构造Web查询,从多种Web数据中搜索相关文档并抽取学习样本,为监督学习找到分类依据,并结合层次式支持向量机进行分类器的学习。实验结果表明,该方法能够利用无标记Web数据学习分类器,并取得了较好的分类效果,其性能接近于有标记训练样本的监督分类方法。 Traditional text classification methods require a labeled corpus to train classifiers,however,it is costly and time-consuming to label corpus manually. This paper proposes a hierarchical text classification method,which trains the text classifier with web data that does not require any classification labels. This method constructs web inquiry by combining classification knowledge and topic hierarchical information,searches relevant documents and extracts the learning sample from many kinds of web data,finds a classification basis to monitor the learning,and combines a hierarchical support vector machine to train classifiers. The experimental results show that this method is able to train classifiers through non-labeled web data,and gains a better result of classification with a performance that is at a level close to the supervised classification method with labeled training samples.

作者何力谭霜贾焰韩伟红

机构地区国防科学技术大学计算机学院

出处《智能系统学报》 CSCD 北大核心 2014年第3期330-335,共6页 CAAI Transactions on Intelligent Systems

基金国家"863"计划资助项目(2010AA012505 2011AA010702 2012AA01A401 2012AA01A402) 国家重点基础研究发展计划资助项目(2013CB329601 2013CB329602) 国家自然科学基金资助项目(60933005 91124002) 国家科技支撑计划资助项目(2012BAH38B04) 国家242信息安全计划资助项目(2011A010)

关键词层次式文本分类主题层次无标记数据分类支持向量机 hierarchical text classification topic hierarchy classification without labeled data support vector machine

分类号 TP391.1 [自动化与计算机技术—计算机应用技术] TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献17

1CHEN Y, LI Z, NIE L, et al. A semi-supervised bayesian network model for microblog topic classification[ C ]//Pro- ceedings of the 24th International Conference on Computa- tional Linguistics. Mumbai, India, 2012: 561-576.
2HA-THUC V, RENDERS J M. Large-scale hierarchical text classification without labelled data [ C ]//Proceedings of the fourth ACM International Conference on Web Search and Data Mining. Hong Kong, China, 2011: 685-694.
3WETZKER R, ALPCAN T, BAUCKHAGE C, et al. An unsupervised hierarchical approach to document categoriza- tion[ Cl//Proceedings of the IEEE/WIC/ACM Internation- al Conference on Web Intelligence. Silicon Valley, USA, 2007 : 482-486.
4ZHANG C, XUE G R, YU Y. Knowledge supervised text classification with no labeled documents [ C ]//Proceedings of the 10th Pacific Rim International Conference on Artifi- cial Intelligence. Hanoi, Vietnam, 2008: 509-520.
5HUANG C C, CHUANG S L, CHIEN L F. Liveclassifier: creating hierarchical text classifiers through Web corpora [ C]//Proceedings of the 13th International Conference on World Wide Web. New York, USA, 2004: 184-192.
6WANG P, DOMENICONI C. Towards a universal text clas- sifier : transfer learning using encyclopedic knowledge [ C ]// Proceedings of the Ninth IEEE International Conference on Data Mining Workshops. Miami, USA, 2009: 435-440.
7HUNG C M, CHIEN L F. Web-based text classification in the absence of manually labeled training documents [ J ]. Journal of the American Society for Information Science and Technology, 2007, 58(1) : 88-96.
8HUNG C M, CHIEN L F. Text classification using Web cor- pora and em algorithms [ C ]//Proceedings of the Asia Infor- mation Retrieval Symposium. Beijing, China, 2005: 12-23.
9刘丽珍,宋瀚涛,陆玉昌.无标记训练样本的Web文本分类方法[J].计算机科学,2006,33(3):200-201. 被引量：2
10WEISS G M. Mining with rarity: a unifying framework [ J ]. ACM SIGKDD Explorations Newsletter, 2004, 6 ( 1 ) : 7- 19.

二级参考文献6

1Linoff G S,J.a.Berry M.Mining the web,America,2001,348.
2Mena J.Data Mining your website.America,2000,368.
3Wang Shi,Gao Wen.Web data mining.Computer Science,2000,27(4) :237～240.
4Hutter M.Distribution of Mutual Information.In:Proc.of the 14th Intl.Conf.on Neural Information Processing Systems,NIPS-2001.
5Keogh E J,et al.Learning Augmented Bayesian Classifiers:A Comparison of Distribution-based and Classification-based Approache,2002 http://citeseer.nj.nec.com/context.
6宫秀军,孙建平,史忠植.主动贝叶斯网络分类器[J].计算机研究与发展,2002,39(5):574-579. 被引量：37

共引文献1

1朱征宇,李力沛,罗颖,周智,朱庆生.一种应用于中文文本聚类的适应值函数[J].计算机科学,2009,36(5):244-246.

1曾义聪,杨贯中,刘柯.基于概念树的主题爬取技术研究[J].科学技术与工程,2005,5(12):785-790. 被引量：3
2段修生,单甘霖,张岐龙.用于多类分类的层次式支持向量机[J].军械工程学院学报,2009,21(1):64-66. 被引量：2
3王继成,武港山,周源远,张福炎.一种篇章结构指导的中文Web文档自动摘要方法[J].计算机研究与发展,2003,40(3):398-405. 被引量：43
4李红梅,丁振国,周水生,周利华.搜索引擎中的聚类浏览技术[J].中文信息学报,2008,22(3):56-63. 被引量：9
5曾义聪.基于本体概念图的电子课本系统构造技术研究[J].计算机系统应用,2008,17(1):31-34.
6刘久云,黄廷磊,夏威,华绿绿.基于多关系与属性的主题层次影响力评估算法[J].桂林电子科技大学学报,2015,35(4):329-335.
7张岐龙,单甘霖,张子宁,刘东平.层次支持向量机在数字电路故障诊断中的应用[J].电光与控制,2011,18(2):89-92.
8张健沛,刘洋,杨静,代坤.搜索引擎结果聚类算法研究[J].计算机工程,2004,30(5):95-97. 被引量：11
9易军凯,刘慕凡,万静.基于主题与语义的作弊网页检测方法[J].计算机工程,2015,41(9):311-316. 被引量：1
10涂鼎,陈岭,陈根才,吴勇,王敬昌.基于在线层次化非负矩阵分解的文本流主题检测[J].浙江大学学报（工学版）,2016,50(8):1618-1626. 被引量：2

智能系统学报

2014年第3期

浏览历史

内容加载中请稍等...

基于无标记Web数据的层次式文本分类

参考文献17

二级参考文献6

共引文献1

相关作者

相关机构

相关主题

浏览历史