摘要
传统的文本分类方法需要标注好的语料来训练分类器,然而人工标记语料代价高昂并且耗时。对此,通过无类别标记的Web数据来训练文本分类器,提出一种基于无标记Web数据的层次式文本分类方法,该方法结合类别知识和主题层次信息来构造Web查询,从多种Web数据中搜索相关文档并抽取学习样本,为监督学习找到分类依据,并结合层次式支持向量机进行分类器的学习。实验结果表明,该方法能够利用无标记Web数据学习分类器,并取得了较好的分类效果,其性能接近于有标记训练样本的监督分类方法。
Traditional text classification methods require a labeled corpus to train classifiers,however,it is costly and time-consuming to label corpus manually. This paper proposes a hierarchical text classification method,which trains the text classifier with web data that does not require any classification labels. This method constructs web inquiry by combining classification knowledge and topic hierarchical information,searches relevant documents and extracts the learning sample from many kinds of web data,finds a classification basis to monitor the learning,and combines a hierarchical support vector machine to train classifiers. The experimental results show that this method is able to train classifiers through non-labeled web data,and gains a better result of classification with a performance that is at a level close to the supervised classification method with labeled training samples.
出处
《智能系统学报》
CSCD
北大核心
2014年第3期330-335,共6页
CAAI Transactions on Intelligent Systems
基金
国家"863"计划资助项目(2010AA012505
2011AA010702
2012AA01A401
2012AA01A402)
国家重点基础研究发展计划资助项目(2013CB329601
2013CB329602)
国家自然科学基金资助项目(60933005
91124002)
国家科技支撑计划资助项目(2012BAH38B04)
国家242信息安全计划资助项目(2011A010)
关键词
层次式文本分类
主题层次
无标记数据分类
支持向量机
hierarchical text classification
topic hierarchy
classification without labeled data
support vector machine