摘要
传统文本分类中的文档表示方法一般基于全文本(Bag-Of-Words)的分析,由于忽略了领域相关的语义特征,无法很好地应用于面向特定领域的文本分类任务.本文提出了一种基于语料库对比领域相关词汇提取的特征选择方法,结合SVM分类器实现了适用于特定领域的文本分类系统,能轻松应用到各个领域.该系统在2005年文本检索会议(TREC,Text REtrieval Conference)的基因领域文本分类任务(Genomics Track Categorization Task)的评测中取得第一名.
The traditional text representation methods for text classification are generally based on the analysis of full text (Bagof-Words). Because of ignoring domain-specific semantic features, they can not fit domain-specific text classification. This paper describes a feature selection method based on domain-specific term extraction using corpus comparison, and a text classification system based on the combination of this method and the SVM classifier, which can be applied to any domain easily. This text classification system got the highest score among runs from 19 groups in the evaluation of TREC 2005 Genomics Track Categorization Task.
出处
《小型微型计算机系统》
CSCD
北大核心
2007年第5期895-899,共5页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(60305006)资助
关键词
文本分类
文档表示
特征选择
领域相关
text classification
document representation
feature selection
domain-specific