摘要
目前中文文本分类算法大多利用词语或词语映射为特征项的分类方式,未考虑中文语法语义的特点,导致分类性能较低。为此,提出中文文本的意群分类算法。通过中文依存句法分析结果制定规则提取意群,并作为特征项表示文本,进而采用支持向量机的方法对训练集进行学习,最终构建类别意群库对测试文本进行分类。实验结果表明,与基于词语的分类方法相比,意群分类算法在分类性能上平均提升3个百分点,平均查准率达到97%。
In general,the conventional word-form based Chinese text categorization approach which does not give further consideration on Chinese linguistic feature often has poor performance.A new algorithm of Chinese text categorization based on sense group is proposed.This algorithm extracts sense group by analyzing Chinese dependency parsing results and developing extraction rules.Here uses Support Vector Machine(SVM) to training test documents to build the category sense group library which is used in test.Experimental results display that the method based on sense group reaches accuracy up to 97%,which is 3% higher than the way which is based on words.
出处
《计算机工程》
CAS
CSCD
2013年第8期204-207,214,共5页
Computer Engineering
基金
国家"863"计划基金资助重点项目(2009AA01Z433)
关键词
文本分类
意群
支持向量机
语义概念
依存句法
类别意群库
text categorization
sense group
Support Vector Machine(SVM)
semantic concept
dependency parsing
category sense group library