摘要
研究了中文文本分类中的文本表示方法,提出了对中文文本表示因素的分析框架,并通过对3个数据集实验结果的分析,确定了各种文本表示因素对分类效果的影响.直接使用汉字进行划分也可以获得较好的分类效果;简单的不使用很大词库的分词和使用大词库的分词,以及复杂的分词对分类效果影响不大;仅使用01表示特征是否出现也可以获得比较好的分类效果;采用综合了合理的向量取值(如使用合适的归一化算法)可以较大幅度地提高分类准确率等.这些结论为后续的应用提供了指导原则.
We investigated the representation methods for text classification, proposed the framework of analyzing Chinese text representation algorithms, analyzed the influence of text representation, and obtained the influence of variable text representation factors on classification effect. Using Chinese characters can directly obtain better effect than expected ; there is little difference on classification effect among splitting articles with smaller or huger dictionary or even by complicated splitting algorithm; and classification with only 01 to represent whether a feature is presented in a text or not can lead to not bad effect. We also found it can greatly improve classification effect to use reasonable vector value such as suitable formalization algorithm. These conclusions have provided instructions to contifurther applications.
出处
《中国科学院研究生院学报》
CAS
CSCD
北大核心
2009年第3期400-407,共8页
Journal of the Graduate School of the Chinese Academy of Sciences
基金
国家863研究计划(2006AA01Z454)项目资助
关键词
中文文本分类
文本表示
向量化
Chinese text classification, text presentation, vectorization