摘要
从信息粒度的角度分析了文本分类中出现样本错分的原因,同时结合人类认知方式,提出一种基于信息粒度的交叠类文本分类方法。新方法通过转换描述训练样本集合的粒度空间,对训练样本进行重新划分,加大训练样本之间的差异性,以此增加分类的先验知识;根据人类认知方式的特点,在划分后的训练样本集合上构建层次分类器进行分类。实验中采用了不同领域、不同类型的语料库,定量分析了类交叠程度对分类性能的影响并对新方法进行了测试。实验结果表明,新方法能够有效地提高分类性能,尤其适合于类交叠程度较高的情况。
The paper firstly analyses the cause of misclassification from the view of information granularity,then gives a method for classification of overlapping classes based on the characteristic of human cognitive style.The new method transfers granularity space that describes train corpus to redrawing trian samples in order to increase the difference between train samples and get more prior knowledge.Then,based on the characteristic of Human beings' cognitive style,new method builds a hierarchical classifier on new corpus.The experiments use corpuses with different types in different field to give quantitative analysis results about the effection of classes overlapping ratio on classification performance and test the performance of new method.The results show the new method can effectively improve classification performance,especially when the degree of classes overlapping is very high.
出处
《情报学报》
CSSCI
北大核心
2011年第4期339-346,共8页
Journal of the China Society for Scientific and Technical Information
基金
国家863项目“网络舆情态势分析与预警关键技术研究”基金资助
关键词
信息粒度
文本分类
认知方式
information granularity
text categorization
cognitive style