摘要
中文新闻信息分类标准中,类别数量大。在将其应用于新闻分类时,会出现训练模型大、训练时间长,尤其是当部分类别改变时需要全部重新训练等问题。由于分类标准中类别之间存在层次关系,因此层次分类方法可以作为解决方案。研究层次化的中文新闻分类方法,并从以下两方面改善层次化分类方法的效果:1)分层的新闻特征计算,解决了层次分类中新闻在分类类别下的特征向量的不同表示的问题;2)错误控制,解决了在上一层分类错误的情况下新闻不会分到正确的类别上的情况。实验结果表明,层次分类方法的效果比平面分类的准确度提高了约4%,进行多次特征权重计算的层次分类方法比普通的层次分类的准确度提高了约3%,同时进行错误控制的分类效果比普通层次的分类效果提高了大概3%。
There are thousands of subjects in Chinese news subject specification.When they are used in news classification,long training time and large model are two key problems we are facing,especially when some of classes are changed.Chinese news subject classification has hierarchical structure and hierarchical can solve the problem partially.We improved the Chinese news hierarchical classification to get better the result from two points of view.1)Repetitious feature calculation represents news of different layers in hierarchical classification.2)Use error control to solve the problem that one error classification in upper layer will lead in the error classification of its deeper classes.Our experiments shows that hierarchical classification improves the precision of 4% comparing with flat classification,hierarchical classification with Repetitious feature calculation improves 3% comparing with hierarchical classification,and hierarchical classification with error control improves 3% comparing with hierarchical classification.
出处
《计算机科学》
CSCD
北大核心
2010年第10期165-168,180,共5页
Computer Science
基金
国家973项目(No.2007CB310803)资助
关键词
层次分类
支持向量机
中文信息分类标准
特征计算
错误控制
Hierarchical classification
Support vector machine
Chinese news subject classification specification
Feature calculation
Error control