摘要
针对标题文本特征少、特征维度高和分布不均匀导致分类性能不佳的问题,提出了一种利用分类体系结构信息的双向特征选择算法,并在该算法的基础上实现标题分类。该方法以具有严格层级关系的分类体系为应用前提,利用类别与词的同现及分布关系进行特征词和候选类别的双向选择,构建类别向量空间;通过分析标题文本特征词在层级类别向量空间的分布所表现出的类别语义信息,确定文本所在层级以及所在层级的候选类别;之后利用分类器对未能成功分类的标题进行分类。在人工标引数据集上的实验结果表明,该方法在不进行语料扩展和外部知识库添加的基础上仍可有效地确定文本所在层级,实现多级学科的分类;并可在识别类别语义信息的基础上,降低候选类别数目,提高分类效率。
This paper proposed an efficient headlines classification method which used the structure of classification system, to solve the poor classification performance causing by headlines' feature fewer, high class feature dimensions and uneven distri- bution of the samples. This method was on the premise of strict hierarchy of classification system. First ,it used feature selection method based on hierarchical category information to build multilayer vector space. Second,it analysed feature word in the vector space distribution to determine which level the headlines located and which categorys the headlines belonged. At last, it used calssifier to classify the title which failed to classification. Experimental in artificial indexing data sets show that using multilayer vector space can effective determine which level the headlines locate on, realize classification at muhilayer level,improve headlines classification accuracy based on identifying category semantic information.
出处
《计算机应用研究》
CSCD
北大核心
2016年第7期2030-2033,共4页
Application Research of Computers
基金
省部级实验室/开放基金资助项目(B2014)
关键词
标题分类
特征选择
层级结构分类体系
同现分析
向量空间
headlines classification
feature selection
hierarchical classification system
co-occurrence analysis
vector space