摘要
距离度量学习在分类领域有着广泛的应用,将其应用到文本分类时,由于一般采用的向量空间模型(VSM)中的TF*IDF算法在对文本向量表达时向量均是维度相同并且归一化的,这就导致传统距离度量学习过程中采用的欧式距离作为相似度判别标准在文本分类领域往往无法取得预期的效果,在距离度量学习中的LMNN算法的启发下提出一种余弦距离度量学习算法,使其适应于文本分类领域,称之为CS-LMNN。考虑到文本分类领域中样本类偏斜情况比较普遍,提出采用一种伪K近邻分类算法与CS-LMNN结合实现文本分类,该算法首先利用CS-LMNN算法对训练数据进行距离度量学习,根据训练结果对测试数据使用伪K近邻分类算法进行分类,实验结果表明,该算法可以有效的提高分类精度。
Distance metric learning has a wide range of application in the area of classification. However, when applied to text classification, it is difficult to obtain good results. For the reason that in traditional area of text classification they choose vector space model as the way to transform a text to a vector, in which each vector is normalized and has the same dimension. The traditional distance metric learning use Euclidean distance as the similarity metric and its value is very sensitive to each dimension's value. A cosine LMNN similarity metric learning method is proposed to adapt LMNN to the text classification field called CS- LMNN. And taking into account the classes of training dataset is commonly skewed in text classification. A new pseudo K-nea rest neighbor classification algorithm is used to achieve text classification. The algorithm first uses the CS-LMNN algorithm to learn the distance metric in the training data, then do the classification using pseudo-K-nearest neighbor classification algorithm. Experiments show that this algorithm can effectively improve the classification accuracy.
出处
《计算机工程与设计》
CSCD
北大核心
2013年第6期2200-2203,2211,共5页
Computer Engineering and Design
基金
国家863高技术研究发展计划基金项目(2011AA040605)
关键词
余弦
距离度量学习
伪K近邻
文本分类
向量空间模型
cosine
distance metric learning
pseudo-K-nearest neighbor
text classification
vector space model