期刊文献+

基于余弦距离度量学习的伪K近邻文本分类算法 被引量:19

Pseudo-K-nearest neighbor text classification algorithm based on cosine distance metric learning
下载PDF
导出
摘要 距离度量学习在分类领域有着广泛的应用,将其应用到文本分类时,由于一般采用的向量空间模型(VSM)中的TF*IDF算法在对文本向量表达时向量均是维度相同并且归一化的,这就导致传统距离度量学习过程中采用的欧式距离作为相似度判别标准在文本分类领域往往无法取得预期的效果,在距离度量学习中的LMNN算法的启发下提出一种余弦距离度量学习算法,使其适应于文本分类领域,称之为CS-LMNN。考虑到文本分类领域中样本类偏斜情况比较普遍,提出采用一种伪K近邻分类算法与CS-LMNN结合实现文本分类,该算法首先利用CS-LMNN算法对训练数据进行距离度量学习,根据训练结果对测试数据使用伪K近邻分类算法进行分类,实验结果表明,该算法可以有效的提高分类精度。 Distance metric learning has a wide range of application in the area of classification. However, when applied to text classification, it is difficult to obtain good results. For the reason that in traditional area of text classification they choose vector space model as the way to transform a text to a vector, in which each vector is normalized and has the same dimension. The traditional distance metric learning use Euclidean distance as the similarity metric and its value is very sensitive to each dimension's value. A cosine LMNN similarity metric learning method is proposed to adapt LMNN to the text classification field called CS- LMNN. And taking into account the classes of training dataset is commonly skewed in text classification. A new pseudo K-nea rest neighbor classification algorithm is used to achieve text classification. The algorithm first uses the CS-LMNN algorithm to learn the distance metric in the training data, then do the classification using pseudo-K-nearest neighbor classification algorithm. Experiments show that this algorithm can effectively improve the classification accuracy.
出处 《计算机工程与设计》 CSCD 北大核心 2013年第6期2200-2203,2211,共5页 Computer Engineering and Design
基金 国家863高技术研究发展计划基金项目(2011AA040605)
关键词 余弦 距离度量学习 伪K近邻 文本分类 向量空间模型 cosine distance metric learning pseudo-K-nearest neighbor text classification vector space model
  • 相关文献

参考文献9

  • 1Kilian Q Weinberger, Lawrence K Saul. Distance metric lear- ning for large margin nearest neighbor classification [-J. Jour- nal of Machine Learning Research, 2009, 10: 207-244.
  • 2LILT Yang, RONG Jkn. D/stance metric leamiv4g: A comprehensive survey ED. Technical Report. Department of Computer Science and Engineering, Michigan State University, 2006.
  • 3熊忠阳,杨营辉,张玉芳.基于密度的kNN分类器训练样本裁剪方法的改进[J].计算机应用,2010,30(3):799-801. 被引量:13
  • 4曾勇,杨煜普.广义近邻模式分类研究[D].上海:上海交通大学,2009.
  • 5张海龙,王莲芝.自动文本分类特征选择方法研究[J].计算机工程与设计,2006,27(20):3840-3841. 被引量:45
  • 6李晓红.中文文本分类中的特征词抽取方法[J].计算机工程与设计,2009,30(17):4127-4129. 被引量:16
  • 7Weinberger K, Chapelle O. Large margin taxonomy embedding with an application to document categorization [C] //Vancou- ver, British Columbia, Canada Advances in Neural Informa- tion Processing Systems 21, 2009 1737-174.
  • 8焦庆争,蔚承建.分布权值调节概率标准差的文本分类方法[J].计算机应用,2009,29(12):3303-3306. 被引量:2
  • 9Weinberger K Q, Saul L K. Distance metric learning for large margin nearest neighbor classification [J]. The Journal of Ma- chine Learning Research, 2009 (10) : 207-244.

二级参考文献39

共引文献78

同被引文献130

引证文献19

二级引证文献50

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部