期刊文献+

大数据下的快速KNN分类算法 被引量:29

Fast KNN classification algorithm under big data
下载PDF
导出
摘要 针对K最近邻算法测试复杂度至少为线性,导致其在大数据样本情况下的效率很低的问题,提出了一种应用于大数据下的快速KNN分类算法。该算法创新性地在K最近邻算法中引入训练过程,即通过线性复杂度聚类方法对大数据样本进行分块,然后在测试过程中找出与待测样本距离最近的块,并将其作为新的训练样本进行K最近邻分类。这样的过程大幅度地减少了K最近邻算法的测试开销,使其能在大数据集中得以应用。实验表明,该算法在与经典KNN分类准确率保持近似的情况下,分类的速度明显快于经典KNN算法。 Aiming at the problems of the K-nearest neighbor algorithm,testing complex is linear at least,and lead to the accuracy is low when the samples are large. This paper proposed a fast KNN classification algorithm faster than the traditional KNN did. The proposed algorithm innovatively introduced the training process during the KNN method,i. e.,the algorithm blocked the big data by linear complexity clustering. Then,the algorithm selected the nearest cluster as new training samples and established a classification model. This process reduced the KNN algorithm testing overhead,which made the proposed algorithm could be applied to big data. Experiments result shows that the accuracy of the proposed KNN classification is similarity than the traditional KNN,but the classification speed has been significantly improved.
出处 《计算机应用研究》 CSCD 北大核心 2016年第4期1003-1006,1023,共5页 Application Research of Computers
基金 国家自然科学基金资助项目(61450001 61263035 61573270) 国家"863"计划资助项目(2012AA011005) 国家"973"计划资助项目(2013CB329404) 广西自然科学基金资助项目(2012GXNSFGA060004 2014jj AA70175 2015GXNSFAA139306 2015GXNSFCB13901) 广西八桂创新团队 广西百人计划和广西高校科学技术研究重点项目(2013ZD04)
关键词 K最近邻 测试复杂度 大数据 分块 聚类中心 K-nearest neighbor(KNN) testing complex big data block cluster centers
  • 相关文献

参考文献24

  • 1Zhang Shichao. KNN-CF approach:incorporating certainty factor to KNN classification[J] . IEEE Intelligent Informatics Bulletin, 2010, 11(1):24-33.
  • 2Zhang Shichao, Zhang Chengqi, Yan Xiaowei. Post-mining:maintenance of association rules by weighting[J] . Information Systems, 2003, 28(7):691-707.
  • 3李荣陆,胡运发.基于密度的kNN文本分类器训练样本裁剪方法[J].计算机研究与发展,2004,41(4):539-545. 被引量:98
  • 4张孝飞,黄河燕.一种采用聚类技术改进的KNN文本分类方法[J].模式识别与人工智能,2009,22(6):936-940. 被引量:32
  • 5李杨,曾海泉,刘庆华,胡运发.基于kNN的快速WEB文档分类[J].小型微型计算机系统,2004,25(4):725-729. 被引量:13
  • 6Zhu Xiaofeng, Huang Zi, Yang Yang, et al. Self-taught dimensionality reduction on the high-dimensional small-sized data[J] . Pattern Reco-gnition, 2013, 46(1):215-229.
  • 7Zhu Xiaofeng, Huang Zi, Cui Jiangtao, et al. Video-to-shot tag propa-gation by graph sparse group Lasso[J] . IEEE Trans on Multimedia, 2013, 15(3):633-646.
  • 8Zhu Xiaofeng, Huang Zi, Cheng Hong, et al. Sparse hashing for fast multimedia search[J] . ACM Trans on Information Systems, 2013, 31(2):9.
  • 9Zhu Xiaofeng, Huang Zi, Shen Hengtao, et al. Dimensionality reduction by mixed kernel canonical correlation analysis[J] . Pattern Recognition, 2012, 45(8):3003-3016.
  • 10Zhu Xiaofeng, Zhang Shichao, Jin Zhi, et al. Missing value estimation for mixed-attribute data sets[J] . IEEE Trans on Knowledge Data Engineering, 2011, 23(1):110-121.

二级参考文献38

  • 1王煜,白石,王正欧.用于Web文本分类的快速KNN算法[J].情报学报,2007,26(1):60-64. 被引量:33
  • 2Lewis D D. Naive Bayes at Forty: The Independence Assumption in Information Retrieval // Proc of the lOth European Conference on Machine Learning. Chemnitz, Germany, 1998 : 4 - 15.
  • 3Cohen W W, Singer Y. Context-Sensitive Learning Methods for Text Categorization// Proc of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Zurich, Switzerland, 1996 : 307 - 315.
  • 4Joaehims T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features//Proc of the 10th European Conference on Machine Learning. Chemnitz, Germany, 1998: 137 - 142.
  • 5Nigam K, Lafferty J, McCallum A. Using Maximum Entropy for Text Classification//Proc of the Workshop on Machine Learning for Information Filtering. Stockholm, Sweden, 1999 : 61 - 67.
  • 6Yang Yiming, Liu Xin. A Re-Examination of Text Categorization Methods// Proc of the 22nd Annual International ACM SIGIR Conference on Research and Development in the Information Retrieval. Berkeley, USA, 1999:42-49.
  • 7Sebastiani F. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 2002, 34 ( 1 ) :1- 47.
  • 8Hull D A. Improving Text Retrieval for the Routing Problem Using Latent Semantic Indexing// Proc of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Dublin, Ireland, 1994 : 282 - 289.
  • 9Joachims T. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization//Proc of the 14th International Conference on Machine Learning. Nashville, USA, 1997: 143-151.
  • 10Galavotti L, Sebastiani F, Simi M. Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization//Proc of the 4th European Conference on Research and Advanced Technology for Digital Libraries. Lisbon, Portugal, 2000 : 59 - 68.

共引文献124

同被引文献204

引证文献29

二级引证文献147

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部