摘要
针对K最近邻算法测试复杂度至少为线性,导致其在大数据样本情况下的效率很低的问题,提出了一种应用于大数据下的快速KNN分类算法。该算法创新性地在K最近邻算法中引入训练过程,即通过线性复杂度聚类方法对大数据样本进行分块,然后在测试过程中找出与待测样本距离最近的块,并将其作为新的训练样本进行K最近邻分类。这样的过程大幅度地减少了K最近邻算法的测试开销,使其能在大数据集中得以应用。实验表明,该算法在与经典KNN分类准确率保持近似的情况下,分类的速度明显快于经典KNN算法。
Aiming at the problems of the K-nearest neighbor algorithm,testing complex is linear at least,and lead to the accuracy is low when the samples are large. This paper proposed a fast KNN classification algorithm faster than the traditional KNN did. The proposed algorithm innovatively introduced the training process during the KNN method,i. e.,the algorithm blocked the big data by linear complexity clustering. Then,the algorithm selected the nearest cluster as new training samples and established a classification model. This process reduced the KNN algorithm testing overhead,which made the proposed algorithm could be applied to big data. Experiments result shows that the accuracy of the proposed KNN classification is similarity than the traditional KNN,but the classification speed has been significantly improved.
出处
《计算机应用研究》
CSCD
北大核心
2016年第4期1003-1006,1023,共5页
Application Research of Computers
基金
国家自然科学基金资助项目(61450001
61263035
61573270)
国家"863"计划资助项目(2012AA011005)
国家"973"计划资助项目(2013CB329404)
广西自然科学基金资助项目(2012GXNSFGA060004
2014jj AA70175
2015GXNSFAA139306
2015GXNSFCB13901)
广西八桂创新团队
广西百人计划和广西高校科学技术研究重点项目(2013ZD04)
关键词
K最近邻
测试复杂度
大数据
分块
聚类中心
K-nearest neighbor(KNN)
testing complex
big data
block
cluster centers