基于MapReduce改进K-NN的大数据分类算法研究被引量：8

Big Data Classification Algorithm Based on MapReduce to Improve K-NN

下载PDF

导出

摘要针对传统k-nearest neighbor algorithm(K-NN)分类算法计算量大、高维度海量数据集处理效率低的缺点,本文基于Hadoop平台依托MapReduce分布式编程模型改写Map和Reduce函数,并针对传统K-NN提出数据集主成分分析和临界区域数据预测时距离加权的方法.首先,对高维度数据进行主成分分析达到降维的目的,从而提高运行效率;其次,在预测分类阶段加入完全区域和临界区域的概念,临界区域对k个值n种类别进行距离加权,提高准确率;最后,在Hadoop集群环境下的算法运行,针对海量数据进一步提高其运行效率.实验结果表明:该算法在处理海量数据时极大地提高了计算效率和准确率. Aiming at the shortcomings of traditional k-nearest neighbor algorithm （K-NN） classification algorithm, such as large amount of calculation and high dimension massive data set processing efficiency, this paper revises the Map and Reduce functions based on Hadoop platform by using MapReduce distributed programming model. Principal component arm[ysis and critical region data when the distance weighted method. First, the principal component analysis of high-dimensional data to achieve the purpose of reducing dimension, so as to improve operational efficiency; secondly, in the classification stage of prediction, adding the concept of complete region and critical region, the critical region of k values of n species distance weighted, Finally, the algorithm running under the Hadoop cluster environment can further improve its operation efficiency against massive data. The experimental results show that this algorithm greatly improves the computational efficiency and accuracy when dealing with massive data

作者蒋华韩飞王鑫王慧娇 JIANG Hua;HAN Fei;WANG Xin;WANG Hui-jiao(School of Computer and Information Security,Guilin University of Electronic Technology,Guilin 541000,China)

机构地区桂林电子科技大学计算机与信息安全学院

出处《微电子学与计算机》 CSCD 北大核心 2018年第10期36-40,45,共6页 Microelectronics & Computer

基金 2016广西高校中青年教师基础能力提升项目(ky2016YB150) 桂林电子科技大学研究生教育创新计划项目(2017YJCX48)

关键词大数据 MAPREDUCE K-近邻算法临界区域主成分分析距离加权 big data MapReduce K-NN critical area PCA distance weighted

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献4

1杨秀荣.并行数据库中异常数据优化分类挖掘方法研究[J].微电子学与计算机,2015,32(10):125-128. 被引量：4
2戴健,丁治明.基于MapReduce快速kNN Join方法[J].计算机学报,2015,38(1):99-108. 被引量：10
3苏毅娟,邓振云,程德波,宗鸣.大数据下的快速KNN分类算法[J].计算机应用研究,2016,33(4):1003-1006. 被引量：29
4蒋华,胡莹.基于MapReduce技术的Argo浮标剖面信息融合算法[J].计算机应用,2015,35(12):3403-3407. 被引量：5

二级参考文献50

1殷晓岚,丁治明,李京.移动对象在空间网络数据库上的kNN查询[J].计算机研究与发展,2007,44(z3):55-60. 被引量：1
2Goodwin G C, Agiiero J C, Cea G, et al. Sampling and sampled-data models[J]. IEEE Control System Magazine, 2013, 10(1): 34-54.
3ROEMMICH D, BOEBEL O, DESAUBIES Y, et aL Argo: the global array of profiling floats [ M]. Observing the Oceans in the 21 st Century. Melbourne, Australia: Godae Project Office, Bureau of Meteorology, 2001:248 -258.
4HASTIE T, STUETZLE W. Principal curves [ J]. Journal of the A- merican Statistical Association, 1989, 84(406): 502-516.
5KEGL B, KRZYZAK A, LINDER T, et al. Learning and design of principal curves [ J]. IEEE Transactions on Pattern Analysis and Machine Intelligence 2000, 22(3): 281-297.
6SHI X, LV Y, FEI Z, et al. A muhivariable statistical process mo-nitoring method based on multiscale analysis and principal curves [ J]. International Journal of Innovative Computing, Information and Control, 2013, 9(4): 1781-1800.
7FERREIRA D D, SEIXAS J M D, CERQUEIRA A S, et al. A new power quality deviation index based on principal curves [ J]. Elec- tric Power Systems Research, 2015, 125:8 - 14.
8CHEN D, CHEN L. Practical constraint K-segment principal curve algorithms for generating railway GPS digital map [ J]. Mathematical Problems in En,ineerin,. 2013, 51 3) : 205 - 209.
9DEAN J, GHEMAWAT S. MapReduce: simplified data processing on large clusters [ C] // Proceedings of the 6th conference on Sym- posium on Opearting Systems Design & Implementation. Berkeley: USENIX Association, 2004, 6: 10.
10Zhang Shichao. KNN-CF approach:incorporating certainty factor to KNN classification[J] . IEEE Intelligent Informatics Bulletin, 2010, 11(1):24-33.