期刊文献+

基于MapReduce改进K-NN的大数据分类算法研究 被引量:8

Big Data Classification Algorithm Based on MapReduce to Improve K-NN
下载PDF
导出
摘要 针对传统k-nearest neighbor algorithm(K-NN)分类算法计算量大、高维度海量数据集处理效率低的缺点,本文基于Hadoop平台依托MapReduce分布式编程模型改写Map和Reduce函数,并针对传统K-NN提出数据集主成分分析和临界区域数据预测时距离加权的方法.首先,对高维度数据进行主成分分析达到降维的目的,从而提高运行效率;其次,在预测分类阶段加入完全区域和临界区域的概念,临界区域对k个值n种类别进行距离加权,提高准确率;最后,在Hadoop集群环境下的算法运行,针对海量数据进一步提高其运行效率.实验结果表明:该算法在处理海量数据时极大地提高了计算效率和准确率. Aiming at the shortcomings of traditional k-nearest neighbor algorithm (K-NN) classification algorithm, such as large amount of calculation and high dimension massive data set processing efficiency, this paper revises the Map and Reduce functions based on Hadoop platform by using MapReduce distributed programming model. Principal component arm[ysis and critical region data when the distance weighted method. First, the principal component analysis of high-dimensional data to achieve the purpose of reducing dimension, so as to improve operational efficiency; secondly, in the classification stage of prediction, adding the concept of complete region and critical region, the critical region of k values of n species distance weighted, Finally, the algorithm running under the Hadoop cluster environment can further improve its operation efficiency against massive data. The experimental results show that this algorithm greatly improves the computational efficiency and accuracy when dealing with massive data
作者 蒋华 韩飞 王鑫 王慧娇 JIANG Hua;HAN Fei;WANG Xin;WANG Hui-jiao(School of Computer and Information Security,Guilin University of Electronic Technology,Guilin 541000,China)
出处 《微电子学与计算机》 CSCD 北大核心 2018年第10期36-40,45,共6页 Microelectronics & Computer
基金 2016广西高校中青年教师基础能力提升项目(ky2016YB150) 桂林电子科技大学研究生教育创新计划项目(2017YJCX48)
关键词 大数据 MAPREDUCE K-近邻算法 临界区域 主成分分析 距离加权 big data MapReduce K-NN critical area PCA distance weighted
  • 相关文献

参考文献4

二级参考文献50

  • 1殷晓岚,丁治明,李京.移动对象在空间网络数据库上的kNN查询[J].计算机研究与发展,2007,44(z3):55-60. 被引量:1
  • 2Goodwin G C, Agiiero J C, Cea G, et al. Sampling and sampled-data models[J]. IEEE Control System Magazine, 2013, 10(1): 34-54.
  • 3ROEMMICH D, BOEBEL O, DESAUBIES Y, et aL Argo: the global array of profiling floats [ M]. Observing the Oceans in the 21 st Century. Melbourne, Australia: Godae Project Office, Bureau of Meteorology, 2001:248 -258.
  • 4HASTIE T, STUETZLE W. Principal curves [ J]. Journal of the A- merican Statistical Association, 1989, 84(406): 502-516.
  • 5KEGL B, KRZYZAK A, LINDER T, et al. Learning and design of principal curves [ J]. IEEE Transactions on Pattern Analysis and Machine Intelligence 2000, 22(3): 281-297.
  • 6SHI X, LV Y, FEI Z, et al. A muhivariable statistical process mo-nitoring method based on multiscale analysis and principal curves [ J]. International Journal of Innovative Computing, Information and Control, 2013, 9(4): 1781-1800.
  • 7FERREIRA D D, SEIXAS J M D, CERQUEIRA A S, et al. A new power quality deviation index based on principal curves [ J]. Elec- tric Power Systems Research, 2015, 125:8 - 14.
  • 8CHEN D, CHEN L. Practical constraint K-segment principal curve algorithms for generating railway GPS digital map [ J]. Mathematical Problems in En,ineerin,. 2013, 51 3) : 205 - 209.
  • 9DEAN J, GHEMAWAT S. MapReduce: simplified data processing on large clusters [ C] // Proceedings of the 6th conference on Sym- posium on Opearting Systems Design & Implementation. Berkeley: USENIX Association, 2004, 6: 10.
  • 10Zhang Shichao. KNN-CF approach:incorporating certainty factor to KNN classification[J] . IEEE Intelligent Informatics Bulletin, 2010, 11(1):24-33.

共引文献43

同被引文献104

引证文献8

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部