摘要
针对传统k-nearest neighbor algorithm(K-NN)分类算法计算量大、高维度海量数据集处理效率低的缺点,本文基于Hadoop平台依托MapReduce分布式编程模型改写Map和Reduce函数,并针对传统K-NN提出数据集主成分分析和临界区域数据预测时距离加权的方法.首先,对高维度数据进行主成分分析达到降维的目的,从而提高运行效率;其次,在预测分类阶段加入完全区域和临界区域的概念,临界区域对k个值n种类别进行距离加权,提高准确率;最后,在Hadoop集群环境下的算法运行,针对海量数据进一步提高其运行效率.实验结果表明:该算法在处理海量数据时极大地提高了计算效率和准确率.
Aiming at the shortcomings of traditional k-nearest neighbor algorithm (K-NN) classification algorithm, such as large amount of calculation and high dimension massive data set processing efficiency, this paper revises the Map and Reduce functions based on Hadoop platform by using MapReduce distributed programming model. Principal component arm[ysis and critical region data when the distance weighted method. First, the principal component analysis of high-dimensional data to achieve the purpose of reducing dimension, so as to improve operational efficiency; secondly, in the classification stage of prediction, adding the concept of complete region and critical region, the critical region of k values of n species distance weighted, Finally, the algorithm running under the Hadoop cluster environment can further improve its operation efficiency against massive data. The experimental results show that this algorithm greatly improves the computational efficiency and accuracy when dealing with massive data
作者
蒋华
韩飞
王鑫
王慧娇
JIANG Hua;HAN Fei;WANG Xin;WANG Hui-jiao(School of Computer and Information Security,Guilin University of Electronic Technology,Guilin 541000,China)
出处
《微电子学与计算机》
CSCD
北大核心
2018年第10期36-40,45,共6页
Microelectronics & Computer
基金
2016广西高校中青年教师基础能力提升项目(ky2016YB150)
桂林电子科技大学研究生教育创新计划项目(2017YJCX48)