摘要
压缩近邻(CNN:Condensed Nearest Neighbors)是Hart针对K-近邻(K-NN:K-Nearest Neighbors)提出的样例选择算法,目的是为了降低K-NN算法的内存需求和计算负担.但在最坏情况下,CNN算法的计算时间复杂度为O(n3),n为训练集中包含的样例数.当CNN算法应用于大数据环境时,高计算时间复杂度会成为其应用的瓶颈.针对这一问题,本文提出了基于MapReduce并行化压缩近邻算法.在Hadoop环境下,编程实现了并行化的CNN,并与原始的CNN算法在6个数据集上进行了实验比较.实验结果显示,本文提出的算法是行之有效的,能解决上述问题.
CNN (Condensed Nearest Neighbors) proposed by Hart is an instance selection algorithm which aims at decreasing the memory and computation requirements. However,in the worst cases, the computational time complexity of CNN is O( n3 ), where, n is the number of instances in a training set. When CNN is applied to big data, high computational time complexity will become the bottle- neck of its application. In order to deal with this problem, a parallelized CNN with MapReduce is proposed in this paper. We implement the proposed algorithm in Hadoop environment,and experimentally compare it with original CNN on 6 data sets. The experimental results show that the proposed algorithm is effective and efficient, and can overcome the mentioned problem.
出处
《小型微型计算机系统》
CSCD
北大核心
2017年第12期2678-2682,共5页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(71371063)资助
河北省自然科学基金项目(F2017201026)资助
浙江省计算机科学与技术重中之重学科(浙江师范大学)课题项目资助
关键词
压缩近邻
K-近邻
样例选择
MAPREDUCE
condensed nearest neighbors
K-nearest neighbors
instance selection
MapReduce