摘要
为了提高k-nearest neighbor algorithm(KNN)算法处理大数据集的能力,本文利用Map Reduce并行编程模型,同时结合KNN算法自身的特点,给出了KNN算法在Hadoop平台下的并行化实现。通过设计Map、Combine和Reduce 3个函数,实现了KNN算法的并行化。Map函数完成每个测试样本与训练样本之间的相似度计算,Combine函数作为一个本地的Reduce操作,用以减少中间计算量及通信开销,Reduce函数则根据上述函数得到的中间结果计算出k近邻并作出分类判断。实验结果表明:较之以往的单机版方法,在Hadoop集群上实现的并行化KNN算法具有较好的加速比和良好的扩展性。
In order to improve the ability of KNN algorithm to process massive data, a new technique based on Hadoop platform is used. Considering the characteristics of the KNN algorithm itself, the par allelism of KNN based on the MapReduce programming model is implemented. Three functions are de signed for the implementation of the parallelism, named Map, Combine and Reduce. The Similarity be tween each test instances and the training dataset are evaluated by Map function. For reducing the com putational complexity and saving network bandwidth, the Combine function is used as a local Reduce op eration. Reduce function is used to get the KNN classification based on the intermediate results. The ex periment on the Hadoop platform shows the method has excellent linear speedup with an increasing number of computer nodes and good scalability.
出处
《南京航空航天大学学报》
EI
CAS
CSCD
北大核心
2013年第4期550-555,共6页
Journal of Nanjing University of Aeronautics & Astronautics
基金
国家自然科学基金(61173143)资助项目
江苏省自然科学基金(BK2010380)资助项目
中国博士后科学基金(2012M511303)资助项目
江苏省高校优势学科建设工程资助项目