期刊文献+

MapReduce并行化压缩近邻算法 被引量:1

Parallelization of Condensed Nearest Neighbor Algorithm with MapReduce
下载PDF
导出
摘要 压缩近邻(CNN:Condensed Nearest Neighbors)是Hart针对K-近邻(K-NN:K-Nearest Neighbors)提出的样例选择算法,目的是为了降低K-NN算法的内存需求和计算负担.但在最坏情况下,CNN算法的计算时间复杂度为O(n3),n为训练集中包含的样例数.当CNN算法应用于大数据环境时,高计算时间复杂度会成为其应用的瓶颈.针对这一问题,本文提出了基于MapReduce并行化压缩近邻算法.在Hadoop环境下,编程实现了并行化的CNN,并与原始的CNN算法在6个数据集上进行了实验比较.实验结果显示,本文提出的算法是行之有效的,能解决上述问题. CNN (Condensed Nearest Neighbors) proposed by Hart is an instance selection algorithm which aims at decreasing the memory and computation requirements. However,in the worst cases, the computational time complexity of CNN is O( n3 ), where, n is the number of instances in a training set. When CNN is applied to big data, high computational time complexity will become the bottle- neck of its application. In order to deal with this problem, a parallelized CNN with MapReduce is proposed in this paper. We implement the proposed algorithm in Hadoop environment,and experimentally compare it with original CNN on 6 data sets. The experimental results show that the proposed algorithm is effective and efficient, and can overcome the mentioned problem.
出处 《小型微型计算机系统》 CSCD 北大核心 2017年第12期2678-2682,共5页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(71371063)资助 河北省自然科学基金项目(F2017201026)资助 浙江省计算机科学与技术重中之重学科(浙江师范大学)课题项目资助
关键词 压缩近邻 K-近邻 样例选择 MAPREDUCE condensed nearest neighbors K-nearest neighbors instance selection MapReduce
  • 相关文献

参考文献1

二级参考文献52

  • 1Mayer-Sch?nberger V, Cukier K. Big Data: A Revolution That Will Transform How We Live, Work, and Think. Boston: Eamon Dolan/Houghton Mifflin Harcourt, 2013.
  • 2Hey T, Tansley S, Tolle K. The Fourth Paradigm: Data-Intensive Scientific Discovery. Redmond: Microsoft Research, 2009.
  • 3Bryant R E. Data-intensive scalable computing for scientific applications. Comput Sci Engin, 2011, 13: 25-33.
  • 4周志华. 机器学习与数据挖掘. 中国计算机学会通讯, 2007, 3: 35-44.
  • 5Zhou Z H, Chawla N V, Jin Y, et al. Big data opportunities and challenges: Discussions from data analytics perspectives. IEEE Comput Intell Mag, 2014, 9: 62-74.
  • 6Jordan M. Message from the president: The era of big data. ISBA Bull, 2011, 18: 1-3.
  • 7Kleiner A, Talwalkar A, Sarkar P, et al. The big data bootstrap. In: Proceedings of the 29th International Conference on Machine Learning (ICML), Edinburgh, 2012, 1759-1766.
  • 8Shalev-Shwartz S, Zhang T. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In: Proceedings of the 31st International Conference on Machine Learning (ICML), Beijing, 2014, 64-72.
  • 9Gonzalez J E, Low Y, Gu H, et al. PowerGraph: Distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Hollywood, 2012, 17-30.
  • 10Gao W, Jin R, Zhu S, et al. One-pass AUC optimization. In: Proceedings of the 30th International Conference on Machine Learning (ICML), Atlanta, 2013, 906-914.

共引文献43

同被引文献2

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部