期刊文献+

基于Spark的压缩近邻算法 被引量:2

Spark Based Condensed Nearest Neighbor Algorithm
下载PDF
导出
摘要 K-近邻(K-Nearest Neighbors,K-NN)是一种懒惰学习算法,用K-NN对数据分类时,不需要训练分类模型。K-NN算法的优点是思想简单、易于实现;缺点是计算量大,原因是在对测试样例进行分类时,其需要计算测试样例与训练集中每一个训练样例之间的距离。压缩近邻算法(Condensed Nearest Neighbors,CNN)可以克服K-NN算法的不足。但是,在面对大数据集时,由于自身的迭代计算特性,CNN的运算效率会变得非常低。针对这一问题,提出一种名为Spark CNN的压缩近邻算法。在大数据环境下,与基于MapReduce的CNN算法相比,Spark CNN的效率大幅提高,在5个大数据集上的实验证明了这一结论。 K-nearest neighbors(K-NN)is a lazy learning algorithm.It is unnecessary to train classification models,when one uses K-NN for data classification.K-NN algorithm is simple and easy to implement.The disadvantages of KNN is that it requires large number of computations,which is introduced by calculating distances between testing instance and every training instance.Condensed nearest neighbors(CNN)can overcome the drawback of K-NN mentioned above.However,CNN is an iterative algorithm,when it is applied in big data scenario,its efficiency becomes very low.In order to deal with this problem,this paper proposed an algorithm named Spark CNN.In big data circumstances,Spark CNN can significantly improve the efficiency of CNN.This paper experimentally compared the Spark CNN with MapReduce CNN on 5 big data sets,the experimental results show that the Spark CNN is very effective.
作者 张素芳 翟俊海 王婷婷 郝璞 王聪 赵春玲 ZHANG Su- fang1, ZHAI Jun- hai2 ,WANG Ting-ting2,HAO Pu2, WANG Cong2, ZHAO Chun- ling2(1Hebei Branch of China Meteorological Administration Training Centre,China Meteorological Administration, Baoding, Hebei 071000, China;2Key Lab. of Machine Learning and Computational Intelligence, College of Mathematics and Information Science, Hebei Universty,Baoding Hebei071002,Chn)
出处 《计算机科学》 CSCD 北大核心 2018年第B06期406-410,共5页 Computer Science
基金 国家自然科学基金项目(71371063) 河北省自然科学基金项目(F2017201026) 河北大学自然科学研究计划项目(799207217071) 河北大学大学生创新训练项目(2017071)资助
关键词 压缩近邻 大数据 样例选择 迭代计算 懒惰学习 Condensed nearest neighbors Big data Instance selection Iterative calculation Lazy learning
  • 相关文献

参考文献3

二级参考文献11

  • 1Bohm C, Krebs F. The k-nearest neighbor join: Turbo charging the KDD process. Knowledge Information System, 2004,6(6): 728-749. [doi: 10.1007/s10115-003-0122-9].
  • 2Xia CY, Lu HJ, Coi BC, Hu J. Gorder: An efficient method for KDD joins processing. In: Proc. of the 30th Int'l Conf. on Very Large Data Bases (VLDB). 2004. 756-767.
  • 3Yao B, Li FF, Kumar P. K nearest neighbor queries and KNN-joins in large relational databases (almost) for free. In: Proc. of the 26th Int'l Conf. on Data Engineering (ICDE). 2010.4-15. [doi: 10.1109/ICDE.2010.5447837].
  • 4Yu C, Cui B, Wang SG, Su JW. Efficient index-based KNN join processing for high-dimensional data. Information and Software Technology, 2007,49(4):332-344. [doi: 10.1016/j.infsof.2006.05.006].
  • 5Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008,51(1):107-113 [doi: 10.1145/1327452.1327492].
  • 6White T. Hadoop: The Definitive Guide. Sebastopol: Yahoo! Press, 2009.
  • 7Zhang C, Li FF, Jestes J. Efficient parallel kNN joins for large data in MapReduce. In: Proc. of the 15th Int'l Conf. on Extending Database Technology (EDBT). 2012.38-49. [doi: 10.1145/2247596.2247602].
  • 8Lu W, Shen YY, Chen S, Col BC. Efficient processing of k nearest neighbor joins using MapReduce. In: Proc. of the 38th lnt'l Conf. on Very Large Data Bases (VLDB). 2012. 1016-1027.
  • 9Liu Y, Jing N, Chen L, Chen HZ. Parallel bulk-loading of spatial data with MapReduce: An R4ree case. Wuhan University Journal of Natural Sciences, 2011,16(6):513-519. [doi: 10.1007/s11859-011-0790-3].
  • 10Tao YF, Papadias D. Range aggregate processing in spatial databases. IEEE Trans. on Knowledge and Data Engineering, 2004, 16(12):1555-1570. [doi: 10.1109/TKDE.2004.93].

共引文献67

同被引文献8

引证文献2

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部