期刊文献+

基于Spark平台的并行KNN异常检测算法 被引量:9

Spark-based Parallel Outlier Detection Algorithm of K-nearest Neighbor
下载PDF
导出
摘要 随着大数据时代的到来,异常检测受到了广泛关注。针对传统KNN异常检测算法处理速度和计算资源的瓶颈,以及Hadoop平台上的MapReduce不能友好支持迭代计算和基于内存计算等问题,提出了一种基于Spark平台的并行KNN异常检测算法。该算法首先对数据集进行分区和广播,然后用map函数计算数据集在每个分区的K近邻,使用reduce函数归并map函数的输出计算全局K近邻得到异常度,将异常度前n个对象视为异常。与传统KNN异常检测算法相比,在保证检测精度的前提下该算法的性能与计算资源呈近似线性关系;与其他并行异常检测算法相比,该算法无需额外扩展数据,支持迭代,而且通过在内存中缓存中间结果来减少I/O花销。实验结果证明,该算法可以提高KNN算法在大规模数据上的异常检测效率。 With the advent of big data era,outlier detection has attracted extensive attention.Computational resources of the traditional K-nearest neighbor outlier detection dealing with massive high dimensional data with single machine are insufficient,and the MapReduce in Hadoop cannot effectively deal with frequent iteration calculation problem.According to the above problems,this paper put forward a Spark-based parallel outlier detection algorithm of K-nearest neighbor,named SPKNN.Firstly,in the stage of map,the algorithm tries to find the local K nearest neighbors for each partition of the data in all data set.Then in the reduce stage,it determines the global K nearest neighbors according to the local K nearest neighbors of each partition.Finally,it calculates the degrees of outliers by using global K nearest neighbors and select outliers.Compared with the traditional K-nearest neighbor outlier detection,the performance of the SPKNN has an approximate linear relationship with computing resources in the premise of ensuring the detection accuracy.And compared with other outlier detection methods,it doesn’t need additional extension data,support iteration calculation and can reduce I/O costs by using memory cache.Experiment results of SPKNN show that it has high efficiency and scalability for massive data sets.
作者 冯贵兰 周文刚 FENG Gui-lan;ZHOU Wen-gang(Modern Education Technology Center,Civil Aviation Flight University of China,Guanghan,Sichuan 618307,China;Institute of Flight Technology,Civil Aviation Flight University of China,Guanghan,Sichuan 618307,China)
出处 《计算机科学》 CSCD 北大核心 2018年第B11期349-352,366,共5页 Computer Science
基金 民航飞行数据分析研究项目(XM2852)资助
关键词 Spark平台 并行 K近邻 异常检测 Spark Parallel K-nearest neighbors Outlier detection
  • 相关文献

参考文献5

二级参考文献63

共引文献75

同被引文献71

引证文献9

二级引证文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部