摘要
异常检测是数据挖掘领域研究的基本问题之一,已被广泛应用于气象预报、网络入侵检测、电信和信用卡欺诈侦察等领域.基于密度的异常检测算法LOF具有较好的检测效果和适用性,但其计算量较大,运行效率不够高,且在进行对象之间的距离计算时忽略了不同属性对异常值的不同影响.针对以上不足,本文提出了一种高效的LOF改进算法iLOF*.该算法利用网格进行数据约简,从而提高了算法的运行效率;同时,在进行对象之间的距离计算时,引入信息熵,给不同属性赋予不同的权值,从而提高了算法的准确率.另外,用MapReduce计算框架将iLOF*算法并行化,进一步提高了算法在大规模数据集上的运行效率.最后的实验结果验证了iLOF*算法的有效性和高效性.
Outlier detection is an important branch in the areaof data mining,It has been widely used in weather forecasting, network intrusion detection, telecommunications and credit card fraud detection,etc. LOF algorithm has good detection effect and availability, but its computation is very high, whose efficiency is not good enough,And when calculating the distance between two objects, LOF algorithm ignores the different influence of different properties.To solve above disadvantages, we put forward an improved outlier detection algorithmiLOF*, iLOF* algorithm usesgrid to reduce the data sets, so as to improve the efficiency of the algorithm; at the same time, when calculating the distance between the object, iLOF* algorithm gives different weights to different properties through the introduction of information entropy, which improve the accuracy of the algorithm.In addition, we use the parallel computing framework MapReduce to parallel iLOF * algorithm, which further improves the efficiency of algorithm on large data sets.The experimental results demonstrate the effectiveness and efficiency of the proposed algorithm.
出处
《计算机系统应用》
2015年第12期233-238,共6页
Computer Systems & Applications
关键词
数据挖掘
异常检测
局部异常因子
信息熵
并行化
data mining
outlier detection
local outlier factor
information entropy
parallelization