摘要
大数据的价值实现,归根到底还是依赖于数据挖掘技术.而在很多领域中,海量数据的非常规模式往往更具分析价值.离群检测,也叫异常检测,是用于挖掘海量数据中非常规模式的一项关键技术,广泛应用于网络入侵检测、公共卫生、医疗监控等领域.基于索引的离群检测算法通常具有较高的检测速度,然而现有的大多数基于索引的检测算法并非完全基于距离,导致通用性降低.较高的抽象能力使得度量空间具有比多维空间更广泛的适用范围,在其基础上设计的算法具有更高的通用性.而最新的度量空间基于索引的离群检测算法iORCA算法通过随机选取支撑点,基于数据到单支撑点的距离建立索引,并应用终止规则(Stopping rule)以期提前结束离群检测并得到正确的结果,多数情况下该机制起到加快检测速度的重要作用.然而iORCA算法未提供支撑点选取算法导致检测结果不稳定,且未能充分利用距离三角不等性减少距离计算次数.针对这些问题,文中指出基于距离的离群点定义应结合使用完全基于距离的离群检测算法,以确保算法的通用性,由此提出了度量空间离群检测的概念.在此基础上明确了支撑点选取的两大目标,即边缘支撑点和密集支撑点,并提出基于多种支撑点的度量空间离群检测算法VPOD.考虑到两个支撑点选取目标难以同时达到,VPOD算法分别予以选取,在近似的密集区域选取支撑点,即密集支撑点,对应使用终止规则,然后用FFT(Farthest-First Traversal)算法另选取若干支撑点,即边缘支撑点,与数据集计算距离而形成支撑点空间,利用距离三角不等性,使距离计算次数显著减少,从而提高检测速度.实验表明该算法能在可接受的时间范围内建立索引,并能高效检测离群点,加速比达2.05,最高达3.54,距离计算次数平均减少51.14%,最高达89.46%,同时保持对多种常见的基于距离的离群点定义的兼容.
The realization of the value of big data,is still dependent on data mining technology in the final analysis.In many areas,the unconventional model of massive data is usually more valuable for analysis.Outlier Detection,also known as Anomaly Detection,is a key technique to discover abnormal patterns from mass data.Outlier detection techniques have been widely applied in many fields,such as network intrusion detection,public health and medical monitoring.Index based outlier detection algorithm usually has higher detection speed.However,most existing index based outlier detection algorithms are not completely based on distance,resulting in the weakening of universal property.In other words,these algorithms can be only applied in multidimensional dataset.Metric space is a set with a distance function which satisfies the distance triangle inequality.Instead of domain specific information,the requirement to apply metric space is so simple that only need to define the distance function.Because of better universal abstraction,metric space has a wider range of application than the multidimensional space,and the algorithm designed on the basis of it is more universal.The latest index based outlier detection algorithm in metric space,namely iORCA,randomly selects a single pivot and builds index upon the distances from data to it,then the algorithm can terminate ahead of time with the correct result in use of stopping rule.In most cases,this mechanism is effective and can save detection time.However,the detection result of iORCA is not stable because of its lack of pivot selection method.Further,it does not exploit Triangle Inequality to reduce distance calculation times.Focusing on these problems,in this paper,we pointed out that the distance based outlier definition should be applied together with completely distance based outlier detection algorithm,in order to guarantee the universal property,and proposed the definition of Outlier Detection in Metric Space.Further,we well defined the two goals of pivot selection,which are border pivot and dense pivot.Based on these goals,Various Pivots based Outlier Detection(VPOD)algorithm is proposed.In consideration of difficulty to achieve the two goals of pivot selection,VPOD selects the two kinds of pivots separately.On one hand,VPOD selects a single pivot in approximately dense region,which is dense pivot,related to the application of stopping rule.On the other hand,several pivots will be selected by Farthest First Traversal algorithm,which are border pivots.Then VPOD will calculate the distance of all the objects of dataset to these pivots,in order to converts the dataset from metric space to a pivot space.With the help of distance triangle inequality,the distance calculation times can be significantly reduced,with the result of higher detection speed.Experimental results show that VPOD can build the metric space index in acceptable time,and achieves a2.05speed up over iORCA on average,and in certain cases,up to3.54.The distance calculation times are reduced by51.14%on average,and up to8946%.In addition,VPOD has not lost the compatibility to the several most popular distance based outlier definitions.
作者
许红龙
唐颂
毛睿
沈婧
刘刚
陈国良
XU Hong-Long;TANG Song;MAO Rui;SHEN Jing;LIU Gang;CHEN Guo-Liang(School of Mathematics and Big Data,Foshan University,Foshan,Guangdong528000;Guangdong Province Key Laboratory of Popular High Performance Computers,College of Computer Science and Software Engineering,Shenzhen University,Shenzhen,Guangdong518060;College of Chemistry,Nankai University,Tianjin300071)
出处
《计算机学报》
EI
CSCD
北大核心
2017年第12期2839-2855,共17页
Chinese Journal of Computers
基金
国家“八六三”高技术研究发展计划项目基金(2015AA015305)
国家自然科学基金委-广东联合项目(U1301252,U1501254)
广东省重点实验室建设情况考评项目(2017B030314073)
广东省自然科学基金(2015A030313636)
深圳市科技计划项目(CXZZ20140418182638764)资助~~
关键词
离群检测
度量空间
索引
支撑点选取
三角不等性
outlier detection
metric space
index
pivot selection
triangle inequality