基于多种支撑点的度量空间离群检测算法被引量：4

Various Pivots Based Outlier Detection Algorithm in Metric Space

下载PDF

导出

摘要大数据的价值实现,归根到底还是依赖于数据挖掘技术.而在很多领域中,海量数据的非常规模式往往更具分析价值.离群检测,也叫异常检测,是用于挖掘海量数据中非常规模式的一项关键技术,广泛应用于网络入侵检测、公共卫生、医疗监控等领域.基于索引的离群检测算法通常具有较高的检测速度,然而现有的大多数基于索引的检测算法并非完全基于距离,导致通用性降低.较高的抽象能力使得度量空间具有比多维空间更广泛的适用范围,在其基础上设计的算法具有更高的通用性.而最新的度量空间基于索引的离群检测算法iORCA算法通过随机选取支撑点,基于数据到单支撑点的距离建立索引,并应用终止规则(Stopping rule)以期提前结束离群检测并得到正确的结果,多数情况下该机制起到加快检测速度的重要作用.然而iORCA算法未提供支撑点选取算法导致检测结果不稳定,且未能充分利用距离三角不等性减少距离计算次数.针对这些问题,文中指出基于距离的离群点定义应结合使用完全基于距离的离群检测算法,以确保算法的通用性,由此提出了度量空间离群检测的概念.在此基础上明确了支撑点选取的两大目标,即边缘支撑点和密集支撑点,并提出基于多种支撑点的度量空间离群检测算法VPOD.考虑到两个支撑点选取目标难以同时达到,VPOD算法分别予以选取,在近似的密集区域选取支撑点,即密集支撑点,对应使用终止规则,然后用FFT(Farthest-First Traversal)算法另选取若干支撑点,即边缘支撑点,与数据集计算距离而形成支撑点空间,利用距离三角不等性,使距离计算次数显著减少,从而提高检测速度.实验表明该算法能在可接受的时间范围内建立索引,并能高效检测离群点,加速比达2.05,最高达3.54,距离计算次数平均减少51.14%,最高达89.46%,同时保持对多种常见的基于距离的离群点定义的兼容. The realization of the value of big data,is still dependent on data mining technology in the final analysis.In many areas,the unconventional model of massive data is usually more valuable for analysis.Outlier Detection,also known as Anomaly Detection,is a key technique to discover abnormal patterns from mass data.Outlier detection techniques have been widely applied in many fields,such as network intrusion detection,public health and medical monitoring.Index based outlier detection algorithm usually has higher detection speed.However,most existing index based outlier detection algorithms are not completely based on distance,resulting in the weakening of universal property.In other words,these algorithms can be only applied in multidimensional dataset.Metric space is a set with a distance function which satisfies the distance triangle inequality.Instead of domain specific information,the requirement to apply metric space is so simple that only need to define the distance function.Because of better universal abstraction,metric space has a wider range of application than the multidimensional space,and the algorithm designed on the basis of it is more universal.The latest index based outlier detection algorithm in metric space,namely iORCA,randomly selects a single pivot and builds index upon the distances from data to it,then the algorithm can terminate ahead of time with the correct result in use of stopping rule.In most cases,this mechanism is effective and can save detection time.However,the detection result of iORCA is not stable because of its lack of pivot selection method.Further,it does not exploit Triangle Inequality to reduce distance calculation times.Focusing on these problems,in this paper,we pointed out that the distance based outlier definition should be applied together with completely distance based outlier detection algorithm,in order to guarantee the universal property,and proposed the definition of Outlier Detection in Metric Space.Further,we well defined the two goals of pivot selection,which are border pivot and dense pivot.Based on these goals,Various Pivots based Outlier Detection(VPOD)algorithm is proposed.In consideration of difficulty to achieve the two goals of pivot selection,VPOD selects the two kinds of pivots separately.On one hand,VPOD selects a single pivot in approximately dense region,which is dense pivot,related to the application of stopping rule.On the other hand,several pivots will be selected by Farthest First Traversal algorithm,which are border pivots.Then VPOD will calculate the distance of all the objects of dataset to these pivots,in order to converts the dataset from metric space to a pivot space.With the help of distance triangle inequality,the distance calculation times can be significantly reduced,with the result of higher detection speed.Experimental results show that VPOD can build the metric space index in acceptable time,and achieves a2.05speed up over iORCA on average,and in certain cases,up to3.54.The distance calculation times are reduced by51.14%on average,and up to8946%.In addition,VPOD has not lost the compatibility to the several most popular distance based outlier definitions.

作者许红龙唐颂毛睿沈婧刘刚陈国良 XU Hong-Long;TANG Song;MAO Rui;SHEN Jing;LIU Gang;CHEN Guo-Liang(School of Mathematics and Big Data,Foshan University,Foshan,Guangdong528000;Guangdong Province Key Laboratory of Popular High Performance Computers,College of Computer Science and Software Engineering,Shenzhen University,Shenzhen,Guangdong518060;College of Chemistry,Nankai University,Tianjin300071)

机构地区佛山科学技术学院数学与大数据学院深圳大学计算机与软件学院广东省普及型高性能计算机重点实验室南开大学化学学院

出处《计算机学报》 EI CSCD 北大核心 2017年第12期2839-2855,共17页 Chinese Journal of Computers

基金国家“八六三”高技术研究发展计划项目基金(2015AA015305) 国家自然科学基金委-广东联合项目(U1301252,U1501254) 广东省重点实验室建设情况考评项目(2017B030314073) 广东省自然科学基金(2015A030313636) 深圳市科技计划项目(CXZZ20140418182638764)资助~~

关键词离群检测度量空间索引支撑点选取三角不等性 outlier detection metric space index pivot selection triangle inequality

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献2

1李越.基于离群点数据挖掘的情报分析方法探析[J].信息通信,2013,26(3):132-133. 被引量：7
2王习特,申德荣,白梅,聂铁铮,寇月,于戈.BOD:一种高效的分布式离群点检测算法[J].计算机学报,2016,39(1):36-51. 被引量：29

二级参考文献21

1蒲群莹.基于数据挖掘的竞争情报系统模型[J].情报杂志,2005,24(1):38-39. 被引量：28
2薛安荣,鞠时光,何伟华,陈伟鹤.局部离群点挖掘算法研究[J].计算机学报,2007,30(8):1455-1463. 被引量：96
3Jiawei Han Micheline Kamber.数据挖掘概念与技术[M].北京:机械工业出版社,2012:295.
4Hawkins D-M. Identification of Outliers. London: Chapman and Hall, 1980.
5Knorr E-M, Ng R-T. Algorithms for mining distance-based outliers in large datasets//Proceedings of the 24th Interna- tional Conference on Very Large Data Bases. New York, USA, 1998:392-403.
6Hung E, Cheung D-W. Parallel mining of outliers in large database. Distributed and Parallel Databases, 2002, 12(1) : 5-26.
7Lozano E, Acufia E. Parallel algorithms for distance-based and density-based outliers//Proceedings of the 15th IEEE International Conference on Data Mining. Houston, USA, 2005:729-732.
8Angiulli F, Basta S, Lodi S, et al. Distributed strategies for mining outliers in large data sets. IEEE Transactions onKnowledge and Data Engineering, 2013, 25(7): 1520-1532.
9Barnett V, Lewis T. Outliers in Statistical Data. New York: Wiley, 1994.
10Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. Dallas, USA, 2000:427-438.

共引文献34

1潘淼鑫,林甲祥,陈崇成,叶晓燕.基于C-SOM和Spark的并行空间离群挖掘方法及应用[J].地球信息科学学报,2019,21(1):128-136. 被引量：4
2官思发,朝乐门.大数据时代信息分析的关键问题、挑战与对策[J].图书情报工作,2015,59(3):12-18. 被引量：39
3蒋卓然.“大数据”时代情报工作面临的机遇与挑战[J].吉林广播电视大学学报,2016(4):73-75. 被引量：4
4于兴平,李洪建,于腾飞,毕卫红.分布式系统数据时序更新方法[J].软件工程,2016,19(5):23-25.
5余波,温亮明,张妍妍.大数据环境下情报研究方法论体系研究[J].情报科学,2016,34(9):7-12. 被引量：14
6刘昌伟,段景辉.基于因子分析法的海关风险管理评价分析[J].海关与经贸研究,2016,37(6):27-42. 被引量：6
7黄建理,杜金燃,谢家全,秦科.一种基于改进KNN的大数据离群点检测算法[J].计算机与现代化,2017(5):67-70. 被引量：4
8张光.基于离群数据挖掘的电子商务推荐系统研究[J].自动化与仪器仪表,2017(8):21-22. 被引量：2
9王红梅.简化粒子群优化结合SOM的网络入侵检测方法[J].微型电脑应用,2018,34(5):29-31. 被引量：3
10袁钟,冯山.基于邻域值差异度量的离群点检测算法[J].计算机应用,2018,38(7):1905-1909. 被引量：9

同被引文献68

1张翔,胡昌振,刘胜航,唐成华.基于支持向量机的网络攻击态势预测技术研究[J].计算机工程,2007,33(11):10-12. 被引量：37
2薛安荣,姚林,鞠时光,陈伟鹤,马汉达.离群点挖掘方法综述[J].计算机科学,2008,35(11):13-18. 被引量：69
3孟德宇,徐晨,徐宗本.基于Isomap的流形结构重建方法[J].计算机学报,2010,33(3):545-555. 被引量：20
4刘芳,毛志忠,李磊.基于模糊自回归隐马尔可夫模型的控制过程异常数据检测[J].仪器仪表学报,2010,31(5):984-990. 被引量：8
5刘京礼,李建平,徐伟宣,石勇.信用评估中的鲁棒赋权自适应L_p最小二乘支持向量机方法[J].中国管理科学,2010,18(5):28-33. 被引量：13
6胡彩平,秦小麟.一种基于密度的局部离群点检测算法DLOF[J].计算机研究与发展,2010,47(12):2110-2116. 被引量：52
7张洪祥,毛志忠.基于多维时间序列的灰色模糊信用评价研究[J].管理科学学报,2011,14(1):28-37. 被引量：44
8张净,孙志挥,杨明,倪巍伟,杨宜东.基于网格和密度的海量数据增量式离群点挖掘算法[J].计算机研究与发展,2011,48(5):823-830. 被引量：8
9揭财明,刘慧君,朱庆生.基于方形对称邻域的局部离群点检测方法[J].计算机应用研究,2012,29(2):472-474. 被引量：5
10李雄飞,李军,董元方,屈成伟.一种新的不平衡数据学习算法PCBoost[J].计算机学报,2012,35(2):202-209. 被引量：63

引证文献4

1刘颖.供应链金融大数据分布特征的分析与洞见[J].计算机科学,2019,46(2):1-10. 被引量：7
2牛少章,欧毓毅,凌捷,顾国生.基于网格查询的局部离群点检测算法[J].计算机工程与应用,2019,55(17):89-94. 被引量：2
3张华,龙灿.热模式分析结合网络自适应跳变算法的超密集网络攻击预测方法[J].计算机应用与软件,2020,37(6):288-296. 被引量：1
4王永刚.改进EHO算法在无线传感网络入侵检测中的应用[J].黑龙江工业学院学报（综合版）,2021,21(4):96-101. 被引量：6

二级引证文献16

1左超文,刘敏,何旺.基于大象优化算法的分布式能源的选址与定容[J].智能计算机与应用,2023,13(8):80-87.
2刘瑞华,阳桂桃.大数据技术在供应链金融中的应用分析[J].时代经贸,2019,0(24):80-82. 被引量：4
3王广宇,李波.信托公司如何借助金融科技发力供应链服务信托[J].国际金融,2020(11):77-80. 被引量：2
4路健,范增民,刘彩娜.基于TF-IDF算法的供应链信息定向挖掘模型[J].计算机仿真,2021,38(7):153-156. 被引量：6
5赵金梦,张静,苏蓓蓓,刘新渝,尚智婕.基于人工神经网络的计算机网络攻击预测模型[J].企业科技与发展,2022(2):37-39. 被引量：2
6何美玲,李佩雅.面向高维大数据的局部离群点并行检测算法[J].计算机仿真,2022,39(2):304-308. 被引量：5
7李钊,张先荣,郭帆.一种基于Web日志的混合入侵检测方法[J].黑龙江工业学院学报（综合版）,2022,22(7):47-52. 被引量：3
8黄建平,杨少杰,余建,陈浩.基于分布式区块链技术的供应链数据平台架构设计[J].电子设计工程,2022,30(17):45-49. 被引量：1
9杨砚砚,王延海.供应链大数据可信共享挖掘技术研究[J].河南科学,2022,40(12):1906-1910.
10蒋斌,黄恩铭.基于分形理论的异质网络中局部离群点检测[J].计算机仿真,2023,40(1):544-547. 被引量：3

1马春玲.整体护理在慢性盆腔炎患者护理中的应用探讨[J].世界最新医学信息文摘,2017,0(69):234-234. 被引量：2
2高晓东.履职为民政协担当——山东省十一届政协五年工作撷英[J].中国政协,2018(2):55-56.
3邓鑫,123RF.微纤维感应器帮助实时监控身体健康[J].健康之家,2017,0(12):31-31.
4董刘敏,刘蔷,赵燕.生物技术在食品检验中的应用分析[J].食品安全导刊,2017,0(10X):131-131.
5Guido Deu ing.你的家具环保吗？[J].实验与分析,2017(4):46-46.
6彭军,李津,李伟,刘皓.可穿戴智能服装的发展现状[J].西部皮革,2017,39(18):109-109. 被引量：1
7张洁玲.一种基于近邻关系的新型离群评估算法[J].福建工程学院学报,2017,15(6):591-596.
8杨友华.健康教育在阴道炎患者护理中的应用效果[J].世界最新医学信息文摘,2017,0(85):239-239. 被引量：1
9童小东,佘士金,郑春莲,欧阳慧.乐山地区健康人群血常规的参考区间[J].医疗装备,2018,31(1):69-70. 被引量：3
10程云英.个性化护理在癌症患者化疗期间PICC置管中的应用效果观察[J].世界最新医学信息文摘,2017,0(73):188-188. 被引量：2

计算机学报

2017年第12期

浏览历史

内容加载中请稍等...

基于多种支撑点的度量空间离群检测算法被引量：4

参考文献2

二级参考文献21

共引文献34

同被引文献68

引证文献4

二级引证文献16

相关作者

相关机构

相关主题

浏览历史

基于多种支撑点的度量空间离群检测算法 被引量：4

参考文献2

二级参考文献21

共引文献34

同被引文献68

引证文献4

二级引证文献16

相关作者

相关机构

相关主题

浏览历史

基于多种支撑点的度量空间离群检测算法被引量：4