基于频数的孤立点检测研究被引量：2

Research on Frequency-based Outlier Mining

下载PDF

导出

摘要基于距离的孤立点检测算法在很多领域都有重要应用,效率不高却限制了孤立点检测算法的广泛应用。针对这个问题,文中通过分析基于索引的检测算法和基于单元的分析算法,受频繁项集挖掘算法的启发,应用统计学原理,提出了改进的基于距离的孤立点检测算法(Index Unit Based-on-Distance Outlier Mining,IU-BDOM)。在待挖掘数据集中,出现次数越少的项集越可能是孤立点,即频数越低越可能是孤立点,本算法在检测孤立点时,从频数最小的项开始检测,从而节省了挖掘频数很高的肯定不是孤立点的数据所带来的开销。为了进一步加快速度,实现算法的并行性,采用超立方体代替传统的超球体来统计数据集中每一个对象o的邻居个数,把距离的计算分散到不同的维度上独立进行,并且给予不同维度不同的权重,此外,利用Greenpulm分布式数据库,并行了挖掘任务,极大地提高了挖掘效率,并通过实验证实了这种改进的有效性。 Distance-based outlier detection algorithm in many fields has important applications, but the efficiency is not high which limit- ed the widely used outlier detection algorithms. For this problem,through analysis of the index detection algorithra and cell-based analysis algorithms,inspired by frequent itemsets mining algorithm, using statistical principles, proposed an improved distance-based outlier detec- tion algorithm （Index Unit Based-on-Distance Outlier Mining ,IU-BDOM）. Data to be excavated concentrated,appears more times the more less of the item sets may be an outliers, i. e. the frequency is the more low, the more likely is an outliers. The present algorithm in the detection of the outliers ,from the frequency of the minimum of the items start detection ,thereby saving the excavation frequency num- ber overhead high certainly not an outliers. In order to further accelerate the speed and realize the parallelism of the algorithm,the number of neighbors used the hypersphere to statistics hypercubes instead of the traditional centralized each object o, the distance independently calculated dispersed into different dimensions, and give different weights to different dimensions, in addition, the use of distributed data- base of Greenpulm, parallel mining tasks and greatly improve the efficiency of mining, and the effectiveness of such an improved is con- firmed by experiment.

作者朱东生吴庆波谭郁松

机构地区国防科学技术大学计算机学院

出处《计算机技术与发展》 2013年第5期10-13,共4页 Computer Technology and Development

基金国家核高基计划项目(2012ZX01040001)

关键词孤立点检测频繁项集基于距离 Greenplum outlier detection frequent itemsets distance-based Greenplum

分类号 TP912.3 [自动化与计算机技术]

引文网络
相关文献

参考文献3

1朱红蕾,李明.一种高效维护关联规则的增量算法[J].计算机应用研究,2004,21(9):107-109. 被引量：9
2黄德才,张良燕,龚卫华,刘端阳.一种改进的关联规则增量式更新算法[J].计算机工程,2008,34(10):38-39. 被引量：21
3商志会,陶树平.一种高效的关联规则增量更新算法[J].计算机应用,2005,25(4):830-832. 被引量：5

二级参考文献21

1朱红蕾,李明.一种高效维护关联规则的增量算法[J].计算机应用研究,2004,21(9):107-109. 被引量：9
2付长贺,赵传立,唐恒永.一种改进的关联规则增量式更新算法[J].沈阳师范大学学报（自然科学版）,2006,24(1):51-54. 被引量：2
3AGRAWAL R,IMIELINSKI T,SWAMI A.Mining Association Rules Between Sets of Items in Large Database[A].Proceedings of the ACM-SIGMOD Conference on Management of Data[C].Washington DC,1993.
4CHEUNG DW.Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique[A].Proceedings of the 12th International Conference on Data Engineering[C].New Orleans,Louisana,1996.106-114.
5AGRAWAL R,SRIKANT R.Fast Algorithms for Mining Association Rules[A].20th Int'l Conference On Very Large Database(VLDB-94)[C].Santiago,Chile,1994.487-499.
6PARK JS,CHEN MS,YU PS.An effective hash-based algorithm for mining association rules[A].Proceedings of 1995 ACM-SICMOD Int Conf Management of Data[C].SM Jose,CA,1995.175-186.
7SAVASERE A,OMIECINSKI E,NAVATHE S.An efficient algo-rithm for mining association rules in large databases[A].Proceedings of the 21st VLDB Conference[C].Zurich,Switzerland,1995.432-444.
8SRIKANT R,AGRAWAL R.Mining generalized association rules[A].Proceedings of the 21th International Conference on Very large Databases[C].Zurich,Switzerland,1994.407-419.
9LEE SD,CHEUNG DW.Maintenance of Discovered Association Rules:when to Update?[A].workshop on Research Issues on Data Mining and Knowledge Discovery(DMKD)[C].Tucson,Arizona,1977.
10R Agrawal,T Imielinski,A Swami.Mining Association Rules between Sets of Items in Large Databases[C].Proc.of the ACM SIGMOD Int.Conf.on Management of Data(ACM SIGMOD'93),Washington,USA,1993.207-216.

共引文献27

1简友光,简曙光.空间数据关联规则挖掘研究综述[J].计算机与数字工程,2007,35(7):52-55.
2张健沛,杨悦,刘卓.一种新的关联规则增量式挖掘算法[J].计算机工程,2006,32(23):43-44. 被引量：6
3杨春华,杨丽萍.利用项集的分解寻求最大频繁项集[J].计算机与数字工程,2007,35(9):37-39.
4黄德才,张良燕,龚卫华,刘端阳.一种改进的关联规则增量式更新算法[J].计算机工程,2008,34(10):38-39. 被引量：21
5宫晓璐.一种改进的增量关联规则算法[J].现代计算机,2009,15(3):37-39.
6马强.时间序列数据挖掘在瓦斯监测中的应用[J].长治学院学报,2009,26(2):37-40. 被引量：1
7向哲,林国龙,杨斌.兴趣度在增量的关联规则挖掘中的研究[J].计算机技术与发展,2009,19(10):33-36. 被引量：1
8董卫萍,郑厚天.一种优化的关联规则增量更新算法[J].计算机应用与软件,2009,26(9):137-138.
9胡少华.分布式专家行为信息系统[J].计算机工程,2009,35(23):278-280. 被引量：3
10张泳涛,张刚华.一种关联规则更新算法[J].电脑与信息技术,2010,18(1):4-7.

同被引文献26

1陆声链,林士敏.基于距离的孤立点检测及其应用[J].计算机与数字工程,2004,32(5):94-97. 被引量：23
2焦誉,傅为忠.基于距离的孤立点挖掘在CRM上的应用[J].华东经济管理,2007,21(6):67-69. 被引量：2
3毛国君,段立娟,王实,等.数据挖掘原理与算法[M].北京:清华大学出版社,2006:183.
4边肇棋,张学工.模式识别[M].北京:清华大学出版社,2007.
5Leitao L, Calado P, Herschel M. Efficient and effective dupli- cate detection in hierarchical data[ J]. IEEE Transactions on Knowledge and Data Engineering,2013,25 ( 5 ) : 1028-1041.
6Ektefa M, Sidi F, Ibrahim H, et al. A threshold-based similari- ty measure for duplicate detection [ C ]//Proc of IEEE confer- ence on open systems. Is. 1. ] :IEEE,2011:37-41.
7Hermandez M A, Stolfo S J. Real-world data is dirty:data cleaning and the merge/purge problem [ J ]. Data Mining and Knowledge Discovery, 1998,2 ( 1 ) :9-37.
8He Ling, Zhang Zhongnan, Tan Yize, et al. An efficient data cleaning algorithm based on attributes selection [ C ]//Proc of ICCIT. Is. 1. ] : Is. n. ] ,2011:375-379.
9Naumarm D U, Szott F, Wonneberg S, et al. Adaptive Windows for duplicate detection [ C ]//Proc of IEEE 28th international conference on data engineering. [ s. 1. ] : IEEE, 2012 : 1073 - 1083.
10Liu Bo, Xiao Yanshan, Yu P S. An efficient approach for outli- er detection with imperfect data labels [ J]. IEEE Transactions on Knowledge and Data Engineering, 2014,26 ( 7 ) : 1602 - 1616.

引证文献2

1陈鹏,胡啸峰,林艳.孤立点挖掘在警情时间序列异常点识别中的应用[J].科学技术与工程,2015,35(7):225-228. 被引量：3
2许必宵,陈升波,韩重阳,马梦环,宫婧.改进的数据预处理算法及其应用[J].计算机技术与发展,2015,25(12):143-146. 被引量：5

二级引证文献8

1苏舟,李灿,姚李孝,崔寒珺.电力负荷数据预处理研究及应用[J].电网与清洁能源,2017,33(5):40-43. 被引量：16
2陈力,费洪晓,丁海伦,成琳,翟纪宇.基于双决策树的数据采样方法[J].计算机工程与科学,2019,41(1):130-135. 被引量：8
3孔元元,白智远,张飒,吕品.融合时间与兴趣相似度的产品推荐方法研究[J].计算机技术与发展,2019,29(9):195-199. 被引量：1
4解初,王建东,韩邦磊,王振.基于趋势特征聚类的多元相似时间序列的提取[J].科学技术与工程,2020,20(7):2786-2793. 被引量：7
5石少冲,陈鹏,曾昭龙,胡校成.基于时间序列分解与全连接神经网络的警情长周期时间序列预测[J].科学技术与工程,2020,20(13):5186-5191. 被引量：5
6罗长银,陈学斌,宋尚文,刘洋.数据预处理技术在异构数据中的应用[J].软件,2020,41(5):6-13. 被引量：5
7刘冰琪,王建东,解初,王振.基于变化趋势的闭环控制系统报警监控方法[J].科学技术与工程,2020,20(27):11173-11179. 被引量：5
8赵帅,秦林,林冬,高健,刘畅,李潮浪,付凌迪,刘永良,付国华.油气管道漏磁数据处理和缺陷识别量化方法的研究进展[J].腐蚀与防护,2024,45(2):27-35.

1数据引擎领先者Greenplum登陆北京[J].中国电信业,2009(1):83-83.
2EMC“大数据”计算系统[J].微电脑世界,2010(11):107-107.
3宋全德.基于Java的MADlib自动化测试框架[J].计算机系统应用,2014,23(2):28-35. 被引量：1
4海量数据处理引擎现身[J].网管员世界,2009(2):13-13.
5冯大辉.EMC收购Greenplum这事儿[J].网管员世界,2010(16):8-8.
6Hadoop是数据库的未来[J].网络运维与管理,2013(7):6-6.
7牛永鑫.基于距离的孤立点挖掘改进算法在教务管理系统中的应用[J].硅谷,2014,7(8):52-53.
8秦艳华.数据挖掘技术中孤立点的分析研究[J].硅谷,2010,3(4):49-50. 被引量：2
9谢文阁,王海虹.一种改进的基于距离的孤立点挖掘算法的研究[J].渤海大学学报（自然科学版）,2011,32(2):157-161. 被引量：1
10陆声链,林士敏.基于距离的孤立点检测研究[J].计算机工程与应用,2004,40(33):73-75. 被引量：44

计算机技术与发展

2013年第5期

浏览历史

内容加载中请稍等...

基于频数的孤立点检测研究被引量：2

参考文献3

二级参考文献21

共引文献27

同被引文献26

引证文献2

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

基于频数的孤立点检测研究 被引量：2

参考文献3

二级参考文献21

共引文献27

同被引文献26

引证文献2

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

基于频数的孤立点检测研究被引量：2