期刊文献+

基于频数的孤立点检测研究 被引量:2

Research on Frequency-based Outlier Mining
下载PDF
导出
摘要 基于距离的孤立点检测算法在很多领域都有重要应用,效率不高却限制了孤立点检测算法的广泛应用。针对这个问题,文中通过分析基于索引的检测算法和基于单元的分析算法,受频繁项集挖掘算法的启发,应用统计学原理,提出了改进的基于距离的孤立点检测算法(Index Unit Based-on-Distance Outlier Mining,IU-BDOM)。在待挖掘数据集中,出现次数越少的项集越可能是孤立点,即频数越低越可能是孤立点,本算法在检测孤立点时,从频数最小的项开始检测,从而节省了挖掘频数很高的肯定不是孤立点的数据所带来的开销。为了进一步加快速度,实现算法的并行性,采用超立方体代替传统的超球体来统计数据集中每一个对象o的邻居个数,把距离的计算分散到不同的维度上独立进行,并且给予不同维度不同的权重,此外,利用Greenpulm分布式数据库,并行了挖掘任务,极大地提高了挖掘效率,并通过实验证实了这种改进的有效性。 Distance-based outlier detection algorithm in many fields has important applications, but the efficiency is not high which limit- ed the widely used outlier detection algorithms. For this problem,through analysis of the index detection algorithra and cell-based analysis algorithms,inspired by frequent itemsets mining algorithm, using statistical principles, proposed an improved distance-based outlier detec- tion algorithm (Index Unit Based-on-Distance Outlier Mining ,IU-BDOM). Data to be excavated concentrated,appears more times the more less of the item sets may be an outliers, i. e. the frequency is the more low, the more likely is an outliers. The present algorithm in the detection of the outliers ,from the frequency of the minimum of the items start detection ,thereby saving the excavation frequency num- ber overhead high certainly not an outliers. In order to further accelerate the speed and realize the parallelism of the algorithm,the number of neighbors used the hypersphere to statistics hypercubes instead of the traditional centralized each object o, the distance independently calculated dispersed into different dimensions, and give different weights to different dimensions, in addition, the use of distributed data- base of Greenpulm, parallel mining tasks and greatly improve the efficiency of mining, and the effectiveness of such an improved is con- firmed by experiment.
出处 《计算机技术与发展》 2013年第5期10-13,共4页 Computer Technology and Development
基金 国家核高基计划项目(2012ZX01040001)
关键词 孤立点检测 频繁项集 基于距离 Greenplum outlier detection frequent itemsets distance-based Greenplum
  • 相关文献

参考文献3

二级参考文献21

  • 1朱红蕾,李明.一种高效维护关联规则的增量算法[J].计算机应用研究,2004,21(9):107-109. 被引量:9
  • 2付长贺,赵传立,唐恒永.一种改进的关联规则增量式更新算法[J].沈阳师范大学学报(自然科学版),2006,24(1):51-54. 被引量:2
  • 3AGRAWAL R,IMIELINSKI T,SWAMI A.Mining Association Rules Between Sets of Items in Large Database[A].Proceedings of the ACM-SIGMOD Conference on Management of Data[C].Washington DC,1993.
  • 4CHEUNG DW.Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique[A].Proceedings of the 12th International Conference on Data Engineering[C].New Orleans,Louisana,1996.106-114.
  • 5AGRAWAL R,SRIKANT R.Fast Algorithms for Mining Association Rules[A].20th Int'l Conference On Very Large Database(VLDB-94)[C].Santiago,Chile,1994.487-499.
  • 6PARK JS,CHEN MS,YU PS.An effective hash-based algorithm for mining association rules[A].Proceedings of 1995 ACM-SICMOD Int Conf Management of Data[C].SM Jose,CA,1995.175-186.
  • 7SAVASERE A,OMIECINSKI E,NAVATHE S.An efficient algo-rithm for mining association rules in large databases[A].Proceedings of the 21st VLDB Conference[C].Zurich,Switzerland,1995.432-444.
  • 8SRIKANT R,AGRAWAL R.Mining generalized association rules[A].Proceedings of the 21th International Conference on Very large Databases[C].Zurich,Switzerland,1994.407-419.
  • 9LEE SD,CHEUNG DW.Maintenance of Discovered Association Rules:when to Update?[A].workshop on Research Issues on Data Mining and Knowledge Discovery(DMKD)[C].Tucson,Arizona,1977.
  • 10R Agrawal,T Imielinski,A Swami.Mining Association Rules between Sets of Items in Large Databases[C].Proc.of the ACM SIGMOD Int.Conf.on Management of Data(ACM SIGMOD'93),Washington,USA,1993.207-216.

共引文献27

同被引文献26

  • 1陆声链,林士敏.基于距离的孤立点检测及其应用[J].计算机与数字工程,2004,32(5):94-97. 被引量:23
  • 2焦誉,傅为忠.基于距离的孤立点挖掘在CRM上的应用[J].华东经济管理,2007,21(6):67-69. 被引量:2
  • 3毛国君,段立娟,王实,等.数据挖掘原理与算法[M].北京:清华大学出版社,2006:183.
  • 4边肇棋,张学工.模式识别[M].北京:清华大学出版社,2007.
  • 5Leitao L, Calado P, Herschel M. Efficient and effective dupli- cate detection in hierarchical data[ J]. IEEE Transactions on Knowledge and Data Engineering,2013,25 ( 5 ) : 1028-1041.
  • 6Ektefa M, Sidi F, Ibrahim H, et al. A threshold-based similari- ty measure for duplicate detection [ C ]//Proc of IEEE confer- ence on open systems. Is. 1. ] :IEEE,2011:37-41.
  • 7Hermandez M A, Stolfo S J. Real-world data is dirty:data cleaning and the merge/purge problem [ J ]. Data Mining and Knowledge Discovery, 1998,2 ( 1 ) :9-37.
  • 8He Ling, Zhang Zhongnan, Tan Yize, et al. An efficient data cleaning algorithm based on attributes selection [ C ]//Proc of ICCIT. Is. 1. ] : Is. n. ] ,2011:375-379.
  • 9Naumarm D U, Szott F, Wonneberg S, et al. Adaptive Windows for duplicate detection [ C ]//Proc of IEEE 28th international conference on data engineering. [ s. 1. ] : IEEE, 2012 : 1073 - 1083.
  • 10Liu Bo, Xiao Yanshan, Yu P S. An efficient approach for outli- er detection with imperfect data labels [ J]. IEEE Transactions on Knowledge and Data Engineering, 2014,26 ( 7 ) : 1602 - 1616.

引证文献2

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部