期刊文献+

基于nested-loop的大数据集快速离群点检测算法 被引量:1

Efficient nested-loop based outlier detection algorithm for large data set
下载PDF
导出
摘要 针对已有的多数离群点检测算法存在扩展性差,不能有效应用于大数据集的问题,在已有的基于距离的离群点检测算法的基础上,设计模信息表存储结构,利用向量内积不等式关系以及合理的存储分配和调度策略,提出一种高效离群点检测算法DBoda.该算法通过在预处理中存储每个点的模信息,减少点间距离的计算量,并对嵌套循环方法进行优化,进一步减少I/O的开销.理论分析和试验结果表明,所提算法具有时间消耗小和适用于处理大数据集的特点,可以有效地解决离群点检测中的算法时间复杂性和算法扩展性问题. Most of the existed outlier detection algorithms have the limitation in algorithms' expansibility, and cannot be used efficiently for the large data set. To solve this problem, mode storage structure and vectors' inner product inequation are designed, the suitable storage allocating method and the I/O strategy are adopted. Furthermore, based on the existed distance-based outlier detection algorithm, an efficient nested-based outlier detection algorithm DBoda is proposed, which is suitable for the large data set. Two strategies are adopted in the algorithm. Firstly, during the pretreatment process, each data point's mode information is stored to reduce the computation work. Secondly, optimization is adopted in the nested loop step to reduce I/O. Theoretical analysis and experiment results testify that DBoda is efficient and suitable to deal with large data set. It can solve the time complexity and expansibility problem of outlier detection algorithms.
出处 《东南大学学报(自然科学版)》 EI CAS CSCD 北大核心 2006年第3期463-466,共4页 Journal of Southeast University:Natural Science Edition
基金 国家自然科学基金资助项目(70371015) 高等学校博士学科点专项科研基金资助项目(20040286009) 审计署审计科研所专项资助项目(SK2006007)
关键词 大数据集 模信息表 向量内积不等式 离群点检测 large data set mode table vectors' inner product inequation outlier detection
  • 相关文献

参考文献1

二级参考文献11

  • 1Sheikholeslami G, Chatterjee S, Zhang A. Wave-Cluster: A multi-resolution clustering approach for very large spatial databases. In:Proceedings of the 24th International Conference on Very Large Databases. New York, 1998. 428~439.
  • 2Aggrawal R, Gehrke J, Gunopulos D, Raghawan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. Seattle, WA, 1998.94~ 105.
  • 3Wang W, Yang J, Muntz R. STING: A statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Databases. Athens, Greece, 1997.186~ 195.
  • 4Hinneburg A, Keim DA. An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'98). New York, 1998.58~65.
  • 5Xing EP, Karp RM. CLIFF: Clustering of high dimensional microarray data via iterative feature filtering using normalized cuts.BIOINFORMATICS, 2001,1(1):1~9.
  • 6Hinneburg A, Keim DA, Brandt W. Clustering 3D-structures of small amino acid chains for detecting dependences from their sequential context in proteins. In: Proceedings of the IEEE International Symposium on BioInformatics and Biomedical Engineering. Washington, DC, 2000. 43-49.
  • 7Xu X, Ester M, Kriegel H, Sander J. A distribution-based clustering algorithm for mining in large spatial databases. In: Proceedings of the 14th International Conference on Data Engineering, ICDE'98. Orlando, FL, 1998. 324~331.
  • 8Silverman B. Density Estimation for Statistics and Data Analysis. Chapman & Hall, 1986.72~113.
  • 9Han J, Kamber M. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2000.335~398.
  • 10Berchtold S, Keim D, Kriegel HP. The X-tree: An index structure for high-dimensional data. In: Proceedings of the International Conference on Very Large Databases. Bombay, India, 1996.28~39.

共引文献14

同被引文献8

引证文献1

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部