期刊文献+

GridOF:面向大规模数据集的高效离群点检测算法 被引量:28

GridOF: An Efficient Outlier Detection Algorithm for Very Large Datasets
下载PDF
导出
摘要 作为数据库知识发现研究的重要技术手段,现有离群点检测算法在运用于大型数据集时其时间与空间效率均无法令人满意.通过对数据集中离群点分布特征的分析,在数据空间网格划分的基础上,研究数据超方格层次上的密度近似计算与稠密数据主体滤除策略.给出通过简单的修正近似计算取代繁复的点对点密度函数值计算的方法.基于上述思想构造的离群点检测算法GridOF在保持足够检测精度的同时显著降低了时空复杂度,运用于大规模数据集离群点检测具有良好的适用性和有效性. Identifying the rare instances in datasets can lead to the discovery of unexpected and useful knowledge. However, existing algorithms for such outlier detection applications are not efficient when facing large datasets. With detailed discussion on the futures of outliers in datasets, a novel grid-based algorithm, called GridOF, is presented, which first filters out crowded grids and then finds outliers by computing adjusted mean approximation of the density function. While still keeping desirable outlier detection accuracy, the algorithm has a very high performance in both space and time usage. Results of experiments also demonstrate promising availabilities of this approach.
出处 《计算机研究与发展》 EI CSCD 北大核心 2003年第11期1586-1592,共7页 Journal of Computer Research and Development
基金 国家自然科学基金(7997009) 江苏省教育厅自然科学基金(02KJB520012)
关键词 离群点检测 修正近似 GridOF算法 outlier detection adjusted mean approximation GridOF algorithm
  • 相关文献

参考文献7

  • 1D Hawkins. Identification of Outliers. London: Chapman and Hall, 1980.
  • 2T Johnson, I Kwok, R Ng. Fast computation of 2-dimensional depth contours. In: Proc of the 4th Int'l Conf on Knowledge Discovery and Data Mining. New York: AAAI Press, 1998. 224-228.
  • 3E M Knorr, R T Ng. Algorithms for mining distance-based outliers in large datasets. In: Proc of the 24th Int'l Conf on Very Large Databases. New York: Morgan Kaufmann, 1998. 392-403.
  • 4D Yu, G Sheikholeslami, A Zhang. Findout: Finding outliers in very large datasets. Department of Computer Science and Engineering, State University of New York at Buffalo, Tech Rep:99-03, 1999. http://www. cse. buffalo. edu/tech-reports.
  • 5M Breunig, H Kriegel, R T Ng et al. LOF: Identifying densitybased local outliers. In: Proc of ACM SIGMOD Int'l Cortf on Management of Data. Dallas, Texas: ACM Press, 2000. 93-104.
  • 6M Joshi, R Agarwal, V Kumar. Mining needles in a haystack:Classifying rare classes via two-phase rule induction. In: Proc of ACM SIGMOD Int'l Conf on Management of Data. Santa Barbara, CA: ACM Press, 2001. 91-102.
  • 7H Samet. The Design and Analysis of Spatial Data Structures.Boston, MA: Addison-Wesley, 1990.

同被引文献222

引证文献28

二级引证文献343

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部