摘要
针对已有的多数离群点检测算法存在扩展性差,不能有效应用于大数据集的问题,在已有的基于距离的离群点检测算法的基础上,设计模信息表存储结构,利用向量内积不等式关系以及合理的存储分配和调度策略,提出一种高效离群点检测算法DBoda.该算法通过在预处理中存储每个点的模信息,减少点间距离的计算量,并对嵌套循环方法进行优化,进一步减少I/O的开销.理论分析和试验结果表明,所提算法具有时间消耗小和适用于处理大数据集的特点,可以有效地解决离群点检测中的算法时间复杂性和算法扩展性问题.
Most of the existed outlier detection algorithms have the limitation in algorithms' expansibility, and cannot be used efficiently for the large data set. To solve this problem, mode storage structure and vectors' inner product inequation are designed, the suitable storage allocating method and the I/O strategy are adopted. Furthermore, based on the existed distance-based outlier detection algorithm, an efficient nested-based outlier detection algorithm DBoda is proposed, which is suitable for the large data set. Two strategies are adopted in the algorithm. Firstly, during the pretreatment process, each data point's mode information is stored to reduce the computation work. Secondly, optimization is adopted in the nested loop step to reduce I/O. Theoretical analysis and experiment results testify that DBoda is efficient and suitable to deal with large data set. It can solve the time complexity and expansibility problem of outlier detection algorithms.
出处
《东南大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2006年第3期463-466,共4页
Journal of Southeast University:Natural Science Edition
基金
国家自然科学基金资助项目(70371015)
高等学校博士学科点专项科研基金资助项目(20040286009)
审计署审计科研所专项资助项目(SK2006007)
关键词
大数据集
模信息表
向量内积不等式
离群点检测
large data set
mode table
vectors' inner product inequation
outlier detection