期刊文献+

基于属性组权重的分类数据离群检测

Attribute Group Weight-based Outlier Detection for Categorical Data
下载PDF
导出
摘要 属性分组是高维离群检测中的有效手段之一,可以有效缓解“维度灾难”的干扰,但现有的属性分组离群检测方法未能体现属性组之间的差异性,以及属性组的偏离程度,严重影响了高维离群检测的效果与性能。该文采用信息熵累加和刻画与描述属性组之间的差异性,提出了一种基于属性组权重的分类离群检测方法。首先,根据数据模式频率和编码长度,定义了属性组偏离因子,并将其作为属性组之间的合并依据,有效地刻画了属性组的偏离程度,进一步提高了属性分组过程中的搜索效率;其次,利用信息熵累加和定义了属性组权重,有效地体现了不同属性组之间的差异性;然后,依据属性组权重,重新定义了离群得分函数,并提出了一种基于属性组权重的分类数据离群检测算法;最后,采用UCI,NTU,KEEL和人工合成数据集,实验验证了该离群检测算法不仅具有较高的检测精度和效率,而且也具有良好的可扩展性与伸缩性,可适用于高维海量分类属性数据集的离群检测任务。 Attribute grouping is one of the effective methods in high-dimensional outlier detection,which can effectively alleviate the interference of“the curse of dimensionality”.However,existing attribute grouping outlier detection methods fail to reflect the differences among attribute groups and the deviation degree of attribute groups,which have a significant negative influence on the efficiency and performance of high-dimensional outlier detection.We propose an attribute group weight-based outlier detection method for categorical data by using information entropy cumulative sum,which depicts and describes the difference among attribute groups.Firstly,the attribute group deviation factor is defined according to the data pattern frequency and code lengths,and used as a basis of merging attribute groups,which effectively portrays the deviation among attribute groups and further improves the search efficiency in the process of attribute grouping.Secondly,the information entropy cumulative sum is used to define the attribute group weights,which effectively reflects the difference among different attribute groups.Thirdly,the outlier score function is redefined based on the attribute group weights,and an outlier detection algorithm for categorical data is proposed on this basis.In the end,experimental results on UCI,NTU,KEEL and synthetic datasets validate that the outlier detection algorithm not only has high detection accuracy and efficiency,but also has good extensibility and scalability,which can be applied to the outlier detection task of high-dimensional massive categorical attribute datasets.
作者 张凯棋 宋亦静 陈鑫 ZHANG Kai-qi;SONG Yi-jing;CHEN Xin(School of Computer Science and Technology,Taiyuan University of Science and Technology,Taiyuan 030024,China)
出处 《计算机技术与发展》 2023年第11期20-27,共8页 Computer Technology and Development
基金 山西省基础研究计划资助项目(202103021223267) 山西省高等学校科技创新计划项目(2021L297) 太原科技大学科研启动基金项目(20212053,20222107)。
关键词 离群检测 属性分组 分类数据 属性组权重 偏离因子 outlier detection attribute grouping categorical data attribute group weight deviation factor
  • 相关文献

参考文献5

二级参考文献23

共引文献72

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部