摘要
离群点检测是数据挖掘的一个重要研究方向,大多数离群数据挖掘算法在应用到高维数据集时效率较低。给出了一种基于属性熵和加权余弦相似度的离群数据挖掘算法LEAWCD.该算法首先根据局部属性熵分析每个对象在其k-邻域内的局部离群属性,并依据各离群属性的属性偏离度自动设置属性权向量;其次使用对高维数据有效的余弦相似度经加权后度量各对象在k-邻域内的离群程度,实现高维局部离群点检测;最后采用国家天文台提供的天体光谱数据作为数据集,实验验证了LEAWCD算法具有伸缩性强和检测精度高等优点。
Outlier mining is an important branch of data mining field. At present, most of the outlier mining algorithms with high-dimensional data are low efficient. An outlier mining algorithm based on attribute entropy and weighted cosine similarity by the name of LEAWCD,is proposed in this paper. Firstly, the outlier attributes of each object in its k-neighborhood are determined by analyzing local attribute entropy. Secondly, attribute weight vector is set automatically on the basis of deviation degree of outlier attributes. Then the weighted cosine similarity, which is effective for high-dimensional data, is used to measure each object's outlier degree. Thus the local outliers are mined in high-dimensional data. Finally, the experiments show that LEAWCD has strong scalability and high precision by using the celestial spectrum data provided by the National Astronomical Observatory as experimental data.
出处
《太原科技大学学报》
2014年第3期171-175,共5页
Journal of Taiyuan University of Science and Technology
基金
太原科技大学青年基金项目(20093015)
关键词
属性熵
余弦相似度
离群数据
天体光谱
attribute entropy, cosine similarity, outlier data, celestial spectra