摘要
传统的聚类算法通常将样本间的距离作为相似度的划分标准,因此距离计算方式的选择对于聚类的结果至关重要。但是传统的距离计算方法忽略了不同数据属性特征对聚类的影响。为了解决此问题,论文结合K-means提出了一种基于属性加权的快速K-means算法FAWK。首先,定义了一个反映属性特征差异的离散度函数对属性特征进行加权;其次,根据加权属性特征计算数据属性间的距离,并将所有属性的加权属性距离求和作为样本间的相似性距离;然后,将加权属性距离作为FAWK算法的划分标准对数据进行聚类;最后,将论文算法与现有方法在8个UCI数据集和LAMOST恒星光谱数据集进行实验测试与分析,实验结果表明FAWK算法具有迭代次数少、运行时间短、聚类结果准确率高且更接近真实数据集划分情况的特点。
The traditional clustering algorithms usually regard the distance between samples as the dividing standard of simi⁃larity,so the choice of distance calculation method is very important to the result of clustering.But the traditional methods of dis⁃tance calculation do not consider the influence of different data attribute on clustering.To solve this problem,this paper combines K-means and proposes a fast K-means algorithm based on attribute weighting(FAWK).First of all,a discrete function reflecting the difference of attribute characteristics is defined to weight attribute characteristics.Then,the distance between data attributes is calculated by the weighted attribute characteristics,and the similarity distance between samples are represented by the sum of the weighted attribute distances for all attributes.In addition,the weighted attribute distance is used as the division standard of the FAWK to cluster the data.Finally,the proposed algorithm and the existing algorithm are tested and analyzed in 8 UCI data sets and LAMOST stellar spectral data sets,the results show that FAWK has the characteristics of fewer iterations,short running time,high accuracy of clustering results,and is closer to the real datasets division.
作者
赵国伟
蔡江辉
杨海峰
荀亚玲
ZHAO Guowei;CAI Jianghui;YANG Haifeng;XUN Yaling(School of Computer Science and Technology,Taiyuan University of Science and Technology,Taiyuan 030024)
出处
《计算机与数字工程》
2021年第5期930-935,共6页
Computer & Digital Engineering
基金
国家青年科学基金项目(编号:61602335)
山西省重点研发项目(编号:201803D121059,201903D121116)资助。