摘要
针对数据集的聚类过程容易受到离群值的影响这一问题,提出了局部密度离群值检测k-means算法,即先对数据集使用局部密度离群值检测方法检测离群值,先把离群值去除,再进行k-means聚类,算法的有效性通过Davies-Bouldin指标(DB)、Dunn指标和Silhouette指标进行评价,在人工生成的数据集与UCI数据集上验证,去除离群值,再使用k-means算法得到的聚类结果相比原始数据集进行k-means算法聚类结果较好,并且用在疫情数据分析上,对安徽省、北京市、福建省、广东省等24个省、市、自治区2020年2月18日新型冠状病毒肺炎确诊人数进行聚类分析,得到的去除离群值在使用k-means算法相比原始数据集进行k-means算法聚类结果较好,该结果能帮助更好地在实际中怎么去做决策以及更好地降低经济损失。
In view of that the clustering process of data set is easily affected by outliers,the local density outlier detection k-means algorithm is proposed.The proposed method firstly detects the outliers of the data set by using local density outlier detection method,removes the outliers at first and then conducts k-means clustering.The validity of the algorithm is evaluated by Davies-Bouldin index,Dunn index and Silhouette index and is verified by artificial data set and UCI data set,and the outliers are removed.The obtained clustering results by using k-means algorithm are better than original data set k-means algorithm clustering results,this method is used for COVID-19 epidemic data analysis and the clustering analysis of the method is conducted on the confirmed infected number of COVID-19 in 24 provinces,municipalities and autonomous regions such as Anhui,Beijing,Fujian,Guangdong and so on on February 18,2020.The clustering results using k-means algorithm by removing outliers are better than the clustering results of original data set using k-means algorithm,and the results can be conducive to how to make decision in practical work and better reduce economic cost.
作者
刘凤
戴家佳
胡阳
LIU Feng;DAI Jia-jia;HU Yang(School of Mathematics and Statistics, Guizhou University, Guiyang 550025, China)
出处
《重庆工商大学学报(自然科学版)》
2021年第4期30-35,共6页
Journal of Chongqing Technology and Business University:Natural Science Edition
基金
贵州省数据驱动建模学习与优化创新团队(黔科合平台人才〔2020〕5016).