期刊文献+

基于数据抽样的自动k-means聚类算法 被引量:4

Automatic k-means clustering algorithm based on data sampling
下载PDF
导出
摘要 为了解决传统k-means算法需要输入k值和在超大规模数据集进行聚类的问题,这里在前人研究基础上,首先在计算距离时引入信息熵,在超大规模数据集采用数据抽样,抽取最优样本数个样本进行聚类,在抽样数据聚类的基础上进行有效性指标的验证,并且获得算法所需要的k值,然后利用引入信息熵的距离公式再在超大数据集上进行聚类。实验表明,该算法解决了传统k-means算法输入k值的缺陷,通过数据抽样在不影响数据聚类质量的前题下自动获取超大数据集聚类的k值。 In order to solve the problems of the traditional k-means algorithm in which k values needs to be input and the the ultra-large-scale data set needs to be clustered,on the basis of previous studies,the information entropy is brought in when distance is calculated,and data sampling method is adopted,that is,the optimal samples are extracted from the ultra-large-scale data set to conduct sample clustering. Based on the sample data clustering,the validity indexes are verified and k value re-quired by the algorithm is obtained. The distance formula for information entropy is brought in to carry out clustering on the ultra-large data set. Experiments show that the algorithm can overcome the defects of traditional k-means algorithm for k value input, and can automatically obtain k values of ultra-large data clustering under the premise of not affecting the quality of the early da-ta clustering.
出处 《现代电子技术》 2014年第8期19-21,共3页 Modern Electronics Technique
关键词 K-MEANS算法 信息熵 最优样本抽取 有效性指标 k-means algorithm information entropy optimal sample extraction validity index
  • 相关文献

参考文献9

  • 1MACQUEEN James. Some methods for classification and analy- sis of multivariate observations [C]// Proceedings of 5-th Berke- ley Symposium on Mathematical Statistics and Probability. Cali- fornia, USA: [s.n.],1967: 281-297.
  • 2GAO Xiao-shan, LI Jing, TAO Da-cheng. Fuzziness measure- ment of fuzzy sets and its application in cluster validity analy- sis [J]. International Journal of Fuzzy Systems, 2007, 9 (4) : 188-191.
  • 3DUDOIT Sandrine, FRIDLYAND Jane. A prediction-based resampling method for estimating the number of clusters in a dataset [J]. Genome biology, 2002, 3(7) : 1-22.
  • 4ROUSSEEUW P J. Silhouettes: a graphical aid to the interpre- tation and validation of cluster analysis [J]. Journal of computa- tional and applied mathematics, 1987, 20: 53-65.
  • 5周世兵,徐振源,唐旭清.基于近邻传播算法的最佳聚类数确定方法比较研究[J].计算机科学,2011,38(2):225-228. 被引量:30
  • 6KAPPA V, TIBSHIRANI R. Are clusters found in one dataset present in another dataset? [J]. Biostatistics, 2007, 8 (1) : 9- 31.
  • 7周世兵,徐振源,唐旭清.K-means算法最佳聚类数确定方法[J].计算机应用,2010,30(8):1995-1998. 被引量:136
  • 8杨善林,李永森,胡笑旋,潘若愚.K-MEANS算法中的K值优化问题研究[J].系统工程理论与实践,2006,26(2):97-101. 被引量:187
  • 9唐波.改进的K-means聚类算法及应用[J].软件,2012,33(3):100-104. 被引量:9

二级参考文献32

  • 1杨世兴.煤矿监测监控系统的现状与发展[J].安防科技(安全经理人),2004(5):39-41. 被引量:32
  • 2陈雷,王延章.熵权法对融合网络服务质量效率保障研究[J].计算机工程与应用,2005,41(23):1-3. 被引量:3
  • 3CALINSKI R,HARABASZ J.A dendrite method for cluster analysis[J].Communications in Statistics,1974,3(1):1 -27.
  • 4DAVIES D L,BOULDIN D W.A cluster separation measure[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,1979,1(2):224-227.
  • 5DUDOIT S,FRIDLYAND J.A prediction-based resampling method for estimating the number of clusters in a dataset[J].Genome Biology,2002,3(7):1-21.
  • 6DIMITRIADOU E,DOLNICAR S,WEINGESSEL A.An examination of indexes for determining the number of cluster in binary data sets[J].Psychometrika,2002,67(1):137-160.
  • 7KAPP A V,TIBSHIRANI R.Are clusters found in one dataset present in another dataset?[J].Biostatistics,2007,8(1):9-31.
  • 8ROUSSEEUW P J.Silhouettes:a graphical aid to the interpretation and validation of cluster analysis[J].Journal of Computational and Applied Mathematics,1987,20(1):53 -65.
  • 9DEMB(E)L(E) D,KASTNER P.Fuzzy C-means method for clustering microarray data[J].Bioinformatics,2003,19(8):973-980.
  • 10Frey B J,Dueck D.Clustering by Passing Messages Between Data Points[J].Science,2007,315(5814):972-976.

共引文献351

同被引文献63

引证文献4

二级引证文献21

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部