摘要
为了解决传统k-means算法需要输入k值和在超大规模数据集进行聚类的问题,这里在前人研究基础上,首先在计算距离时引入信息熵,在超大规模数据集采用数据抽样,抽取最优样本数个样本进行聚类,在抽样数据聚类的基础上进行有效性指标的验证,并且获得算法所需要的k值,然后利用引入信息熵的距离公式再在超大数据集上进行聚类。实验表明,该算法解决了传统k-means算法输入k值的缺陷,通过数据抽样在不影响数据聚类质量的前题下自动获取超大数据集聚类的k值。
In order to solve the problems of the traditional k-means algorithm in which k values needs to be input and the the ultra-large-scale data set needs to be clustered,on the basis of previous studies,the information entropy is brought in when distance is calculated,and data sampling method is adopted,that is,the optimal samples are extracted from the ultra-large-scale data set to conduct sample clustering. Based on the sample data clustering,the validity indexes are verified and k value re-quired by the algorithm is obtained. The distance formula for information entropy is brought in to carry out clustering on the ultra-large data set. Experiments show that the algorithm can overcome the defects of traditional k-means algorithm for k value input, and can automatically obtain k values of ultra-large data clustering under the premise of not affecting the quality of the early da-ta clustering.
出处
《现代电子技术》
2014年第8期19-21,共3页
Modern Electronics Technique