期刊文献+

基于遗传算法的文档聚类算法的设计与仿真(英文) 被引量:4

Design and simulation of a document clustering algorithm based on genetic algorithm
下载PDF
导出
摘要 在各种聚类算法中,K-means是一种基于划分的经典算法.但是由于K-means方法对于初始中心点的选择非常敏感,有可能导致聚类结果收敛于局部,本文提出了一种基于遗传算法来对类中心点进行全局寻优的文档聚类算法.在传统相似度计算的方法中,文档相似矩阵为绝大部分元素为0的稀疏矩阵,忽略了关键字之间的部分相似性,影响了文档之间的相似度.为此,本文改变了传统相似度计算的方法,通过关键字之间的部分相似度,设计出更加精确的文档相似度计算公式.在遗传算法的设计中,将K个类中心点组成的矩阵作为初始个体,采用浮点数进行编码;适应度函数采用所有类内距离的均方差之和加1的倒数表示,当类内均方差之和越小,则个体的适应度越大,被选择进入下一代的概率也越大.通过选择、交叉和变异等步骤对聚类的中心点进行反复迭代寻优,最终找到最优的类中心点.通过实验仿真,K-means收敛速度快,聚类的平均目标函数大于genetic algorithm(GA)且正确率明显小于GA.本文提出的GA算法的分类正确率能达到98%以上,与传统的K-means方法相比,聚类的准确性更高,说明本文提出的算法是一种行之有效的文档聚类方法. Among various document algorithms, K-means is a classical one. However it is a greedy algorithm, which is sensitive to the choice of cluster center and is much easier to result in local optimization. As genetic algorithm (GA) is a global convergence algorithm and the best cluster center can be found easily, a new dynamic document clustering method based on GA is presented in this paper. Reviewing all kinds of traditional document clustering methods, the partial similarity of keywords was not taken into account, so the document similar matrix is a sparse matrix. To some extent, the accuracy of document similarity is influenced. In this paper, some new formulas are given which are improved based on the traditional method. The formulas take the partial similarity of keywords into account, thus improving the accuracy of the calculation of similarity. In this algorithm, the single individual is presented by a matrix which consists of K cluster centers. All individuals are encoded by floating-point numbers. The reciprocal of the sum of mean square deviation of intra class distance plus one is adopted as the fitness function. The smaller the fitness function, the littler probability that the individual can be selected to enter the next generation. The optimal cluster center is finally found by the following iteration process: selection, crossover, mutation and so on. The simulation results show that the accuracy of this classification can reach over 98 percent and the algorithm is superior to K-means in performance. Thus, the algorithm of this paper is an effective method of document clustering.
出处 《南京大学学报(自然科学版)》 CAS CSCD 北大核心 2009年第3期432-438,共7页 Journal of Nanjing University(Natural Science)
基金 National Natural Science Foundation of China(10771076)
关键词 文档聚类 遗传算法 相似度 类中心 document clustering, genetic algorithm, similarity, cluster center
  • 相关文献

参考文献13

  • 1杨建林.基于文献集相似度的分类方法[J].情报学报,1999,18(S1):92-94. 被引量:5
  • 2Casillas A,Gonzdlez de Lena M T,Martínez R.Document clustering into an unknown number of clusters using a genetic algorithm.International Conference on Text Speech and Dialogue,2003,43-49.
  • 3林春燕,朱东华.科学文献的模糊聚类算法[J].计算机应用,2004,24(11):66-67. 被引量:9
  • 4苗建新,吉根林.GML文档结构聚类算法Clu-GML[J].南京大学学报(自然科学版),2008,44(2):188-194. 被引量:8
  • 5Selim S Z,Ismail M A.K-means-type algorithms:a generalized convergence theorem characterization of local optimality.IEEE Transactions Pattern Analysis and Machine Intelligence,1984,6(1):81-87.
  • 6Bradley P S,Fayyad U M.Refining initial points for K-means clustering.Advance in Knowledge Discovery and Data Mining.Cambridge:MIT Press,1996.
  • 7Raymond T N,Han J W.Efficient and effective clustering methods for spatial data mining.Proceeding of the 20th VLDB Conference Santiago,Chile,1994,144-155.
  • 8索红光,王玉伟.一种用于文本聚类的改进k-means算法[J].山东大学学报(理学版),2008,43(1):60-64. 被引量:34
  • 9Shi Z.Efficient online spherical K-means lustering.Proceedings of the 2005 IEEE International Joint Conference on Neural Networks.Montreal,IEEE Press,2005,3180-3185.
  • 10曹付元,梁吉业,姜广.基于邻域模型的K-means初始聚类中心选择算法[J].计算机科学,2008,35(11):181-184. 被引量:6

二级参考文献58

共引文献116

同被引文献60

引证文献4

二级引证文献35

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部