期刊文献+

基于学术文献同被引分析的K-means算法改进研究 被引量:4

Improvement of K-means Algorithm Based on Co-citation Analysis
下载PDF
导出
摘要 K—means算法是一种应用广泛的聚类算法,但是存在初始聚类中心和K值选取的难题。本文提出了一种基于学术文献同被引分析的初始聚类中心和K值选取的K—means改进算法。该算法属于两步聚类算法,首先对学术文献进行同被引分析,得到同被引矩阵,然后基于同被引矩阵进行层次聚类。算法记录每次迭代过程中被聚为一类的学术文献间的距离以及两次迭代间的距离差,当两次迭代的距离差取得最大值时取其聚类数作为第二步K-means算法的K值,并且将此时的类中心作为第二步K—means算法的初始聚类中心。第二步聚类则依据文献内容实现K-means算法。实验通过与经典K—means算法和基于凝聚层次聚类算法的改进K—means算法的对比,证明了本文提出的改进的K—means算法具备更优的聚类效果。 K-means algorithm is a widely-used clustering algorithm. The main problem of the algorithm is the determination of the optimal number of clusters and the selection of initial cluster centers. In this paper, a novel algorithm based on co-citation analysis is proposed. This algorithm is divided into two steps. The first step is to do co-citation analysis in the academic literature set, and get the matrix of co-citation, and run hierarchical clustering algorithm based on the matrix. In each iteration, distance of academic literature in a cluster and the difference of the distance between two iterations are recorded. In the end of first step, the value of K and the centers of every cluster are selected for the second step when the maximum of the difference is achieved. The second part of the research is to execute the K-means algorithm based on the content of academic literature. Experimental results show that the clustering quality is improved.
出处 《情报学报》 CSSCI 北大核心 2012年第1期82-94,共13页 Journal of the China Society for Scientific and Technical Information
基金 本文得到国家社科基金项目“中文学术信息检索系统相关性集成研究”(项目批准号:10CTQ027),教育部人文社会科学研究规划基金项目“面向用户的相关性标准及其应用研究”(项目批准号:07JA870006),中国科学技术信息研究所合作研究项目的资助.
关键词 K—means算法 K值 初始聚类中心 同被引 文献聚类 K-means algorithm, number of clusters, initial clustering centers, co-citation, papers clustering
  • 相关文献

参考文献13

二级参考文献120

共引文献176

同被引文献134

引证文献4

二级引证文献52

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部