Similarity matrix-based K-means algorithm for text clustering

Similarity matrix-based K-means algorithm for text clustering

下载PDF

导出

摘要 K-means algorithm is one of the most widely used algorithms in the clustering analysis. To deal with the problem caused by the random selection of initial center points in the traditional al- gorithm, this paper proposes an improved K-means algorithm based on the similarity matrix. The im- proved algorithm can effectively avoid the random selection of initial center points, therefore it can provide effective initial points for clustering process, and reduce the fluctuation of clustering results which are resulted from initial points selections, thus a better clustering quality can be obtained. The experimental results also show that the F-measure of the improved K-means algorithm has been greatly improved and the clustering results are more stable. K-means algorithm is one of the most widely used algorithms in the clustering analysis. To deal with the problem caused by the random selection of initial center points in the traditional al- gorithm, this paper proposes an improved K-means algorithm based on the similarity matrix. The im- proved algorithm can effectively avoid the random selection of initial center points, therefore it can provide effective initial points for clustering process, and reduce the fluctuation of clustering results which are resulted from initial points selections, thus a better clustering quality can be obtained. The experimental results also show that the F-measure of the improved K-means algorithm has been greatly improved and the clustering results are more stable.

作者曹奇敏郭巧吴向华

机构地区 School of Automation

出处《Journal of Beijing Institute of Technology》 EI CAS 2015年第4期566-572,共7页 北京理工大学学报（英文版）

关键词 text clustering K-means algorithm similarity matrix F-MEASURE text clustering K-means algorithm similarity matrix F-measure

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献20

1Shi Z Z. Knowledge discovery[M]. Beijing: Tsinghua University Press, 2002.
2Han J, Kamber M. Data mining: concepts and techniques[M]. San Francisco: Morgan Kaufmann Publishers, 2000.
3Grabmeier J, Rudolph A. Techniques of cluster algorithms in data mining[J]. Data Mining and Knowledge Discovery, 2002, 6(4):303-360.
4Meyer C D, Wessell C D. Stochastic data clustering[J]. SIAM Journal on Matrix Analysis and Applications, 2012, 33(4): 1214-1236.
5Hammouda K M, Kamel M S. Efficient phrase-based document indexing for web document clustering[J]. IEEE Transactions on Knowledge and Data Engineering, 2004, 16(10):1279-1296.
6Rousseeuw P J, Kaufman L. Finding groups in data: an introduction to cluster analysis[M].New York: John Wiley & Sons, 2009.
7Gnanadesikan R. Methods for statistical data analysis of multivariate observations[M]. New York: John Wiley & Sons, 2011.
8Huang J Z, Ng M K, Rong H, et al. Automated variable weighting in K-means type clustering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(5):657-668.
9Celebi M E, Kingravi H A, Vela P A. A comparative study of efficient initialization methods for the k-means clustering algorithm[J]. Expert Systems with Applications, 2013, 40(1): 200-210.
10Shameem M U S, Ferdous R. An efficient k-means algorithm integrated with Jaccard distance measure for document clustering //AH-ICI 2009, First Asian Himalayas International Conference on Internet, 2009: 1-6.

1吴萍,宋瀚涛,张利萍,吴正宇.Clustering with Weighted Hyperlink and Sub Similarity Matrix[J].Journal of Beijing Institute of Technology,2006,15(2):177-180.
2赵康,陆介平,倪巍伟,王桂平.一种基于密度的文本聚类挖掘算法[J].计算机应用研究,2009,26(1):124-126. 被引量：4
3李向军,徐国华,刘立平.一种文本聚类算法[J].西北大学学报（自然科学版）,2005,35(2):155-158. 被引量：3
4王刚,钟国祥.一种基于本体相似度计算的文本聚类算法研究[J].计算机科学,2010,37(9):222-224. 被引量：10
5XU Junling,XU Baowen,ZHANG Weifeng,CUI Zifeng,ZHANG Wei.A New Feature Selection Method for Text Clustering[J].Wuhan University Journal of Natural Sciences,2007,12(5):912-916. 被引量：3
6SU Ya-ru,WANG Ru-jing,CHEN Peng,WEI Yuan-yuan,LI Chuan-xi,HU Yi-min.Agricultural Ontology Based Feature Optimization for Agricultural Text Clustering[J].Journal of Integrative Agriculture,2012,11(5):752-759. 被引量：4
7CHENJian-bin,DONGXiang-jun,SONGHan-tao.The Refinement Algorithm Consideration in Text Clustering Scheme Based on Multilevel Graph[J].Wuhan University Journal of Natural Sciences,2004,9(5):671-675.
8白秋产,金春霞.概念属性扩展的短文本聚类算法[J].长春师范学院学报（自然科学版）,2011,30(5):29-33. 被引量：4
9冯中慧,鲍军鹏,沈钧毅.基于EM算法的文本聚类优化研究[J].信息与控制,2006,35(5):657-661. 被引量：2
10FENG Zhonghui SHEN Junyi BAO Junpeng.An Incremental Algorithm of Text Clustering Based on Semantic Sequences[J].Wuhan University Journal of Natural Sciences,2006,11(5):1340-1344. 被引量：1

Journal of Beijing Institute of Technology

2015年第4期

浏览历史

内容加载中请稍等...

Similarity matrix-based K-means algorithm for text clustering

参考文献20

相关作者

相关机构

相关主题

浏览历史