期刊文献+

基于子空间变量自动加权的K-均值文本聚类算法的研究 被引量:1

STUDY ON K-MEANS TEXT CLUSTERING ALGORITHM BASED ON SUBSPACE VARIABLE SELF-WEIGHTING
下载PDF
导出
摘要 传统的K-均值算法聚类虽然速度快,在文本聚类中易于实现,但其同量地依赖于所有变量,聚类效果往往不尽如人意。为了克服这一缺点,提出一种改进的K-均值文本聚类算法,它在K-均值聚类过程中,向每一个聚类簇中的关键词自动计算添加一个权重,重要的关键词赋予较大的权重。经过实验测试,获得了一种基于子空间变量自动加权的适合文本数据聚类分析的改进算法,它不仅可以在大规模、高维和稀疏的文本数据上有效地进行聚类,还能够生成质量较高的聚类结果。实验结果表明基于子空间变量自动加权的K-均值文本聚类算法是有效的大规模文本数据聚类算法。 K-means is one of the widely used text clustering techniques due to its rapidity, simplicity and high scalability. However, since traditional K-means algorithm treats all variables equally as well as the sparse of text characteristic matrix, it is not good enough in clustering effect. In this paper it proposes an improved K-means text clustering algorithm. In the process of K-means clustering, it can automatically ap- pend the weight value to key words in each cluster, but the important key words will be assigned the greater value. Through experiments and tests,the researchers obtained an optimized algorithm based on subspace variable self-weighting which suits the text data clustering analysis,it can cluster large-scale, high dimension and sparse text data effectively, and can form high quality clustering results. It was shown by the experimental result that this algorithm is effect for large-scale text data clustering.
出处 《计算机应用与软件》 CSCD 北大核心 2008年第8期251-253,共3页 Computer Applications and Software
关键词 文本聚类 K-均值 变量加权 子空间 Text clustering K-means Features weight Subspace
  • 相关文献

参考文献5

二级参考文献20

  • 1高新波 姬红兵.一种基于特征加权的模糊C-均值聚类算法[J].西安电子科技大学学报,2000,27(10):80-83.
  • 2HUANG Zhe-xue. Extensions to the k-means algorithm for clustering large data sets with categorical values [J]. Data Mining and Knowl Discovery, 1998, 2(1) :283-304.
  • 3HUANG Zhe-xue. Clustering large data sets with mixed numeric and categorical values [A].Proceedings of the Fisrt Pacific-Asia Conference on Knowledge Discovery and Data Mining [C].Singapore: World Scientific, 1997. 21-34.
  • 4HANJia—wei KAMBERM.Data Mining Concepts and Techniques[M].北京:高等教育出版社,2001..
  • 5Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys,2002, 34( 1 ) : 1 - 47.
  • 6Tom Mitchell. Machine learning. McGraw Hill, New York. 1996.
  • 7R.D. Fierro and M.W. Berry. Efficient Computation of the Riemannian SVD in TLS Problems in Information Retrieval, in Total Least Squares and Errors-In-Variables Modeling: Analysis, Algorithms, and Applications, S. van Huffel and P. Lemmerling (Eds.), Kluwer Academic Publishers, Boston, 2002. 349 - 360.
  • 8Thomas Hofmann. Gaussian Latent Semantic Models for Collaborative Filtering. 26th Annual International ACM SIGIR Conference, 2003.
  • 9Han J Kamber M 范明 孟小峰译.Data Mining Concepts and Techniques[M].北京:机械工业出版社,2001-08..
  • 10BersonA SmithT Thur1ingK.构建面向CRM的数据挖掘应用[M].北京:人民邮电出版社,2001..

共引文献31

同被引文献7

  • 1Gao G, Wu J,Yang Z.A fuzzy subspace clustering algorithm for clustering high dimensional data[C]//Li X, Zaiane 0 R, Li Z.Proc of the ADMA.Beflin, Heidelberg:Springer-Verlag, 2006: 271-278.
  • 2A.H.Sung and S. Mukkamala. Feature Selection for Intrusion Detection using Neural NetworKs and Support Vector Machines [J]. Journal of the Transportation Research Board of the National Academics, 2005, 1822: 55-39.
  • 3Jing L,Ng M K,Huang J Z.An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data[J]. IEEE Trans on Knowledge and Data Engineering,2007, 19(8): 1-16.
  • 4Hotelling H.Analysis of a complex of statistical variables into principal components[J]. Journal of Educatfonal Psychology, 1955,24(6) :417-441.
  • 5Chu Y, Chert Y, Yang D, et aI.R, educing redundancy in subspace clustering[J].lEEE Transactions on Knowledge and Data Engineering, 2009,21 (10) : 1432-1446.
  • 6单世民,闫妍,张宪超.基于k最相似聚类的子空间聚类算法[J].计算机工程,2009,35(14):4-6. 被引量:8
  • 7陈黎飞,郭躬德,姜青山.自适应的软子空间聚类算法[J].软件学报,2010,21(10):2513-2523. 被引量:33

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部