期刊文献+

基于LDA改进的K-means算法在短文本聚类中的研究 被引量:6

Improved K-means algorithm based on Latent Dirichlet Allocation for short text clustering
下载PDF
导出
摘要 在短文本聚类的过程中,常发现特征词的稀疏性质、高维空间处理的复杂性.由于微博的内容长度限制和特征稀疏性,特征向量的高维度被执行,导致模糊聚类结果.本文使用了Latent Dirichlet Allocation主题模型,对训练数据进行建模,并将主题术语扩展原始微博的特征,从而丰富了聚类文本特征,提高聚类效果.实验结合K-means和Canopy聚类算法对文本数据进行处理,提出了LKC算法,弥补了K-means算法对初始聚类中心点选取的敏感性,结果实现了更高的精度和聚类F1-measure的测量值.F1值提高了10%,准确度提高了2%. In the process of short text clustering, the sparse nature of the characteristic words, the complexity of the high- dimensional space processing are often found. Due to the content length limitation of the micro blog and its feature sparsity, the high dimensionality of feature vectors is performed, resulted in obscured clustering results. A Latent Dirichlet Allocation (LDA)theme model is proposed to the training data, and extend the subject term into the characteristics of the original micro blog, such that to enrich the category features to improve the clustering consequent. Our experiment combines K-means and Canopy clustering algorithm to process the text data and the results achieve higher accuracy and Fl-measure.The F1 value improved by 10%, and the accuracy improved by 2%.
作者 冯靖 莫秀良 王春东 FENG Jing;MO Xiu-liang;WANG Chun-dong(School of Computer Science and Engineering, Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology, Tianjin University of Technology, Tianjin 300384, Chin)
出处 《天津理工大学学报》 2018年第3期7-11,共5页 Journal of Tianjin University of Technology
基金 天津市科委基金(15JCYBJC15600)
关键词 短文本 LDA K-MEANS聚类 Canopy聚类 short text LDA K-means clustering Canopy clustering
  • 相关文献

参考文献4

二级参考文献44

  • 1钟将,吴中福,吴开贵,欧灵.基于人工免疫网络的动态聚类算法[J].电子学报,2004,32(8):1268-1272. 被引量:24
  • 2马静.语言学视野中的网络语言[J].西北工业大学学报(社会科学版),2002,22(1):52-56. 被引量:22
  • 3黄永光,刘挺,车万翔,胡晓光.面向变异短文本的快速聚类算法[J].中文信息学报,2007,21(2):63-68. 被引量:17
  • 4袁方,周志勇,宋鑫.初始聚类中心优化的k-means算法[J].计算机工程,2007,33(3):65-66. 被引量:152
  • 5王永恒,贾焰,杨树强.海量短语信息文本聚类技术研究[J].计算机工程,2007,33(14):38-40. 被引量:13
  • 6Wang L,Jia Y,Han W H.Instant message clustering based on extended vector space model.In:Proceedings of the 2nd International Symposium on Intelligence Computation and Applications.Wuhan,China:Springer,2007.435-443
  • 7He H,Chen B,Xu W R,Guo J.Short text feature extraction and clustering for web topic mining.In:Proceedings of the 3rd International Conference on Semantics,Knowledge and Grid.Washington D.C.,USA:IEEE,2007.382-385
  • 8de Castro L N,Von Z F J.aiNet:an artificial immune network for data analysis.Data Mining:A Heuristic Approach.New York:Idea Group Publishing,2001.231-259
  • 9Xia Y Q,Wong K F.Anomaly detecting within dynamic Chinese chat text.In:Proceedings of New Text Workshop st the 11th Conference for European Chapter of the Association for Computational Linguistics.Trento,Italy:Acl Anthology Network,2006.48-55
  • 10Xia Y Q,Wong K F,Gao W.NIL is not nothing:recognition of Chinese network informal language expressions.In:Proceedings of the 4th SIGHAN Workshop on Chinese Langunge Processing.Jeju Island,Republic of Korea:Acl Anthology Network,2005.95-102

共引文献77

同被引文献66

引证文献6

二级引证文献14

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部