期刊文献+

改进的概率潜在语义分析下的文本聚类算法 被引量:14

Improved text clustering algorithm of probabilistic latent with semantic analysis
下载PDF
导出
摘要 概率潜在语义分析(PLSA)模型用期望最大化(EM)算法进行参数训练,由于算法参数的随机初始化,致使聚类的效果过度拟合且过分依赖于参数初始值。将潜在语义分析(LSA)模型参数概率化,用以初始化概率潜在语义分析模型的参数,得到的改进算法有效解决了参数随机初始化问题。经实验验证,所提出的方法对文本聚类的归一化互信息(NM I)和准确度都有明显提高。 Trained by the Expectation Maximization (EM) algorithm, whose model parameters are randomly initialized, the performance of Probabilistic Latent Semantic Analysis (PLSA) model is quite dependent on the initialization of the model, and the result of iteration is not a global maximum, but a local one. The authors derived probabilities from Latent Semantic Analysis (LSA), and then used it to initialize the parameters of PLSA model in documents clustering. The improved PLSA could effectively solve the puzzle of random initializing of EM. It is shown that the improved algorithm has a distinct improvement in Normalized Mutual Information (NMI) and accuracy.
出处 《计算机应用》 CSCD 北大核心 2011年第3期674-676,693,共4页 journal of Computer Applications
基金 中国博士后科学基金资助项目(20070420711) 重庆市科委基金资助项目(2008BB2191)
关键词 文本聚类 概率潜在语义分析 参数初始化 潜在语义分析 document clustering Probabilistic Latent Semantic Analysis (PLSA) parameter initialization Latent Semantic Analysis (LSA)
  • 相关文献

参考文献7

  • 1WANG ZAN, TSIM Y C, YEUNG W S, et al. Probabilistic Latent Semantic Analysis (PLSA) in bibliometric analysis for technology forecasting [ J]. Journal of Technology Management and Innovation, 2007, 41(6): 11-24.
  • 2HOFMANN T. Unsupervised learning by probabilistic latent seman- tic analysis [ J]. Machine Learning, 2001, 42(1/2) : 177 - 196.
  • 3PETERSEN B, WINTER O, HANSEN L K. On the slow conver- gence of EM and VBEM in low-noise linear models [ J]. Neural Computation, 2005, 17(9): 1921-1926.
  • 4AZADI T El, ALMASGANJ F. Using backward elimination with a new model order reduction algorithm to select best double mixture model for document [ J]. Expert Systems with Applications, 2009, 36(7) : 10485 - 10493.
  • 5TIPPING M, BISHOP C M. Probabilistic principal component anal- ysis [J]. Journal of the Royal Statistical Society, Series B, 1999, 61(3): 611-622.
  • 6DING C H Q. A similarity-based probability model for latent seman- tic indexing [ C]// Proceedings on the 22nd Annual International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval. Berkeley: ACM Press, 1999:194-198.
  • 7CHEN WENYEN, SONG YANGQIU, BAI HONGJIE, et al. Paral- lel spectral clustering in distributed systems [ EB/OL]. [ 2010 - 02 - 26]. http://www, csie. ntu. edu. tw/~ cjlin/papers/psc08, pdf.

同被引文献174

引证文献14

二级引证文献60

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部