期刊文献+

基于概率主题模型的文档聚类 被引量:23

Document Clustering Based on Probabilistic Topic Model
下载PDF
导出
摘要 为了实现普通文本语料库和数字图书语料库的有效聚类,分别提出基于传统LDA(Latent Dirichlet Allo-cation)模型和TC-LDA模型的聚类算法.TC-LDA模型在LDA模型基础上进行扩展,通过对图书文档的目录和正文信息联合进行主题建模.和传统方法不同,基于主题模型的聚类算法能将具备同一主题的文档聚为一类.实验结果表明从主题分析角度出发实现的聚类算法优于传统的聚类算法. To effectively cluster corpus of ordinary documents and digital books,the clustering algorithms based on LDA model and TC-LDA were proposed,respectively.The topic model named TC-LDA,the extension of LDA,is proposed for digital books corpus for jointly topic modeling from both of Texts and Contents.Unlike traditional clustering methods,topic model based methods cluster documents in a group if they share one or more common topics.Empirical evaluation demonstrates that our approach based on topic analysis can substantially improve the clustering results as compared to related methods.
出处 《电子学报》 EI CAS CSCD 北大核心 2012年第11期2346-2350,共5页 Acta Electronica Sinica
基金 国家自然科学青年基金(No.61103171 No.61103099) 浙江省公益性技术应用研究计划(No.2011C31048)
关键词 主题模型 LDA模型 TC-LDA模型 文档聚类 topic model LDA model TC-LDA model document clustering
  • 相关文献

参考文献9

  • 1Newman D,Noh Y, Tally E. Evaluating topic models for digi- tal libraries[ A] .Proc of JCDL[ C]. Gold Coast, Queensland, Australia, 2010.215 - 224.
  • 2Frey B J, Dueck D. Clustering by passing messages between data points[ J]. Science,2007,315(5814) :972- 976.
  • 3Andrzejewski D, Buttler D. Latent topic feedback for informa- tion relrieval[ A ]. Proceedings of 17th ACM SIGKDD Interna- tional Conference on Knowledge Discovery and Data Mining (KDD) [ C] .New York: ACM press,2011.600- 608.
  • 4Ramage D, Heymann P. Clustering the tagged web[ A] .Proc of the Second ACM International Conference on Web Search and Data Mining[ C]. Barcelona, Spain,2009.54- 63.
  • 5曹娟,张勇东,李锦涛,唐胜.一种基于密度的自适应最优LDA模型选择方法[J].计算机学报,2008,31(10):1780-1787. 被引量:82
  • 6Wang X, et al. Topical N-grams:Phrase and topic discovery, with an application to information retrieval[ A]. Proc of the 7th IEEE. International Conference on Data Mining [ C ]. Omaha, Nebraska, USA, 2007.697 - 702.
  • 7Heinrich G. Parameter estimation for text analysis[ Z/OL]. http://www, arbylon, net/publications/text-est, pdf, 2005.
  • 8Shehata S,et al. An efficient concept-based mining model for enhancing text clustering[ J]. IEEE Transactions on Knowledge and Data Engineering,2010,22(10) : 1360 - 1371.
  • 9刘铭,王晓龙,刘远超.基于语义的高维数据聚类技术[J].电子学报,2009,37(5):925-929. 被引量:6

二级参考文献23

  • 1刘远超,王晓龙,刘秉权.一种改进的k-means文档聚类初值选择算法[J].高技术通讯,2006,16(1):11-15. 被引量:23
  • 2Damminda A,Saman K H.Dynamic self-organizing maps with controlled growth for knowleage discovery[J].IEEE Transactions on Neural Networks,2000,11(3):601-614.
  • 3Rauber A,Merkl D.Tie growing hierarchical self-organizing map:exploratory analysis of high-dimensional data[J].IEEE Transactions on Neural Neural Networks,2002,13(6):1331-1341.
  • 4Xu Y D,Xu Z M,et al.Using multiple features and stalistical model to calculate text units similarity[A].Proceedings of 2006 International Conference on Machine Learning and Cybernetics[C].China:IEEE Press,2005.3834-3839.
  • 5Gonenc E,Ilyas C.Using lexical chains for keyword extraction[J].Informtion Processing and Management,2007,43(6):1705-1714.
  • 6Kohonen T,Kaski S,et al.Self organization of a massive document collection[J].IEEE Transactions on Neural Networks,2000,11(3):574-585.
  • 7Shahpurkar S S,Sundareshan M K.Cornparison of self-organizing map with k-means hierarchical clustering for bioinformatics applications[A].International Joint Conference on Neural Networks[C].Hungary;IEEE Press,2004.1221-1226.
  • 8Blei D, Ng A, Jordan M. Latent dirichlet allocation. Journal of Machine Learning Research, 2003, 3:993-1022
  • 9Blei D, Lafferty J. Correlated topic models//Weiss Y, Seholkopf B, Platt J eds. Advances in Neural Information Processing Systems 18. Cambridge, MA: MIT Press, 2006
  • 10Li W, McCallum A. Pachinko allocation: DAG-struetured mixture models of topic correlations//Proceedings of the International Conference on Machine Learning (ICML). Pittsburgh, Pennsylvania, 2006: 577-584

共引文献86

同被引文献168

引证文献23

二级引证文献267

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部