

The documents classification algorithm based on LDA
摘要 LDA可以实现大量数据集合中潜在主题的挖掘与文本信息的分类,模型假设,如果文档与某主题相关,那么文档中的所有单词都与该主题相关.然而,在面对实际环境中大规模的数据,这会导致主题范围的扩大,不能对主题单词的潜在语义进行准确定位,限制了模型的鲁棒性和有效性.本文针对LDA的这一弊端提出了新的文档主题分类算法gLDA,该模型通过增加主题类别分布参数确定主题的产生范围,提高分类的准确性.Reuters-21578数据集与复旦大学文本语料库中的数据结果证明,相对于传统的主题分类模型,该模型的分类效果得到了一定程度的提高. Latent Dirichlet Allocation is a classic topic model which can extract latent topic from large data corpus. Model assumes that if a document is relevant to a topic, then all tokens in the document are relevant to that topic. Through narrowing the generate scope that each document generated from, in this paper, we present an improved text classification algorithm for adding topic-category distribution parameter to Latent Dirichlet Allocation. Documents in this model are generated from the category they most relevant. Gibbs sampling is employed to conduct approximate inference. And preliminary experiment is presented at the end of this paper.
出处 《天津理工大学学报》 2014年第4期28-31,共4页 Journal of Tianjin University of Technology
基金 国家自然科学基金(61202169 61170027)
关键词 主题模型 LDA 文本分类 topic model LDA text classification
  • 相关文献


  • 1Blei D M, Ng A Y, Jordan M. Latent Dirichlet Allocation [ J ]. Journal of Machine Learning Research, 2003(3 ) : 993-1022.
  • 2石晶,范猛,李万龙.基于LDA模型的主题分析[J].自动化学报,2009,35(12):1586-1592. 被引量:34
  • 3Blei D,Griths T,Jordan M. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hier- archies[J]. Journal of the ACM,2010,57(2) : 1-30.
  • 4Reisinger J, Waters A, Silverthorn B, et al. Spherical topic models [ C ]//The 27th International Conference on Machine Learning (ICML- 10). Haifa: Israel Press, 2010.
  • 5Homan M,Blei D,Bach F. On-line learning for latent dirichlet allocation[C]//In Neural Information Processing Systems. Vancouver: NZPS, 2010.
  • 6Boyd-Graber D, Blei D,Zhu X. A topic model for word sense disambiguation [ C J//Proceedings of the Joint Con- ference of Empirical Methods in Natural Language Pro- cessing and Computational Natural Language Learning. New York : ACM, 2007 : 1024-1033.
  • 7刘培奇,孙捷焓.基于LDA主题模型的标签传递算法[J].计算机应用,2012,32(2):403-406. 被引量:5
  • 8Hofmann T. Probabilistic latent semantic indexing [C]// Proceedings of the 22nd Annual ACM Conference on Re- search and Development in Information Retrieval. New York : ACM Press, 1999 : 50-57.
  • 9Teh Y W,Jordan M,Beal M,et al. Hierarchical dirichlet processes [J]. Journal of the American Statistical Associa- tion,2006, 101(476) : 1566-1581.
  • 10Jones M N, Mewhort D J K, Representing word meaning and order information in a composite holographic lexicon [J]. Psychological Review, 2007,114 (2) : 1-37.


  • 1苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:383
  • 2朱靖波,叶娜,罗海涛.基于多元判别分析的文本分割模型[J].软件学报,2007,18(3):555-564. 被引量:15
  • 3石晶,戴国忠.基于PLSA模型的文本分割[J].计算机研究与发展,2007,44(2):242-248. 被引量:25
  • 4伍建军,康耀红.文本分类中特征降维方式的研究[J].海南大学学报(自然科学版),2007,25(1):62-66. 被引量:4
  • 5Kehagias A, Nicolaou A, Petridis V, Fragkou P. Text segmentation by product partition models and dynamic programming. Mathematical and Computer Modeling, 2004, 39(2-3): 209-217.
  • 6Gina-Anne L. Prosody-based topic segmentation for mandarin broadcast news. In: Proceedings of the 9th American Chapter of the Association for Computational Linguistics- Human Language Technologies. Boston, USA: Association for Computational Linguistics, 2004. 137-140.
  • 7Olivier F. Using collocations for topic segmentation and link detection. In: Proceedings of the 19th International Conference on Computational Linguistics. Taipei, China: Association for Computational Linguistics, 2002. 1-7.
  • 8Li H, Yamanishi K. Topic analysis using a finite mixture model. Information Processing and Management, 2003, 39(4): 521-541.
  • 9Hofmann T. Probabilistic latent semantic analysis. In: Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. Stockholm, Sweden: Morgan Kaufmann, 1999. 289-296.
  • 10Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation. Journal of Machine Learning Research, 2003, 3:993-]022.









使用帮助 返回顶部