基于LDA的文本分类算法

The documents classification algorithm based on LDA

下载PDF

导出

摘要 LDA可以实现大量数据集合中潜在主题的挖掘与文本信息的分类,模型假设,如果文档与某主题相关,那么文档中的所有单词都与该主题相关.然而,在面对实际环境中大规模的数据,这会导致主题范围的扩大,不能对主题单词的潜在语义进行准确定位,限制了模型的鲁棒性和有效性.本文针对LDA的这一弊端提出了新的文档主题分类算法gLDA,该模型通过增加主题类别分布参数确定主题的产生范围,提高分类的准确性.Reuters-21578数据集与复旦大学文本语料库中的数据结果证明,相对于传统的主题分类模型,该模型的分类效果得到了一定程度的提高. Latent Dirichlet Allocation is a classic topic model which can extract latent topic from large data corpus. Model assumes that if a document is relevant to a topic, then all tokens in the document are relevant to that topic. Through narrowing the generate scope that each document generated from, in this paper, we present an improved text classification algorithm for adding topic-category distribution parameter to Latent Dirichlet Allocation. Documents in this model are generated from the category they most relevant. Gibbs sampling is employed to conduct approximate inference. And preliminary experiment is presented at the end of this paper.

作者何锦群刘朋杰

机构地区天津理工大学计算机与通信工程学院移动计算与数据挖掘重点实验室计算机视觉与系统教育部重点实验室

出处《天津理工大学学报》 2014年第4期28-31,共4页 Journal of Tianjin University of Technology

基金国家自然科学基金(61202169 61170027)

关键词主题模型 LDA 文本分类 topic model LDA text classification

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献13

1Blei D M, Ng A Y, Jordan M. Latent Dirichlet Allocation [ J ]. Journal of Machine Learning Research, 2003(3 ) : 993-1022.
2石晶,范猛,李万龙.基于LDA模型的主题分析[J].自动化学报,2009,35(12):1586-1592. 被引量：34
3Blei D,Griths T,Jordan M. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hier- archies[J]. Journal of the ACM,2010,57(2) : 1-30.
4Reisinger J, Waters A, Silverthorn B, et al. Spherical topic models [ C ]//The 27th International Conference on Machine Learning (ICML- 10). Haifa: Israel Press, 2010.
5Homan M,Blei D,Bach F. On-line learning for latent dirichlet allocation[C]//In Neural Information Processing Systems. Vancouver: NZPS, 2010.
6Boyd-Graber D, Blei D,Zhu X. A topic model for word sense disambiguation [ C J//Proceedings of the Joint Con- ference of Empirical Methods in Natural Language Pro- cessing and Computational Natural Language Learning. New York : ACM, 2007 : 1024-1033.
7刘培奇,孙捷焓.基于LDA主题模型的标签传递算法[J].计算机应用,2012,32(2):403-406. 被引量：5
8Hofmann T. Probabilistic latent semantic indexing [C]// Proceedings of the 22nd Annual ACM Conference on Re- search and Development in Information Retrieval. New York : ACM Press, 1999 : 50-57.
9Teh Y W,Jordan M,Beal M,et al. Hierarchical dirichlet processes [J]. Journal of the American Statistical Associa- tion,2006, 101(476) : 1566-1581.
10Jones M N, Mewhort D J K, Representing word meaning and order information in a composite holographic lexicon [J]. Psychological Review, 2007,114 (2) : 1-37.

二级参考文献39

1苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量：383
2朱靖波,叶娜,罗海涛.基于多元判别分析的文本分割模型[J].软件学报,2007,18(3):555-564. 被引量：15
3石晶,戴国忠.基于PLSA模型的文本分割[J].计算机研究与发展,2007,44(2):242-248. 被引量：25
4伍建军,康耀红.文本分类中特征降维方式的研究[J].海南大学学报（自然科学版）,2007,25(1):62-66. 被引量：4
5Kehagias A, Nicolaou A, Petridis V, Fragkou P. Text segmentation by product partition models and dynamic programming. Mathematical and Computer Modeling, 2004, 39(2-3): 209-217.
6Gina-Anne L. Prosody-based topic segmentation for mandarin broadcast news. In: Proceedings of the 9th American Chapter of the Association for Computational Linguistics- Human Language Technologies. Boston, USA: Association for Computational Linguistics, 2004. 137-140.
7Olivier F. Using collocations for topic segmentation and link detection. In: Proceedings of the 19th International Conference on Computational Linguistics. Taipei, China: Association for Computational Linguistics, 2002. 1-7.
8Li H, Yamanishi K. Topic analysis using a finite mixture model. Information Processing and Management, 2003, 39(4): 521-541.
9Hofmann T. Probabilistic latent semantic analysis. In: Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. Stockholm, Sweden: Morgan Kaufmann, 1999. 289-296.
10Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation. Journal of Machine Learning Research, 2003, 3:993-]022.

共引文献90

1孟旭,谢靖,李春旺.基于核心主题特征的作者身份识别研究[J].知识管理论坛,2023(5):351-364.
2胡艳丽,白亮,张维明.网络舆情中一种基于OLDA的在线话题演化方法[J].国防科技大学学报,2012,34(1):150-154. 被引量：29
3周亦鹏,杨月华,梁美玉,杜军平.跨媒体主题分析及应用研究[J].计算机仿真,2012,29(6):1-4. 被引量：1
4杨赛,赵春霞.基于隐含狄利克雷分配模型的图像分类算法[J].计算机工程,2012,38(14):181-183. 被引量：9
5贺喜,蒋建春,丁丽萍,王永吉,廖晓峰.基于LDA模型的主机异常检测方法[J].计算机应用与软件,2012,29(8):1-4. 被引量：5
6胡艳丽,白亮,张维明.一种话题演化建模与分析方法[J].自动化学报,2012,38(10):1690-1697. 被引量：26
7周亦鹏,杜军平.基于时空情境模型的主题跟踪[J].华南理工大学学报（自然科学版）,2012,40(8):82-87. 被引量：1
8王力,李培峰,朱巧明.一种基于LDA模型的主题句抽取方法[J].计算机工程与应用,2013,49(2):160-164. 被引量：10
9李小三,雷康.基于LDA模型和SVM的文本分类研究[J].网友世界,2013(5):2-2.
10周亦鹏,杜军平.基于关联词的主题模型语义标注[J].智能系统学报,2012,7(4):327-332. 被引量：3

1周雄志,段成华.一种基于特征值的数据仓库主题搜索方法[J].微型机与应用,2004,23(9):13-15.
2韩洪光,董晓平.创新始于联想[J].中国发明与专利,2010(1):30-30.
3罗玉华,左军,李岩.SVM及其在文本分类中的应用[J].科技信息,2010(3):49-50. 被引量：3
4白秋产,金春霞,章慧,周海岩.词共现文本主题聚类算法[J].计算机工程与科学,2013,35(7):164-168. 被引量：13
5科技论文中文摘要编写的4个要素[J].测井技术,2009,33(4):397-397.
6科技论文中文摘要编写的4个要素[J].测井技术,2009,33(3):278-278.
7科技论文中文摘要编写的4个要素[J].测井技术,2009,33(6):549-549.
8胡四元,陈伟.基于数据仓库和OLAP的电子馆务决策支持系统[J].图书馆学研究,2007(11):9-11. 被引量：1
9陆海先,郭立,桂树,谢锦生.基于潜在主题的视频异常行为分析[J].通信技术,2012,45(7):67-71. 被引量：2
10朱青,吕晓旭.基于机器学习的HTML标题抽取[J].微计算机信息,2010,26(9):15-16. 被引量：4

天津理工大学学报

2014年第4期

浏览历史

内容加载中请稍等...

基于LDA的文本分类算法

参考文献13

二级参考文献39

共引文献90

相关作者

相关机构

相关主题

浏览历史