期刊文献+

基于LDA模型的网络刊物主题发现与聚类 被引量:4

Topic Discovery and Clustering for Online Journals Based on LDA Algorithm
下载PDF
导出
摘要 随着智能终端的普及,文本的主题挖掘需求也越来越广泛,主题建模是文本主题挖掘的核心,LDA生成模型是基于贝叶斯框架的概率模型,它以语义关联为基础,很好地解决了文本潜在主题的提取问题。对文本聚类过程的核心技术LDA生成模型、数据采样、模型评价等作了较为深入的阐述和解析,结合网络教育平台的2794篇学习刊物进行了主题发现和聚类实验,建立了包含3800个词项的词库,通过kmeans算法和合并向量算法(UVM)分两步解决了主题聚类问题。提出了文本挖掘实验的一般方法,并对层次聚类中文本距离的算法提出了改进。实验结果表明,该平台刊物的主题整体相似度比较好,但主题过于集中使得许多刊物的内容不具有辨识度,影响用户对主题的定位。 With the popularity of intelligent terminals,the demand of text topic mining is becoming more prevalent in many different domains.Theme modeling is the kernel of text topic mining.LDA (latent Dirichlet allocation) generating model is a probability model based on Bayesian framework,and it solves the problem of text potential topic extraction based on semantic association.The key technology of text clustering process,including LDA generating model,data sampling,model evaluation,was described and analyzed in depth.Theme discovery and clustering experiments were carried out in 2 794 learning journals on the network education platform.A thesaurus containing 3 800 terms was established.The problem of topic clustering was solved by kmeans algorithm and UVM (union vector method) algorithm in two steps.Meanwhile a general method of text mining experiment was proposed,and the algorithm of text distance in hierarchical clustering was improved.The experimental results show that the overall similarity of topics in the platform is good,but the focus of topics makes the content of many journals not identifiable,which affects the user's positioning of topics.
作者 杨传春 张冰雪 李仁德 郭强 YANG Chuanchun;ZHANG Bingxue;LI Rende;GUO Qiang(Research Center of Complex Systems Science,University of Shanghai for Science and Technology,Shanghai 200093,China;MPA Education Center,University of Shanghai for Science and Technology,Shanghai 200093,China)
出处 《上海理工大学学报》 CAS CSCD 北大核心 2019年第3期273-280,306,共9页 Journal of University of Shanghai For Science and Technology
关键词 LDA模型 生成模型 主题发现 层次聚类 文本挖掘 LDA model generating model topic discovery hierarchical clustering text mining
  • 相关文献

参考文献8

二级参考文献71

  • 1王昱.社科文献的特点、作用及省级社科文献资源建设[J].青海社会科学,1994(6):83-89. 被引量:2
  • 2Blei D, Ng A, Jordan M. Latent Dirichlet allocation [ J ]. Journal of Machine Learning Research ,2003 (3) .993 - 1022.
  • 3Hong Liangjie, Davison B. Empirical study of topic modeling in Twitter[ C ]// Proceedings of the First Workshop on Social Media Analytics ( SOMA' 10). New York . ACM Press,2010.80 - 88.
  • 4Decrwester S, Dumais S, Landauer T,et al. Indexing by latent se- mantic analysis[ J]. Journal of the American Society for Informa- tion Science, 1990,41 (6) .391 -407.
  • 5Hofmann T. Unsupervised learning by probabilistic latent semantic analysis [J]. Machine Learning,2001,42( 1 ) .177 - 196.
  • 6Steyvers M, Griffiths T. Probabilistic topic models [ M ]//Landauer T, McNamara D, Dennis S, et al. Latent Semantic Analysis. A Road to Meaning. Mahwah . Lawrence Erlbaum Associates, 2007 . 424 - 440.
  • 7Griffiths T, Steyvers M. Finding scientific topics [ C ]//Proceedings of the National Academy of Sciences. Washington D. C. . United States National Academy of Sciences,2004.5228 -5235.
  • 8Tang Jie, Jin Ruoming, Zhang Jing. A topic modeling approach and its integration into the random walk framework for academic search [ C ]//Proceedings of the 2008 Eighth IEEE International Confer- ence on Data Mining ( ICDM ' 08 ). Washington . IEEE Computer Society, 2008..1055 - 1060.
  • 9Lu Yue , Zhai Chengxiang. Opinion integration through semi - su- pervised topic modeling [ C ]//Proceedings of the 17th International Conference on World Wide Web. ( WWW ' 08 ). New York . ACM Press,2008.121 - 130.
  • 10Weng Jianshu, Lim Ee-Pang, Jiang Jing, et al. Twitterrank . finding topic - sensitive influential Twitterers [ C ]//Proceedings of the 3 rd ACM. International Conference on Web Search and Data Mining ( WSDM' I0). New York. ACM Press,2010.261 - 270. Zvi..

共引文献164

同被引文献51

引证文献4

二级引证文献14

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部