期刊文献+

基于频繁词网络的LDA最优主题个数选取方法 被引量:5

Selection Method of LDA Optimal Topic Number Based on Frequent Word Network
下载PDF
导出
摘要 LDA(latent Dirichlet allocation,隐含狄利克雷分布)主题模型被广泛应用于大规模文档处理,通常用于主题提取、情感分析和文本降维等。这些模型使用类似期望最大算法从文档集合中提取低维语义分布,并将每一维分布有效结合,形成主题。在模型构建过程中,初始主题数K对迭代过程与结果非常重要。针对这一问题,根据文档聚类簇数(即社区个数)与文档集隐含主题数相一致的特点,提出了一种以频繁词集网络的社区划分个数用来指定LDA主题模型主题输入个数的方法。该方法对文档构建频繁词对,并以此为基础构建词共现网络,然后采用无监督社区划分算法对该词共现网络进行社区划分,并以划分的社区个数作为LDA主题模型的主题个数。实验结果表明,该方法可以自动化指定主题个数K,显著提升主题查准率和查全率,主题独立性更强。 LDA topic model is widely used in large-scale document processing and usually used for topic extraction,emotional analysis and text reduction. These models use the similar expectation maximum algorithm to extract the low-dimensional semantic distribution from the document collection,and effectively combine each dimension distribution to form the topic. In the model building process,the initial topic number K is very important for the iterative process and result. In order to solve this problem,according to the characteristics that the number of frequent words implied in the network community is consistent with the implied topics of document sets,we propose a method to specify the number of inputs for LDA topic model based on the number of community partition in the frequent word set network. This method builds frequent word pairs of documents,based on which the word co-occurrence network is constructed. And then,the unsupervised community partition algorithm is used to partition the co-occurrence network,and the number of communities is used as the number of topics in the LDA topic model. The experiment shows that this method can automatically specify the number of topic number K ,which significantly improves the precision and recall of topic and makes the independence of topic stronger.
作者 李菲菲 王移芝 LI Fei-fei;WANG Yi-zhi(School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044,China)
出处 《计算机技术与发展》 2018年第8期1-5,共5页 Computer Technology and Development
基金 国家自然科学基金(K13A300050)
关键词 隐含狄利克雷分布 主题模型 频繁词网络 聚类 社区划分 LDA topic model frequent word network clustering community partition
  • 相关文献

参考文献10

二级参考文献75

  • 1冯志伟.当前自然语言处理发展的几个特点[J].暨南大学华文学院学报,2006(1):34-40. 被引量:15
  • 2Watts D J, Strogatz S H.Collective dynamics of 'small-world' networks[J] .Nature, 1998,393(4):440-442.
  • 3Barabosi A L, Albert R. Emergence of scaling in random networks[J].Science, 1999,286(5439):509-512.
  • 4Albert R,Jeong H,Barabasi A L.Diameter of the world-wide web [J] .Nature, 1999,401:130-131.
  • 5Newman MEJ.The structure and function of complex networks [Z].
  • 6Newman MEJ, Girvan M. Finding and evaluating community structure in networks[J].Phys Rev E,2004,69(2):026113.
  • 7Girvan M,Newman MEJ.Community structure in social and biological networks[C].Proc Natl Acad Sci,2001:7821-7826.
  • 8Breiger R L,Boorman S A,Arabie EAn algorithm for cluster relations data with applications to social network analysis and comparison with multidimensional scaling[J] .Journal of Mathematical Psychology, 1975,12:328-383.
  • 9Kernighan B W, Lin S. A efficient beuristic procedure for partitioning graphs [J]. Bell System Technical Journal, 1970, 49:291-307.
  • 10Pothen A,Simon H,Liou K P.Partitioning sparse matrices with eigenvectors of graphs[J].SIAM J Matrix Anal Appl, 1990,11 (3): 430-452.

共引文献395

同被引文献36

引证文献5

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部