摘要
LDA(latent Dirichlet allocation,隐含狄利克雷分布)主题模型被广泛应用于大规模文档处理,通常用于主题提取、情感分析和文本降维等。这些模型使用类似期望最大算法从文档集合中提取低维语义分布,并将每一维分布有效结合,形成主题。在模型构建过程中,初始主题数K对迭代过程与结果非常重要。针对这一问题,根据文档聚类簇数(即社区个数)与文档集隐含主题数相一致的特点,提出了一种以频繁词集网络的社区划分个数用来指定LDA主题模型主题输入个数的方法。该方法对文档构建频繁词对,并以此为基础构建词共现网络,然后采用无监督社区划分算法对该词共现网络进行社区划分,并以划分的社区个数作为LDA主题模型的主题个数。实验结果表明,该方法可以自动化指定主题个数K,显著提升主题查准率和查全率,主题独立性更强。
LDA topic model is widely used in large-scale document processing and usually used for topic extraction,emotional analysis and text reduction. These models use the similar expectation maximum algorithm to extract the low-dimensional semantic distribution from the document collection,and effectively combine each dimension distribution to form the topic. In the model building process,the initial topic number K is very important for the iterative process and result. In order to solve this problem,according to the characteristics that the number of frequent words implied in the network community is consistent with the implied topics of document sets,we propose a method to specify the number of inputs for LDA topic model based on the number of community partition in the frequent word set network. This method builds frequent word pairs of documents,based on which the word co-occurrence network is constructed. And then,the unsupervised community partition algorithm is used to partition the co-occurrence network,and the number of communities is used as the number of topics in the LDA topic model. The experiment shows that this method can automatically specify the number of topic number K ,which significantly improves the precision and recall of topic and makes the independence of topic stronger.
作者
李菲菲
王移芝
LI Fei-fei;WANG Yi-zhi(School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044,China)
出处
《计算机技术与发展》
2018年第8期1-5,共5页
Computer Technology and Development
基金
国家自然科学基金(K13A300050)