摘要
针对传统基于多项式分布的主题模型不能较好地刻画文档中词汇突发的现象,综合考虑文本集固有的时间信息,提出一种面向词汇突发的Dirichlet组合多项式(DCM)连续时间主题模型。采用DCM分布对文本集中的词汇突发现象进行建模,利用Beta分布刻画文本集中的时间特征,通过Gibbs采样和不动点迭代法实现模型参数的估计。实验结果表明,在预设主题数目较少的情况下,与To T和DCMLDA模型相比,该模型具有明显的泛化性能优势,并且可以有效揭示出文本集中潜在的主题演化趋势。
To solve the problem that traditional topic models based on multinomial distribution cannot properly capture the condition of word burstiness,a continuous-time topic model with Dirichlet Compound Multinomial(DCM)for word burstiness is proposed,which integrates inherent temporal information in the corpus.In this model,the phenomenon of word burstiness is modeled by DCM distribution,while temporal features are characterized by Beta distribution.Gibbs sampling and fixed-point iteration method are employed to estimate the parameters in the model.Experimental results demonstrate that the model has obvious advantages over ToT and DCMLDA in terms of generalization performance when the given number of topics is small,and it can also effectively reveal the latent evolutions of topics in the corpus.
出处
《计算机工程》
CAS
CSCD
北大核心
2016年第11期195-201,共7页
Computer Engineering
基金
国家自然科学基金(61462022)