摘要
针对短文本中固有的文本内容稀疏和上下文信息匮乏等问题,在双词主题模型(BTM)的基础上提出一种融合词向量特征的双词主题模型LF-BTM。该模型引入潜在特征模型以利用丰富的词向量信息弥补内容稀疏,在改进的生成过程中每个双词的词汇的生成受到主题—词汇多项分布和潜在特征模型的共同影响。模型中的参数通过吉布斯采样算法进行估计。在真实的短文本数据集上的实验结果表明,该模型能结合外部通用的大规模语料库上已训练好的词向量挖掘出语义一致性显著提升的主题。
To solve the problem of content sparsity and lack of context information existed inherently in short texts, this paper proposed a biterm topic model (BTM) incorporating word vector features LF-BTM based on BTM. This model introuded latent feature model which utilized its abundant word vector information to offset the data sparsity. Generation of words in each biterm was influenced jointly by topic-word multinomial distribution and latent features model in the improved generative process. Parameters in the model could be learned by of Gibbs sampling method. Experimental results on real-world short texts datasets demonstrate that the model can integrate word vectors trained from external general large-scale corpora to produce significant improvements on topic coherence.
出处
《计算机应用研究》
CSCD
北大核心
2017年第7期2055-2058,共4页
Application Research of Computers
基金
国家自然科学基金资助项目(61462022)