摘要
LDA(Latent Dirichlet Allocation)在训练的过程中没有结合词向量训练,而LF-LDA(Latent FeatureLDA)在训练过程中利用Word2vec词向量改善了文档的主题分布。但是,文档用主题分布进行表示,没有结合特征词的上下文信息。为此,本文提出利用LF-LDA生成的主题向量结合Word2vec词向量,对文本进行表示。另外,文章还提出了利用LF-LDA生成的主题向量对文档进行表示。在Stack Overflow短文本数据集上的分类结果表明,LF-LDA结合Word2vec的文本表示优于LDA结合Word2vec的文本表示和LF-LDA主题分布的文本表示。基于主题向量的文本表示模型优于LDA模型。
LDA(Latent Dirichlet Allocation) does not consider word vector in training process while LF-LDA(Latent Feature-LDA) uses Word2 vec to improve the distribution of topics. However, document represented by the distribution of topics, which is not combined with context information of feature words. Thus, we propose LF-LDA combined with Word2 vec, which utilize topic vector generated by LF-LDA and word vector generated by Word2 vec to represent text. In addition, we also propose text representation which adopts topic vector generated by LF-LDA. Experimental result on data set of Stack Overflow show that LF-LDA combined with Word2vec's text representation is superior to LDA combined with Word2vec's text representation and LF-LDA's text representation. Text representation model based on topic vector is superior to LDA model.
出处
《电子技术(上海)》
2017年第7期1-5,共5页
Electronic Technology