摘要
话题演进分析主要是挖掘话题内容随着时间流的演进情况。话题的内容可用关键词来表示。利用word2vec对75万篇新闻和微博文本进行训练,得到词向量模型。将文本流处理后输入模型,获得时间序列下所有词汇的词向量,利用K-means对词向量进行聚类,从而实现话题关键词的抽取。实验对比了基于PLSA和LDA主题模型下的话题抽取效果,发现本文的话题分析效果优于主题模型的方法。同时,采集足够大量、内容足够丰富的语料,可训练得到泛化能力比较强的模型,有利于实时话题演进分析研究工作。
The analysis of topic evolution is regarded as the mining of topic content evolving with the time. This article, based on the hypothesis that topic content may be embodied by key words, adopt word2vec for the training of 750 thousand pieces of news and micro-blog texts to establish the model of word vector. The text information flow is applied to the model and all word vectors by time series are acquired. K-means is used to cluster the word vectors before the key words are drawn and the analysis of topic evolution is visualized. By comparing the effect of the word vector model with those of PLSA or LDA topic models on drawing topic, the results show that the former is more effective than the latter two models. In addition, the collection of abundant and varied data can facilitate the training of the word vector model with better generalization ability and the investigation on real-time analysis of topic evolution.
出处
《计算机工程与科学》
CSCD
北大核心
2016年第11期2368-2374,共7页
Computer Engineering & Science
基金
国家社科基金项目(12BYY045)
广东省哲学社会科学"十二五"规划项目(GD15YTS01)