摘要
在大规模时序文档集中,异同话题缺乏从时序文档集中识别跟踪分析话题随时间变迁的能力。为此,提出一种面向时序文档语料库的话题变迁检测方法。该方法从时序文档语料库中发现相似话题和异同话题。利用改进的联合非负矩阵分解算法,从多个数据集中提取话题集合。为避免引入噪声话题,计算所有话题的话题熵,以获取优质话题,并通过运用词云和趋势图来分析话题变迁趋势。在20Newsgroups和LTN2011数据集上的实验结果表明,该方法可以有效地从时序文档集中发现异同话题,且提取的话题效果好、准确率高。
In large-scale temporal documents similarities and differences do not have the ability to identily topics from temporal documents and to track and analyze topics over time. To this end, a method of topic change detection for temporal document corpus is proposed. Similar topics and similarities and foundations are found in the temporal document corpus. Using the improved joint Nonnegative Matrix Factorization (NMF) algorithm, similarities and differences were found in the the timeseries document. To avoid the introduction of noise topics, by calculating the topic of all topic entropy, in order to obtain high-quality topics. Use the word cloud and trend graph to analyze the trend of topic change. Experimental results of two real data sets, 20Newsgroups and LTN2011 show that this method can effectively find similarities and differences from the tempord of documents, and the extraction topic is effect and the accuracy is high.
出处
《计算机工程》
CAS
CSCD
北大核心
2018年第1期35-43,共9页
Computer Engineering
基金
上海市科学技术委员会科研计划项目(16511102702)
上海市经济和信息化委员会项目(150643)
关键词
联合非负矩阵分解
话题模型
时序异同话题
优质话题
话题变迁检测
Joint Nonnegative Matrix Factorization (NMF)
topic model
temporal similarities and differences topic
high quality topic
topic change detection