摘要
周期性话题挖掘是目前数据挖掘领域的研究热点之一,针对当前绝大部分研究只限于时间序列数据库、无法直接应用于文本数据的不足,提出了一种基于划分的周期性话题挖掘方法(PTMP),首先,将话题划分为周期性话题、背景话题和突发性话题,然后,将每个周期性话题的时标分布建模为混合高斯分布,为了缓解背景噪声问题,通过均匀分布生成背景话题的时标,用高斯分布来生成突发话题的时标,然后通过将该混合模型根据时标文本数据进行调整,从而发现周期性话题及其时间分布。最后,收集了包括研讨会、DBLP和Flickr在内的多个代表性数据集,验证方法的有效性。
Periodic topic mining is a hot problem of current research in the data mining region. Aiming at the disadvantages ofmost existing studies which are limited to time series database and cannot be applied on text data directly, this paper proposes aperiodic topic mining method based on partition, firstly, topics can be classified into three types: periodic topics, background top-ics, and bursty topics, we model the distribution of time-stamps for each periodic topic as a mixture of Gaussian distributions, inorder to alleviate the problem of background noises, the time-stamps of the background topics are generated by a uniform distribu-tion, the time-stamps of the bursty topics are generated from a Gaussian distribution, and then By fitting such a mixture model totime-stamped text data, we can discover periodic topics along with their time distributions. To show the effectiveness of our model,we collect several representative datasets including Seminar, DBLP and Flickr.
出处
《微型电脑应用》
2014年第8期21-26,共6页
Microcomputer Applications
关键词
周期性话题
数据挖掘
混合高斯分布
噪声
时标
Periodic Topic
Data Ming
Mixture of Gaussian Distributions
Noise
Time-Stamps