期刊文献+

面向热点话题时间序列的有效聚类算法研究 被引量:31

An Efficient and Effective Clustering Algorithm for Time Series of Hot Topics
下载PDF
导出
摘要 聚类热度时间序列是揭示和建模网络热点话题形成与发展的重要过程.Leskovec等人在2010年提出面向话题时间序列的K_SC聚类算法,其精确度较高且能较好地刻画话题内在发展趋势特征.但K_SC算法具有对初始类矩阵中心高度敏感、高时间复杂度等特性,使其难以在实际高维大数据集上应用.文中结合小波变换技术,提出一个新的迭代式聚类算法WKSC,主要提出两个创新:(1)用Haar小波变换将原始时间序列进行压缩,降低原始时间序列的维度,从而降低了算法的时间复杂度;(2)在Haar反小波变换中,将低维聚类返回得到的矩阵中心作为高维聚类的初始矩阵中心,在迭代聚类过程中优化了对初始矩阵中心高敏感性的问题,提高了聚类的效果.文中分别采用国内外3个数据集作为测试样本,进行了大量的实验.实验结果表明WKSC算法能显著降低聚类的时间复杂度,同时改进聚类效果.WKSC算法可很好的应用于大量高维热点话题的模式分析. Hot degree time series clustering is very important for revealing and modeling development process of hot topics in Web sites. In 2010, Leskovec and his colleagues proposed a K-Spectral Centroid (K_SC) time series clustering algorithm, which has higher accuracy and can be used to better describe the trend of hot topics. But K_SC algorithm is sensitive to the initialization of cluster centers and has high time complexity. Therefore, it is difficult to directly apply K_SC to high dimensional data. Based on wavelet transform technology, a new iteration clustering algorithm--WKSC is proposed in this paper, which has two improvements: (1) the original time series are compressed by Haar wavelets transform to lower dimensions of the original time series. WKSC algorithm groups topics based on lower dimensions time series and the time complexity is reduced; (2) the clustering results from previous iteration of K_SC are used as the initial assignment at the high level, then the high sensitivity to cluster centers is solved. Three datasets from different sources were selected and comprehensive experiments were conducted. Experimental results show that WKSC algorithm can significantly reduce time complexity, and improve the quality of clustering result, which means WKSC algorithm can l^e used on massive and high dimension hot topics.
出处 《计算机学报》 EI CSCD 北大核心 2012年第11期2337-2347,共11页 Chinese Journal of Computers
基金 国家自然科学基金(61170112) 北京市属高等学校科学技术与研究生教育创新工程建设项目(PXM2012_014213_000037)资助~~
关键词 聚类 时间序列 热点话题 小波 clustering time series hot topics wavelet
  • 相关文献

参考文献17

  • 1Szabo G, Huberman B A. Predicting the popularity of online content. Communications of the ACM, 2010, 53(8): 80-88.
  • 2Kumar R, Novak J, Raghavan Pet al. On the bursty evolu- tion of blogspace//Proceedings of the World Wide Web Conference on Internet and Web Information Systems. Washington, USA, 2005:159-178.
  • 3Mei Q, Liu C, Su H et al. A probabilistic approach to spatiotemporal theme pattern mining on weblogs//Proceedings of the 15th International Conference on World Wide Web. New York, USA, 2006: 533-542.
  • 4Crane R, Sornette D. Robust dynamic classes revealed by measuring the response function of a social system//Proceed ings of the National Academy of Sciences of the United States of America. Washington, USA, 2008:15649-15653.
  • 5Malmgren R D, Stouffer D B, Motter A E et al. A Poisso- nian explanation for heavy tails in email communication// Proceedings of the National Academy of Sciences of the Unit- ed States of America. New York, USA, 2008:18153-18158.
  • 6Barabasi A L. The origin of bursts and heavy tails in human dynamics. Nature, 2005, 435(7039): 207-211.
  • 7Yang J, Leskovec J. Patterns of temporal variation in online media//Proeeedings of the 4th ACM International Conference on Web Search and Data Mining. New York, USA, 2011: 177-186.
  • 8Chan F K P, Fu A W C, Yu C. Haar wavelets for efficient similarity search of time-series: With and without time war ping. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(3) : 686-705.
  • 9李斌,谭立湘,章劲松,庄镇泉.面向数据挖掘的时间序列符号化方法研究[J].电路与系统学报,2000,5(2):9-14. 被引量:29
  • 10李爱国,覃征.在线分割时间序列数据[J].软件学报,2004,15(11):1671-1679. 被引量:27

二级参考文献35

  • 1Das Gautam, Lin David, Mannila Heikki, et al. Rule Discovery from Time Series. In: Proc. Fourth Annual Conference on Knowledge Discovery and Data Mining.
  • 2Andrea Baraldi, Ethem Alpaydm. Simplified ART: A New Class of ART Algorithms. Technique Report of INTERNATIONAL COMPUTER SCIENCE INSTITUTE , California: Berkeley, 1998.
  • 3Keogh E. Fast Similarity Search in the Presence of Longitudinal Scaling in Time Series Databases. In: Proceedings of the 9th international Conference on Tools with Artificial Intelligence. [s. 1.]: IEEE Press, 1997. 578-584
  • 4Keogh E. Fast similarity search in the presence of longitudinal scaling in time series databases[C]. In:Proceedings of the IEEE 9th International Conference on Tools with Artificial Intelligence,Washington: IEEE Computer Society, 1997. 578-584
  • 5Keogh E, Folias T. The UCR Time Series Data Mining Archive[EB/OL]. http://www. cs. ucr. edu/- eamonn/TSDMA/index.html. Irvine, CA: University of California, Department of Informarion and Computer Science, 2002
  • 6Prat K B, Fink E. Search for patterns in compressed time series[J]. International Journal of Image and Graphics, 2002, 2(1): 89-106
  • 7Keogh E J, Chakrabarti K, Pazzani M J. Sharad Mehrotra. Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases[J]. Knowl. Inf. Syst,2001, 3(3): 263-286
  • 8Yi B K,Faloustsos C. Fast Time Sequence Indexing for Arbitrary Lp Norms [C]. In:Proceedings of the 26th International Conference on Very Large Data Bases, San Francisco: Morgan Kaufmann Publishers Inc, 2000. 385-394
  • 9Xiao Hui, Feng Xiao-Fei, Hu Yun-Fu. A new segmented time warping distance for data mining in time series database [C]. In:Proceedings of 2004 International Conference on Machine Learning and Cybernetics,Shanghai, China, 2004. 1277-1281
  • 10Keogh E,Kasetty S.On the need for time series data mining benchmarks:A survey and empirical demonstration//Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Edmonton,Alberta,Canada,2002:102-111

共引文献144

同被引文献294

引证文献31

二级引证文献184

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部