期刊文献+

基于流信息距离的多文本流热点挖掘 被引量:5

Mining Hotspots from Multiple Text Streams Based on Stream Information Distance
下载PDF
导出
摘要 把文本流中的热点区分为局部热点和全局热点,分析了二者的相关性,并将Kolmogorov复杂度应用于多文本流中的热点挖掘.首先,定义了基于Kolmogorov复杂度的冗余信息的概念,并论证了文本流存在局部热点的必要条件是冗余信息超过某个阈值;其次,基于条件Kolmogorov复杂度提出了一个相似性度量指标——流信息距离(stream information distance,简称SID),以衡量不同文本流之间的相似度;并借鉴计算生物学领域中的种系发生树的思想,提出了一种基于层次聚类的多文本流全局热点挖掘启发式算法.在合成和真实数据集的实验,验证了算法的收敛性、有效性和规模可伸缩性. This paper characterizes the local and global hotspots in text streams and elaborates their correlation. The paper then applies Kolmogorov complexity to mining the hotspots in multiple text streams. The Redundant Information is defined based on Kolmogorov complexity, and it has been demonstrated that the Redundant Information exceeding a threshold is necessary for the local hotspots. Secondly, a similarity metric, termed as Stream Information Distance (SID), is suggested based on the conditional Kolmogorov complexity to quantify the similarity between different text streams. Borrowing ideas of Phylogeny originated from Computational Biology, a heuristic algorithm based on hierarchical clustering is proposed to mine the global hostspots from multiple text streams. Finally, the convergency, effectiveness, and scalability of this algorithm are validated by the extensive experiments over synthetic and real data set.
出处 《软件学报》 EI CSCD 北大核心 2011年第8期1761-1770,共10页 Journal of Software
基金 国家自然科学基金(600773169) 国家科技支撑计划(2006BAI05A01)
关键词 热点挖掘 多文本流 流信息距离 冗余信息 Kolmogorov复杂度 hotspot mining multiple text streams stream information distance redundant information Kolmogorov complexity
  • 相关文献

参考文献21

  • 1Parikh N, Sundaresan N. Scalable and near real-time burst detection from eCommerce queries. In: Proc.of the ACM KDD 2008. New York: ACM, 2008.972-980. [doi: 10.1145/1401890.1402006].
  • 2Lappas T, Arai B, Platakis M, Kotsakos D, Gunopulos D. On burstiness-aware search for document sequences. In: Proc.of the ACM KDD 2009. New York: ACM, 2009. 477-485.
  • 3Fung GPC, Yu JX, Yu PS, Lu HJ. Parameter free bursty events detection in text streams. In: Proc. of the VLDB 2005. Trondheim: VLDB Endowment, 2005. 181 - 192.
  • 4Allan J, Carbonell J, Doddington G, Yamron J, Yang YM. Topic detection and tracking pilot study: Final report. In: Proc. of the DARPA Broadcast News Transcription and Understanding Workshop. Arlington: NSF, 1998.65-74.
  • 5Brants T, Chen F, Farahat A. A system for new event detection. In: Proc. of the 26th ACM SIGIR Int'l Conf. on Research and Development in Information Retrieval (SIGIR 2003). Toronto: ACM, 2003. 102-112. [doi: 10.1145/860435.860495].
  • 6Yang YM, Ault T, Pierce T, Lattimer CW. Improving text categorization methods for event tracking. In: Belkin NJ, Ingwersen P, Leong MK, eds. Proc. of the SIGIR 2000. Athens: ACM, 2000.65-72. [doi: 10.1145/345508.345550].
  • 7Kleinberg J. Bursty and hierarchical structure in streams. In: Proc. of KDD 2002. Edmonton: ACM, 2002.91-101. [doi: 10.1145/ 775047.775061 ].
  • 8Kumar R, Novak J, Raghavan P, Tomkins A. On the bursty evolution of blogspace. In: Proc. of the WWW 2003. Budapest: ACM, 2003. 568-576. [doi: 10.1145/775152.775233].
  • 9He Q, Chang KY, Lim EP. Analyzing feature trajectories for event detection. In: Proc. of the SIGIR 2007. Amsterdam: ACM, 2007. 186-197. [doi: 10.1145/1277741.1277779].
  • 10Vlachos M, Meek C, Vagcna Z, Gunopulos D. Identifying similarities, periodicities and bursts for online search queries. In: Proc. of the SIGMOD 2004. New York: ACM, 2004. 131-142. [doi: 10.1145/1007568.1007586].

同被引文献48

引证文献5

二级引证文献17

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部