期刊文献+

突发事件检测的MapReduce并行化实现 被引量:3

Parallel Implementing Bursty Events Detection Using Map Reduce
原文传递
导出
摘要 【目的】在大数据环境下,从文本流中准确且快速地检测出特定领域的突发事件。【方法】利用Kleinberg突发检测方法和LDA主题模型方法,将其扩展到Map Reduce并行框架中,实现并行语料预处理、并行突发词检测、并行突发文档过滤和并行主题提取。【结果】对新闻文本流进行模拟仿真实验,结果表明,该并行方法在特定领域突发事件检测中准确率P、召回率R和调和平均值F分别最高可达87.50%、77.78%和82.35%。【局限】基于Map Reduce的并行方法难以实现大规模动态文本流在线(Online)实时(Real-time)突发事件检测。【结论】与传统串行突发事件检测方法相比,所构建的分布式并行化方法在保证检测结果正确性的同时,具有良好的可扩展性,性能得到较大提升。 [Objective] In big data environment, this paper aims to accurately and quickly detect bursty events from the text stream. [Methods] Using Kleinberg bursty detection and LDA topic model, the method is extended to MapReduce framework to achieve parallel corpus predisposed, parallel detection of bursty word, parallel filtration of bursty document and parallel extraction of topic. [Results] The results of simulation experiments on the news text stream show that precision reaches 87.50%, recall reaches 77.78%, and F-measure reaches 82.35% with the parallel method to detect bursty events in specific areas. [Limitations] The MapReduce parallel method is difficult to achieve Online and Real-time detection ofbursty events with large-scale dynamic text stream. [Conclusions] Compared with the traditional serial detecting method of bursty events, the distributed parallel method not only guarantees the accuracy of detecting results, but also has a good scalability.
出处 《现代图书情报技术》 CSSCI 2015年第2期46-54,共9页 New Technology of Library and Information Service
基金 国家社会科学基金项目"基于关联数据的图书馆语义云服务研究"(项目编号:12CTQ009) 国家社会科学基金重大项目"面向突发事件应急决策的快速响应情报体系研究"(项目编号:13&ZD174) 国家自然科学基金面上项目"面向知识服务的知识组织模式与应用研究"(项目编号:71273126) 江苏省社会科学基金青年项目"基于语义云服务的数字阅读推广研究"(项目编号:14TQC003)的研究成果之一
关键词 突发事件检测 MAPREDUCE 分布式处理 LDA 主题模型 Bursty event detection MapReduce Distributed process LDA topic model
  • 相关文献

参考文献28

  • 1Xie W, Zhu F, Jiang J, et al. TopicSketch: Real-Time BurstyTopic Detection from Twitter [C]. In: Proceedings of the 13thInternational Conference on Data Mining,Dallas, Texas,USA. IEEE, 2013: 837-846.
  • 2Dean J,Ghemawat S. MapReduce: Simplified Data Processingon Large Clusters [J]. Communications of the ACM, 2008,51(1): 107-113.
  • 3Hadoop [EB/OL]. [2014-07-15]. http://hadoop.apache.org/.
  • 4Allan J,Carbonell J, Doddington G, et al. Topic Detection andTracking Pilot Study Final Report [C]. In: Proceedings of theDARPA Broadcast News Transcription and UnderstandingWorkshop, 1998: 194-218.
  • 5Hofmann T. Probabilistic Latent Semantic Analysis [C]. In:Proceedings of the 15th Conference on Uncertainty inArtificial Intelligence. Morgan Kaufmann Publishers Inc.,1999: 289-296.
  • 6Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation [J].The Journal of Machine Learning Research, 2003,3:993-1022.
  • 7李文波,孙乐,张大鲲.基于Labeled-LDA模型的文本分类新算法[J].计算机学报,2008,31(4):620-627. 被引量:102
  • 8Wang X,Zhai C, Hu X, et al. Mining Correlated Bursty TopicPatterns from Coordinated Text Streams [C]. In: Proceedingsof the 13th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining. New York, USA:ACM, 2007: 784-793.
  • 9Lin C X,Zhao B, Mei Q, et al. PET: A Statistical Model forPopular Events Tracking in Social Communities [C]. In:Proceedings of the 16th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining. NewYork, USA: ACM, 2010: 929-938.
  • 10Dubrawski A. Detection of Events in Multiple Streams ofSurveillance Data [A].// Infectious Disease Informatics andBiosurveillance [M]. Springer US, 2011: 145-171.

二级参考文献121

共引文献191

同被引文献46

  • 1徐剑,黄秋月.“二八定律”在图书馆管理中的应用[J].中国图书馆学报,2007,33(5):106-108. 被引量:39
  • 2秦新国.基于句子相似度的文档复制检测算法研究[J].现代图书情报技术,2007(11):63-66. 被引量:9
  • 3Apache spark [ EB/OL ]. [ 2015 - 03 - 18]. http://spark, a-pache. org.
  • 4Si A, Leong H V,Lau R W H. Check: A document plagiarism de-tection system [ C ] //Proceedings of the 1997 ACM Symposium onApplied Computing. New York: ACM, 1997 : 70 -77.
  • 5Schleimer S, Wilkerson D S,Aiken A. Winnowing: Local algo-rithms for document fingerprinting [ C ] //Proceedings of the 2003ACM SIGMOD International Conference on Management of Data.New York:ACM, 2003: 76 -85.
  • 6Roul R K,Mittal S,Joshi P. Efficient approach for near duplicatedocument detection using textual and conceptual based techniques[M ] // Advanced Computing, Networking and Informatics -Volume1. Springer International Publishing, 2014 : 195 -203.
  • 7Luo Xi, Najjar W, Hristidis V. Efficient near-duplicate documentdetection using FPGAs [ C ]//Big Data, 2013 IEEE InternationalConference on. Silicon Valley : IEEE, 2013 : 54-61.
  • 8Monostori K, Zaslavsky A, Schmidt H. Parallel and distributeddocument overlap detection on the Web [ M ] //Applied ParallelComputing. New Paradigms for HPC in Industry and Academia.London:Springer-Verlag London, 2001 : 206 -214.
  • 9Apache Hadoop. Hadoop [ EB/OL]. [2015 -03 - 18]. http://hadoop. apache, org.
  • 10ApacheStorm. Storm[ EB/OL]. [2015 - 03 - 18 ]. http://storm.apache, org.

引证文献3

二级引证文献15

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部