云计算环境下基于代表点增量层次密度聚类的微博事件检测及跟踪被引量：3

Microblog events detection and tracking with incremental hierarchical DBSCAN based on representative posts using cloud framework

下载PDF

导出

摘要为从微博服务平台产生的大量实时信息中抽取新闻事件,提出了一套完整的云计算环境下的微博事件检测跟踪算法。首先采用新的基于微博转发数和评论数的权值计算方法,将微博文本表示成向量空间模型;再利用基于代表点的增量层次密度聚类(RIHDBSCAN)算法抽取关键词,最终实现新闻事件的检测和跟踪。针对单一节点无法快速高效地处理海量微博数据的问题,将算法部署在云计算平台Hadoop上。通过在新浪微博平台上获取的真实数据进行实验,结果表明,所提出的权值计算方法比TF-IDF和UF-ITUF有更高的性能,并且云框架的使用较好地提高了处理速度,适合用于海量数据的分析和挖掘。 For the purpose of events extraction from large-scale short posts of microblogging service, a complete event detection and tracking algorithm was proposed using cloud framework. First, based on the number of forward and comment of the microblog, the posts were expressed as Vector Space Model （ VSM）. Then the keywords were extracted using RIHDBSCAN （Incremental Hierarchical DBSCAN based on Representative posts） to realize the event detection and tracking. Considering that a single node cannot quickly and efficiently handle the large amount of data, the algorithm would be deployed on Hadoop, a cloud computing platform. The experiment on real microblog data extracted from Sina microblogging platform shows that the proposed method achieves higher performance than that of TF-IDF （ Term Frequency-Inverse Document Frequency） and UF- ITUF （User Frequency-Inverse Thread User Frequency）, and the use of cloud framework improves the processing speed. Therefore, it is suitable for data analysis and mining on huge datasets.

作者冯永韩楠贾东风

机构地区信息物理社会可信服务计算教育部重点实验室(重庆大学) 重庆大学计算机学院

出处《计算机应用》 CSCD 北大核心 2013年第12期3559-3562,3595,共5页 journal of Computer Applications

基金国家自然科学基金资助项目(61103114) 国家科技支撑计划项目(2012BAH19F00) 中央高校基本科研业务基金资助项目(106112013CDJZR185502) 重庆市高等教育教学改革研究重点项目(112023)

关键词微博事件检测密度聚类算法云计算 HADOOP平台代表点 microblog events detection Density-Based Spatial Clustering of Applications with Noise （DBSCAN） cloudcomputing Hadoop platform representative post

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献15

1MATI-IIOUDAKIS M, KOUDAS N. TwitterMonitor: trend detection over the Twitter stream [ C]// SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Da- ta. New York: ACM, 2010:1155-1158.
2SAKAKI T, OKAZAKI M, MATSUO Y. Earthquake shakes Twitter users: real-time event detection by social sensors [ C]//WWW 10: Proceedings of the 19th International Conference on World Wide Web. New York: ACM, 2010:851-860.
3PETROVI S, OSBORNE M, LAVRENKO V. Streaming first story detection with application to Twitter [ C]// HLT '10 Human Lan- guage Technologies: The 2010 Annual Conference of the North A- merican Chapter of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2010: 181 - 189.
4GHEMAWAT S, GOBIOFF H, LEUNG S-T. The Google file system [J]. ACM SIGOPS Operating Systems Review, 2003, 37(5):29 - 43.
5DEAN J, GHEMAWAT S. MapReduce: simplified data processing on large clusters [ J]. Communications of the ACM, 2008, 51 (1) : 107 - 113.
6郑斐然,苗夺谦,张志飞,高灿.一种中文微博新闻话题检测的方法[J].计算机科学,2012,39(1):138-141. 被引量：84
7ESTER M, KRIEGEL H P, SANDER J, et al. Incremental cluste- ring for mining in a data warehousing environment [ C] // VLDB '98: Proceedings of the 24rd International Conference on Very Large Data Bases. San Francisco: Morgan Kaufmann Publishers, 1998:323-333.
8蔡颖琨,谢昆青,马修军.屏蔽了输入参数敏感性的DBSCAN改进算法[J].北京大学学报（自然科学版）,2004,40(3):480-486. 被引量：39
9马帅,王腾蛟,唐世渭,杨冬青,高军.一种基于参考点和密度的快速聚类算法[J].软件学报,2003,14(6):1089-1095. 被引量：108
10周红芳,赵雪涵,周扬.基于限定区域数据取样的密度聚类算法[J].计算机应用,2012,32(8):2182-2185. 被引量：5

二级参考文献55

1周水庚,周傲英,金文,范晔,钱卫宁.FDBSCAN:一种快速 DBSCAN算法(英文)[J].软件学报,2000,11(6):735-744. 被引量：42
2冯少荣,肖文俊.基于密度的DBSCAN聚类算法的研究及应用[J].计算机工程与应用,2007,43(20):216-221. 被引量：34
3Kwak H, Lee C, Park H, et al. What is Twitter, a Social Net- work or a News Media? I-A]//WWW' 10 Proceedings of the 19th International Conference on World Wide Web, 2010[C]. Raleigh, North Carolina, USA : ACM, 2010 : 591 -600.
4Liu Zi-tao, Yu Wen-chao, Chen Wei, et al. Short Text Feature Selection for Miero-blog Mining [A]//Computational Intelli- gence and Software Engineering, 2010[C]. Wuhan, China: Wu- han University, 2010: 1-4.
5Pak A,Paxoubek Pa Twitter as a Corpus for Sentiment Analy- sis and Opinion Mining[A]//Proceedings of LREC, 2010[C]. Valletta, Malta: European Language Resources Association (ELRA). 2010:1320-1326.
6Allan J,Carbonell JG, et al. Topic Detection and Tracking Pilot Study Final Report[A]//Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, 1998 [C]. 1998:194-218.
7Sakaki Ti, Okazaki M, Matsuo Y. Earthquake Shakes Twittt User..Real-time Event Detection by Social Sensors [ A] // Pr1 ceedings of the 19th International Conference on World Wi1 Web, 2010[C]. Raleigh, North Carolina: ACM Press, 2010: 85] 861.
8Petrovi S, Osborne M, Lavrenko V. Streaming First Story De- tection with application to Twitter[A]//Proceedings of HLT- NAACL, 2010 [C]. Stroudsburg, PA, USA: Association for Computational Linguisties. 2010:181-189.
9Zhang H P, Yu H K, Xiong D Y, et al. HHMM-based Chinese lexieal analyzer ICTCLAS [A]//. Proceedings of the second SIGHAN workshop on Chinese language processing-Volume 17, 2003 [C]. Sapporo, Japan: Association for Computational Linguistics, 2003 : 184-187.
10路荣,项亮,刘明荣,等.基于隐主题分析和文本聚类的微博客新闻话题发现研究[A]∥第六届全国信息检索学术会议,2010[C].2010:291-298.

共引文献246

1李玉鑑.自适应K-均值聚类算法[J].计算机研究与发展,2007,44(z2):100-104. 被引量：5
2薛永生,翁伟,文娟,王劲波,张宇.LSNCCP——一种基于最大不相含核心点集的聚类算法[J].计算机研究与发展,2004,41(11):1930-1935. 被引量：2
3陈燕,耿国华,郑建国.一种改进的基于密度的聚类算法[J].微机发展,2005,15(3):17-19. 被引量：13
4王恬宇.基于空间聚类的图像检索方法[J].情报杂志,2005,24(4):108-109.
5董子祥,赵阔.解决大学生心理问题的重要手段——网络化教育[J].社会科学论坛（学术研究卷）,2005(4):100-101.
6石陆魁,何丕廉.一种基于密度的高效聚类算法[J].计算机应用,2005,25(8):1824-1826. 被引量：21
7伊胜伟,刘旸,魏红芳.基于数据挖掘的入侵检测系统智能结构模型[J].计算机工程与设计,2005,26(9):2464-2466. 被引量：10
8徐晓华.高中阶段教育面临的形势与发展策略[J].教育科学论坛,2005(12):57-58.
9文登敏,张丽梅.基于对象“形状”的聚类算法[J].计算机应用与软件,2005,22(12):121-123.
10陈卓,孟庆春,魏振钢,任丽婕,窦金凤.一种基于网格和密度凝聚点的快速聚类算法[J].哈尔滨工业大学学报,2005,37(12):1654-1657. 被引量：14

同被引文献72

1贾自艳,何清,张海俊,李嘉佑,史忠植.一种基于动态进化模型的事件探测和追踪算法[J].计算机研究与发展,2004,41(7):1273-1280. 被引量：58
2陆安生,陈永强,屠浩文.决策树C5算法的分析与应用[J].电脑知识与技术（技术论坛）,2005(3):17-20. 被引量：16
3刘华.超大规模分类语料库构建[J].现代图书情报技术,2006(1):71-73. 被引量：6
4洪宇,张宇,刘挺,李生.话题检测与跟踪的评测及研究综述[J].中文信息学报,2007,21(6):71-87. 被引量：153
5Allan J, Lavrenko V, Swan R.. Explorations within topic tracking and detection [ M ]//Topic Detection and Tracing : Event-based Information Organization. Kluwer Academic: Massachusetts, 2002,197-224.
6Petrovic S, Osborne M, Lavrenko V. Streaming first story detection with application to Twitter[ C]// In Proceedings of the lhh Annual Conference of the North American Chap- ter of the Association for Computational Linguistics. [ s. n. ], 2010:181-189.
7Sakaki T, Okazaki M, Matsuo Y. Earthquake shakes Twit- ter users : Real-time event detection by social sensors [ C ] //In Proceedings of the 19th International World Wide Web Conference. New York : ACM, 2010,851 - 860.
8Phuvipadawat S, Murata T. Breaking news detection and tracking in Twitter[ C ]//Web Intelligence and Intelligent Agent Technology In Proceedings of IEEE/WIC/ACM In- ternational Conference. Toronto,Canada:IEEE, 2010: 120- 123.
9Lin J, Snow R, Morgan W. Smoothing techniques for a- daptive online language models: topic tracking in tweet streams[ C]/// In Proceedings of the 17th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining. New York : ACM ,2011 : 422-429.
10Asur S, Huberman B A. Predicting the Future With Social Media[ C ] // In Proceedings of the ACM International Conference on Web Intelligence. New Nork:ACM. 2010.

引证文献3

1张钰莎,蒋盛益.微博公共事件演化分析研究综述[J].广东工业大学学报,2015,32(2):58-63. 被引量：1
2李进华,安仲杰.基于地理坐标的微博事件检测与分析[J].现代图书情报技术,2016(2):90-101. 被引量：4
3王冰玉,吴振宇,沈苏彬,陈佳颖.社交媒体事件检测研究综述[J].计算机技术与发展,2018,28(9):105-111. 被引量：1

二级引证文献6

1吴小兰,章成志.基于突发事件特征网络的用户社区发现与社区主题演化研究——以新浪微博H7N9事件为例[J].情报理论与实践,2017,40(5):94-98. 被引量：14
2刘东江,黎建辉.基于主动学习的微博数据分类[J].计算机应用研究,2018,35(3):803-806. 被引量：1
3仲兆满,管燕,李存华,刘宗田.微博网络地域Top-k突发事件检测[J].计算机学报,2018,41(7):1504-1516. 被引量：18
4刘金龙,郭岩,余智华,刘悦,俞晓明,程学旗.基于词聚类的跨媒体突发事件检测方法[J].广西师范大学学报（自然科学版）,2019,37(1):23-31.
5张仰森,段宇翔,王建,吴云芳.基于多种词特征的微博突发事件检测方法[J].电子学报,2019,47(9):1919-1928. 被引量：5
6赵旭剑,王崇伟,王俊力.融合社会影响力和时间分布的微博关键事件抽取方法[J].计算机应用,2022,42(9):2667-2673.

1费欢,李光辉.基于K-means聚类的WSN异常数据检测算法[J].计算机工程,2015,41(7):124-128. 被引量：33
2高程,努尔布力,谢男男.PCKCI:一种基于特征提取的入侵检测聚类算法的研究[J].激光杂志,2015,36(12):26-30.
3申锐.数据挖掘技术中聚类算法的探索与研究[J].山西科技,2009,24(2):90-91. 被引量：2
4牛延莉,张化.文本自动分类研究进展[J].软件导刊,2008,7(4):24-26. 被引量：3
5赵红梅.互联网实时信息搜索引擎[J].大众标准化,2004(12):42-45.
6方自力,蒋萍.应用3^+网处理和传送实时信息[J].华东电力,1992,20(2):31-33.
72010年度云计算10大新闻事件[J].通讯世界,2011(2):47-47.
8李南,钟一文.多代表点的数据流分类算法[J].小型微型计算机系统,2015,36(7):1535-1539. 被引量：2
9陈园园,陈治平.一种基于代表点和点密度的聚类算法[J].计算机工程与应用,2008,44(28):136-139. 被引量：2
10孙兆林,杨宏文,胡卫东.基于贝叶斯网络的态势估计方法[J].计算机应用,2005,25(4):745-747. 被引量：23

计算机应用

2013年第12期

浏览历史

内容加载中请稍等...

云计算环境下基于代表点增量层次密度聚类的微博事件检测及跟踪被引量：3

参考文献15

二级参考文献55

共引文献246

同被引文献72

引证文献3

二级引证文献6

相关作者

相关机构

相关主题

浏览历史

云计算环境下基于代表点增量层次密度聚类的微博事件检测及跟踪 被引量：3

参考文献15

二级参考文献55

共引文献246

同被引文献72

引证文献3

二级引证文献6

相关作者

相关机构

相关主题

浏览历史

云计算环境下基于代表点增量层次密度聚类的微博事件检测及跟踪被引量：3