摘要
为从微博服务平台产生的大量实时信息中抽取新闻事件,提出了一套完整的云计算环境下的微博事件检测跟踪算法。首先采用新的基于微博转发数和评论数的权值计算方法,将微博文本表示成向量空间模型;再利用基于代表点的增量层次密度聚类(RIHDBSCAN)算法抽取关键词,最终实现新闻事件的检测和跟踪。针对单一节点无法快速高效地处理海量微博数据的问题,将算法部署在云计算平台Hadoop上。通过在新浪微博平台上获取的真实数据进行实验,结果表明,所提出的权值计算方法比TF-IDF和UF-ITUF有更高的性能,并且云框架的使用较好地提高了处理速度,适合用于海量数据的分析和挖掘。
For the purpose of events extraction from large-scale short posts of microblogging service, a complete event detection and tracking algorithm was proposed using cloud framework. First, based on the number of forward and comment of the microblog, the posts were expressed as Vector Space Model ( VSM). Then the keywords were extracted using RIHDBSCAN (Incremental Hierarchical DBSCAN based on Representative posts) to realize the event detection and tracking. Considering that a single node cannot quickly and efficiently handle the large amount of data, the algorithm would be deployed on Hadoop, a cloud computing platform. The experiment on real microblog data extracted from Sina microblogging platform shows that the proposed method achieves higher performance than that of TF-IDF ( Term Frequency-Inverse Document Frequency) and UF- ITUF (User Frequency-Inverse Thread User Frequency), and the use of cloud framework improves the processing speed. Therefore, it is suitable for data analysis and mining on huge datasets.
出处
《计算机应用》
CSCD
北大核心
2013年第12期3559-3562,3595,共5页
journal of Computer Applications
基金
国家自然科学基金资助项目(61103114)
国家科技支撑计划项目(2012BAH19F00)
中央高校基本科研业务基金资助项目(106112013CDJZR185502)
重庆市高等教育教学改革研究重点项目(112023)
关键词
微博
事件检测
密度聚类算法
云计算
HADOOP平台
代表点
microblog
events detection
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
cloudcomputing
Hadoop platform
representative post