摘要
文中设计并实现了一种基于Hadoop的网络舆情监控系统。该系统以HDFS作为底层存储系统,在其上构建基于HBase的分布式数据库对舆情信息进行统一存储管理。首先利用基于MapReduce的分布式网络爬虫进行数据抓取,以解决单机爬虫效率低、可扩展性差等问题;其次采用Canopy结合K-means的二次聚类算法,克服单一K-means聚类算法的不足,以提高文本聚类的效率和准确度;最后实现基于查询的话题追踪策略,对热点话题进行有效跟踪分析。仿真实验表明:Canopy-Kmeans聚类方法比传统K-means方法漏报率、误报率分别降低1.24%、0.09%,最小标准代价降低1.681%。系统通过提供可视化舆情分析报告,为企业或单位及时掌握舆情热点、制定舆情策略提供科学、系统的技术支持。
A network consensus monitoring system based on Hadoop was designed and realized. The system adopts HDFS as the underlying storage system,and then it builds a distributed database based on HBase with it to realize unified storage and management on the network consensus information. Firstly,it grabs the data with the distributed web craw ler based on MapReduce to solve the problems of lowefficiency and poor expansibility of single craw ler. Then it uses the secondary clustering algorithm with Canopy combined with K-means,which can overcome the shortages of single K- means clustering algorithm and could improve the efficiency and precision of text clustering. Finally,it could realize the topics tracking strategy based on query,also could be effective track and analysis of hot topics. The simulation experiment results show that compared with the traditional methods,the false negative and false positive of Canopy- Kmeans clustering method is lower at 1. 24% and 0. 09% respectively,the minimum standard price is lower at 1. 681%. Through providing the visualized analysis of network consensus,the system proposed could provide scientific and systematical technology support for enterprises and scientific institutions to learn the hot network consensus and make network consensus strategy.
出处
《计算机技术与发展》
2016年第2期144-149,共6页
Computer Technology and Development
基金
山东省科学院青年基金项目(2013QN036)
山东省科技发展计划(2013GGX10127
2014GGX101013)