摘要
随着互联网数据的快速增长,针对如何对互联网数据进行有效的收集和分析,提出一种基于分布式平台的系统架构。该架构包括爬虫模块、Web模块以及分布式平台三大模块,其中爬虫模块负责数据的收集,Web模块负责简单任务的处理以及分析结果的可视化展示,分布式平台提供数据的存储以及复杂任务的计算功能,3个模块的结合为网络上海量数据的爬取、存储与分析提供了一个很好的解决方案。最后,针对社交网站新浪微博的应用案例验证了该分布式舆情分析系统架构的可用性。
With the rapid growth of internet data,system architecture based on distributed platform to effectively crawl was proposed and the data was analyzed.The architecture consists of three modules,crawler module,Web module and distributed platform module.Among them,the crawler module is responsible for data collection,Web module processes the simple job and gives a visual display for the analysis result,and the distributed platform module is for data storage and complicated job computing.The combination of three modules provides an excellent solution for mass data collection,storage and analysis on the internet.The effectiveness of the proposed framework was verified in the development of a public opinion mining system.
出处
《电信科学》
北大核心
2013年第7期66-71,共6页
Telecommunications Science
关键词
分布式系统架构
舆情分析
爬虫
可视化
distributed system architecture
public opinion analysis
crawler
visualization