摘要
针对微博平台大数据的采集、挖掘、分析等热点问题,深入介绍并分析了采集平台的相关理论技术,通过对采集平台功能结构及后台数据库设计、页面爬取和解析、反爬虫的应用技术设计、分布式策略设计等四个方面的技术研究,设计并实现了一种基于分布式的微博数据采集平台;给出了主从模式系统架构;达到了用户只需根据需要输入待爬取微博页面的ID,并选择要采集的数据类型,即可获得所需数据的目的。经测试,系统搭建成本低,爬取性能高,可运用于微博数据的舆情分析和数据调研等研究方面的基础数据采集。
Focusing on hot issues such as collecting, mining, and analyzing big data of micro-blog platform, in-depth introduction and analysis of the relevant theory and technology of the collection platform, research on the function structure of collection platform , the design of the back-end database, page crawling and parsing, application design of anti-reptiles and distribution strategy design, designed and implemented a distributed micro-blog data collection platform;provided a master-slave mode system architecture;achieved the user only needs to enter the ID of the micro-blog page to be crawled as needed, and select the type of data to be collected to obtain the desired data. After testing, the system has low construction cost ,high crawl performance ,and can be applied to the basic data collection in public opinion analysis and data research of micro-blog data.
作者
严慧
彭绪富
朱小婉
熊旭辉
董叶豪
YAN Hui;PENG Xu-fu;ZHU Xiao-wan;XIONG Xu-hui;DONG Ye-hao(College of Computer Science and Technology,Hubei Normal University,Huangshi 435002,China;College of Arts and Science,Hubei Normal University,Huangshi 435002,China;College of Educational Science,Hubei Normal University,Huangshi 435002,China)
出处
《湖北师范大学学报(自然科学版)》
2019年第1期19-25,共7页
Journal of Hubei Normal University:Natural Science
基金
湖北省高等学校优秀中青年科技创新团队计划项目(T201430)