期刊文献+

基于Scrapy-Redis分布式数据采集平台的设计与实现 被引量:7

Design and implementation of distributed data collection system based on Scrapy-Redis
下载PDF
导出
摘要 针对微博平台大数据的采集、挖掘、分析等热点问题,深入介绍并分析了采集平台的相关理论技术,通过对采集平台功能结构及后台数据库设计、页面爬取和解析、反爬虫的应用技术设计、分布式策略设计等四个方面的技术研究,设计并实现了一种基于分布式的微博数据采集平台;给出了主从模式系统架构;达到了用户只需根据需要输入待爬取微博页面的ID,并选择要采集的数据类型,即可获得所需数据的目的。经测试,系统搭建成本低,爬取性能高,可运用于微博数据的舆情分析和数据调研等研究方面的基础数据采集。 Focusing on hot issues such as collecting, mining, and analyzing big data of micro-blog platform, in-depth introduction and analysis of the relevant theory and technology of the collection platform, research on the function structure of collection platform , the design of the back-end database, page crawling and parsing, application design of anti-reptiles and distribution strategy design, designed and implemented a distributed micro-blog data collection platform;provided a master-slave mode system architecture;achieved the user only needs to enter the ID of the micro-blog page to be crawled as needed, and select the type of data to be collected to obtain the desired data. After testing, the system has low construction cost ,high crawl performance ,and can be applied to the basic data collection in public opinion analysis and data research of micro-blog data.
作者 严慧 彭绪富 朱小婉 熊旭辉 董叶豪 YAN Hui;PENG Xu-fu;ZHU Xiao-wan;XIONG Xu-hui;DONG Ye-hao(College of Computer Science and Technology,Hubei Normal University,Huangshi 435002,China;College of Arts and Science,Hubei Normal University,Huangshi 435002,China;College of Educational Science,Hubei Normal University,Huangshi 435002,China)
出处 《湖北师范大学学报(自然科学版)》 2019年第1期19-25,共7页 Journal of Hubei Normal University:Natural Science
基金 湖北省高等学校优秀中青年科技创新团队计划项目(T201430)
关键词 微博平台 数据采集 分布式 网络爬虫 Scrapy-Redis micro-blog platform data collection distributed web vrawler Scrapy-Redis
  • 相关文献

参考文献3

二级参考文献31

  • 1欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报(自然科学版),2005,45(S1):1743-1747. 被引量:70
  • 2周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9):1965-1969. 被引量:153
  • 3Pieter N, Michiel H. Mining Twitter in the cloud: A case study [C]// Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing, CLOUD 2010. Miami, USA: IEEE Computer Society, 2010: 107 -114.
  • 4Abraham R, Martinez T. Twitter: Network properties analysis [C]// Proceedings of the CONIELECOMP 2010 20th International Conference on Electronics Communications and Computers. Cholula Puebla, Mexico: IEEE Computer Society, 2010: 180 - 184.
  • 5wenE,SunV.新浪微博研究报告[Z/OL].(2011-05-20),http://www.techweb.com.cn/data/2011-02-25/916941.shtml.
  • 6HAN Ruixia. The influence of microblogging on personal public participation [C]// Proceedings of the 2010 IEEE 2nd Symposium on Web Society, SWS 2010. Beijing, China: Association for Computing Machinery, 2010:615 -618.
  • 7KANG Shulong, ZHANG Chuang. Complexity research of massively microhlogging based on human behaviors [C]//2010 2nd International Workshop on Database Technology and Applications, DBTA2010 Proceedings. Wuhan, China: IEEE Computer Society, 2010: 1 -4.
  • 8WANG Rui, JIN Yongsheng. An empirical study on the relationship between the followers' number and influence of microblogging [C]// Proceedings of the International Conference on E-Business and E-Government, ICEE 2010. Guangzhou, China: IEEE Computer Society, 2010: 2014- 2017.
  • 9Westman S, Freund L characters or less : Genres on interaction in 140 twitter [C]//IIiX 2010 Proceedings of the 2010 Information Interaction in Context Symposium. New Brunswick, USA: Association for Computing Machinery, 2010:323 - 326.
  • 10TUMASJAN A, SPRENGER T O, SANDNER P G, et al. Predicting elections with Twitter: what 140 characters reveal about political sentiment[C] // Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media. Madison: AAAI Press, 2010, 10: 178-185.

共引文献174

同被引文献66

引证文献7

二级引证文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部