摘要
随着互联网技术的飞速发展和网络数据的急速增长,如何对海量数据进行快速有效地采集和分析已经成为大数据分析与应用领域中亟待解决的重要问题。基于Scrapy框架实现主从式结构的分布式网络爬虫,运用开源项目Scrapy-Redis来部署网络爬虫,继而完成对知乎网站话题的爬取与分析工作,共爬取44346个话题、94688个回答和31202个用户数据,并从话题、回答、用户这三个方面应用可视化技术进行多维度分析。结果表明,开放式网络问答社区的话题主题与网络用户性别、地理位置分布及专业背景等因素具有显著的线性相关关系。该方法可推广应用于自动模式识别、网络舆情预测等大数据应用领域。
With the rapid development of internet technology and fast growth of network data, how to collect and analyze massive data quickly and effectively has become an urgent problem to be solved in big data analysis and corresponding application.This paper deploys the distributed network crawler with master-slave structure based on Scrapy framework to complete the crawling and to analysis of Zhihu’s topic.A total of 44,346 topics, 94,688 answers and 31,202 user data have been crawled, and carried on multidimensional analysis given visual chart from from topics, answers and users.The result of data analysis shows that there is a significant correlation between the focused topics and some factors such as gender, geographical location and professional background of network users.The method of data collection and analysis in the paper can be popularized and applied to automatic pattern recognition, network public opinion prediction and other big data application fields.
作者
李光敏
李平
汪聪
LI Guang-min;LI Ping;WANG Cong(College of Computer Science and Technology, Hubei Normal University, Huangshi, 435002, China;College of Math and Statistics, Huanggang Normal University, Huanggang, 438000, China)
出处
《湖北师范大学学报(自然科学版)》
2019年第3期1-7,共7页
Journal of Hubei Normal University:Natural Science
基金
湖北省教育厅科研计划重点项目(D20172502)