摘要
面对互联网信息极其庞大并且经常更新的问题,基于Scrapy爬虫框架设计并实现了一种数据采集系统。不仅可以根据用户自身需求获取数据,还可以对自身的采集任务进行简单的管理。介绍了系统开发的关键技术,探讨了系统框架设计、功能模块和数据库设计方案。使用Django MTV模式进行开发,底层数据采集框架使用Scrapy,一种使用Python编写实现的网站数据异步爬虫应用框架,网页解析采用XPath和Python正则相结合的方法,采用j Query树插件z Tree实现了任务的树形管理,使用bootstrap实现了数据的任务名加关键字组合查询和页面效果。系统主要分为网页解析模块、数据处理模块、系统登录模块、任务新建模块、任务管理模块和数据查询模块。最后分析了浏览器端和服务器端的数据交互,以及网页数据定位和解析的实现。
For the huge and frequent updating of the Internet information,we design and implement a data acquisition system based on theScrapy crawler framework,which can not only obtain data according to the user’ s own needs,but also manage its own collection taskssimply. The key technology of system development is introduced,and the frame design,function module and database design scheme ofthe system are discussed. The Django MTV mode is used for development,and the underlying data collection framework applies Scrapy,an asynchronous crawler application framework implemented by Python. The web page analysis uses the method in combination of XPathand Python regular. The jQuery zTree plug-in is utilized to realize tree management of tasks,the bootstrap to achieve the effect of taskname with the keyword combination query and page. The system is divided into web page analysis module,data processing module,system login module,task module,task management module and data query module. Finally,the realization of data interaction betweenbrowser and server,and the web page data positioning and analysis are analyzed.
作者
杨君
陈春玲
余瀚
YANG Jun;CHEN Chun-ling;YU Han(School of Computer,Nanjing University of Posts and Telecommunications,Nanjing 210003,China)
出处
《计算机技术与发展》
2018年第10期177-181,共5页
Computer Technology and Development
基金
国家自然科学基金(11501302)