摘要
ScrapySharp在HtmlAgilityPack类库的基础上进行了扩展,能够模拟Web浏览器操作,支持CSS选择器解析HTML节点,是基于.NET的数据采集框架。ScrapySharp高效、易用,但模拟浏览器的能力有限,而Selenium自动化测试框架具备强大的浏览器操作能力。通过对开发环境搭建、ScrapySharp与Selenium结合使用、JSON数据的采集方法、反反爬虫手段、数据批量存储等技术进行研究,得出一种基于C#+ScrapySharp+Selenium的数据采集解决方案。
ScrapySharp is extended on the basis of HtmlAgilityPack class library.It can simulate the operation of web browser and support CSS selector to parse HTML nodes.It is a data acquisition framework based on NET.ScrapySharp is efficient and easy to use,but its ability to simulate browsers is limited,while Selenium automated test framework has powerful browsing capabilities.Through the research of development environment construction,combined use of ScrapySharp and Selenium,JSON data acquisition method,anti-anti-reptile means,data bulk storage,etc.,a data acquisition solution based on C#+ScrapySharp+Selenium is obtained.
作者
叶文全
YE Wen-quan(Department of Information,Minbei Vocational and Technical College,Nanping Fujian 353000,China)
出处
《湖北第二师范学院学报》
2019年第8期44-48,共5页
Journal of Hubei University of Education
基金
闽北职业技术学院校级科研项目“基于大数据的跨境电商多平台数据分析系统开发”(MJKA1907)