摘要
以京东平台的网页数据抓取为例,研究如何提高网络爬虫技术对网页数据的抓取效率,进而对抓取到的数据进行数据挖掘和数据分析.该网络爬虫技术主要建立在分布式系统的基础上,多台计算机多线程同时运行,使数据抓取效率显著提高.分析京东平台的网页信息,统一分类,抓取分类下的商品信息,获取到网页内容后,利用解析器重建网页DOM树,通过JQUERY选择器,针对选择不同的标签名称和标识名称获取商品信息,把获取到的数据进行过滤、整合,然后进行数据挖掘和数据分析,对电商行业走势进行预测,进而指导电商运营团队决策.
Taking the data web Jingdong platform as an example, this paper researched how to improve the efficiency of data capture of web crawler technology, and to crawl into data for data mining and data analysis. The crawler technology is mainly built on the basis of distributed system, and multiple computers run simultaneously at the same time, so that the effi- ciency of data capture is significantly improved. After analyzing Web information, Jingdong platform unified classification, grasping under the category of commodity information, and the access to web content, DOM tree was reconstructed by using parser, and through the JQUERY selector, different commodity information was selected according to the label name and logo name, and the obtained data was filtered, integrated, and then data mining and data analysis were carried out to predict the trend of the e-commerce industry, and then to guide the decision-making of the e-commerce operations team.
出处
《经济数学》
2018年第1期77-85,共9页
Journal of Quantitative Economics
基金
国家社科基金重点项目"新型战略产业的培育机制研究"(11AJL008)
湖南省社科基金重大项目"战略性新兴产业商业模式创新研究"(12ZDA10)
国家软科学项目"生产服务业与战略性新兴产业融合互动机制研究"(2014GXS4D136)
广义虚拟经济研究专项(GX2014-1003-Y)
湖南省企业管理与投资基地项目(16JDZD01)