摘要
通过采取高效的分布式网络数据获取方法,结合大数据分析与人工智能技术,能够为传统行业建设与管理提供更加科学精准的分析与预测手段。以江苏省电力建设的投资成本预测为背景,基于主流python语言和分布式爬虫框架scrapy研究深层网络爬虫,根据深层网络结构设计爬虫策略并实现并行网络数据抓取系统,大规模获取江苏省各地市的GDP、人口数量、企业分类、社区建设、交通建设等宏观经济数据。通过自然语言处理和正则表达式等技术,对获取到的结构化和非结构化数据进行数据清洗和文本处理,最终实现数据的可视化展示。
By adopting efficient distributed network data acquisition method,combining big data analysis and artificial intelligence technology,it can provide more scientific and accurate analysis and prediction means for the construction and management of traditional industries.In this paper,based on the prediction of investment cost of electric power construction in Jiangsu Province,the deep web crawler is studied based on mainstream python language and distributed crawler framework scrapy,and the crawler strategy is designed according to the deep network structure,and a parallel network data capture system is implemented,so as to obtain the macroeconomic data of GDP,population,enterprise classification,community construction,traffic construction and other cities in Jiangsu Province on a large scale.Through natural language processing and regular expression technology,data cleaning and text processing are carried out on the obtained structured and unstructured data,and finally the visual display of data is realized.
作者
张震宇
王婷
任腾云
赵琳
王纪军
ZHANG Zhen-yu;WANG Ting;REN Teng-yun;ZHAO Lin;WANG Ji-jun(Jiangsu Electric Power Information Technology Co.,Ltd.,Nanjing 215000 China)
出处
《自动化技术与应用》
2023年第7期119-122,共4页
Techniques of Automation and Applications
关键词
分布式计算
大数据
爬虫框架
投资成本
distributed computing
big data
crawler framework
investment cost