摘要
在开展长白山生态数据智能分析时,需要爬取大量的网络数据,这些数据大概率会出现缺失、重复、异常、噪声等情况,因此需要对爬取到的数据进行必要的清洗的功能。设计了一种爬虫,并实现了数据清洗的功能,实验证明,100万条数据的爬取用时<30 min。
When the intelligent analysis of Changbai Mountain ecological data is carried outa large amount ofnetwork data needs to be crawled.These data are likely to be missingduplicateabnormaland noisey.Thereforethe crawled data needs to be cleaned as necessary.In this papera crawler is designed and the data cleaning is implemented.Experiments show that the crawling of one million pieces of data takes less than 30 minutes.
作者
郑国勋
姚学坤
陈冠澎
胥政尧
ZHENG Guo—xun(School of Computer Technology&Engineering,Changchun Institute of Technology,Changchun 130012,China)
出处
《长春工程学院学报(自然科学版)》
2021年第4期82-86,124,共6页
Journal of Changchun Institute of Technology:Natural Sciences Edition
基金
中央引导地方科技发展资金项目(202002029JC)。
关键词
长白山生态数据
爬虫
缺失值
数据清洗
Changbai Mountain ecological data
data crawler
missing values
data cleaning