期刊文献+

Hadoop环境下基于SparkSQL海量自动站数据查询统计初探 被引量:12

Query and Statistical Analysis of Mass Automatic Station Data Based on SparkSQL in Hadoop Environment
下载PDF
导出
摘要 在Hadoop分布式计算和存储架构下,自定义ETL数据清洗规则将海量自动站小时单站文件按所属年和站号合并为大文件流转存储至HDFS中,并运用SparkSQL并行计算框架进行统计处理生成常用气象要素日统计值。结果表明,数据处理和获取时效较关系型数据库方式有显著提升。采用SparkSQL并行计算框架对多气象要素多站点和长时间序列进行数据统计处理查询均能达到秒级别响应,并随着统计站点数的不断增加和时间跨度的延长其优势更为明显,能更高效地支撑此类气象数据服务,为海量气象数据处理从关系型数据库到大数据分布式架构的转换处理提供了新思路。 Under the distributed computing and storage framework of Hadoop,according to the customed ETL data cleaning rules,based on its year in which it belongs and station number,the hourly singlestation files of mass automatic station data are merged into large files and transferred to the distributed storage HDFS,using the Spark SQL parallel computation framework to deal with and produce the daily statistical values of common meteorological elements,which greatly improves data processing and acquisition efficiency compared with the relational database.The experimental results show that the data processing and querying of multiple meteorological elements,multi-site data and long-time series can reach the second level response by using the SparkSQL parallel computing framework,and its advantages are more obvious with the increasing number of statistical sites and the extension of time span.It can support this kind of meteorological data service more efficiently and provide new ideas for the transformation of large-scale meteorological data processing from relational database to large data distributed framework.
作者 黄志 詹利群 任晓炜 李涛 Huang Zhi;Zhan Liqun;Ren Xiaowei;Li Tao(Guangxi Meteorological Information Center,Nanning 530022)
出处 《气象科技》 2019年第5期768-772,871,共6页 Meteorological Science and Technology
基金 国家档案局项目(2016-X-06)“基于Hadoop大数据处理的广西气象数字档案馆建设”资助
关键词 HADOOP HDFS SparkSQL ETL Hadoop HDFS SparkSQL ETL
  • 相关文献

参考文献9

二级参考文献51

共引文献242

同被引文献116

引证文献12

二级引证文献40

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部