摘要
随着遥感技术的快速发展,遥感数据呈爆炸式增长,给遥感数据计算带来巨大的挑战。采用基于内存计算的Spark分布式计算框架以克服该问题,并选择YARN作为资源调度系统和采用HDFS为分布式存储系统。Spark是一个开源的分布式计算框架,基于弹性分布式数据集(RDD)概念,采用先进的有向无环图执行机制以支持循环数据流操作,通过一次数据导入内存就可以完成多次迭代运算。因而,特别适合基于多次迭代的大数据计算分析方法,相较于每轮迭代需把数据导入内存的Map Reduce有更大的优势。将该计算框架应用于海量遥感数据分析,验证需要多次迭代的奇异值分解(SVD)算法在该数据分析中的有效性。实验表明,随着迭代次数增加,基于Spark的SVD运算效率相对于Map Reduce有明显提高,通常可提高一个数量级。
With the fast development of remote sensing techniques,the volume of acquired data grows exponentially.This brings a big challenge to process massive remote sensing data.In the paper,an in-memory computing framework is proposed to address this problem.Here,Spark is an open-source distributed computing platform with Hadoop YARN as resource scheduler and HDFS as cloud storage system.Spark is based on an abstraction so-called resilient distributed datasets(RDD).and it has an advanced directed acyclic graph(DAG) execution engine to support a cyclic data flow.On the Spark-based platform,the data loaded into memory in the first iteration can be reused in the subsequent iterations.This mechanism makes Spark much suitable for running multi-iteration algorithms compared to MapReduce which has to load data in each iteration.The experiments are carried out on massive remote sensing data using multi-iteration singular value decomposition(SVD) algorithm.The results show that Spark-based SVD can obtain significantly faster computation time than that by MapReduce.usually by one order of magnitude.
出处
《微型电脑应用》
2015年第8期65-67,6,共3页
Microcomputer Applications
基金
国家自然科学基金
(71331005)