摘要
随着移动互联网时代的到来,越来越多的含地理位置信息的空间数据需要处理,如何在海量的空间数据中进行常见的几何查询成为一个挑战,凸包问题因其在模式识别、图像处理、统计学、地理信息系统、博弈论、图论等领域中被广泛应用成为近些年研究的一个热点。凸包问题的研究始于单机版的算法,进而过渡到Hadoop等基于硬盘的分布式系统,但是受限于单节点的计算存储能力的瓶颈以及Hadoop平台基于硬盘的特性,其计算性能尚不能达到人们的在线实时计算的需求。研究基于内存的分布式计算框架Spark下的凸包问题,给出基于Spark平台的凸包查询整体框架,框架从查询接口、语法解析和物理执行等多方面结合SparkSQL引擎。随后,给出基于Andrew单调链算法的单机算法CHStand,分析单机算法并行度上的问题后,提出基于Spark的CHSpark算法,进一步优化算法并提出一种Spark平台下的优化算法CHGeom。通过实验对比说明三种算法的相对性能提升,实验发现Spark平台下的解决方案相对传统的单机平台下的解决方案有着较大的性能提升,所提算法具有良好的拓展性和广泛的实际应用价值。
With the arrival of mobile Internet era,more and more spatial data need to be stored and processed.How to conduct spatial geometrical queries on massive spatial data becomes a challenge in recent years.The convex hull problem becomes a focus for research on its wide usage in visual pattern matching,image processing,statistics,geographic information system,game theory and graph theory fields.The solution for convex hull problem starts with algorithms running on single computer,and then comes to some disk-based distributed systems,such as Hadoop.But bottlenecked by the storage and computing ability for single computer,the frameworks can not satisfy the demand for data scientists.This paper introduces an in-memory cluster computing framework Apache Spark to solve convex hull problem,the framework optimizes SparkSQL for geometrical operation from query interface,syntax parsing and physical execution.Firstly,a standalone algorithm CHStand on single computer is brought up to solve this problem,which is a na?ve solution on Spark named CHSpark.Then an optimized CHGeom algorithm is come up to solve the convex hull problem.Additional experiments show that CHGeom runs much faster than algorithms on single computer.Some ideas mentioned in this paper can be easily extended to other spatial geometrical operation.
作者
李格非
马蔚吟
李力
LI Gefei;MA Weiyin;LI Li(Department of Computer Science and Engineering,Shanghai Jiao Tong University,Shanghai 200240,China;School of Basic Medical Sciences,Nanjing Medical University,Nanjing 211166,China;School of Software,Shanghai Jiao Tong University,Shanghai 200240,China)
出处
《计算机工程与应用》
CSCD
北大核心
2018年第22期67-73,112,共8页
Computer Engineering and Applications
基金
国家重点基础研究发展计划(973计划)(No.2015CB352403)
国家自然科学基金(No.61729202
No.U1636210
No.61373156
No.91438121
No.61672351)
国家重点研发计划(No.2016YFB0700502)
上海市科学创新行动计划(No.15JC1402400)
微软亚洲研究院
关键词
Spark平台
分布式计算
空间几何查询
凸包运算
Spark platform
distributed computing
spatial geometrical query
convex hull operation