摘要
Apache Spark是当前流行的大数据处理模型,具有快速、通用、简单等特点。Spark是针对Map Reduce在迭代式机器学习算法和交互式数据挖掘等应用方面的低效率,而提出的新的内存计算框架,既保留了Map Reduce的可扩展性、容错性、兼容性,又弥补了Map Reduce在这些应用上的不足。由于采用基于内存的集群计算,所以Spark在这些应用上比Map Reduce快100倍。介绍Spark的基本概念、组成部分、部署模式,分析Spark的核心内容与编程模型,给出相关的编程示例。
Apache Spark is a popular model for large scale data processing at present, which is fast, general and easy. Compared with the Map Reduce computing framework, Spark is efficient in iterative machine learning algorithms and interactive data mining applications while re-taining the compatibility, scalability and fault-tolerance of Map Reduce. With its in-memory computing, Spark is up to 100 x faster than Hadoop Map Reduce in memory. Presents the basic conception, component and the deploying mode of Spark, introduces the internal abstraction and the programming model, gives the programming examples.