摘要
随着物联网的快速发展和科技的进步,社会各行业的数据量正以前所未有的速度和规模在增长,如何在海量数据中快速获得有价值的数据也成为企业关注的重点。Spark作为目前最流行的开源大数据处理框架,受底层机制复杂和集群资源的限制,常出现内存不足、任务执行时间长等问题。为此,本文从开发原则、分区和读取数据的格式、集群并行度以及结构化API这4个方面对Spark应用程序性能进行分析和总结,以优化资源配置、提高开发效率。
With the rapid development of the Internet of Things and the advancement of science and technology, the amount of data in various industries in society is growing at an unprecedented speed and scale. How to quickly obtain valuable data from the massive data has become the focus of enterprises. Spark, as the most popular open source big data processing framework, is limited by the complexity of the underlying mechanism and cluster resources, and often has problems such as insufficient memory and long task execution time. To this end, this paper analyzes and summarizes the performance of Spark applications from four aspects: development principles, partition and read data formats, cluster parallelism, and structured API, in order to optimize resource allocation and improve development efficiency.
作者
韦统边
吴江波
苏德
张亮
韦通明
WEI Tongbian;WU Jiangbo;SU De;ZHANG Liang;WEI Tongming(Guangxi Key Laboratory of Automobile Four New Features,SAIC GM Wuling Automoblie Co.,Ltd.,Liuzhou Guangxi 545007,China)
出处
《信息与电脑》
2022年第2期53-55,共3页
Information & Computer
关键词
物联网
价值
计算
SPARK
并行度
Internet of Things
value
calculation
spark
parallelism