摘要
大数据具备数据量大、富于多样性的特点。因此在大数据分析方面,无论是对处理速度还是实时性都具有较高的要求。数据挖掘技术是从海量数据里采用某种建模算法,用来寻找隐藏在数据背后的信息,从而让大数据产生更大的价值。Spark框架是一个针对超大数据集合的低延迟的集群分布式计算系统。本文基于该框架,对大数据挖掘技术进行了具体研究,首先完成了基于Yarn部署上Spark集群搭建,然后提出并实现了并行Apriori算法,该算法成功补充了Spark MLlib分布式机器学习库中所缺乏的关联分析问题的分布式算法。
Because big data have the characteristics of large amount of data and rich diversity, it must be demanding large data analysis both in processing speed and real-time requirements. Data mining technology is to use some modeling algorithm from massive data, to look for hidden information behind the data, so that big data can produce greater value. Spark framework is a low latency cluster distributed computing system for super large data sets. Based on the framework, this paper studies the big data mining technology. This paper designs and implements the Yarn deployment on the Spark cluster firstly, and then proposes and implements parallel Apriori algorithm. This algorithm successfully adds to the distributed algorithm of association analysis by the lack of Spark MLlib distributed machine learning repository.
出处
《微型电脑应用》
2017年第6期64-66,80,共4页
Microcomputer Applications