摘要
针对经典Apriori算法及其改进算法不能有效处理大规模数据集,提出基于Hadoop-MapReduce编程模型的两种改进算法:HAprioriK,HApriori2。其中HAprioriK需要k个MapReduce Jobs,而HApriori2仅需要2个就能在整个数据集上找到频繁k项集,两种改进算法均充分利用了Hadoop平台的计算优势,可以轻松地处理大量数据。采用IBM的数据集进行改进算法有效性的研究,实验结果表明,HApriori2算法在不同规模的数据集和支持度下,能够有效地挖掘频繁项集,具有比HAprioriK更好的性能。
Aiming at the classical Apriori algorithm and the subsequent improved algorithm can not effectively deal with large-scale data set, two improved algorithms based on Hadoop-MapReduce programming model are proposed: HApriori K,HApriori2. Where HApriori K requires k MapReduce Jobs and HApriori2 requires only 2 to find frequent k-itemsets on the entire dataset. Both of the improved algorithms take advantage of the Hadoop platform's computational advantage and can easily handle large amounts of data. The experimental results show that the HApriori2 algorithm can effectively exploit frequent itemsets and has better performance than that of HApriori K under the different data sets and support degree.
作者
谢建峰
孙剑伟
XIE Jian-feng, SUN Jian-wei(Depavtment of No .5 System, North China Institute of Computing Technology,Beijing 100083,Chin)
出处
《信息技术》
2018年第4期129-133,140,共6页
Information Technology