摘要
频繁模式挖掘作为模式识别的重要问题,一直受到研究者的广泛关注。FP-Growth算法因其高效快速的特点,被大量应用于频繁模式的挖掘任务中。然而,该算法依赖于内存运行的特性,使其难以适应大规模数据计算。针对上述问题,围绕大规模数据集下频繁模式挖掘展开研究,基于Spark框架,通过对支持度计数和分组过程的优化改进了FP-Growth算法,并实现了算法的分布式计算和计算资源的动态分配。运算过程中产生的中间结果均保存在内存中,因此有效减少数据的I/O消耗,提高算法的运行效率。实验结果表明,经优化后的算法在面向大规模数据时要优于传统的FP-Growth算法。
As an important problem of pattern recognition,frequent itemsets mining has been paid more and more attention by researchers. FP-Growth algorithm is widely used in frequent pattern mining because of its high efficiency and fast performance. However,the algorithm relies on the characteristics of local memory operation,making it difficult to adapt to large-scale data calculation. To solve these problems,this paper focuses on the research of frequent itemsets mining in a distributed environment. The FP-Growth algorithm which based on the Spark framework was improved by optimizing the support count and grouping process. Furthermore,the distributed computation and the dynamic allocation of computing resources were realized. The intermediate results were stored in the memory,so the I/O consumption was reduced and the efficiency of the algorithm was improved. The experimental results show that the improved distributed FP-Growth algorithm is superior to the traditional FP-Growth algorithm for large-scale data.
出处
《计算机应用与软件》
2017年第9期273-278,共6页
Computer Applications and Software
基金
国家自然科学基金项目(71371013)
安徽工业大学校青年教师科研基金项目(QZ201420)
安徽省教育厅自然科学基金项目(KJ2016A087)