摘要
针对大数据环境下基于Can树的增量关联规则算法存在树结构空间占用过大、支持度阈值无法动态设置以及Map与Reduce阶段数据传输耗时等问题,提出了一种基于信息熵和遗传算法的并行关联规则增量挖掘算法MR-PARIMIEG。首先,该算法设计基于信息熵的相似项合并策略(SIM-IE)来合并相似数据项,并根据合并后的数据集进行Can树构造,从而减少树结构的空间占用;其次,提出基于遗传算法的DST-GA策略获取大数据环境下相对最优的动态支持度阈值,根据此阈值进行频繁项集挖掘,避免了冗余的频繁模式挖掘导致的时间消耗;最后,在MapReduce并行化运算过程中使用并行LZO数据压缩算法对Map端输出数据进行压缩,从而减少传输的数据规模,最终提升算法的运行速度。实验仿真结果表明,MR-PARIMIEG在大数据环境下进行频繁项集挖掘时具有较好的性能表现,适用于对较大规模的数据集进行并行化处理。
Aiming at the problems that in the big data environment,the Can-tree based incremental association rule algorithm had problems such as too much space occupation of the tree structure,inability to dynamically set the support threshold,and too much time consumption during the data transfer process between the Map and Reduce stages,the Map Reduce-based parallel association rules incremental mining algorithm using information entropy and genetic algorithm(MR-PARIMIEG)was proposed.Firstly,a similar items merging based on information entropy(SIM-IE)was designed to merge similar data items,and a Can tree based on the merged data set was constructed,thereby reducing the space occupation of the tree structure.Secondly,the dynamic support threshold obtaining using genetic algorithm(DST-GA)was proposed to obtain the relatively optimal dynamic support threshold in the big data environment,and frequent itemset mining was performed according to this threshold to avoid the unnecessary time consumption caused by mining redundant frequent patterns.Finally,in the process of MapReduce parallel operation,the parallel LZO data compression algorithm was used to compress the output data of the Map stage,thereby reducing the size of the transmitted data,and finally improving the running speed of the algorithm.Experimental simulation results show that MR-PARIMIEG has better performance when mining frequent item sets in the big data environment,and it is suitable for parallel processing of larger data sets.
作者
毛伊敏
邓千虎
陈志刚
MAO Yimin;DENG Qianhu;CHEN Zhigang(School of Information Engineering,Jiangxi University of Science and Technology,Ganzhou 341000,China;College of Computer Science and Engineering,Central South University,Changsha 410083,China)
出处
《通信学报》
EI
CSCD
北大核心
2021年第5期122-136,共15页
Journal on Communications
基金
国家自然科学基金资助项目(No.41562019,No.61762046)
国家重点研发计划基金资助项目(No.2018YFC1504705)。
关键词
Can树
信息熵
大数据
增量挖掘
数据压缩
Can-tree
information entropy
big data
incremental mining
data compression