期刊文献+

基于信息熵与遗传算法的并行关联规则增量挖掘算法 被引量:21

Parallel association rules incremental mining algorithm based on information entropy and genetic algorithm
下载PDF
导出
摘要 针对大数据环境下基于Can树的增量关联规则算法存在树结构空间占用过大、支持度阈值无法动态设置以及Map与Reduce阶段数据传输耗时等问题,提出了一种基于信息熵和遗传算法的并行关联规则增量挖掘算法MR-PARIMIEG。首先,该算法设计基于信息熵的相似项合并策略(SIM-IE)来合并相似数据项,并根据合并后的数据集进行Can树构造,从而减少树结构的空间占用;其次,提出基于遗传算法的DST-GA策略获取大数据环境下相对最优的动态支持度阈值,根据此阈值进行频繁项集挖掘,避免了冗余的频繁模式挖掘导致的时间消耗;最后,在MapReduce并行化运算过程中使用并行LZO数据压缩算法对Map端输出数据进行压缩,从而减少传输的数据规模,最终提升算法的运行速度。实验仿真结果表明,MR-PARIMIEG在大数据环境下进行频繁项集挖掘时具有较好的性能表现,适用于对较大规模的数据集进行并行化处理。 Aiming at the problems that in the big data environment,the Can-tree based incremental association rule algorithm had problems such as too much space occupation of the tree structure,inability to dynamically set the support threshold,and too much time consumption during the data transfer process between the Map and Reduce stages,the Map Reduce-based parallel association rules incremental mining algorithm using information entropy and genetic algorithm(MR-PARIMIEG)was proposed.Firstly,a similar items merging based on information entropy(SIM-IE)was designed to merge similar data items,and a Can tree based on the merged data set was constructed,thereby reducing the space occupation of the tree structure.Secondly,the dynamic support threshold obtaining using genetic algorithm(DST-GA)was proposed to obtain the relatively optimal dynamic support threshold in the big data environment,and frequent itemset mining was performed according to this threshold to avoid the unnecessary time consumption caused by mining redundant frequent patterns.Finally,in the process of MapReduce parallel operation,the parallel LZO data compression algorithm was used to compress the output data of the Map stage,thereby reducing the size of the transmitted data,and finally improving the running speed of the algorithm.Experimental simulation results show that MR-PARIMIEG has better performance when mining frequent item sets in the big data environment,and it is suitable for parallel processing of larger data sets.
作者 毛伊敏 邓千虎 陈志刚 MAO Yimin;DENG Qianhu;CHEN Zhigang(School of Information Engineering,Jiangxi University of Science and Technology,Ganzhou 341000,China;College of Computer Science and Engineering,Central South University,Changsha 410083,China)
出处 《通信学报》 EI CSCD 北大核心 2021年第5期122-136,共15页 Journal on Communications
基金 国家自然科学基金资助项目(No.41562019,No.61762046) 国家重点研发计划基金资助项目(No.2018YFC1504705)。
关键词 Can树 信息熵 大数据 增量挖掘 数据压缩 Can-tree information entropy big data incremental mining data compression
  • 相关文献

参考文献3

二级参考文献5

共引文献11

同被引文献227

引证文献21

二级引证文献15

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部