摘要
传统Bootstrap抽样和Bagging集成学习通常以串行方式实现,计算效率低,且存在样本不可重用、扩展性差等问题,不适合高效的大规模Bagging集成学习。从大数据分布式计算的思维入手,提出新的Bootstrap样本划分(BSP)大数据模型和分布式集成学习方法。BSP数据模型通过分布式生成算法将训练数据表达成分布式Bootstrap样本集的集合,存储成HDFS分布式数据文件,为后续的分布式集成学习提供数据支持。分布式集成学习方法从BSP数据模型中随机选取多个BSP数据块,读入集群各个节点的虚拟机,用串行算法对选取的数据块并行计算统计量或训练建模,再将所有的计算子结果回传至主节点中,生成最终的集成学习结果,此过程中可加入对子结果的质量选择以进一步提高预测效果。BSP数据模型的生成和分布式集成学习采用非Map-Reduce计算范式进行,每个数据块的计算独立完成,减少了计算节点间的数据通信开销。提出的算法在Spark开源系统中以新的算子方式实现,供Spark应用程序调用。实验表明,新方法可以高效地生成训练数据的BSP数据模型,提高数据样本的可重用性,在基于有监督机器学习算法构建的大规模Bagging集成学习实验中,计算效率能提高50%以上,同时预测精度进一步提高约2%。
A sequential implementation of Bootstrap sampling and Bagging ensemble learning is computationally inefficient and not scalable to build large Bagging ensemble models with a large number of component models.Inspired by distributed big data computing,a new Bootstrap sample partition(BSP)big data model and a distributed ensemble learning method for large-scale distributed ensemble learning were proposed.The BSP data model extended a dataset as a set of Bootstrap samples stored in Hadoop distributed file system.Our distributed ensemble learning method randomly selected a subset of samples from the BSP data model and read them into Java virtual machines of the cluster.Following this,a serial algorithm was executed in each virtual machine to process each sample data and build a machine learningmodel on each sample data independently and in parallel with other virtual machines.Eventually,allsub-results were collected and processed in the master node to produce the ensemble result,optionally adding a sample preferences trategy for the BSP data blocks.The BSP data model generation and the component model building were computed using a non-MapReduce computing paradigm.All component models were computed in parallel without data communication among the nodes.The algorithms proposed in this paper were implemented in spark as internal operators that can be utilized in Spark applications.Experiments have demonstrated that BSP data model of a dataset can be generated efficiently through the new distributed algorithm.It improves the reusability of data samples and increases computational efficiency by over 50%in large-scale Bagging ensemble learning,while also increasing prediction accuracy by approximately 2%.
作者
罗凯靖
张育铭
何玉林
黄哲学
LUO Kaijing;ZHANG Yuming;HE Yulin;HUANG Zhexue(Big Data Institute,College of Computer Science and Software Engineering,Shenzhen University,Shenzhen 518060,China;Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ),Shenzhen 518107,China)
出处
《大数据》
2024年第3期93-108,共16页
Big Data Research
基金
国家自然科学基金项目(No.61972261)
广东省自然科学基金面上项目(No.2023A1515011667)
深圳市基础研究重点项目(No.JCYJ20220818100205012)
深圳市基础研究面上项目(No.JCYJ20210324093609026)。