期刊文献+

基于随机森林和投票机制的大数据样例选择算法 被引量:7

Instance selection algorithm for big data based on random forest and voting mechanism
下载PDF
导出
摘要 针对大数据样例选择问题,提出了一种基于随机森林(RF)和投票机制的大数据样例选择算法。首先,将大数据集划分成两个子集,要求第一个子集是大型的,第二个子集是中小型的。然后,将第一个大型子集划分成q个规模较小的子集,并将这些子集部署到q个云计算节点,并将第二个中小型子集广播到q个云计算节点。接下来,在各个节点用本地数据子集训练随机森林,并用随机森林从第二个中小型子集中选择样例,之后合并在各个节点选择的样例以得到这一次所选样例的子集。重复上述过程p次,得到p个样例子集。最后,用这p个子集进行投票,得到最终选择的样例子集。在Hadoop和Spark两种大数据平台上实现了提出的算法,比较了两种大数据平台的实现机制。此外,在6个大数据集上将所提算法与压缩最近邻(CNN)算法和约简最近邻(RNN)算法进行了比较,实验结果显示数据集的规模越大时,与这两个算法相比,提出的算法测试精度更高且时间消耗更短。证明了提出的算法在大数据处理上具有良好的泛化能力和较高的运行效率,可以有效地解决大数据的样例选择问题。 To deal with the problem of instance selection for big data,an instance selection algorithm based on Random Forest(RF)and voting mechanism was proposed for big data.Firstly,a dataset of big data was divided into two subsets:the first subset is large and the second subset is small or medium.Then,the first large subset was divided into q smaller subsets,and these subsets were deployed to q cloud computing nodes,and the second small or medium subset was broadcast to q cloud computing nodes.Next,the local data subsets at different nodes were used to train the random forest,and the random forest was used to select instances from the second small or medium subset.The selected instances at different nodes were merged to obtain the subset of selected instances of this time.The above process was repeated p times,and p subsets of selected instances were obtained.Finally,these p subsets were used for voting to obtain the final selected instance set.The proposed algorithm was implemented on two big data platforms Hadoop and Spark,and the implementation mechanisms of these two big data platforms were compared.In addition,the comparison between the proposed algorithm with the Condensed Nearest Neighbor(CNN)algorithm and the Reduced Nearest Neighbor(RNN)algorithm was performed on 6 large datasets.Experimental results show that compared with these two algorithms,the proposed algorithm has higher test accuracy and smaller time consumption when the dataset is larger.It is proved that the proposed algorithm has good generalization ability and high operational efficiency in big data processing,and can effectively solve the problem of big data instance selection.
作者 周翔 翟俊海 黄雅婕 申瑞彩 侯璎真 ZHOU Xiang;ZHAI Junhai;HUANG Yajie;SHEN Ruicai;HOU Yingzhen(College of Mathematics and Information Science,Hebei University,Baoding Hebei 071002,China;Hebei Key Laboratory of Machine Learning and Computational Intelligence(Hebei University),Baoding Hebei 071002,China)
出处 《计算机应用》 CSCD 北大核心 2021年第1期74-80,共7页 journal of Computer Applications
基金 河北省重点研发计划项目(19210310D) 河北大学研究生创新资助项目(hbu2020ss045)。
关键词 大数据 样例选择 决策树 随机森林 投票机制 big data instance selection decision tree Random Forest(RF) voting mechanism
  • 相关文献

同被引文献89

引证文献7

二级引证文献24

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部