期刊文献+

Spark下基于PCA和分层选择的随机森林算法 被引量:1

Random Forest Algorithm Based on PCA and Hierarchical Selection Under Spark
下载PDF
导出
摘要 针对大数据背景下随机森林算法中存在协方差矩阵规模较大、子空间特征信息覆盖不足和节点通信开销大的问题,提出了基于PCA和子空间分层选择的并行随机森林算法PLA-PRF(PCA and subspace layer sampling on parallel random forest algorithm)。对初始特征集,提出了基于PCA的矩阵分解策略(matrix factorization strategy,MFS),压缩原始特征集,提取主成分特征,解决特征变换过程中协方差矩阵规模较大的问题;基于主成分特征,提出基于误差约束的分层子空间构造算法(error-constrained hierarchical subspace construction algorithm,EHSCA),分层选取信息素特征,构建特征子空间,解决子空间特征信息覆盖不足的问题;在Spark环境下并行化训练决策树的过程中,设计了一种数据复用策略(data reuse strategy,DRS),通过垂直划分RDD数据并结合索引表,实现特征复用,解决了节点通信开销大的问题。实验结果表明PLA-PRF算法分类效果更佳,并行化效率更高。 In the context of big data, the random forest algorithm has large covariance matrix, insufficient coverage of subspace feature information and high node communication overhead. A parallel random forest algorithm based on PCA and subspace hierarchical selection, PLA-PRF(PCA and subspace layer sampling on parallel random forest algorithm).For the initial feature set, a PCA-based matrix factorization strategy(MFS)is proposed to extract principal component features to solve the problem of large covariance matrix in the process of feature transformation. Based on the obtained principal component features, a hierarchical subspace construction algorithm(error-constrained hierarchical subspace construction algorithm, EHSCA)based on error constraints is proposed, which selects pheromone features hierarchically, constructs feature subspaces, and solves the problem of insufficient coverage of subspace feature information. In the process of parallel training decision trees in the Spark environment, a data reuse strategy(DRS)is designed to solve the problem of high node communication overhead. By vertically dividing RDD data objects, it improves the performance of the distributed environment. Data utilization rate solves the problem of high node communication overhead. Experimental results show that PLA-PRF has better classification effect and higher parallelization efficiency.
作者 雷晨 毛伊敏 LEI Chen;MAO Yimin(School of Information Engineering,Jiangxi University of Science&Technology,Ganzhou,Jiangxi 341000,China)
出处 《计算机工程与应用》 CSCD 北大核心 2022年第6期118-127,共10页 Computer Engineering and Applications
基金 国家重点研发计划(2018YFC1504705) 国家自然科学基金(41562019) 江西省教育厅科技项目(GJJ151528,GJJ151531)。
关键词 随机森林 SPARK 主成分分析(PCA) 分层抽样 误差约束 数据划分 数据复用 random forest Spark princepal component analysis(PCA) layer sampling error constraint data partition data reuse
  • 相关文献

参考文献4

二级参考文献171

  • 1Redman T. The impact of poor data quality on the typical enterprise [J]. Communications of the ACM, 1998, 41(2) : 79-82.
  • 2Miller D W, Yeast J D, Evans R L. Missing prenatal records at a birth center: A communication problem quantified [C] // Proc of AMIA Annual Syrup Proceedings. Maryland: American Medical Informatics Association, 2005 : 535-539.
  • 3Swartz N. Gartner warns firms of 'dirty data' [J]. Information Management Journal, 2007, 41(3): 6.
  • 4Kohn L T, Corrigan J M, Donaldson M S. To Err is Human: Building a Safer Health System [M]. Washington: National Academies Press, 2000.
  • 5Eckerson W. Data Warehousing Special Report Data quality and the bottom line [R]. Applications Development Trends, 2002.
  • 6English L P. Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits [M]. New York: Wiley, 1999.
  • 7Woolsey B, Schulz M. Credit card statistics, industry facts, debt statistics [OL]. [2013-04-20 ]. http://www. creditcards, com/credit-card-news/credit-card-indust ry-facts- personal-debt-statistics-1276, php.
  • 8Shilakes C, Tylman J. Enterprise information portals [R]. New York: Merrill Lynch, 1998.
  • 9Rahm E, Do H H. Data cleaning:Problems and current approaches [J]. IEEE Data Engineering Bulletin, 2000, 23 (4): 3-13.
  • 10Dong X L, Berti-Equille L, Srivastava D. Integrating conflicting data:The role of source dependence[J]. Proceedings of the VLDB Endowment, 2009, 2(1): 550-561.

共引文献337

同被引文献20

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部