摘要
大数据处理项目中,随着采集到的高维数据指数式增长,数据预处理工作已经成为数据分析和知识挖掘的瓶颈。主成分分析PCA是目前使用最广泛的数据维规约算法,特别是对大型稀疏矩阵,处理效果良好,但通常伴随着大规模复杂运算。基于大数据平台Hadoop的MapReduce并行处理框架的PCA并行处理算法,通过映射和规约将复杂运算分配到多个处理器并行处理,算法验证实验结果表明,数据集规模增大,选取适当的分布计算节点数量,并行PCA方法的加速比可提高约30%,时间消耗可降低约21%。
In the project of big data processing project,with the high-dimensional data growing exponentially,the data preprocessing has become a bottleneck in data analysis and knowledge mining.The Principal Component Analysis(PCA)is the most widely used data dimensioning reduction algorithm,especially,it is good at processing the large sparse matrices,but it accompanied by large-scale complex operations.The PCA parallel processing algorithm based on MapReduce parallel processing framework,assign the operations to multiple processors based on mapping and specification.The experimental results of the algorithm show that the larger data set and the appropriate number of distributed computing nodes,the acceleration ratio can be increased by about 30%and the time consumption can be reduced by about 21%.
作者
陈燕
陈亚林
郑军
CHEN Yan;CHEN Ya-lin;Zhen Jun(School of Mathematics and Information Science of Guiyang University,Guiyang,550002,Guizhou China;School of Management Science,Nanjing University of Finance&Economics,Nanjing,210046,Jiangsu China)
出处
《贵阳学院学报(自然科学版)》
2019年第4期92-96,共5页
Journal of Guiyang University:Natural Sciences
基金
2019年度市科技局贵阳学院科技专项资金[项目编号:GYU-KYZ[2019~2020]PT06-02]
教育部青年基金项目:“水资源约束下的涉煤产业政策研究:机理、模型与仿真”[项目编号:18YJCZH016]