摘要
偏最小二乘算法(PLS)是现代工业过程常用的多变量统计过程监控方法之一,然而在现代工业背景下,采用单台PC对大规模工业过程数据进行PLS回归分析的时间复杂度较高。针对此问题,在Hadoop云平台上提出了一种基于MapReduce框架的并行PLS算法。从时间复杂度考虑,将其交叉有效性检验部分并行处理。在三台PC上搭建三个节点的Hadoop全分布集群平台上,以田纳西-伊斯曼过程仿真平台数据回归分析为例,验证所提出的算法。实验结果表明,在使用PLS做现代大规模工业过程数据分析时,所提出的算法在保证精度的前提下,能有效改善数据处理的时效性并且随着PC数量的增加时效性具有近似线性的提高。
Partial Least Squares (PLS) has been widely used in multivariate statistical process monitoring methods for industrial processes, and it is computation-intensive and time-demanding when dealing with massive data. To solve this problem to consider time complexity, a novel implementation of parallel partial least squares is proposed using MapReduce, which consists of the parallelization of cross validation. Using Tennessee-Eastman Process data as an example, experiments are conducted on a Hadoop cluster, which is a collection of ordinary computers. The experimental results demonstrate that parallel partial least squares algorithm can handle massive process data, can significantly cut down the modeling time, and gains a basically linear speedup with the number of computers increased, and can be easily scaled up.
作者
王德政
张益农
杨帆
WANG Dezheng;ZHANG Yinong;YANG Fan(Beijing Key Laboratory of Information Service Engineering,Beijing Union University,Beijing 100101,China;Department of Automation,Tsinghua University,Beij ing 100084,China)
出处
《计算机工程与应用》
CSCD
北大核心
2018年第24期61-65,175,共6页
Computer Engineering and Applications
基金
国家自然科学基金(No.61433001)
北京市属高等学校高层次人才引进与培养计划项目(No.CIT&TCD20150314)