[目的]科学智能(AI for Science)方法正在深刻地改变当前科学计算的格局。其融合了物理模型、人工智能与高性能计算,针对传统科学计算中的高维问题,通过数据拟合的方式实现成量级的增加高精度科学计算问题的时间和空间尺度,正在推动一...[目的]科学智能(AI for Science)方法正在深刻地改变当前科学计算的格局。其融合了物理模型、人工智能与高性能计算,针对传统科学计算中的高维问题,通过数据拟合的方式实现成量级的增加高精度科学计算问题的时间和空间尺度,正在推动一场科研范式的变革。[方法]本文针对第一性原理精度的分子动力学,提出一种HPC+AI驱动的科学智能计算平台,针对科学智能在工作流上带来的变化与挑战,从科学数据的生成与数据集制备、构型空间探索与训练样本标注、科学智能模型的高效训练及大规模高效推理等四个方面阐述构建科学智能计算平台的关键技术与流程。[结果]本文所提出的计算平台在整合科学智能计算工作流的基础上,针对HPC+AI驱动的第一性原理精度分子动力学这一典型应用,提出了基于卡尔曼滤波的主动学习策略;改进了拟二阶AI模型训练方法,实现训练时间从天到分钟级的加速;利用五阶多项式AI模型压缩技术实现在同等硬件条件下模型推理的体系规模提高1个数量级,到解时间提高3-9倍。[结论]通过上述工作的整合,形成一套可用于第一性原理精度分子动力学计算的科学智能计算平台。[局限与展望]科学智能计算方法与工作流仍处于蓬勃发展阶段,在高精度数据、更通用AI模型和高效的计算方法等方面仍面临巨大的挑战,也将成为本文工作在未来的重要探索方向。展开更多
Performance models provide insightful perspectives to predict performance and to propose optimization guidance.Although there has been much researches,pinpointing bottlenecks of various memory access patterns and reac...Performance models provide insightful perspectives to predict performance and to propose optimization guidance.Although there has been much researches,pinpointing bottlenecks of various memory access patterns and reaching high accurate prediction of both regular and irregular programs on various hardware configurations are still not trivial.This work proposes a novel model called process-RAM-feedback(PRF)to quantify the overhead of computation and data transmission time on general-purpose multi-core processors.The PRF model predicts the cost of instruction for singlecore by a directed acyclic graph(DAG)and the transmission time of memory access between each memory hierarchy through a newly designed cache simulator.By using performance modeling and feedback optimization method,this paper uses PRF model to analyze and optimize convolution,sparse matrix-vector multiplication and sn-sweep as case study for covering with typical regular kernel to irregular and data dependence.Through the PRF model,it obtains optimization guidance with various sparsity structures,algorithm designs,and instruction sets support on different data sizes.展开更多
在海洋数据同化领域,集合最优插值方法中,矩阵求逆过程所使用的奇异值分解(singular value decomposition,SVD)十分耗时。对集合最优插值中逆矩阵的求逆过程进行优化,分别使用LU分解、Choleskey分解、QR分解来替代SVD分解。首先,通过LU...在海洋数据同化领域,集合最优插值方法中,矩阵求逆过程所使用的奇异值分解(singular value decomposition,SVD)十分耗时。对集合最优插值中逆矩阵的求逆过程进行优化,分别使用LU分解、Choleskey分解、QR分解来替代SVD分解。首先,通过LU分解(Choleskey分解或QR分解)得到相应的三角矩阵(或正交矩阵);然后,利用分解后的矩阵来实现相关逆矩阵的计算。由于LU分解、Choleskey分解、QR分解的算法复杂度都远小于SVD分解,因此改进后的同化程序能得到大幅度的性能提升。数值结果表明,所采用的三种矩阵分解方法相比于SVD分解,都能将集合最优插值的计算效率提升至少两倍以上。值得一提的是,在四种矩阵分解中Choleskey分解使得整个同化程序的性能达到了最优。展开更多
3D reverse time migration in tiled transversly isotropic(3D RTM-TTI) is the most precise model for complex seismic imaging.However,vast computing time of 3D RTM-TTI prevents it from being widely used,which is addresse...3D reverse time migration in tiled transversly isotropic(3D RTM-TTI) is the most precise model for complex seismic imaging.However,vast computing time of 3D RTM-TTI prevents it from being widely used,which is addressed by providing parallel solutions for 3D RTM-TTI on multicores and many-cores.After data parallelism and memory optimization,the hot spot function of 3D RTMTTI gains 35.99 X speedup on two Intel Xeon CPUs,89.75 X speedup on one Intel Xeon Phi,89.92 X speedup on one NVIDIA K20 GPU compared with serial CPU baseline.This study makes RTM-TTI practical in industry.Since the computation pattern in RTM is stencil,the approaches also benefit a wide range of stencil-based applications.展开更多
Clustering data with varying densities and complicated structures is important,while many existing clustering algorithms face difficulties for this problem. The reason is that varying densities and complicated structu...Clustering data with varying densities and complicated structures is important,while many existing clustering algorithms face difficulties for this problem. The reason is that varying densities and complicated structure make single algorithms perform badly for different parts of data. More intensive parts are assumed to have more information probably,an algorithm clustering from high density part is proposed,which begins from a tiny distance to find the highest density-connected partition and form corresponding super cores,then distance is iteratively increased by a global heuristic method to cluster parts with different densities. Mean of silhouette coefficient indicates the cluster performance. Denoising function is implemented to eliminate influence of noise and outliers. Many challenging experiments indicate that the algorithm has good performance on data with widely varying densities and extremely complex structures. It decides the optimal number of clusters automatically.Background knowledge is not needed and parameters tuning is easy. It is robust against noise and outliers.展开更多
The wide acceptance and data deluge in medical imaging processing require faster and more efficient systems to be built.Due to the advances in heterogeneous architectures recently,there has been a resurgence in the fi...The wide acceptance and data deluge in medical imaging processing require faster and more efficient systems to be built.Due to the advances in heterogeneous architectures recently,there has been a resurgence in the first research aimed at FPGA-based as well as GPGPU-based accelerator design.This paper quantitatively analyzes the workload,computational intensity and memory performance of a single-particle 3D reconstruction application,called EMAN,and parallelizes it on CUDA GPGPU architectures and decouples the memory operations from the computing flow and orchestrates the thread-data mapping to reduce the overhead of off-chip memory operations.Then it exploits the trend towards FPGA-based accelerator design,which is achieved by offloading computingintensive kernels to dedicated hardware modules.Furthermore,a customized memory subsystem is also designed to facilitate the decoupling and optimization of computing dominated data access patterns.This paper evaluates the proposed accelerator design strategies by comparing it with a parallelized program on a 4-cores CPU.The CUDA version on a GTX480 shows a speedup of about 6 times.The performance of the stream architecture implemented on a Xilinx Virtex LX330 FPGA is justified by the reported speedup of 2.54 times.Meanwhile,measured in terms of power efficiency,the FPGA-based accelerator outperforms a 4-cores CPU and a GTX480 by 7.3 times and 3.4 times,respectively.展开更多
文摘[目的]科学智能(AI for Science)方法正在深刻地改变当前科学计算的格局。其融合了物理模型、人工智能与高性能计算,针对传统科学计算中的高维问题,通过数据拟合的方式实现成量级的增加高精度科学计算问题的时间和空间尺度,正在推动一场科研范式的变革。[方法]本文针对第一性原理精度的分子动力学,提出一种HPC+AI驱动的科学智能计算平台,针对科学智能在工作流上带来的变化与挑战,从科学数据的生成与数据集制备、构型空间探索与训练样本标注、科学智能模型的高效训练及大规模高效推理等四个方面阐述构建科学智能计算平台的关键技术与流程。[结果]本文所提出的计算平台在整合科学智能计算工作流的基础上,针对HPC+AI驱动的第一性原理精度分子动力学这一典型应用,提出了基于卡尔曼滤波的主动学习策略;改进了拟二阶AI模型训练方法,实现训练时间从天到分钟级的加速;利用五阶多项式AI模型压缩技术实现在同等硬件条件下模型推理的体系规模提高1个数量级,到解时间提高3-9倍。[结论]通过上述工作的整合,形成一套可用于第一性原理精度分子动力学计算的科学智能计算平台。[局限与展望]科学智能计算方法与工作流仍处于蓬勃发展阶段,在高精度数据、更通用AI模型和高效的计算方法等方面仍面临巨大的挑战,也将成为本文工作在未来的重要探索方向。
基金Supported by the National Key Research and Development Program of China(No.2017YFB0202105,2016YFB0201305,2016YFB0200803,2016YFB0200300)the National Natural Science Foundation of China(No.61521092,91430218,31327901,61472395,61432018).
文摘Performance models provide insightful perspectives to predict performance and to propose optimization guidance.Although there has been much researches,pinpointing bottlenecks of various memory access patterns and reaching high accurate prediction of both regular and irregular programs on various hardware configurations are still not trivial.This work proposes a novel model called process-RAM-feedback(PRF)to quantify the overhead of computation and data transmission time on general-purpose multi-core processors.The PRF model predicts the cost of instruction for singlecore by a directed acyclic graph(DAG)and the transmission time of memory access between each memory hierarchy through a newly designed cache simulator.By using performance modeling and feedback optimization method,this paper uses PRF model to analyze and optimize convolution,sparse matrix-vector multiplication and sn-sweep as case study for covering with typical regular kernel to irregular and data dependence.Through the PRF model,it obtains optimization guidance with various sparsity structures,algorithm designs,and instruction sets support on different data sizes.
文摘在海洋数据同化领域,集合最优插值方法中,矩阵求逆过程所使用的奇异值分解(singular value decomposition,SVD)十分耗时。对集合最优插值中逆矩阵的求逆过程进行优化,分别使用LU分解、Choleskey分解、QR分解来替代SVD分解。首先,通过LU分解(Choleskey分解或QR分解)得到相应的三角矩阵(或正交矩阵);然后,利用分解后的矩阵来实现相关逆矩阵的计算。由于LU分解、Choleskey分解、QR分解的算法复杂度都远小于SVD分解,因此改进后的同化程序能得到大幅度的性能提升。数值结果表明,所采用的三种矩阵分解方法相比于SVD分解,都能将集合最优插值的计算效率提升至少两倍以上。值得一提的是,在四种矩阵分解中Choleskey分解使得整个同化程序的性能达到了最优。
基金Supported by the National Natural Science Foundation of China(No.61432018)
文摘3D reverse time migration in tiled transversly isotropic(3D RTM-TTI) is the most precise model for complex seismic imaging.However,vast computing time of 3D RTM-TTI prevents it from being widely used,which is addressed by providing parallel solutions for 3D RTM-TTI on multicores and many-cores.After data parallelism and memory optimization,the hot spot function of 3D RTMTTI gains 35.99 X speedup on two Intel Xeon CPUs,89.75 X speedup on one Intel Xeon Phi,89.92 X speedup on one NVIDIA K20 GPU compared with serial CPU baseline.This study makes RTM-TTI practical in industry.Since the computation pattern in RTM is stencil,the approaches also benefit a wide range of stencil-based applications.
基金Supported by the National Key Research and Development Program of China(No.2016YFB0201305)National Science and Technology Major Project(No.2013ZX0102-8001-001-001)National Natural Science Foundation of China(No.91430218,31327901,61472395,61272134,61432018)
文摘Clustering data with varying densities and complicated structures is important,while many existing clustering algorithms face difficulties for this problem. The reason is that varying densities and complicated structure make single algorithms perform badly for different parts of data. More intensive parts are assumed to have more information probably,an algorithm clustering from high density part is proposed,which begins from a tiny distance to find the highest density-connected partition and form corresponding super cores,then distance is iteratively increased by a global heuristic method to cluster parts with different densities. Mean of silhouette coefficient indicates the cluster performance. Denoising function is implemented to eliminate influence of noise and outliers. Many challenging experiments indicate that the algorithm has good performance on data with widely varying densities and extremely complex structures. It decides the optimal number of clusters automatically.Background knowledge is not needed and parameters tuning is easy. It is robust against noise and outliers.
基金Supported by the National Basic Research Program of China(No.2012CB316502)the National High Technology Research and DevelopmentProgram of China(No.2009AA01A129)the National Natural Science Foundation of China(No.60921002)
文摘The wide acceptance and data deluge in medical imaging processing require faster and more efficient systems to be built.Due to the advances in heterogeneous architectures recently,there has been a resurgence in the first research aimed at FPGA-based as well as GPGPU-based accelerator design.This paper quantitatively analyzes the workload,computational intensity and memory performance of a single-particle 3D reconstruction application,called EMAN,and parallelizes it on CUDA GPGPU architectures and decouples the memory operations from the computing flow and orchestrates the thread-data mapping to reduce the overhead of off-chip memory operations.Then it exploits the trend towards FPGA-based accelerator design,which is achieved by offloading computingintensive kernels to dedicated hardware modules.Furthermore,a customized memory subsystem is also designed to facilitate the decoupling and optimization of computing dominated data access patterns.This paper evaluates the proposed accelerator design strategies by comparing it with a parallelized program on a 4-cores CPU.The CUDA version on a GTX480 shows a speedup of about 6 times.The performance of the stream architecture implemented on a Xilinx Virtex LX330 FPGA is justified by the reported speedup of 2.54 times.Meanwhile,measured in terms of power efficiency,the FPGA-based accelerator outperforms a 4-cores CPU and a GTX480 by 7.3 times and 3.4 times,respectively.