Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these adv...Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming GPUs remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication—a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)—this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique’s broader applicability and its potential to address some of the most pressing computational challenges in GPU-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in GPU computing.展开更多
The trajectory optimization of an unpowered reentry vehicle via artificial emotion memory optimization(AEMO)is discussed.Firstly,reentry dynamics are established based on multiple constraints and parameterized control...The trajectory optimization of an unpowered reentry vehicle via artificial emotion memory optimization(AEMO)is discussed.Firstly,reentry dynamics are established based on multiple constraints and parameterized control variables with finite dimensions are designed.If the constraint is not satisfied,a distance measure and an adaptive penalty function are used to address this scenario.Secondly,AEMO is introduced to solve the trajectory optimization problem.Based on the theories of biology and cognition,the trial solutions based on emotional memory are established.Three search strategies are designed for realizing the random search of trial solutions and for avoiding becoming trapped in a local minimum.The states of the trial solutions are determined according to the rules of memory enhancement and forgetting.As the iterations proceed,the trial solutions with poor quality will gradually be forgotten.Therefore,the number of trial solutions is decreased,and the convergence of the algorithm is accelerated.Finally,a numerical simulation is conducted,and the results demonstrate that the path and terminal constraints are satisfied and the method can realize satisfactory performance.展开更多
Spark is a distributed data processing framework based on memory.Memory allocation is a focus question of Spark research.A good memory allocation scheme can effectively improve the efficiency of task execution and mem...Spark is a distributed data processing framework based on memory.Memory allocation is a focus question of Spark research.A good memory allocation scheme can effectively improve the efficiency of task execution and memory resource utilization of the Spark.Aiming at the memory allocation problem in the Spark2.x version,this paper optimizes the memory allocation strategy by analyzing the Spark memory model,the existing cache replacement algorithms and the memory allocation methods,which is on the basis of minimizing the storage area and allocating the execution area according to the demand.It mainly including two parts:cache replacement optimization and memory allocation optimization.Firstly,in the storage area,the cache replacement algorithm is optimized according to the characteristics of RDD Partition,which is combined with PCA dimension.In this section,the four features of RDD Partition are selected.When the RDD cache is replaced,only two most important features are selected by PCA dimension reduction method each time,thereby ensuring the generalization of the cache replacement strategy.Secondly,the memory allocation strategy of the execution area is optimized according to the memory requirement of Task and the memory space of storage area.In this paper,a series of experiments in Spark on Yarn mode are carried out to verify the effectiveness of the optimization algorithm and improve the cluster performance.展开更多
With the rapid development of big data and artificial intelligence(AI),the cloud platform architecture system is constantly developing,optimizing,and improving.As such,new applications,like deep computing and high-per...With the rapid development of big data and artificial intelligence(AI),the cloud platform architecture system is constantly developing,optimizing,and improving.As such,new applications,like deep computing and high-performance computing,require enhanced computing power.To meet this requirement,a non-uniform memory access(NUMA)configuration method is proposed for the cloud computing system according to the affinity,adaptability,and availability of the NUMA architecture processor platform.The proposed method is verified based on the test environment of a domestic central processing unit(CPU).展开更多
As a large amount of data is increasingly generated from edge devices,such as smart homes,mobile phones,and wearable devices,it becomes crucial for many applications to deploy machine learning modes across edge device...As a large amount of data is increasingly generated from edge devices,such as smart homes,mobile phones,and wearable devices,it becomes crucial for many applications to deploy machine learning modes across edge devices.The execution speed of the deployed model is a key element to ensure service quality.Considering a highly heterogeneous edge deployment scenario,deep learning compiling is a novel approach that aims to solve this problem.It defines models using certain DSLs and generates efficient code implementations on different hardware devices.However,there are still two aspects that are not yet thoroughly investigated yet.The first is the optimization of memory-intensive operations,and the second problem is the heterogeneity of the deployment target.To that end,in this work,we propose a system solution that optimizes memory-intensive operation,optimizes the subgraph distribution,and enables the compiling and deployment of DNN models on multiple targets.The evaluation results show the performance of our proposed system.展开更多
This paper presents the architecture of a Convolution Neural Network(CNN)accelerator based on a newprocessing element(PE)array called a diagonal cyclic array(DCA).As demonstrated,it can significantly reduce the burden...This paper presents the architecture of a Convolution Neural Network(CNN)accelerator based on a newprocessing element(PE)array called a diagonal cyclic array(DCA).As demonstrated,it can significantly reduce the burden of repeated memory accesses for feature data and weight parameters of the CNN models,which maximizes the data reuse rate and improve the computation speed.Furthermore,an integrated computation architecture has been implemented for the activation function,max-pooling,and activation function after convolution calculation,reducing the hardware resource.To evaluate the effectiveness of the proposed architecture,a CNN accelerator has been implemented for You Only Look Once version 2(YOLOv2)-Tiny consisting of 9 layers.Furthermore,the methodology to optimize the local buffer size with little sacrifice of inference speed is presented in this work.We implemented the proposed CNN accelerator using a Xilinx Zynq ZCU102 Ultrascale+Field Programmable Gate Array(FPGA)and ISE Design Suite.The FPGA implementation uses 34,336 Look Up Tables(LUTs),576 Digital Signal Processing(DSP)blocks,and an on-chip memory of only 58 KB,and it could achieve accuracies of 57.92% and 56.42% mean Average Precession@0.5 thresholds for intersection over union(mAP@0.5)using quantized 16-bit and 8-bit full integer data manipulation with only 0.68% as a loss for 8-bit version and computation time of 137.9 and 69 ms for each input image respectively using a clock speed of 200 MHz.These speeds are expected to be doubled five times using a clock speed of 1GHz if implemented in a silicon System on Chip(SoC)using a sub-micron process.展开更多
This letter exploits fundamental characteristics of a wavelet transform image to form a progressive octave-based spatial resolution. Each wavelet subband is coded based on zeroblock and quardtree partitioning ordering...This letter exploits fundamental characteristics of a wavelet transform image to form a progressive octave-based spatial resolution. Each wavelet subband is coded based on zeroblock and quardtree partitioning ordering scheme with memory optimization technique. The method proposed in this letter is of low complexity and efficient for Internet plug-in software.展开更多
Spark,a distributed computing platform,has rapidly developed in the field of big data.Its in-memory computing feature reduces disk read overhead and shortens data processing time,making it have broad application prosp...Spark,a distributed computing platform,has rapidly developed in the field of big data.Its in-memory computing feature reduces disk read overhead and shortens data processing time,making it have broad application prospects in large-scale computing applications such as machine learning and image processing.However,the performance of the Spark platform still needs to be improved.When a large number of tasks are processed simultaneously,Spark’s cache replacementmechanismcannot identify high-value data partitions,resulting inmemory resources not being fully utilized and affecting the performance of the Spark platform.To address the problem that Spark’s default cache replacement algorithm cannot accurately evaluate high-value data partitions,firstly the weight influence factors of data partitions are modeled and evaluated.Then,based on this weighted model,a cache replacement algorithm based on dynamic weighted data value is proposed,which takes into account hit rate and data difference.Better integration and usage strategies are implemented based on LRU(LeastRecentlyUsed).Theweight update algorithm updates the weight value when the data partition information changes,accurately measuring the importance of the partition in the current job;the cache removal algorithm clears partitions without useful values in the cache to releasememory resources;the weight replacement algorithm combines partition weights and partition information to replace RDD partitions when memory remaining space is insufficient.Finally,by setting up a Spark cluster environment,the algorithm proposed in this paper is experimentally verified.Experiments have shown that this algorithmcan effectively improve cache hit rate,enhance the performance of the platform,and reduce job execution time by 7.61%compared to existing improved algorithms.展开更多
COVID’19 has caused the entire universe to be in existential healthcrisis by spreading globally in the year 2020. The lungs infection is detected inComputed Tomography (CT) images which provide the best way to increa...COVID’19 has caused the entire universe to be in existential healthcrisis by spreading globally in the year 2020. The lungs infection is detected inComputed Tomography (CT) images which provide the best way to increasethe existing healthcare schemes in preventing the deadly virus. Nevertheless,separating the infected areas in CT images faces various issues such as lowintensity difference among normal and infectious tissue and high changes inthe characteristics of the infection. To resolve these issues, a new inf-Net (LungInfection Segmentation Deep Network) is designed for detecting the affectedareas from the CT images automatically. For the worst segmentation results,the Edge-Attention Representation (EAR) is optimized using AdaptiveDonkey and Smuggler Optimization (ADSO). The edges which are identifiedby the ADSO approach is utilized for calculating dissimilarities. An IFCM(Intuitionistic Fuzzy C-Means) clustering approach is applied for computingthe similarity of the EA component among the generated edge maps andGround-Truth (GT) edge maps. Also, a Semi-Supervised Segmentation(SSS) structure is designed using the Randomly Selected Propagation (RP)technique and Inf-Net, which needs only less number of images and unlabelleddata. Semi-Supervised Multi-Class Segmentation (SSMCS) is designed usinga Bi-LSTM (Bi-Directional Long-Short-Term-memory), acquires all theadvantages of the disease segmentation done using Semi Inf-Net and enhancesthe execution of multi-class disease labelling. The newly designed SSMCSapproach is compared with existing U-Net++, MCS, and Semi-Inf-Net.factors such as MAE (Mean Absolute Error), Structure measure, Specificity(Spec), Dice Similarity coefficient, Sensitivity (Sen), and Enhance-AlignmentMeasure are considered for evaluation purpose.展开更多
Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at t...Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at this problem,a parallelization approach was proposed with six memory optimization schemes for CG,four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20,the parallelization approach can reach up to 21 and 133 times speedups with size A and B,respectively,compared with single power processor element. Finally,the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV,simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores.展开更多
The radiation damage effect of key structural materials is one of the main research subjects of the numerical reactor.From the perspective of experimental safety and feasibility,Molecular Dynamics(MD)in the materials ...The radiation damage effect of key structural materials is one of the main research subjects of the numerical reactor.From the perspective of experimental safety and feasibility,Molecular Dynamics(MD)in the materials field is an ideal method for simulating the radiation damage of structural materials.The Crystal-MD represents a massive parallel MD simulation software based on the key material characteristics of reactors.Compared with the Large-scale Atomic/Molecurlar Massively Parallel Simulator(LAMMPS)and ITAP Molecular Dynamics(IMD)software,the Crystal-MD reduces the memory required for software operation to a certain extent,but it is very time-consuming.Moreover,the calculation results of the Crystal-MD have large deviations,and there are also some problems,such as memory limitation and frequent communication during its migration and optimization.In this paper,in order to solve the above problems,the memory access mode of the Crystal-MD software is studied.Based on the memory access mode,a memory access optimization strategy is proposed for a unique architecture of China’s supercomputer Sunway Taihu Light.The proposed optimization strategy is verified by the experiments,and experimental results show that the running speed of the Crystal-MD is increased significantly by using the proposed optimization strategy.展开更多
Computational Fluid Dynamics-Discrete Element Method is used to model gas-solid systems in several applications in energy,pharmaceutical and petrochemical industries.Computational performance bot-tlenecks often limit ...Computational Fluid Dynamics-Discrete Element Method is used to model gas-solid systems in several applications in energy,pharmaceutical and petrochemical industries.Computational performance bot-tlenecks often limit the problem sizes that can be simulated at industrial scale.The data structures used to store several millions of particles in such large-scale simulations have a large memory footprint that does not fit into the processor cache hierarchies on current high-performance-computing platforms,leading to reduced computational performance.This paper specifically addresses this aspect of memory access bottlenecks in industrial scale simulations.The use of space-flling curves to improve memory access patterns is described and their impact on computational performance is quantified in both shared and distributed memory parallelization paradigms.The Morton space flling curve applied to uniform grids and k-dimensional tree partitions are used to reorder the particle data-structure thus improving spatial and temporal locality in memory.The performance impact of these techniques when applied to two benchmark problems,namely the homogeneous-cooling-system and a fluidized-bed,are presented.These optimization techniques lead to approximately two-fold performance improvement in particle focused operations such as neighbor-list creation and data-exchange,with~1.5 times overall improvement in a fluidization simulation with 1.27 million particles.展开更多
For studying and optimizing the performance of general-purpose computing on graphics processing units(GPGPU)based on single instruction multiple threads(SIMT)processor about the neural network application,this work co...For studying and optimizing the performance of general-purpose computing on graphics processing units(GPGPU)based on single instruction multiple threads(SIMT)processor about the neural network application,this work contributes a self-developed SIMT processor named Pomelo and correlated assembly program.The parallel mechanism of SIMT computing mode and self-developed Pomelo processor is briefly introduced.A common convolutional neural network(CNN)is built to verify the compatibility and functionality of the Pomelo processor.CNN computing flow with task level and hardware level optimization is adopted on the Pomelo processor.A specific algorithm for organizing a Z-shaped memory structure is developed,which addresses reducing memory access in mass data computing tasks.Performing the above-combined adaptation and optimization strategy,the experimental result demonstrates that reducing memory access in SIMT computing mode plays a crucial role in improving performance.A 6.52 times performance is achieved on the 4 processing elements case.展开更多
文摘Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming GPUs remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication—a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)—this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique’s broader applicability and its potential to address some of the most pressing computational challenges in GPU-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in GPU computing.
基金supported by the Defense Science and Technology Key Laboratory Fund of Luoyang Electro-optical Equipment Institute,Aviation Industry Corporation of China(6142504200108).
文摘The trajectory optimization of an unpowered reentry vehicle via artificial emotion memory optimization(AEMO)is discussed.Firstly,reentry dynamics are established based on multiple constraints and parameterized control variables with finite dimensions are designed.If the constraint is not satisfied,a distance measure and an adaptive penalty function are used to address this scenario.Secondly,AEMO is introduced to solve the trajectory optimization problem.Based on the theories of biology and cognition,the trial solutions based on emotional memory are established.Three search strategies are designed for realizing the random search of trial solutions and for avoiding becoming trapped in a local minimum.The states of the trial solutions are determined according to the rules of memory enhancement and forgetting.As the iterations proceed,the trial solutions with poor quality will gradually be forgotten.Therefore,the number of trial solutions is decreased,and the convergence of the algorithm is accelerated.Finally,a numerical simulation is conducted,and the results demonstrate that the path and terminal constraints are satisfied and the method can realize satisfactory performance.
文摘Spark is a distributed data processing framework based on memory.Memory allocation is a focus question of Spark research.A good memory allocation scheme can effectively improve the efficiency of task execution and memory resource utilization of the Spark.Aiming at the memory allocation problem in the Spark2.x version,this paper optimizes the memory allocation strategy by analyzing the Spark memory model,the existing cache replacement algorithms and the memory allocation methods,which is on the basis of minimizing the storage area and allocating the execution area according to the demand.It mainly including two parts:cache replacement optimization and memory allocation optimization.Firstly,in the storage area,the cache replacement algorithm is optimized according to the characteristics of RDD Partition,which is combined with PCA dimension.In this section,the four features of RDD Partition are selected.When the RDD cache is replaced,only two most important features are selected by PCA dimension reduction method each time,thereby ensuring the generalization of the cache replacement strategy.Secondly,the memory allocation strategy of the execution area is optimized according to the memory requirement of Task and the memory space of storage area.In this paper,a series of experiments in Spark on Yarn mode are carried out to verify the effectiveness of the optimization algorithm and improve the cluster performance.
基金the National Key Research and Development Program of China(No.2017YFC0212100)National High-tech R&D Program of China(No.2015AA015308).
文摘With the rapid development of big data and artificial intelligence(AI),the cloud platform architecture system is constantly developing,optimizing,and improving.As such,new applications,like deep computing and high-performance computing,require enhanced computing power.To meet this requirement,a non-uniform memory access(NUMA)configuration method is proposed for the cloud computing system according to the affinity,adaptability,and availability of the NUMA architecture processor platform.The proposed method is verified based on the test environment of a domestic central processing unit(CPU).
基金supported by the National Natural Science Foundation of China(U21A20519)。
文摘As a large amount of data is increasingly generated from edge devices,such as smart homes,mobile phones,and wearable devices,it becomes crucial for many applications to deploy machine learning modes across edge devices.The execution speed of the deployed model is a key element to ensure service quality.Considering a highly heterogeneous edge deployment scenario,deep learning compiling is a novel approach that aims to solve this problem.It defines models using certain DSLs and generates efficient code implementations on different hardware devices.However,there are still two aspects that are not yet thoroughly investigated yet.The first is the optimization of memory-intensive operations,and the second problem is the heterogeneity of the deployment target.To that end,in this work,we propose a system solution that optimizes memory-intensive operation,optimizes the subgraph distribution,and enables the compiling and deployment of DNN models on multiple targets.The evaluation results show the performance of our proposed system.
基金supported by the National Research Foundation of Korea(NRF)grant funded by the Korea government(MSIT)(No.2022R1A5A8026986)supported by the Institute of Information&communications Technology Planning&Evaluation(IITP)grant funded by the Korean government(MSIT)(No.2020-0-01304,Development of Self-learnable Mobile Recursive Neural Network Processor Technology)supported by the MSIT(Ministry of Science and ICT),Korea,under the Grand Information Technology Research Center support program(IITP-2023-2020-0-01462)'supervised by the IITP(Institute for Information&communications Technology Planning&Evaluation)and supported by the National Research Foundation of Korea(NRF)grant funded by the Korea government(MSIT)(No.2021R1F1A1061314).
文摘This paper presents the architecture of a Convolution Neural Network(CNN)accelerator based on a newprocessing element(PE)array called a diagonal cyclic array(DCA).As demonstrated,it can significantly reduce the burden of repeated memory accesses for feature data and weight parameters of the CNN models,which maximizes the data reuse rate and improve the computation speed.Furthermore,an integrated computation architecture has been implemented for the activation function,max-pooling,and activation function after convolution calculation,reducing the hardware resource.To evaluate the effectiveness of the proposed architecture,a CNN accelerator has been implemented for You Only Look Once version 2(YOLOv2)-Tiny consisting of 9 layers.Furthermore,the methodology to optimize the local buffer size with little sacrifice of inference speed is presented in this work.We implemented the proposed CNN accelerator using a Xilinx Zynq ZCU102 Ultrascale+Field Programmable Gate Array(FPGA)and ISE Design Suite.The FPGA implementation uses 34,336 Look Up Tables(LUTs),576 Digital Signal Processing(DSP)blocks,and an on-chip memory of only 58 KB,and it could achieve accuracies of 57.92% and 56.42% mean Average Precession@0.5 thresholds for intersection over union(mAP@0.5)using quantized 16-bit and 8-bit full integer data manipulation with only 0.68% as a loss for 8-bit version and computation time of 137.9 and 69 ms for each input image respectively using a clock speed of 200 MHz.These speeds are expected to be doubled five times using a clock speed of 1GHz if implemented in a silicon System on Chip(SoC)using a sub-micron process.
文摘This letter exploits fundamental characteristics of a wavelet transform image to form a progressive octave-based spatial resolution. Each wavelet subband is coded based on zeroblock and quardtree partitioning ordering scheme with memory optimization technique. The method proposed in this letter is of low complexity and efficient for Internet plug-in software.
基金the National Natural Science Foundation of China(61872284)Key Research and Development Program of Shaanxi(2023-YBGY-203,2023-YBGY-021)+3 种基金Industrialization Project of Shaanxi ProvincialDepartment of Education(21JC017)“Thirteenth Five-Year”National Key R&D Program Project(Project Number:2019YFD1100901)Natural Science Foundation of Shannxi Province,China(2021JLM-16,2023-JC-YB-825)Key R&D Plan of Xianyang City(L2023-ZDYF-QYCX-021)。
文摘Spark,a distributed computing platform,has rapidly developed in the field of big data.Its in-memory computing feature reduces disk read overhead and shortens data processing time,making it have broad application prospects in large-scale computing applications such as machine learning and image processing.However,the performance of the Spark platform still needs to be improved.When a large number of tasks are processed simultaneously,Spark’s cache replacementmechanismcannot identify high-value data partitions,resulting inmemory resources not being fully utilized and affecting the performance of the Spark platform.To address the problem that Spark’s default cache replacement algorithm cannot accurately evaluate high-value data partitions,firstly the weight influence factors of data partitions are modeled and evaluated.Then,based on this weighted model,a cache replacement algorithm based on dynamic weighted data value is proposed,which takes into account hit rate and data difference.Better integration and usage strategies are implemented based on LRU(LeastRecentlyUsed).Theweight update algorithm updates the weight value when the data partition information changes,accurately measuring the importance of the partition in the current job;the cache removal algorithm clears partitions without useful values in the cache to releasememory resources;the weight replacement algorithm combines partition weights and partition information to replace RDD partitions when memory remaining space is insufficient.Finally,by setting up a Spark cluster environment,the algorithm proposed in this paper is experimentally verified.Experiments have shown that this algorithmcan effectively improve cache hit rate,enhance the performance of the platform,and reduce job execution time by 7.61%compared to existing improved algorithms.
文摘COVID’19 has caused the entire universe to be in existential healthcrisis by spreading globally in the year 2020. The lungs infection is detected inComputed Tomography (CT) images which provide the best way to increasethe existing healthcare schemes in preventing the deadly virus. Nevertheless,separating the infected areas in CT images faces various issues such as lowintensity difference among normal and infectious tissue and high changes inthe characteristics of the infection. To resolve these issues, a new inf-Net (LungInfection Segmentation Deep Network) is designed for detecting the affectedareas from the CT images automatically. For the worst segmentation results,the Edge-Attention Representation (EAR) is optimized using AdaptiveDonkey and Smuggler Optimization (ADSO). The edges which are identifiedby the ADSO approach is utilized for calculating dissimilarities. An IFCM(Intuitionistic Fuzzy C-Means) clustering approach is applied for computingthe similarity of the EA component among the generated edge maps andGround-Truth (GT) edge maps. Also, a Semi-Supervised Segmentation(SSS) structure is designed using the Randomly Selected Propagation (RP)technique and Inf-Net, which needs only less number of images and unlabelleddata. Semi-Supervised Multi-Class Segmentation (SSMCS) is designed usinga Bi-LSTM (Bi-Directional Long-Short-Term-memory), acquires all theadvantages of the disease segmentation done using Semi Inf-Net and enhancesthe execution of multi-class disease labelling. The newly designed SSMCSapproach is compared with existing U-Net++, MCS, and Semi-Inf-Net.factors such as MAE (Mean Absolute Error), Structure measure, Specificity(Spec), Dice Similarity coefficient, Sensitivity (Sen), and Enhance-AlignmentMeasure are considered for evaluation purpose.
基金Project(2008AA01A201) supported the National High-tech Research and Development Program of ChinaProjects(60833004, 60633050) supported by the National Natural Science Foundation of China
文摘Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at this problem,a parallelization approach was proposed with six memory optimization schemes for CG,four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20,the parallelization approach can reach up to 21 and 133 times speedups with size A and B,respectively,compared with single power processor element. Finally,the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV,simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores.
基金supported by the National Key R&D Program of China(No.2017YFB0202003)。
文摘The radiation damage effect of key structural materials is one of the main research subjects of the numerical reactor.From the perspective of experimental safety and feasibility,Molecular Dynamics(MD)in the materials field is an ideal method for simulating the radiation damage of structural materials.The Crystal-MD represents a massive parallel MD simulation software based on the key material characteristics of reactors.Compared with the Large-scale Atomic/Molecurlar Massively Parallel Simulator(LAMMPS)and ITAP Molecular Dynamics(IMD)software,the Crystal-MD reduces the memory required for software operation to a certain extent,but it is very time-consuming.Moreover,the calculation results of the Crystal-MD have large deviations,and there are also some problems,such as memory limitation and frequent communication during its migration and optimization.In this paper,in order to solve the above problems,the memory access mode of the Crystal-MD software is studied.Based on the memory access mode,a memory access optimization strategy is proposed for a unique architecture of China’s supercomputer Sunway Taihu Light.The proposed optimization strategy is verified by the experiments,and experimental results show that the running speed of the Crystal-MD is increased significantly by using the proposed optimization strategy.
文摘Computational Fluid Dynamics-Discrete Element Method is used to model gas-solid systems in several applications in energy,pharmaceutical and petrochemical industries.Computational performance bot-tlenecks often limit the problem sizes that can be simulated at industrial scale.The data structures used to store several millions of particles in such large-scale simulations have a large memory footprint that does not fit into the processor cache hierarchies on current high-performance-computing platforms,leading to reduced computational performance.This paper specifically addresses this aspect of memory access bottlenecks in industrial scale simulations.The use of space-flling curves to improve memory access patterns is described and their impact on computational performance is quantified in both shared and distributed memory parallelization paradigms.The Morton space flling curve applied to uniform grids and k-dimensional tree partitions are used to reorder the particle data-structure thus improving spatial and temporal locality in memory.The performance impact of these techniques when applied to two benchmark problems,namely the homogeneous-cooling-system and a fluidized-bed,are presented.These optimization techniques lead to approximately two-fold performance improvement in particle focused operations such as neighbor-list creation and data-exchange,with~1.5 times overall improvement in a fluidization simulation with 1.27 million particles.
基金the Scientific Research Program Funded by Shaanxi Provincial Education Department(20JY058)。
文摘For studying and optimizing the performance of general-purpose computing on graphics processing units(GPGPU)based on single instruction multiple threads(SIMT)processor about the neural network application,this work contributes a self-developed SIMT processor named Pomelo and correlated assembly program.The parallel mechanism of SIMT computing mode and self-developed Pomelo processor is briefly introduced.A common convolutional neural network(CNN)is built to verify the compatibility and functionality of the Pomelo processor.CNN computing flow with task level and hardware level optimization is adopted on the Pomelo processor.A specific algorithm for organizing a Z-shaped memory structure is developed,which addresses reducing memory access in mass data computing tasks.Performing the above-combined adaptation and optimization strategy,the experimental result demonstrates that reducing memory access in SIMT computing mode plays a crucial role in improving performance.A 6.52 times performance is achieved on the 4 processing elements case.