期刊文献+
共找到13篇文章
< 1 >
每页显示 20 50 100
Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique
1
作者 Neda Seifi Abdullah Al-Mamun 《Journal of Computer and Communications》 2024年第5期124-139,共16页
Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these adv... Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming GPUs remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication—a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)—this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique’s broader applicability and its potential to address some of the most pressing computational challenges in GPU-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in GPU computing. 展开更多
关键词 Data Layout Optimization CUDA Performance Optimization GPU memory Optimization Dynamic Programming Matrix Multiplication memory Access Pattern Optimization in CUDA
下载PDF
Trajectory optimization of a reentry vehicle based on artificial emotion memory optimization 被引量:2
2
作者 FU Shengnan WANG Liang XIA Qunli 《Journal of Systems Engineering and Electronics》 SCIE EI CSCD 2021年第3期668-680,共13页
The trajectory optimization of an unpowered reentry vehicle via artificial emotion memory optimization(AEMO)is discussed.Firstly,reentry dynamics are established based on multiple constraints and parameterized control... The trajectory optimization of an unpowered reentry vehicle via artificial emotion memory optimization(AEMO)is discussed.Firstly,reentry dynamics are established based on multiple constraints and parameterized control variables with finite dimensions are designed.If the constraint is not satisfied,a distance measure and an adaptive penalty function are used to address this scenario.Secondly,AEMO is introduced to solve the trajectory optimization problem.Based on the theories of biology and cognition,the trial solutions based on emotional memory are established.Three search strategies are designed for realizing the random search of trial solutions and for avoiding becoming trapped in a local minimum.The states of the trial solutions are determined according to the rules of memory enhancement and forgetting.As the iterations proceed,the trial solutions with poor quality will gradually be forgotten.Therefore,the number of trial solutions is decreased,and the convergence of the algorithm is accelerated.Finally,a numerical simulation is conducted,and the results demonstrate that the path and terminal constraints are satisfied and the method can realize satisfactory performance. 展开更多
关键词 trajectory optimization adaptive penalty function artificial emotion memory optimization(AEMO) multiple constraint
下载PDF
A Dynamic Memory Allocation Optimization Mechanism Based on Spark 被引量:2
3
作者 Suzhen Wang Shanshan Geng +7 位作者 Zhanfeng Zhang Anshan Ye Keming Chen Zhaosheng Xu Huimin Luo Gangshan Wu Lina Xu Ning Cao 《Computers, Materials & Continua》 SCIE EI 2019年第8期739-757,共19页
Spark is a distributed data processing framework based on memory.Memory allocation is a focus question of Spark research.A good memory allocation scheme can effectively improve the efficiency of task execution and mem... Spark is a distributed data processing framework based on memory.Memory allocation is a focus question of Spark research.A good memory allocation scheme can effectively improve the efficiency of task execution and memory resource utilization of the Spark.Aiming at the memory allocation problem in the Spark2.x version,this paper optimizes the memory allocation strategy by analyzing the Spark memory model,the existing cache replacement algorithms and the memory allocation methods,which is on the basis of minimizing the storage area and allocating the execution area according to the demand.It mainly including two parts:cache replacement optimization and memory allocation optimization.Firstly,in the storage area,the cache replacement algorithm is optimized according to the characteristics of RDD Partition,which is combined with PCA dimension.In this section,the four features of RDD Partition are selected.When the RDD cache is replaced,only two most important features are selected by PCA dimension reduction method each time,thereby ensuring the generalization of the cache replacement strategy.Secondly,the memory allocation strategy of the execution area is optimized according to the memory requirement of Task and the memory space of storage area.In this paper,a series of experiments in Spark on Yarn mode are carried out to verify the effectiveness of the optimization algorithm and improve the cluster performance. 展开更多
关键词 memory calculation memory allocation optimization cache replacement optimization
下载PDF
Research on optimization of virtual machine memory access based on NUMA architecture 被引量:2
4
作者 He Mujun Zheng Linjiang +2 位作者 Yang Kai Liu Runfeng Liu Weining 《High Technology Letters》 EI CAS 2021年第4期347-356,共10页
With the rapid development of big data and artificial intelligence(AI),the cloud platform architecture system is constantly developing,optimizing,and improving.As such,new applications,like deep computing and high-per... With the rapid development of big data and artificial intelligence(AI),the cloud platform architecture system is constantly developing,optimizing,and improving.As such,new applications,like deep computing and high-performance computing,require enhanced computing power.To meet this requirement,a non-uniform memory access(NUMA)configuration method is proposed for the cloud computing system according to the affinity,adaptability,and availability of the NUMA architecture processor platform.The proposed method is verified based on the test environment of a domestic central processing unit(CPU). 展开更多
关键词 cloud computing VIRTUALIZATION non-uniform memory access(NUMA)virtual machine memory access optimization
下载PDF
Optimization and Deployment of Memory-Intensive Operations in Deep Learning Model on Edge
5
作者 Peng XU Jianxin ZHAO Chi Harold LIU 《计算机科学》 CSCD 北大核心 2023年第2期3-12,共10页
As a large amount of data is increasingly generated from edge devices,such as smart homes,mobile phones,and wearable devices,it becomes crucial for many applications to deploy machine learning modes across edge device... As a large amount of data is increasingly generated from edge devices,such as smart homes,mobile phones,and wearable devices,it becomes crucial for many applications to deploy machine learning modes across edge devices.The execution speed of the deployed model is a key element to ensure service quality.Considering a highly heterogeneous edge deployment scenario,deep learning compiling is a novel approach that aims to solve this problem.It defines models using certain DSLs and generates efficient code implementations on different hardware devices.However,there are still two aspects that are not yet thoroughly investigated yet.The first is the optimization of memory-intensive operations,and the second problem is the heterogeneity of the deployment target.To that end,in this work,we propose a system solution that optimizes memory-intensive operation,optimizes the subgraph distribution,and enables the compiling and deployment of DNN models on multiple targets.The evaluation results show the performance of our proposed system. 展开更多
关键词 memory optimization Deep compiler Computation optimization Model deployment Edge computing
下载PDF
CNN Accelerator Using Proposed Diagonal Cyclic Array for Minimizing Memory Accesses
6
作者 Hyun-Wook Son Ali AAl-Hamid +2 位作者 Yong-Seok Na Dong-Yeong Lee Hyung-Won Kim 《Computers, Materials & Continua》 SCIE EI 2023年第8期1665-1687,共23页
This paper presents the architecture of a Convolution Neural Network(CNN)accelerator based on a newprocessing element(PE)array called a diagonal cyclic array(DCA).As demonstrated,it can significantly reduce the burden... This paper presents the architecture of a Convolution Neural Network(CNN)accelerator based on a newprocessing element(PE)array called a diagonal cyclic array(DCA).As demonstrated,it can significantly reduce the burden of repeated memory accesses for feature data and weight parameters of the CNN models,which maximizes the data reuse rate and improve the computation speed.Furthermore,an integrated computation architecture has been implemented for the activation function,max-pooling,and activation function after convolution calculation,reducing the hardware resource.To evaluate the effectiveness of the proposed architecture,a CNN accelerator has been implemented for You Only Look Once version 2(YOLOv2)-Tiny consisting of 9 layers.Furthermore,the methodology to optimize the local buffer size with little sacrifice of inference speed is presented in this work.We implemented the proposed CNN accelerator using a Xilinx Zynq ZCU102 Ultrascale+Field Programmable Gate Array(FPGA)and ISE Design Suite.The FPGA implementation uses 34,336 Look Up Tables(LUTs),576 Digital Signal Processing(DSP)blocks,and an on-chip memory of only 58 KB,and it could achieve accuracies of 57.92% and 56.42% mean Average Precession@0.5 thresholds for intersection over union(mAP@0.5)using quantized 16-bit and 8-bit full integer data manipulation with only 0.68% as a loss for 8-bit version and computation time of 137.9 and 69 ms for each input image respectively using a clock speed of 200 MHz.These speeds are expected to be doubled five times using a clock speed of 1GHz if implemented in a silicon System on Chip(SoC)using a sub-micron process. 展开更多
关键词 CNN ACCELERATOR systolic array memory optimization YOLOv2-tiny mAP@0.5
下载PDF
SPATIALLY SCALABLE RESOLUTION IMAGE CODING METHOD WITH MEMORY OPTIMIZATION BASED ON WAVELET TRANSFORM
7
作者 WangNa ZhangLi +2 位作者 ZhouXiao'an JiaChuanying LiXia 《Journal of Electronics(China)》 2005年第1期94-97,共4页
This letter exploits fundamental characteristics of a wavelet transform image to form a progressive octave-based spatial resolution. Each wavelet subband is coded based on zeroblock and quardtree partitioning ordering... This letter exploits fundamental characteristics of a wavelet transform image to form a progressive octave-based spatial resolution. Each wavelet subband is coded based on zeroblock and quardtree partitioning ordering scheme with memory optimization technique. The method proposed in this letter is of low complexity and efficient for Internet plug-in software. 展开更多
关键词 memory optimization Spatially resolution scalability Wavelet transform Quard-tree partitioning
下载PDF
Research on Performance Optimization of Spark Distributed Computing Platform
8
作者 Qinlu He Fan Zhang +2 位作者 Genqing Bian Weiqi Zhang Zhen Li 《Computers, Materials & Continua》 SCIE EI 2024年第5期2833-2850,共18页
Spark,a distributed computing platform,has rapidly developed in the field of big data.Its in-memory computing feature reduces disk read overhead and shortens data processing time,making it have broad application prosp... Spark,a distributed computing platform,has rapidly developed in the field of big data.Its in-memory computing feature reduces disk read overhead and shortens data processing time,making it have broad application prospects in large-scale computing applications such as machine learning and image processing.However,the performance of the Spark platform still needs to be improved.When a large number of tasks are processed simultaneously,Spark’s cache replacementmechanismcannot identify high-value data partitions,resulting inmemory resources not being fully utilized and affecting the performance of the Spark platform.To address the problem that Spark’s default cache replacement algorithm cannot accurately evaluate high-value data partitions,firstly the weight influence factors of data partitions are modeled and evaluated.Then,based on this weighted model,a cache replacement algorithm based on dynamic weighted data value is proposed,which takes into account hit rate and data difference.Better integration and usage strategies are implemented based on LRU(LeastRecentlyUsed).Theweight update algorithm updates the weight value when the data partition information changes,accurately measuring the importance of the partition in the current job;the cache removal algorithm clears partitions without useful values in the cache to releasememory resources;the weight replacement algorithm combines partition weights and partition information to replace RDD partitions when memory remaining space is insufficient.Finally,by setting up a Spark cluster environment,the algorithm proposed in this paper is experimentally verified.Experiments have shown that this algorithmcan effectively improve cache hit rate,enhance the performance of the platform,and reduce job execution time by 7.61%compared to existing improved algorithms. 展开更多
关键词 SPARK memory optimization memory replacement strategy
下载PDF
Covid-19 CT Lung Image Segmentation Using Adaptive Donkey and Smuggler Optimization Algorithm 被引量:1
9
作者 P.Prabu K.Venkatachalam +3 位作者 Ala Saleh Alluhaidan Radwa Marzouk Myriam Hadjouni Sahar A.El_Rahman 《Computers, Materials & Continua》 SCIE EI 2022年第4期1133-1152,共20页
COVID’19 has caused the entire universe to be in existential healthcrisis by spreading globally in the year 2020. The lungs infection is detected inComputed Tomography (CT) images which provide the best way to increa... COVID’19 has caused the entire universe to be in existential healthcrisis by spreading globally in the year 2020. The lungs infection is detected inComputed Tomography (CT) images which provide the best way to increasethe existing healthcare schemes in preventing the deadly virus. Nevertheless,separating the infected areas in CT images faces various issues such as lowintensity difference among normal and infectious tissue and high changes inthe characteristics of the infection. To resolve these issues, a new inf-Net (LungInfection Segmentation Deep Network) is designed for detecting the affectedareas from the CT images automatically. For the worst segmentation results,the Edge-Attention Representation (EAR) is optimized using AdaptiveDonkey and Smuggler Optimization (ADSO). The edges which are identifiedby the ADSO approach is utilized for calculating dissimilarities. An IFCM(Intuitionistic Fuzzy C-Means) clustering approach is applied for computingthe similarity of the EA component among the generated edge maps andGround-Truth (GT) edge maps. Also, a Semi-Supervised Segmentation(SSS) structure is designed using the Randomly Selected Propagation (RP)technique and Inf-Net, which needs only less number of images and unlabelleddata. Semi-Supervised Multi-Class Segmentation (SSMCS) is designed usinga Bi-LSTM (Bi-Directional Long-Short-Term-memory), acquires all theadvantages of the disease segmentation done using Semi Inf-Net and enhancesthe execution of multi-class disease labelling. The newly designed SSMCSapproach is compared with existing U-Net++, MCS, and Semi-Inf-Net.factors such as MAE (Mean Absolute Error), Structure measure, Specificity(Spec), Dice Similarity coefficient, Sensitivity (Sen), and Enhance-AlignmentMeasure are considered for evaluation purpose. 展开更多
关键词 Adaptive donkey and snuggler optimization.bi-directional long short term memory coronavirus disease 2019 randomly selected propagation semi-supervised learning
下载PDF
Multi-core optimization for conjugate gradient benchmark on heterogeneous processors
10
作者 邓林 窦勇 《Journal of Central South University》 SCIE EI CAS 2011年第2期490-498,共9页
Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at t... Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at this problem,a parallelization approach was proposed with six memory optimization schemes for CG,four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20,the parallelization approach can reach up to 21 and 133 times speedups with size A and B,respectively,compared with single power processor element. Finally,the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV,simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores. 展开更多
关键词 multi-core processor NAS parallelization CG memory optimization
下载PDF
Memory Access Optimization of Molecular Dynamics Simulation Software Crystal-MD on Sunway Taihu Light
11
作者 Jianjiang Li Jie Lin +2 位作者 Panpan Du Kai Zhang Jie Wu 《Tsinghua Science and Technology》 SCIE EI CAS CSCD 2021年第3期296-308,共13页
The radiation damage effect of key structural materials is one of the main research subjects of the numerical reactor.From the perspective of experimental safety and feasibility,Molecular Dynamics(MD)in the materials ... The radiation damage effect of key structural materials is one of the main research subjects of the numerical reactor.From the perspective of experimental safety and feasibility,Molecular Dynamics(MD)in the materials field is an ideal method for simulating the radiation damage of structural materials.The Crystal-MD represents a massive parallel MD simulation software based on the key material characteristics of reactors.Compared with the Large-scale Atomic/Molecurlar Massively Parallel Simulator(LAMMPS)and ITAP Molecular Dynamics(IMD)software,the Crystal-MD reduces the memory required for software operation to a certain extent,but it is very time-consuming.Moreover,the calculation results of the Crystal-MD have large deviations,and there are also some problems,such as memory limitation and frequent communication during its migration and optimization.In this paper,in order to solve the above problems,the memory access mode of the Crystal-MD software is studied.Based on the memory access mode,a memory access optimization strategy is proposed for a unique architecture of China’s supercomputer Sunway Taihu Light.The proposed optimization strategy is verified by the experiments,and experimental results show that the running speed of the Crystal-MD is increased significantly by using the proposed optimization strategy. 展开更多
关键词 molecular dynamics simulation Crystal-MD Sunway Taihu Light memory access optimization
原文传递
Memory access optimization for particle operations in computational fluid dynamics-discrete element method simulations
12
作者 Deepthi Vaidhynathan Hariswaran Sitaraman +3 位作者 Ray Grout Thomas Hauser Christine M.Hrenya Jordan Musser 《Particuology》 SCIE EI CAS CSCD 2023年第7期97-110,共14页
Computational Fluid Dynamics-Discrete Element Method is used to model gas-solid systems in several applications in energy,pharmaceutical and petrochemical industries.Computational performance bot-tlenecks often limit ... Computational Fluid Dynamics-Discrete Element Method is used to model gas-solid systems in several applications in energy,pharmaceutical and petrochemical industries.Computational performance bot-tlenecks often limit the problem sizes that can be simulated at industrial scale.The data structures used to store several millions of particles in such large-scale simulations have a large memory footprint that does not fit into the processor cache hierarchies on current high-performance-computing platforms,leading to reduced computational performance.This paper specifically addresses this aspect of memory access bottlenecks in industrial scale simulations.The use of space-flling curves to improve memory access patterns is described and their impact on computational performance is quantified in both shared and distributed memory parallelization paradigms.The Morton space flling curve applied to uniform grids and k-dimensional tree partitions are used to reorder the particle data-structure thus improving spatial and temporal locality in memory.The performance impact of these techniques when applied to two benchmark problems,namely the homogeneous-cooling-system and a fluidized-bed,are presented.These optimization techniques lead to approximately two-fold performance improvement in particle focused operations such as neighbor-list creation and data-exchange,with~1.5 times overall improvement in a fluidization simulation with 1.27 million particles. 展开更多
关键词 CFD-DEM memory access optimization Spatial reordering Performance optimization
原文传递
Convolutional neural network adaptation and optimization method in SIMT computing mode
13
作者 Feng Zhenfu Zhang Yaying +1 位作者 Yang Lele Xing Lidong 《The Journal of China Universities of Posts and Telecommunications》 EI CSCD 2024年第2期105-112,共8页
For studying and optimizing the performance of general-purpose computing on graphics processing units(GPGPU)based on single instruction multiple threads(SIMT)processor about the neural network application,this work co... For studying and optimizing the performance of general-purpose computing on graphics processing units(GPGPU)based on single instruction multiple threads(SIMT)processor about the neural network application,this work contributes a self-developed SIMT processor named Pomelo and correlated assembly program.The parallel mechanism of SIMT computing mode and self-developed Pomelo processor is briefly introduced.A common convolutional neural network(CNN)is built to verify the compatibility and functionality of the Pomelo processor.CNN computing flow with task level and hardware level optimization is adopted on the Pomelo processor.A specific algorithm for organizing a Z-shaped memory structure is developed,which addresses reducing memory access in mass data computing tasks.Performing the above-combined adaptation and optimization strategy,the experimental result demonstrates that reducing memory access in SIMT computing mode plays a crucial role in improving performance.A 6.52 times performance is achieved on the 4 processing elements case. 展开更多
关键词 parallel computing single instruction multiple threads(SIMT) convolutional neural network(CNN) memory optimization
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部