Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these adv...Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming GPUs remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication—a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)—this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique’s broader applicability and its potential to address some of the most pressing computational challenges in GPU-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in GPU computing.展开更多
With the rapid development of big data and artificial intelligence(AI),the cloud platform architecture system is constantly developing,optimizing,and improving.As such,new applications,like deep computing and high-per...With the rapid development of big data and artificial intelligence(AI),the cloud platform architecture system is constantly developing,optimizing,and improving.As such,new applications,like deep computing and high-performance computing,require enhanced computing power.To meet this requirement,a non-uniform memory access(NUMA)configuration method is proposed for the cloud computing system according to the affinity,adaptability,and availability of the NUMA architecture processor platform.The proposed method is verified based on the test environment of a domestic central processing unit(CPU).展开更多
A quality of service(QoS) guaranteed cross-layer resource allocation algorithm with physical layer, medium access control(MAC) layer and call admission control(CAC) considered simultaneously is proposed for the ...A quality of service(QoS) guaranteed cross-layer resource allocation algorithm with physical layer, medium access control(MAC) layer and call admission control(CAC) considered simultaneously is proposed for the full IP orthogonal frequency division multiple access(OFDMA) communication system, which can ensure the quality of multimedia services in full IP networks.The algorithm converts the physical layer resources such as subcarriers, transmission power, and the QoS metrics into equivalent bandwidth which can be distributed by the base station in all three layers. By this means, the QoS requirements in terms of bit error rate(BER), transmission delay and dropping probability can be guaranteed by the cross-layer optimal equivalent bandwidth allocation. The numerical results show that the proposed algorithm has higher spectrum efficiency compared to the existing systems.展开更多
This study establishes an evaluation and optimization framework for the public transit network based on social network analysis and a greedy algorithm,aiming to explore a quantitative approach to improving access to u...This study establishes an evaluation and optimization framework for the public transit network based on social network analysis and a greedy algorithm,aiming to explore a quantitative approach to improving access to urban parks through public transit optimization.Social network analysis and the ArcGIS platform are used to build a public transit network model within Nanjing Old City and analyze its overall network structure characteristics.The study also focuses on a method to improve the convenience of reaching regional and citylevel parks by public transit by increasing access and connecting points accordingly.A greedy algorithm is introduced to generate an optimized solution for improving public transit accessibility to regional and city-level parks,consequently enhancing their utilization.The major findings include:(1)The greedy algorithm effectively enhances the performance of the public transit network,but its benefits gradually diminish as more stations are added.(2)Strategically adding stations enhances the performance of most public transit access points,creating efficient pathways for other stations to directly reach these access points and enter regional and city-level parks.(3)The optimized public transit network model offers guidance for the planning and layout of regional and city-level parks.The site selection for new parks should prioritize establishing connections with the“hubs”in the public transit network.The proposed optimization of the public transit network in this study is specific to a single type of urban park,but subsequent research could be conducted to extend the optimization of public transit accessibility around more urban public resources.展开更多
Computational Fluid Dynamics-Discrete Element Method is used to model gas-solid systems in several applications in energy,pharmaceutical and petrochemical industries.Computational performance bot-tlenecks often limit ...Computational Fluid Dynamics-Discrete Element Method is used to model gas-solid systems in several applications in energy,pharmaceutical and petrochemical industries.Computational performance bot-tlenecks often limit the problem sizes that can be simulated at industrial scale.The data structures used to store several millions of particles in such large-scale simulations have a large memory footprint that does not fit into the processor cache hierarchies on current high-performance-computing platforms,leading to reduced computational performance.This paper specifically addresses this aspect of memory access bottlenecks in industrial scale simulations.The use of space-flling curves to improve memory access patterns is described and their impact on computational performance is quantified in both shared and distributed memory parallelization paradigms.The Morton space flling curve applied to uniform grids and k-dimensional tree partitions are used to reorder the particle data-structure thus improving spatial and temporal locality in memory.The performance impact of these techniques when applied to two benchmark problems,namely the homogeneous-cooling-system and a fluidized-bed,are presented.These optimization techniques lead to approximately two-fold performance improvement in particle focused operations such as neighbor-list creation and data-exchange,with~1.5 times overall improvement in a fluidization simulation with 1.27 million particles.展开更多
The radiation damage effect of key structural materials is one of the main research subjects of the numerical reactor.From the perspective of experimental safety and feasibility,Molecular Dynamics(MD)in the materials ...The radiation damage effect of key structural materials is one of the main research subjects of the numerical reactor.From the perspective of experimental safety and feasibility,Molecular Dynamics(MD)in the materials field is an ideal method for simulating the radiation damage of structural materials.The Crystal-MD represents a massive parallel MD simulation software based on the key material characteristics of reactors.Compared with the Large-scale Atomic/Molecurlar Massively Parallel Simulator(LAMMPS)and ITAP Molecular Dynamics(IMD)software,the Crystal-MD reduces the memory required for software operation to a certain extent,but it is very time-consuming.Moreover,the calculation results of the Crystal-MD have large deviations,and there are also some problems,such as memory limitation and frequent communication during its migration and optimization.In this paper,in order to solve the above problems,the memory access mode of the Crystal-MD software is studied.Based on the memory access mode,a memory access optimization strategy is proposed for a unique architecture of China’s supercomputer Sunway Taihu Light.The proposed optimization strategy is verified by the experiments,and experimental results show that the running speed of the Crystal-MD is increased significantly by using the proposed optimization strategy.展开更多
文摘Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming GPUs remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication—a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)—this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique’s broader applicability and its potential to address some of the most pressing computational challenges in GPU-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in GPU computing.
基金the National Key Research and Development Program of China(No.2017YFC0212100)National High-tech R&D Program of China(No.2015AA015308).
文摘With the rapid development of big data and artificial intelligence(AI),the cloud platform architecture system is constantly developing,optimizing,and improving.As such,new applications,like deep computing and high-performance computing,require enhanced computing power.To meet this requirement,a non-uniform memory access(NUMA)configuration method is proposed for the cloud computing system according to the affinity,adaptability,and availability of the NUMA architecture processor platform.The proposed method is verified based on the test environment of a domestic central processing unit(CPU).
基金supported by the National Natural Science Foundation of China(61271235)the Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions-Information and Communication Engineering
文摘A quality of service(QoS) guaranteed cross-layer resource allocation algorithm with physical layer, medium access control(MAC) layer and call admission control(CAC) considered simultaneously is proposed for the full IP orthogonal frequency division multiple access(OFDMA) communication system, which can ensure the quality of multimedia services in full IP networks.The algorithm converts the physical layer resources such as subcarriers, transmission power, and the QoS metrics into equivalent bandwidth which can be distributed by the base station in all three layers. By this means, the QoS requirements in terms of bit error rate(BER), transmission delay and dropping probability can be guaranteed by the cross-layer optimal equivalent bandwidth allocation. The numerical results show that the proposed algorithm has higher spectrum efficiency compared to the existing systems.
基金This work was supported by the National Natural Science Foundation of China(Grant Nos.51978147,52378046).
文摘This study establishes an evaluation and optimization framework for the public transit network based on social network analysis and a greedy algorithm,aiming to explore a quantitative approach to improving access to urban parks through public transit optimization.Social network analysis and the ArcGIS platform are used to build a public transit network model within Nanjing Old City and analyze its overall network structure characteristics.The study also focuses on a method to improve the convenience of reaching regional and citylevel parks by public transit by increasing access and connecting points accordingly.A greedy algorithm is introduced to generate an optimized solution for improving public transit accessibility to regional and city-level parks,consequently enhancing their utilization.The major findings include:(1)The greedy algorithm effectively enhances the performance of the public transit network,but its benefits gradually diminish as more stations are added.(2)Strategically adding stations enhances the performance of most public transit access points,creating efficient pathways for other stations to directly reach these access points and enter regional and city-level parks.(3)The optimized public transit network model offers guidance for the planning and layout of regional and city-level parks.The site selection for new parks should prioritize establishing connections with the“hubs”in the public transit network.The proposed optimization of the public transit network in this study is specific to a single type of urban park,but subsequent research could be conducted to extend the optimization of public transit accessibility around more urban public resources.
文摘Computational Fluid Dynamics-Discrete Element Method is used to model gas-solid systems in several applications in energy,pharmaceutical and petrochemical industries.Computational performance bot-tlenecks often limit the problem sizes that can be simulated at industrial scale.The data structures used to store several millions of particles in such large-scale simulations have a large memory footprint that does not fit into the processor cache hierarchies on current high-performance-computing platforms,leading to reduced computational performance.This paper specifically addresses this aspect of memory access bottlenecks in industrial scale simulations.The use of space-flling curves to improve memory access patterns is described and their impact on computational performance is quantified in both shared and distributed memory parallelization paradigms.The Morton space flling curve applied to uniform grids and k-dimensional tree partitions are used to reorder the particle data-structure thus improving spatial and temporal locality in memory.The performance impact of these techniques when applied to two benchmark problems,namely the homogeneous-cooling-system and a fluidized-bed,are presented.These optimization techniques lead to approximately two-fold performance improvement in particle focused operations such as neighbor-list creation and data-exchange,with~1.5 times overall improvement in a fluidization simulation with 1.27 million particles.
基金supported by the National Key R&D Program of China(No.2017YFB0202003)。
文摘The radiation damage effect of key structural materials is one of the main research subjects of the numerical reactor.From the perspective of experimental safety and feasibility,Molecular Dynamics(MD)in the materials field is an ideal method for simulating the radiation damage of structural materials.The Crystal-MD represents a massive parallel MD simulation software based on the key material characteristics of reactors.Compared with the Large-scale Atomic/Molecurlar Massively Parallel Simulator(LAMMPS)and ITAP Molecular Dynamics(IMD)software,the Crystal-MD reduces the memory required for software operation to a certain extent,but it is very time-consuming.Moreover,the calculation results of the Crystal-MD have large deviations,and there are also some problems,such as memory limitation and frequent communication during its migration and optimization.In this paper,in order to solve the above problems,the memory access mode of the Crystal-MD software is studied.Based on the memory access mode,a memory access optimization strategy is proposed for a unique architecture of China’s supercomputer Sunway Taihu Light.The proposed optimization strategy is verified by the experiments,and experimental results show that the running speed of the Crystal-MD is increased significantly by using the proposed optimization strategy.