Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these adv...Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming GPUs remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication—a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)—this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique’s broader applicability and its potential to address some of the most pressing computational challenges in GPU-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in GPU computing.展开更多
Studies to date have failed to consider gage disc cutters’variable cutting depth and the constraints of cutter-head welds,and have ignored the coupling mechanism between the profile of the full-face rock tunnel-borin...Studies to date have failed to consider gage disc cutters’variable cutting depth and the constraints of cutter-head welds,and have ignored the coupling mechanism between the profile of the full-face rock tunnel-boring machine(TBM)cutter-head and the assembled radius layout of the disc cutters.To solve these problems,an adaptive design method for studying cutter layout was proposed.Taking the bearing stress of the outermost gage disc cutter as an index,the profile of the cutter-head was determined.Using a genetic algorithm and based on the principles of equal life and equal wear,the assembled radii of the cutters were optimally designed.Boundary conditions of non-interference between the cutters,manholes,muck buckets and welding lines were given when a star layout pattern was used on cutters.The cutter-head comprehensive evaluation model was established by adopting relative optimization improvement degree of evaluation indices to achieve dimensional consistency.Exemplifying the MB264-311-8030 mm tape TBM cutter-head,the calculations show that compared with the original layout scheme,among the 51 disc cutters,the largest gap of the cutters’assembled radiuses is only 25.8 mm,which is 0.64%of the cutter-head’s radius and is negligible.The cutter-head’s unbalanced radial force decreases by 62.41%,the overturning moment decreases by 33.22%,and the cutter group’s centroid shift increases by only 18.48%.Each index is better than or approximately equal to the original cutter-head layout scheme,and the equivalent stress and deformation are both smaller;these results fully verify the feasibility and effectiveness of the method.展开更多
文摘Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming GPUs remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication—a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)—this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique’s broader applicability and its potential to address some of the most pressing computational challenges in GPU-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in GPU computing.
基金Projects(51275339,51575379,51675374)supported by the National Natural Science Foundation of ChinaProject(2013CB035402)supported by the National Hi-tech Research and Development Program of China
文摘Studies to date have failed to consider gage disc cutters’variable cutting depth and the constraints of cutter-head welds,and have ignored the coupling mechanism between the profile of the full-face rock tunnel-boring machine(TBM)cutter-head and the assembled radius layout of the disc cutters.To solve these problems,an adaptive design method for studying cutter layout was proposed.Taking the bearing stress of the outermost gage disc cutter as an index,the profile of the cutter-head was determined.Using a genetic algorithm and based on the principles of equal life and equal wear,the assembled radii of the cutters were optimally designed.Boundary conditions of non-interference between the cutters,manholes,muck buckets and welding lines were given when a star layout pattern was used on cutters.The cutter-head comprehensive evaluation model was established by adopting relative optimization improvement degree of evaluation indices to achieve dimensional consistency.Exemplifying the MB264-311-8030 mm tape TBM cutter-head,the calculations show that compared with the original layout scheme,among the 51 disc cutters,the largest gap of the cutters’assembled radiuses is only 25.8 mm,which is 0.64%of the cutter-head’s radius and is negligible.The cutter-head’s unbalanced radial force decreases by 62.41%,the overturning moment decreases by 33.22%,and the cutter group’s centroid shift increases by only 18.48%.Each index is better than or approximately equal to the original cutter-head layout scheme,and the equivalent stress and deformation are both smaller;these results fully verify the feasibility and effectiveness of the method.