A novel slow-down set waveform is proposed to improve the set performance and a 1 kb phase change random access memory chip fabricated with a 13nm CMOS technology is implemented to investigate the set performance by d...A novel slow-down set waveform is proposed to improve the set performance and a 1 kb phase change random access memory chip fabricated with a 13nm CMOS technology is implemented to investigate the set performance by different set programming strategies based on this new set pulse. The amplitude difference (I1 - I2) of the set pulse is proved to be a crucial parameter for set programming. We observe and analyze the cell characteristics with different I1 - I2 by means of thermal simulations and high-resolution transmission electron microscopy, which reveal that an incomplete set programming will occur when the proposed slow-down pulse is set with an improperly high I1 - I2. This will lead to an amorphous residue in the active region. We also discuss the programming method to avoid the set performance degradations.展开更多
Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these adv...Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming GPUs remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication—a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)—this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique’s broader applicability and its potential to address some of the most pressing computational challenges in GPU-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in GPU computing.展开更多
Based on the semidefinite programming relaxation of the CDMA maximum likelihood multiuser detection problem, a detection strategy by the successive quadratic programming algorithm is presented. Coupled with the random...Based on the semidefinite programming relaxation of the CDMA maximum likelihood multiuser detection problem, a detection strategy by the successive quadratic programming algorithm is presented. Coupled with the randomized cut generation scheme, the suboptimal solution of the multiuser detection problem in obtained. Compared to the interior point methods previously reported based on semidefmite programming, simulations demonstrate that the successive quadratic programming algorithm often yields the similar BER performances of the multiuser detection problem. But the average CPU time of this approach is significantly reduced.展开更多
基金Supported by the Strategic Priority Research Program of Chinese Academy of Sciences under Grant No XDA09020402the National Key Basic Research Program of China under Grant Nos 2013CBA01900,2010CB934300,2011CBA00607,and 2011CB932804+2 种基金the National Integrate Circuit Research Program of China under Grant No 2009ZX02023-003the National Natural Science Foundation of China under Grant Nos 61176122,61106001,61261160500,and 61376006the Science and Technology Council of Shanghai under Grant Nos 12nm0503701,13DZ2295700,12QA1403900,and 13ZR1447200
文摘A novel slow-down set waveform is proposed to improve the set performance and a 1 kb phase change random access memory chip fabricated with a 13nm CMOS technology is implemented to investigate the set performance by different set programming strategies based on this new set pulse. The amplitude difference (I1 - I2) of the set pulse is proved to be a crucial parameter for set programming. We observe and analyze the cell characteristics with different I1 - I2 by means of thermal simulations and high-resolution transmission electron microscopy, which reveal that an incomplete set programming will occur when the proposed slow-down pulse is set with an improperly high I1 - I2. This will lead to an amorphous residue in the active region. We also discuss the programming method to avoid the set performance degradations.
文摘Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming GPUs remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication—a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)—this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique’s broader applicability and its potential to address some of the most pressing computational challenges in GPU-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in GPU computing.
文摘Based on the semidefinite programming relaxation of the CDMA maximum likelihood multiuser detection problem, a detection strategy by the successive quadratic programming algorithm is presented. Coupled with the randomized cut generation scheme, the suboptimal solution of the multiuser detection problem in obtained. Compared to the interior point methods previously reported based on semidefmite programming, simulations demonstrate that the successive quadratic programming algorithm often yields the similar BER performances of the multiuser detection problem. But the average CPU time of this approach is significantly reduced.