Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these adv...Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming GPUs remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication—a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)—this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique’s broader applicability and its potential to address some of the most pressing computational challenges in GPU-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in GPU computing.展开更多
Due to the recent trend of software intelligence in the Fourth Industrial Revolution,deep learning has become a mainstream workload for modern computer systems.Since the data size of deep learning increasingly grows,m...Due to the recent trend of software intelligence in the Fourth Industrial Revolution,deep learning has become a mainstream workload for modern computer systems.Since the data size of deep learning increasingly grows,managing the limited memory capacity efficiently for deep learning workloads becomes important.In this paper,we analyze memory accesses in deep learning workloads and find out some unique characteristics differentiated from traditional workloads.First,when comparing instruction and data accesses,data access accounts for 96%–99%of total memory accesses in deep learning workloads,which is quite different from traditional workloads.Second,when comparing read and write accesses,write access dominates,accounting for 64%–80%of total memory accesses.Third,although write access makes up the majority of memory accesses,it shows a low access bias of 0.3 in the Zipf parameter.Fourth,in predicting re-access,recency is important in read access,but frequency provides more accurate information in write access.Based on these observations,we introduce a Non-Volatile Random Access Memory(NVRAM)-accelerated memory architecture for deep learning workloads,and present a new memory management policy for this architecture.By considering the memory access characteristics of deep learning workloads,the proposed policy improves memory performance by 64.3%on average compared to the CLOCK policy.展开更多
With the rapid development of big data and artificial intelligence(AI),the cloud platform architecture system is constantly developing,optimizing,and improving.As such,new applications,like deep computing and high-per...With the rapid development of big data and artificial intelligence(AI),the cloud platform architecture system is constantly developing,optimizing,and improving.As such,new applications,like deep computing and high-performance computing,require enhanced computing power.To meet this requirement,a non-uniform memory access(NUMA)configuration method is proposed for the cloud computing system according to the affinity,adaptability,and availability of the NUMA architecture processor platform.The proposed method is verified based on the test environment of a domestic central processing unit(CPU).展开更多
Most transactional memory (TM) research focused on multi-core processors, and others investigated at the clusters, leaving the area of non-uniform memory access (NUMA) system unexplored. The existing TM implementation...Most transactional memory (TM) research focused on multi-core processors, and others investigated at the clusters, leaving the area of non-uniform memory access (NUMA) system unexplored. The existing TM implementations made significant performance degradation on NUMA system because they ignored the slower remote memory access. To solve this problem, a latency-based conflict detection and a forecasting-based conflict prevention method were proposed. Using these techniques, the NUMA aware TM system was presented. By reducing the remote memory access and the abort rate of transaction, the experiment results show that the NUMA aware strategies present good practical TM performance on NUMA system.展开更多
Memory access fast switching structures in cluster are studied,and three kinds of fast switching structures( FS,LR2 SS,and LAPS) are proposed. A mixed simulation test bench is constructed and used for statistic of d...Memory access fast switching structures in cluster are studied,and three kinds of fast switching structures( FS,LR2 SS,and LAPS) are proposed. A mixed simulation test bench is constructed and used for statistic of data access delay among these three structures in various cases. Finally these structures are realized on Xilinx FPGA development board and DCT,FFT,SAD,IME,FME,and de-blocking filtering algorithms are mapped onto the structures. Compared with available architectures,our proposed structures have lower data access delay and lower area.展开更多
Reducing the process variation is a significant concern for resistive random access memory(RRAM).Due to its ultrahigh integration density,RRAM arrays are prone to lithographic variation during the lithography process,...Reducing the process variation is a significant concern for resistive random access memory(RRAM).Due to its ultrahigh integration density,RRAM arrays are prone to lithographic variation during the lithography process,introducing electrical variation among different RRAM devices.In this work,an optical physical verification methodology for the RRAM array is developed,and the effects of different layout parameters on important electrical characteristics are systematically investigated.The results indicate that the RRAM devices can be categorized into three clusters according to their locations and lithography environments.The read resistance is more sensitive to the locations in the array(~30%)than SET/RESET voltage(<10%).The increase in the RRAM device length and the application of the optical proximity correction technique can help to reduce the variation to less than 10%,whereas it reduces RRAM read resistance by 4×,resulting in a higher power and area consumption.As such,we provide design guidelines to minimize the electrical variation of RRAM arrays due to the lithography process.展开更多
Embedded memory,which heavily relies on the manufacturing process,has been widely adopted in various industrial applications.As the field of embedded memory continues to evolve,innovative strategies are emerging to en...Embedded memory,which heavily relies on the manufacturing process,has been widely adopted in various industrial applications.As the field of embedded memory continues to evolve,innovative strategies are emerging to enhance performance.Among them,resistive random access memory(RRAM)has gained significant attention due to its numerousadvantages over traditional memory devices,including high speed(<1 ns),high density(4 F^(2)·n^(-1)),high scalability(~nm),and low power consumption(~pJ).This review focuses on the recent progress of embedded RRAM in industrial manufacturing and its potentialapplications.It provides a brief introduction to the concepts and advantages of RRAM,discusses the key factors that impact its industrial manufacturing,and presents the commercial progress driven by cutting-edge nanotechnology,which has been pursued by manysemiconductor giants.Additionally,it highlights the adoption of embedded RRAM in emerging applications within the realm of the Internet of Things and future intelligent computing,with a particular emphasis on its role in neuromorphic computing.Finally,the review discusses thecurrent challenges and provides insights into the prospects of embedded RRAM in the era of big data and artificial intelligence.展开更多
Graph convolutional neural networks(GCNs)have emerged as an effective approach to extending deep learning for graph data analytics,but they are computationally challenging given the irregular graphs and the large num-...Graph convolutional neural networks(GCNs)have emerged as an effective approach to extending deep learning for graph data analytics,but they are computationally challenging given the irregular graphs and the large num-ber of nodes in a graph.GCNs involve chain sparse-dense matrix multiplications with six loops,which results in a large de-sign space for GCN accelerators.Prior work on GCN acceleration either employs limited loop optimization techniques,or determines the design variables based on random sampling,which can hardly exploit data reuse efficiently,thus degrading system efficiency.To overcome this limitation,this paper proposes GShuttle,a GCN acceleration scheme that maximizes memory access efficiency to achieve high performance and energy efficiency.GShuttle systematically explores loop opti-mization techniques for GCN acceleration,and quantitatively analyzes the design objectives(e.g.,required DRAM access-es and SRAM accesses)by analytical calculation based on multiple design variables.GShuttle further employs two ap-proaches,pruned search space sweeping and greedy search,to find the optimal design variables under certain design con-straints.We demonstrated the efficacy of GShuttle by evaluation on five widely used graph datasets.The experimental simulations show that GShuttle reduces the number of DRAM accesses by a factor of 1.5 and saves energy by a factor of 1.7 compared with the state-of-the-art approaches.展开更多
Computational Fluid Dynamics-Discrete Element Method is used to model gas-solid systems in several applications in energy,pharmaceutical and petrochemical industries.Computational performance bot-tlenecks often limit ...Computational Fluid Dynamics-Discrete Element Method is used to model gas-solid systems in several applications in energy,pharmaceutical and petrochemical industries.Computational performance bot-tlenecks often limit the problem sizes that can be simulated at industrial scale.The data structures used to store several millions of particles in such large-scale simulations have a large memory footprint that does not fit into the processor cache hierarchies on current high-performance-computing platforms,leading to reduced computational performance.This paper specifically addresses this aspect of memory access bottlenecks in industrial scale simulations.The use of space-flling curves to improve memory access patterns is described and their impact on computational performance is quantified in both shared and distributed memory parallelization paradigms.The Morton space flling curve applied to uniform grids and k-dimensional tree partitions are used to reorder the particle data-structure thus improving spatial and temporal locality in memory.The performance impact of these techniques when applied to two benchmark problems,namely the homogeneous-cooling-system and a fluidized-bed,are presented.These optimization techniques lead to approximately two-fold performance improvement in particle focused operations such as neighbor-list creation and data-exchange,with~1.5 times overall improvement in a fluidization simulation with 1.27 million particles.展开更多
As one of the most notorious programming errors,memory access errors still hurt modern software security.Particularly,they are hidden deeply in important software systems written in memory unsafe languages like C/C++....As one of the most notorious programming errors,memory access errors still hurt modern software security.Particularly,they are hidden deeply in important software systems written in memory unsafe languages like C/C++.Plenty of work have been proposed to detect bugs leading to memory access errors.However,all existing works lack the ability to handle two challenges.First,they are not able to tackle fine-grained memory access errors,e.g.,data overflow inside one data structure.These errors are usually overlooked for a long time since they happen inside one memory block and do not lead to program crash.Second,most existing works rely on source code or debugging information to recover memory boundary information,so they cannot be directly applied to detection of memory access errors in binary code.However,searching memory access errors in binary code is a very common scenario in software vulnerability detection and exploitation.In order to overcome these challenges,we propose Memory Access Integrity(MAI),a dynamic method to detect finegrained memory access errors in off-the-shelf binary executables.The core idea is to recover fine-grained accessing policy between memory access behaviors and memory ranges,and then detect memory access errors based on the policy.The key insight in our work is that memory accessing patterns reveal information for recovering the boundary of memory objects and the accessing policy.Based on these recovered information,our method maintains a new memory model to simulate the life cycle of memory objects and report errors when any accessing policy is violated.We evaluate our tool on popular CTF datasets and real world softwares.Compared with the state of the art detection tool,the evaluation result demonstrates that our tool can detect fine-grained memory access errors effectively and efficiently.As the practical impact,our tool has detected three 0-day memory access errors in an audio decoder.展开更多
The radiation damage effect of key structural materials is one of the main research subjects of the numerical reactor.From the perspective of experimental safety and feasibility,Molecular Dynamics(MD)in the materials ...The radiation damage effect of key structural materials is one of the main research subjects of the numerical reactor.From the perspective of experimental safety and feasibility,Molecular Dynamics(MD)in the materials field is an ideal method for simulating the radiation damage of structural materials.The Crystal-MD represents a massive parallel MD simulation software based on the key material characteristics of reactors.Compared with the Large-scale Atomic/Molecurlar Massively Parallel Simulator(LAMMPS)and ITAP Molecular Dynamics(IMD)software,the Crystal-MD reduces the memory required for software operation to a certain extent,but it is very time-consuming.Moreover,the calculation results of the Crystal-MD have large deviations,and there are also some problems,such as memory limitation and frequent communication during its migration and optimization.In this paper,in order to solve the above problems,the memory access mode of the Crystal-MD software is studied.Based on the memory access mode,a memory access optimization strategy is proposed for a unique architecture of China’s supercomputer Sunway Taihu Light.The proposed optimization strategy is verified by the experiments,and experimental results show that the running speed of the Crystal-MD is increased significantly by using the proposed optimization strategy.展开更多
General purpose graphics processing units(GPGPUs)can be used to improve computing performance considerably for regular applications.However,irregular memory access exists in many applications,and the benefits of graph...General purpose graphics processing units(GPGPUs)can be used to improve computing performance considerably for regular applications.However,irregular memory access exists in many applications,and the benefits of graphics processing units(GPUs)are less substantial for irregular applications.In recent years,several studies have presented some solutions to remove static irregular memory access.However,eliminating dynamic irregular memory access with software remains a serious challenge.A pure software solution without hardware extensions or offline profiling is proposed to eliminate dynamic irregular memory access,especially for indirect memory access.Data reordering and index redirection are suggested to reduce the number of memory transactions,thereby improving the performance of GPU kernels.To improve the efficiency of data reordering,an operation to reorder data is offloaded to a GPU to reduce overhead and thus transfer data.Through concurrently executing the compute unified device architecture(CUDA)streams of data reordering and the data processing kernel,the overhead of data reordering can be reduced.After these optimizations,the volume of memory transactions can be reduced by 16.7%-50%compared with CUSPARSE-based benchmarks,and the performance of irregular kernels can be improved by 9.64%-34.9%using an NVIDIA Tesla P4 GPU.展开更多
As one of the most notorious programming errors,memory access errors still hurt modern software security.Particularly,they are hidden deeply in important software systems written in memory unsafe languages like C/C++....As one of the most notorious programming errors,memory access errors still hurt modern software security.Particularly,they are hidden deeply in important software systems written in memory unsafe languages like C/C++.Plenty of work have been proposed to detect bugs leading to memory access errors.However,all existing works lack the ability to handle two challenges.First,they are not able to tackle fine-grained memory access errors,e.g.,data overflow inside one data structure.These errors are usually overlooked for a long time since they happen inside one memory block and do not lead to program crash.Second,most existing works rely on source code or debugging information to recover memory boundary information,so they cannot be directly applied to detection of memory access errors in binary code.However,searching memory access errors in binary code is a very common scenario in software vulnerability detection and exploitation.In order to overcome these challenges,we propose Memory Access Integrity(MAI),a dynamic method to detect finegrained memory access errors in off-the-shelf binary executables.The core idea is to recover fine-grained accessing policy between memory access behaviors and memory ranges,and then detect memory access errors based on the policy.The key insight in our work is that memory accessing patterns reveal information for recovering the boundary of memory objects and the accessing policy.Based on these recovered information,our method maintains a new memory model to simulate the life cycle of memory objects and report errors when any accessing policy is violated.We evaluate our tool on popular CTF datasets and real world softwares.Compared with the state of the art detection tool,the evaluation result demonstrates that our tool can detect fine-grained memory access errors effectively and efficiently.As the practical impact,our tool has detected three 0-day memory access errors in an audio decoder.展开更多
Using computer-aided design three-dimensional (3D) simulation technology, the recovery mechanism of single event upset and the effects of spacing and hit angle on the recovery are studied. It is found that the multi...Using computer-aided design three-dimensional (3D) simulation technology, the recovery mechanism of single event upset and the effects of spacing and hit angle on the recovery are studied. It is found that the multi-node charge collection plays a key role in recovery and shielding the charge sharing by adding guard rings. It cannot exhibit the recovery effect. It is also indicated that the upset linear energy transfer (LET) threshold is kept constant while the recovery LET threshold increases as the spacing increases. Additionally, the effect of incident angle on recovery is analysed and it is shown that a larger angle can bring about a stronger charge sharing effect, thus strengthening the recovery ability.展开更多
The era of information explosion is coming and information need to be continuously stored and randomly accessed over long-term periods,which constitute an insurmountable challenge for existing data centers.At present,...The era of information explosion is coming and information need to be continuously stored and randomly accessed over long-term periods,which constitute an insurmountable challenge for existing data centers.At present,computing devices use the von Neumann architecture with separate computing and memory units,which exposes the shortcomings of“memory bottleneck”.Nonvolatile memristor can realize data storage and in-memory computing at the same time and promises to overcome this bottleneck.Phase-change random access memory(PCRAM)is called one of the best solutions for next generation non-volatile memory.Due to its high speed,good data retention,high density,low power consumption,PCRAM has the broad commercial prospects in the in-memory computing application.In this review,the research progress of phase-change materials and device structures for PCRAM,as well as the most critical performances for a universal memory,such as speed,capacity,and power consumption,are reviewed.By comparing the advantages and disadvantages of phase-change optical disk and PCRAM,a new concept of optoelectronic hybrid storage based on phase-change material is proposed.Furthermore,its feasibility to replace existing memory technologies as a universal memory is also discussed as well.展开更多
An optimized device structure for reducing the RESET current of phase-change random access memory (PCRAM) with blade-type like (BTL) phase change layer is proposed. The electrical thermal analysis of the BTL cell ...An optimized device structure for reducing the RESET current of phase-change random access memory (PCRAM) with blade-type like (BTL) phase change layer is proposed. The electrical thermal analysis of the BTL cell and the blade heater contactor structure by three-dimensional finite element modeling are compared with each other during RESET operation. The simulation results show that the programming region of the phase change layer in the BTL cell is much smaller, and thermal electrical distributions of the BTL cell are more concentrated on the TiN/GST interface. The results indicate that the BTL cell has the superiorities of increasing the heating efficiency, decreasing the power consumption and reducing the RESET current from 0.67mA to 0.32mA. Therefore, the BTL cell will be appropriate for high performance PCRAM device with lower power consumption and lower RESET current.展开更多
This paper investigated phase change Si1Sb2Te3 material for application of chalcogenide random access memory. Current-voltage performance was conducted to determine threshold current of phase change from amorphous pha...This paper investigated phase change Si1Sb2Te3 material for application of chalcogenide random access memory. Current-voltage performance was conducted to determine threshold current of phase change from amorphous phase to polycrystalline phase. The film holds a threshold current about 0.155 mA, which is smaller than the value 0.31 mA of Ge2Sb2Te5 film. Amorphous Si1Sb2Te3 changes to face-centred-cubic structure at ~ 180℃ and changes to hexagonal structure at ~ 270℃. Annealing temperature dependent electric resistivity of Si1Sb2Te3 film was studied by four-point probe method. Data retention of the films was characterized as well.展开更多
Recent progresses in magnetic tunnel junctions with perpendicular magnetic anisotropy (PMA) are reviewed and summarized. At first, the concept and source of perpendicular magnetic anisotropy (PMA) are introduced. ...Recent progresses in magnetic tunnel junctions with perpendicular magnetic anisotropy (PMA) are reviewed and summarized. At first, the concept and source of perpendicular magnetic anisotropy (PMA) are introduced. Next, a historical overview of PMA materials as magnetic electrodes, such as the RE-TM alloys TbFeCo and GdFeCo, novel tetragonal manganese alloys Mn-Ga, L10-ordered (Co, Fe)/Pt alloy, multilayer film [Co, Fe, CoFe/Pt, Pd, Ni, AU]N, and ultra-thin magnetic metal/oxidized barrier is offered. The other part of the article focuses on the optimization and fabrication of CoFeB/MgO/CoFeB p-MTJs, which is thought to have high potential to meet the main demands for non-volatile magnetic random access memory.展开更多
In this letter,the Ta/HfO/BN/TiN resistive switching devices are fabricated and they exhibit low power consumption and high uniformity each.The reset current is reduced for the HfO/BN bilayer device compared with that...In this letter,the Ta/HfO/BN/TiN resistive switching devices are fabricated and they exhibit low power consumption and high uniformity each.The reset current is reduced for the HfO/BN bilayer device compared with that for the Ta/HfO/TiN structure.Furthermore,the reset current decreases with increasing BN thickness.The HfOlayer is a dominating switching layer,while the low-permittivity and high-resistivity BN layer acts as a barrier of electrons injection into TiN electrode.The current conduction mechanism of low resistance state in the HfO/BN bilayer device is space-chargelimited current(SCLC),while it is Ohmic conduction in the HfOdevice.展开更多
Synergistic effects of the total ionizing dose (TID) on the single event upset (SEU) sensitivity in static random access memories (SRAMs) were studied by using protons. The total dose was cumulated with high flu...Synergistic effects of the total ionizing dose (TID) on the single event upset (SEU) sensitivity in static random access memories (SRAMs) were studied by using protons. The total dose was cumulated with high flux protons during the TID exposure, and the SEU cross section was tested with low flux protons at several cumulated dose steps. Because of the radiation-induced off-state leakage current increase of the CMOS transistors, the noise margin became asymmetric and the memory imprint effect was observed.展开更多
文摘Over the past decade, Graphics Processing Units (GPUs) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming GPUs remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication—a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)—this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique’s broader applicability and its potential to address some of the most pressing computational challenges in GPU-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in GPU computing.
基金supported in part by the NRF(National Research Foundation of Korea)Grant(No.2019R1A2C1009275)by the Institute of Information&communications Technology Planning&Evaluation(IITP)grant funded by theKorean government(MSIT)(No.2021-0-02068,Artificial Intelligence Innovation Hub).
文摘Due to the recent trend of software intelligence in the Fourth Industrial Revolution,deep learning has become a mainstream workload for modern computer systems.Since the data size of deep learning increasingly grows,managing the limited memory capacity efficiently for deep learning workloads becomes important.In this paper,we analyze memory accesses in deep learning workloads and find out some unique characteristics differentiated from traditional workloads.First,when comparing instruction and data accesses,data access accounts for 96%–99%of total memory accesses in deep learning workloads,which is quite different from traditional workloads.Second,when comparing read and write accesses,write access dominates,accounting for 64%–80%of total memory accesses.Third,although write access makes up the majority of memory accesses,it shows a low access bias of 0.3 in the Zipf parameter.Fourth,in predicting re-access,recency is important in read access,but frequency provides more accurate information in write access.Based on these observations,we introduce a Non-Volatile Random Access Memory(NVRAM)-accelerated memory architecture for deep learning workloads,and present a new memory management policy for this architecture.By considering the memory access characteristics of deep learning workloads,the proposed policy improves memory performance by 64.3%on average compared to the CLOCK policy.
基金the National Key Research and Development Program of China(No.2017YFC0212100)National High-tech R&D Program of China(No.2015AA015308).
文摘With the rapid development of big data and artificial intelligence(AI),the cloud platform architecture system is constantly developing,optimizing,and improving.As such,new applications,like deep computing and high-performance computing,require enhanced computing power.To meet this requirement,a non-uniform memory access(NUMA)configuration method is proposed for the cloud computing system according to the affinity,adaptability,and availability of the NUMA architecture processor platform.The proposed method is verified based on the test environment of a domestic central processing unit(CPU).
基金Projects(61003075, 61170261) supported by the National Natural Science Foundation of China
文摘Most transactional memory (TM) research focused on multi-core processors, and others investigated at the clusters, leaving the area of non-uniform memory access (NUMA) system unexplored. The existing TM implementations made significant performance degradation on NUMA system because they ignored the slower remote memory access. To solve this problem, a latency-based conflict detection and a forecasting-based conflict prevention method were proposed. Using these techniques, the NUMA aware TM system was presented. By reducing the remote memory access and the abort rate of transaction, the experiment results show that the NUMA aware strategies present good practical TM performance on NUMA system.
基金Supported by the National Natural Science Foundation of China(61272120,61634004,61602377)the Shaanxi Provincial Co-ordination Innovation Project of Science and Technology(2016KTZDGY02-04-02)Scientific Research Program Funded by Shannxi Provincial Education Department(17JK0689)
文摘Memory access fast switching structures in cluster are studied,and three kinds of fast switching structures( FS,LR2 SS,and LAPS) are proposed. A mixed simulation test bench is constructed and used for statistic of data access delay among these three structures in various cases. Finally these structures are realized on Xilinx FPGA development board and DCT,FFT,SAD,IME,FME,and de-blocking filtering algorithms are mapped onto the structures. Compared with available architectures,our proposed structures have lower data access delay and lower area.
基金supported in part by the Open Fund of State Key Laboratory of Integrated Chips and Systems,Fudan Universityin part by the National Science Foundation of China under Grant No.62304133 and No.62350610271.
文摘Reducing the process variation is a significant concern for resistive random access memory(RRAM).Due to its ultrahigh integration density,RRAM arrays are prone to lithographic variation during the lithography process,introducing electrical variation among different RRAM devices.In this work,an optical physical verification methodology for the RRAM array is developed,and the effects of different layout parameters on important electrical characteristics are systematically investigated.The results indicate that the RRAM devices can be categorized into three clusters according to their locations and lithography environments.The read resistance is more sensitive to the locations in the array(~30%)than SET/RESET voltage(<10%).The increase in the RRAM device length and the application of the optical proximity correction technique can help to reduce the variation to less than 10%,whereas it reduces RRAM read resistance by 4×,resulting in a higher power and area consumption.As such,we provide design guidelines to minimize the electrical variation of RRAM arrays due to the lithography process.
基金supported by the Key-Area Research and Development Program of Guangdong Province(Grant No.2021B0909060002)National Natural Science Foundation of China(Grant Nos.62204219,62204140)+1 种基金Major Program of Natural Science Foundation of Zhejiang Province(Grant No.LDT23F0401)Thanks to Professor Zhang Yishu from Zhejiang University,Professor Gao Xu from Soochow University,and Professor Zhong Shuai from Guangdong Institute of Intelligence Science and Technology for their support。
文摘Embedded memory,which heavily relies on the manufacturing process,has been widely adopted in various industrial applications.As the field of embedded memory continues to evolve,innovative strategies are emerging to enhance performance.Among them,resistive random access memory(RRAM)has gained significant attention due to its numerousadvantages over traditional memory devices,including high speed(<1 ns),high density(4 F^(2)·n^(-1)),high scalability(~nm),and low power consumption(~pJ).This review focuses on the recent progress of embedded RRAM in industrial manufacturing and its potentialapplications.It provides a brief introduction to the concepts and advantages of RRAM,discusses the key factors that impact its industrial manufacturing,and presents the commercial progress driven by cutting-edge nanotechnology,which has been pursued by manysemiconductor giants.Additionally,it highlights the adoption of embedded RRAM in emerging applications within the realm of the Internet of Things and future intelligent computing,with a particular emphasis on its role in neuromorphic computing.Finally,the review discusses thecurrent challenges and provides insights into the prospects of embedded RRAM in the era of big data and artificial intelligence.
基金supported by the U.S.National Science Foundation under Grant Nos.CCF-2131946,CCF-1953980,and CCF-1702980.
文摘Graph convolutional neural networks(GCNs)have emerged as an effective approach to extending deep learning for graph data analytics,but they are computationally challenging given the irregular graphs and the large num-ber of nodes in a graph.GCNs involve chain sparse-dense matrix multiplications with six loops,which results in a large de-sign space for GCN accelerators.Prior work on GCN acceleration either employs limited loop optimization techniques,or determines the design variables based on random sampling,which can hardly exploit data reuse efficiently,thus degrading system efficiency.To overcome this limitation,this paper proposes GShuttle,a GCN acceleration scheme that maximizes memory access efficiency to achieve high performance and energy efficiency.GShuttle systematically explores loop opti-mization techniques for GCN acceleration,and quantitatively analyzes the design objectives(e.g.,required DRAM access-es and SRAM accesses)by analytical calculation based on multiple design variables.GShuttle further employs two ap-proaches,pruned search space sweeping and greedy search,to find the optimal design variables under certain design con-straints.We demonstrated the efficacy of GShuttle by evaluation on five widely used graph datasets.The experimental simulations show that GShuttle reduces the number of DRAM accesses by a factor of 1.5 and saves energy by a factor of 1.7 compared with the state-of-the-art approaches.
文摘Computational Fluid Dynamics-Discrete Element Method is used to model gas-solid systems in several applications in energy,pharmaceutical and petrochemical industries.Computational performance bot-tlenecks often limit the problem sizes that can be simulated at industrial scale.The data structures used to store several millions of particles in such large-scale simulations have a large memory footprint that does not fit into the processor cache hierarchies on current high-performance-computing platforms,leading to reduced computational performance.This paper specifically addresses this aspect of memory access bottlenecks in industrial scale simulations.The use of space-flling curves to improve memory access patterns is described and their impact on computational performance is quantified in both shared and distributed memory parallelization paradigms.The Morton space flling curve applied to uniform grids and k-dimensional tree partitions are used to reorder the particle data-structure thus improving spatial and temporal locality in memory.The performance impact of these techniques when applied to two benchmark problems,namely the homogeneous-cooling-system and a fluidized-bed,are presented.These optimization techniques lead to approximately two-fold performance improvement in particle focused operations such as neighbor-list creation and data-exchange,with~1.5 times overall improvement in a fluidization simulation with 1.27 million particles.
文摘As one of the most notorious programming errors,memory access errors still hurt modern software security.Particularly,they are hidden deeply in important software systems written in memory unsafe languages like C/C++.Plenty of work have been proposed to detect bugs leading to memory access errors.However,all existing works lack the ability to handle two challenges.First,they are not able to tackle fine-grained memory access errors,e.g.,data overflow inside one data structure.These errors are usually overlooked for a long time since they happen inside one memory block and do not lead to program crash.Second,most existing works rely on source code or debugging information to recover memory boundary information,so they cannot be directly applied to detection of memory access errors in binary code.However,searching memory access errors in binary code is a very common scenario in software vulnerability detection and exploitation.In order to overcome these challenges,we propose Memory Access Integrity(MAI),a dynamic method to detect finegrained memory access errors in off-the-shelf binary executables.The core idea is to recover fine-grained accessing policy between memory access behaviors and memory ranges,and then detect memory access errors based on the policy.The key insight in our work is that memory accessing patterns reveal information for recovering the boundary of memory objects and the accessing policy.Based on these recovered information,our method maintains a new memory model to simulate the life cycle of memory objects and report errors when any accessing policy is violated.We evaluate our tool on popular CTF datasets and real world softwares.Compared with the state of the art detection tool,the evaluation result demonstrates that our tool can detect fine-grained memory access errors effectively and efficiently.As the practical impact,our tool has detected three 0-day memory access errors in an audio decoder.
基金supported by the National Key R&D Program of China(No.2017YFB0202003)。
文摘The radiation damage effect of key structural materials is one of the main research subjects of the numerical reactor.From the perspective of experimental safety and feasibility,Molecular Dynamics(MD)in the materials field is an ideal method for simulating the radiation damage of structural materials.The Crystal-MD represents a massive parallel MD simulation software based on the key material characteristics of reactors.Compared with the Large-scale Atomic/Molecurlar Massively Parallel Simulator(LAMMPS)and ITAP Molecular Dynamics(IMD)software,the Crystal-MD reduces the memory required for software operation to a certain extent,but it is very time-consuming.Moreover,the calculation results of the Crystal-MD have large deviations,and there are also some problems,such as memory limitation and frequent communication during its migration and optimization.In this paper,in order to solve the above problems,the memory access mode of the Crystal-MD software is studied.Based on the memory access mode,a memory access optimization strategy is proposed for a unique architecture of China’s supercomputer Sunway Taihu Light.The proposed optimization strategy is verified by the experiments,and experimental results show that the running speed of the Crystal-MD is increased significantly by using the proposed optimization strategy.
基金Project supported by the National Key Research and Development Program of China(No.2018YFB1003500)。
文摘General purpose graphics processing units(GPGPUs)can be used to improve computing performance considerably for regular applications.However,irregular memory access exists in many applications,and the benefits of graphics processing units(GPUs)are less substantial for irregular applications.In recent years,several studies have presented some solutions to remove static irregular memory access.However,eliminating dynamic irregular memory access with software remains a serious challenge.A pure software solution without hardware extensions or offline profiling is proposed to eliminate dynamic irregular memory access,especially for indirect memory access.Data reordering and index redirection are suggested to reduce the number of memory transactions,thereby improving the performance of GPU kernels.To improve the efficiency of data reordering,an operation to reorder data is offloaded to a GPU to reduce overhead and thus transfer data.Through concurrently executing the compute unified device architecture(CUDA)streams of data reordering and the data processing kernel,the overhead of data reordering can be reduced.After these optimizations,the volume of memory transactions can be reduced by 16.7%-50%compared with CUSPARSE-based benchmarks,and the performance of irregular kernels can be improved by 9.64%-34.9%using an NVIDIA Tesla P4 GPU.
文摘As one of the most notorious programming errors,memory access errors still hurt modern software security.Particularly,they are hidden deeply in important software systems written in memory unsafe languages like C/C++.Plenty of work have been proposed to detect bugs leading to memory access errors.However,all existing works lack the ability to handle two challenges.First,they are not able to tackle fine-grained memory access errors,e.g.,data overflow inside one data structure.These errors are usually overlooked for a long time since they happen inside one memory block and do not lead to program crash.Second,most existing works rely on source code or debugging information to recover memory boundary information,so they cannot be directly applied to detection of memory access errors in binary code.However,searching memory access errors in binary code is a very common scenario in software vulnerability detection and exploitation.In order to overcome these challenges,we propose Memory Access Integrity(MAI),a dynamic method to detect finegrained memory access errors in off-the-shelf binary executables.The core idea is to recover fine-grained accessing policy between memory access behaviors and memory ranges,and then detect memory access errors based on the policy.The key insight in our work is that memory accessing patterns reveal information for recovering the boundary of memory objects and the accessing policy.Based on these recovered information,our method maintains a new memory model to simulate the life cycle of memory objects and report errors when any accessing policy is violated.We evaluate our tool on popular CTF datasets and real world softwares.Compared with the state of the art detection tool,the evaluation result demonstrates that our tool can detect fine-grained memory access errors effectively and efficiently.As the practical impact,our tool has detected three 0-day memory access errors in an audio decoder.
基金supported by the State Key Program of the National Natural Science Foundation of China (Grant No.60836004)the National Natural Science Foundation of China (Grant Nos.61076025 and 61006070)
文摘Using computer-aided design three-dimensional (3D) simulation technology, the recovery mechanism of single event upset and the effects of spacing and hit angle on the recovery are studied. It is found that the multi-node charge collection plays a key role in recovery and shielding the charge sharing by adding guard rings. It cannot exhibit the recovery effect. It is also indicated that the upset linear energy transfer (LET) threshold is kept constant while the recovery LET threshold increases as the spacing increases. Additionally, the effect of incident angle on recovery is analysed and it is shown that a larger angle can bring about a stronger charge sharing effect, thus strengthening the recovery ability.
基金the National Natural Science Foundation of China(Grant Nos.21773291,61904118,and 22002102)the Natural Science Foundation of Jiangsu Province,China(Grant Nos.BK20190935 and BK20190947)+3 种基金the Natural Science Foundation of the Jiangsu Higher Education Institutions of China(Grant Nos.19KJA210005,19KJB510012,19KJB120005,and 19KJB430034)the Fund from the Suzhou Key Laboratory for Nanophotonic and Nanoelectronic Materials and Its Devices(Grant No.SZS201812)the Science Fund from the Jiangsu Key Laboratory for Environment Functional Materialsthe State Key Laboratory of Transducer Technology,Shanghai Institute of Microsystem and Information Technology,Chinese Academy of Sciences.
文摘The era of information explosion is coming and information need to be continuously stored and randomly accessed over long-term periods,which constitute an insurmountable challenge for existing data centers.At present,computing devices use the von Neumann architecture with separate computing and memory units,which exposes the shortcomings of“memory bottleneck”.Nonvolatile memristor can realize data storage and in-memory computing at the same time and promises to overcome this bottleneck.Phase-change random access memory(PCRAM)is called one of the best solutions for next generation non-volatile memory.Due to its high speed,good data retention,high density,low power consumption,PCRAM has the broad commercial prospects in the in-memory computing application.In this review,the research progress of phase-change materials and device structures for PCRAM,as well as the most critical performances for a universal memory,such as speed,capacity,and power consumption,are reviewed.By comparing the advantages and disadvantages of phase-change optical disk and PCRAM,a new concept of optoelectronic hybrid storage based on phase-change material is proposed.Furthermore,its feasibility to replace existing memory technologies as a universal memory is also discussed as well.
基金Supported by the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No XDA09020402the National Integrate Circuit Research Program of China under Grant No 2009ZX02023-003+1 种基金the National Natural Science Foundation of China under Grant Nos 61261160500,61376006,61401444 and 61504157the Science and Technology Council of Shanghai under Grant Nos 14DZ2294900,15DZ2270900 and 14ZR1447500
文摘An optimized device structure for reducing the RESET current of phase-change random access memory (PCRAM) with blade-type like (BTL) phase change layer is proposed. The electrical thermal analysis of the BTL cell and the blade heater contactor structure by three-dimensional finite element modeling are compared with each other during RESET operation. The simulation results show that the programming region of the phase change layer in the BTL cell is much smaller, and thermal electrical distributions of the BTL cell are more concentrated on the TiN/GST interface. The results indicate that the BTL cell has the superiorities of increasing the heating efficiency, decreasing the power consumption and reducing the RESET current from 0.67mA to 0.32mA. Therefore, the BTL cell will be appropriate for high performance PCRAM device with lower power consumption and lower RESET current.
文摘This paper investigated phase change Si1Sb2Te3 material for application of chalcogenide random access memory. Current-voltage performance was conducted to determine threshold current of phase change from amorphous phase to polycrystalline phase. The film holds a threshold current about 0.155 mA, which is smaller than the value 0.31 mA of Ge2Sb2Te5 film. Amorphous Si1Sb2Te3 changes to face-centred-cubic structure at ~ 180℃ and changes to hexagonal structure at ~ 270℃. Annealing temperature dependent electric resistivity of Si1Sb2Te3 film was studied by four-point probe method. Data retention of the films was characterized as well.
基金supported by the State Key Project of Fundamental Research of Ministry of Science and Technology,China(Grant No.2010CB934400)the National Natural Science Foundation of China(Grant Nos.51229101 and 11374351)
文摘Recent progresses in magnetic tunnel junctions with perpendicular magnetic anisotropy (PMA) are reviewed and summarized. At first, the concept and source of perpendicular magnetic anisotropy (PMA) are introduced. Next, a historical overview of PMA materials as magnetic electrodes, such as the RE-TM alloys TbFeCo and GdFeCo, novel tetragonal manganese alloys Mn-Ga, L10-ordered (Co, Fe)/Pt alloy, multilayer film [Co, Fe, CoFe/Pt, Pd, Ni, AU]N, and ultra-thin magnetic metal/oxidized barrier is offered. The other part of the article focuses on the optimization and fabrication of CoFeB/MgO/CoFeB p-MTJs, which is thought to have high potential to meet the main demands for non-volatile magnetic random access memory.
基金supported by the National Natural Science Foundation of China(Grant Nos.61274113,11204212,61404091,51502203,and 51502204)the Tianjin Natural Science Foundation,China(Grant Nos.14JCZDJC31500 and 14JCQNJC00800)the Tianjin Science and Technology Developmental Funds of Universities and Colleges,China(Grant No.20130701)
文摘In this letter,the Ta/HfO/BN/TiN resistive switching devices are fabricated and they exhibit low power consumption and high uniformity each.The reset current is reduced for the HfO/BN bilayer device compared with that for the Ta/HfO/TiN structure.Furthermore,the reset current decreases with increasing BN thickness.The HfOlayer is a dominating switching layer,while the low-permittivity and high-resistivity BN layer acts as a barrier of electrons injection into TiN electrode.The current conduction mechanism of low resistance state in the HfO/BN bilayer device is space-chargelimited current(SCLC),while it is Ohmic conduction in the HfOdevice.
基金supported by the Open Foundation of State Key Laboratory of Electronic Thin Films and Integrated Devices,China(Grant No.KFJJ201306)
文摘Synergistic effects of the total ionizing dose (TID) on the single event upset (SEU) sensitivity in static random access memories (SRAMs) were studied by using protons. The total dose was cumulated with high flux protons during the TID exposure, and the SEU cross section was tested with low flux protons at several cumulated dose steps. Because of the radiation-induced off-state leakage current increase of the CMOS transistors, the noise margin became asymmetric and the memory imprint effect was observed.