Deep learning algorithms have been widely used in computer vision,natural language processing and other fields.However,due to the ever-increasing scale of the deep learning model,the requirements for storage and compu...Deep learning algorithms have been widely used in computer vision,natural language processing and other fields.However,due to the ever-increasing scale of the deep learning model,the requirements for storage and computing performance are getting higher and higher,and the processors based on the von Neumann architecture have gradually exposed significant shortcomings such as consumption and long latency.In order to alleviate this problem,large-scale processing systems are shifting from a traditional computing-centric model to a data-centric model.A near-memory computing array architecture based on the shared buffer is proposed in this paper to improve system performance,which supports instructions with the characteristics of store-calculation integration,reducing the data movement between the processor and main memory.Through data reuse,the processing speed of the algorithm is further improved.The proposed architecture is verified and tested through the parallel realization of the convolutional neural network(CNN)algorithm.The experimental results show that at the frequency of 110 MHz,the calculation speed of a single convolution operation is increased by 66.64%on average compared with the CNN architecture that performs parallel calculations on field programmable gate array(FPGA).The processing speed of the whole convolution layer is improved by 8.81%compared with the reconfigurable array processor that does not support near-memory computing.展开更多
Flash memories and semiconductor p-n junctions are two elementary but incompatible building blocks of most electronic and optoelectronic devices.The pressing demand to efficiently transfer massive data between memorie...Flash memories and semiconductor p-n junctions are two elementary but incompatible building blocks of most electronic and optoelectronic devices.The pressing demand to efficiently transfer massive data between memories and logic circuits,as well as for high data storage capability and device integration density,has fueled the rapid growth of technique and material innovations.Two-dimensional(2D)materials are considered as one of the most promising candidates to solve this challenge.However,a key aspect for 2D materials to build functional devices requires effective and accurate control of the carrier polarity,concentration and spatial distribution in the atomically thin structures.Here,a non-volatile opto-electrical doping approach is demonstrated,which enables reversibly writing spatially resolved doping patterns in the MoTe2 conductance channel through a MoTe2/hexagonal boron nitride(h-BN)heterostructure.Based on the doping effect induced by the combination of electrostatic modulation and ultraviolet light illumination,a 3-bit flash memory and various homojunctions on the same MoTe2/BN heterostructure are successfully developed.The flash memory achieved 8 well distinguished memory states with a maximum on/off ratio over 10^4.Each state showed negligible decay during the retention time of 2,400 s.The heterostructure also allowed the formation of p-p,n-n,p-n,and n-p homojunctions and the free transition among these states.The MoTe2 p-n homojunction with a rectification ratio of 10^3 exhibited excellent photodetection and photovoltaic performance.Having the memory device and p-n junction built on the same structure makes it possible to bring memory and computational circuit on the same chip,one step further to realize near-memory computing.展开更多
Genomic sequence alignment is the most critical and time-consuming step in genomic analysis.Alignment algorithms generally follow a seed-and-extend model.Acceleration of the extension phase for sequence alignment has ...Genomic sequence alignment is the most critical and time-consuming step in genomic analysis.Alignment algorithms generally follow a seed-and-extend model.Acceleration of the extension phase for sequence alignment has been well explored in computing-centric architectures on field-programmable gate array(FPGA),application-specific integrated circuit(ASIC),and graphics processing unit(GPU)(e.g.,the Smith-Waterman algorithm).Compared with the extension phase,the seeding phase is more critical and essential.However,the seeding phase is bounded by memory,i.e.,fine-grained random memory access and limited parallelism on conventional system.In this paper,we argue that the processing-in-memory(PIM)concept could be a viable solution to address these problems.This paper describes\PIM-Align"|an application-driven near-data processing architecture for sequence alignment.In order to achieve memory-capacity proportional performance by taking advantage of 3D-stacked dynamic random access memory(DRAM)technology,we propose a lightweight message mechanism between different memory partitions,and a specialized hardware prefetcher for memory access patterns of sequence alignment.Our evaluation shows that the proposed architecture can achieve 20x and 1820x speedup when compared with the best available ASIC implementation and the software running on 32-thread CPU,respectively.展开更多
基金Supported by the National Natural Science Foundation of China(No.61802304,61834005,61772417,61602377)the Shaanxi Province KeyR&D Plan(No.2021GY-029)。
文摘Deep learning algorithms have been widely used in computer vision,natural language processing and other fields.However,due to the ever-increasing scale of the deep learning model,the requirements for storage and computing performance are getting higher and higher,and the processors based on the von Neumann architecture have gradually exposed significant shortcomings such as consumption and long latency.In order to alleviate this problem,large-scale processing systems are shifting from a traditional computing-centric model to a data-centric model.A near-memory computing array architecture based on the shared buffer is proposed in this paper to improve system performance,which supports instructions with the characteristics of store-calculation integration,reducing the data movement between the processor and main memory.Through data reuse,the processing speed of the algorithm is further improved.The proposed architecture is verified and tested through the parallel realization of the convolutional neural network(CNN)algorithm.The experimental results show that at the frequency of 110 MHz,the calculation speed of a single convolution operation is increased by 66.64%on average compared with the CNN architecture that performs parallel calculations on field programmable gate array(FPGA).The processing speed of the whole convolution layer is improved by 8.81%compared with the reconfigurable array processor that does not support near-memory computing.
基金This work is supported by the National Natural Science Foundation of China(No.21405109)Seed Foundation of State Key Laboratory of Precision Measurement Technology and Instruments,China(No.Pilt1710).
文摘Flash memories and semiconductor p-n junctions are two elementary but incompatible building blocks of most electronic and optoelectronic devices.The pressing demand to efficiently transfer massive data between memories and logic circuits,as well as for high data storage capability and device integration density,has fueled the rapid growth of technique and material innovations.Two-dimensional(2D)materials are considered as one of the most promising candidates to solve this challenge.However,a key aspect for 2D materials to build functional devices requires effective and accurate control of the carrier polarity,concentration and spatial distribution in the atomically thin structures.Here,a non-volatile opto-electrical doping approach is demonstrated,which enables reversibly writing spatially resolved doping patterns in the MoTe2 conductance channel through a MoTe2/hexagonal boron nitride(h-BN)heterostructure.Based on the doping effect induced by the combination of electrostatic modulation and ultraviolet light illumination,a 3-bit flash memory and various homojunctions on the same MoTe2/BN heterostructure are successfully developed.The flash memory achieved 8 well distinguished memory states with a maximum on/off ratio over 10^4.Each state showed negligible decay during the retention time of 2,400 s.The heterostructure also allowed the formation of p-p,n-n,p-n,and n-p homojunctions and the free transition among these states.The MoTe2 p-n homojunction with a rectification ratio of 10^3 exhibited excellent photodetection and photovoltaic performance.Having the memory device and p-n junction built on the same structure makes it possible to bring memory and computational circuit on the same chip,one step further to realize near-memory computing.
基金The National Key Research and Development Program of China under Grant Nos. 2018YFB0204400,2016YFB0201305, 2016YFB0200803, 2016YFB0200300, and XDC01030000the National Natural Science Foundation of China underGrant Nos. 6197237, and 61702483the CAS QYZDJ-SSW-JSC035 Funding.
文摘Genomic sequence alignment is the most critical and time-consuming step in genomic analysis.Alignment algorithms generally follow a seed-and-extend model.Acceleration of the extension phase for sequence alignment has been well explored in computing-centric architectures on field-programmable gate array(FPGA),application-specific integrated circuit(ASIC),and graphics processing unit(GPU)(e.g.,the Smith-Waterman algorithm).Compared with the extension phase,the seeding phase is more critical and essential.However,the seeding phase is bounded by memory,i.e.,fine-grained random memory access and limited parallelism on conventional system.In this paper,we argue that the processing-in-memory(PIM)concept could be a viable solution to address these problems.This paper describes\PIM-Align"|an application-driven near-data processing architecture for sequence alignment.In order to achieve memory-capacity proportional performance by taking advantage of 3D-stacked dynamic random access memory(DRAM)technology,we propose a lightweight message mechanism between different memory partitions,and a specialized hardware prefetcher for memory access patterns of sequence alignment.Our evaluation shows that the proposed architecture can achieve 20x and 1820x speedup when compared with the best available ASIC implementation and the software running on 32-thread CPU,respectively.