As a computing paradigm that combines temporal and spatial computations,dynamic reconfigurable computing provides superiorities of flexibility,energy efficiency and area efficiency,attracting interest from both academ...As a computing paradigm that combines temporal and spatial computations,dynamic reconfigurable computing provides superiorities of flexibility,energy efficiency and area efficiency,attracting interest from both academia and industry.However,dynamic reconfigurable computing is not yet mature because of several unsolved problems.This work introduces the concept,architecture,and compilation techniques of dynamic reconfigurable computing.It also discusses the existing major challenges and points out its potential applications.展开更多
This paper describes a new specialized Reconfigurable Cryptographic for Block ciphersArchitecture(RCBA).Application-specific computation pipelines can be configured according to thecharacteristics of the block cipher ...This paper describes a new specialized Reconfigurable Cryptographic for Block ciphersArchitecture(RCBA).Application-specific computation pipelines can be configured according to thecharacteristics of the block cipher processing in RCBA,which delivers high performance for crypto-graphic applications.RCBA adopts a coarse-grained reconfigurable architecture that mixes the ap-propriate amount of static configurations with dynamic configurations.RCBA has been implementedbased on Altera’s FPGA,and representative algorithms of block cipher such as DES,Rijndael and RC6have been mapped on RCBA architecture successfully.System performance has been analyzed,andfrom the analysis it is demonstrated that the RCBA architecture can achieve more flexibility and ef-ficiency when compared with other implementations.展开更多
This paper introduces a new datapath architecture for reconfigurable processors. The proposed datapath is based on Network-on-Chip approach and facilitates tight coupling of all functional units. Reconfigurable functi...This paper introduces a new datapath architecture for reconfigurable processors. The proposed datapath is based on Network-on-Chip approach and facilitates tight coupling of all functional units. Reconfigurable functional elements can be dynamically allocated for application specific optimizations, enabling polymorphic computing. Using a modified network simulator, performance of several NoC topologies and parameters are investigated with standard benchmark programs, including fine grain and coarse grain computations. Simulation results highlight the flexibility and scalability of the proposed polymorphic NoC processor for a wide range of application domains.展开更多
In order to improve the concurrent access performance of the web-based spatial computing system in cluster,a parallel scheduling strategy based on the multi-core environment is proposed,which includes two levels of pa...In order to improve the concurrent access performance of the web-based spatial computing system in cluster,a parallel scheduling strategy based on the multi-core environment is proposed,which includes two levels of parallel processing mechanisms.One is that it can evenly allocate tasks to each server node in the cluster and the other is that it can implement the load balancing inside a server node.Based on the strategy,a new web-based spatial computing model is designed in this paper,in which,a task response ratio calculation method,a request queue buffer mechanism and a thread scheduling strategy are focused on.Experimental results show that the new model can fully use the multi-core computing advantage of each server node in the concurrent access environment and improve the average hits per second,average I/O Hits,CPU utilization and throughput.Using speed-up ratio to analyze the traditional model and the new one,the result shows that the new model has the best performance.The performance of the multi-core server nodes in the cluster is optimized;the resource utilization and the parallel processing capabilities are enhanced.The more CPU cores you have,the higher parallel processing capabilities will be obtained.展开更多
Due to the diversity of graph computing applications, the power-law distribution of graph data, and the high compute-to-memory ratio, traditional architectures face significant challenges regarding poor flexibility, i...Due to the diversity of graph computing applications, the power-law distribution of graph data, and the high compute-to-memory ratio, traditional architectures face significant challenges regarding poor flexibility, imbalanced workload distribution, and inefficient memory access when executing graph computing tasks. Graph computing accelerator, GraphApp, based on a reconfigurable processing element(PE) array was proposed to address the challenges above. GraphApp utilizes 16 reconfigurable PEs for parallel computation and employs tiled data. By reasonably dividing the data into tiles, load balancing is achieved and the overall efficiency of parallel computation is enhanced. Additionally, it preprocesses graph data using the compressed sparse columns independently(CSCI) data compression format to alleviate the issue of low memory access efficiency caused by the high memory access-to-computation ratio. Lastly, GraphApp is evaluated using triangle counting(TC) and depth-first search(DFS) algorithms. Performance analysis is conducted by measuring the execution time of these algorithms in GraphApp against existing typical graph frameworks, Ligra, and GraphBIG, using six datasets from the Stanford Network Analysis Project(SNAP) database. The results show that GraphApp achieves a maximum performance improvement of 30.86% compared to Ligra and 20.43% compared to GraphBIG when processing the same datasets.展开更多
The concept and advantage of reconfigurable technology is introduced. A kind of processor architecture of re configurable macro processor (RMP) model based on FPGA array and DSP is put forward and has been implemented...The concept and advantage of reconfigurable technology is introduced. A kind of processor architecture of re configurable macro processor (RMP) model based on FPGA array and DSP is put forward and has been implemented. Two image algorithms are developed: template-based automatic target recognition and zone labeling. One is estimating for motion direction in the infrared image background, another is line picking-up algorithm based on image zone labeling and phase grouping technique. It is a kind of 'hardware' function that can be called by the DSP in high-level algorithm. It is also a kind of hardware algorithm of the DSP. The results of experiments show the reconfigurable computing technology based on RMP is an ideal accelerating means to deal with the high-speed image processing tasks. High real time performance is obtained in our two applications on RMP.展开更多
Polymorphic computing is widely seen as next evolutionary step in designing advanced computing architectures. This paper presents a brief history of reconfigurable and polymorphic computing, and highlights the recent ...Polymorphic computing is widely seen as next evolutionary step in designing advanced computing architectures. This paper presents a brief history of reconfigurable and polymorphic computing, and highlights the recent trends and challenges. A novel polymorphic architecture featuring programmable memory event triggers and a new concept of control agents is proposed. This architecture can provide dynamic load balancing, distributed control, separated memory and processing fabrics, configurable memory blocks, and task-optimized computation.展开更多
Traditional digital processing approaches are based on semiconductor transistors, which suffer from high power consumption, aggravating with technology node scaling. To solve definitively this problem, a number of eme...Traditional digital processing approaches are based on semiconductor transistors, which suffer from high power consumption, aggravating with technology node scaling. To solve definitively this problem, a number of emerging non-volatile nanodevices are under intense investigations. Meanwhile, novel computing circuits are invented to dig the full potential of the nanodevices. The combination of non-volatile nanodevices with suitable computing paradigms have many merits compared with the complementary metal-oxide-semiconductor transistor (CMOS) technology based structures, such as zero standby power, ultra-high density, non-volatility, and acceptable access speed. In this paper, we overview and compare the computing paradigms based on the emerging nanodevices towards ultra-low dissipation.展开更多
Deep learning algorithms have been widely used in computer vision,natural language processing and other fields.However,due to the ever-increasing scale of the deep learning model,the requirements for storage and compu...Deep learning algorithms have been widely used in computer vision,natural language processing and other fields.However,due to the ever-increasing scale of the deep learning model,the requirements for storage and computing performance are getting higher and higher,and the processors based on the von Neumann architecture have gradually exposed significant shortcomings such as consumption and long latency.In order to alleviate this problem,large-scale processing systems are shifting from a traditional computing-centric model to a data-centric model.A near-memory computing array architecture based on the shared buffer is proposed in this paper to improve system performance,which supports instructions with the characteristics of store-calculation integration,reducing the data movement between the processor and main memory.Through data reuse,the processing speed of the algorithm is further improved.The proposed architecture is verified and tested through the parallel realization of the convolutional neural network(CNN)algorithm.The experimental results show that at the frequency of 110 MHz,the calculation speed of a single convolution operation is increased by 66.64%on average compared with the CNN architecture that performs parallel calculations on field programmable gate array(FPGA).The processing speed of the whole convolution layer is improved by 8.81%compared with the reconfigurable array processor that does not support near-memory computing.展开更多
Reconfiguration is the key to produce an applicable ternary optical computer (TOC). The method to implement the reconfiguration function determines whether a TOC can step into applied fields or not. In this work, a ...Reconfiguration is the key to produce an applicable ternary optical computer (TOC). The method to implement the reconfiguration function determines whether a TOC can step into applied fields or not. In this work, a design of the reconfiguration circuit based on field programmable gates array (FPGA) is proposed, and the structure of the entire hardware system is discussed.展开更多
To simplify the fabrication process and increase the versatility of neuromorphic systems,the reconfiguration concept has attracted much attention.Here,we developed a novel electrochemical VO_(2)(EC-VO_(2))device,which...To simplify the fabrication process and increase the versatility of neuromorphic systems,the reconfiguration concept has attracted much attention.Here,we developed a novel electrochemical VO_(2)(EC-VO_(2))device,which can be reconfigured as synapses or LIF neurons.The ionic dynamic doping contributed to the resistance changes of VO_(2),which enables the reversible modulation of device states.The analog resistance switching and tunable LIF functions were both measured based on the same device to demonstrate the capacity of reconfiguration.Based on the reconfigurable EC-VO_(2),the simulated spiking neural network model exhibited excellent performances by using low-precision weights and tunable output neurons,whose final accuracy reached 91.92%.展开更多
The present research attempted a Large-Eddy Simulation (LES) of airflow over a steep, three-dimensional isolated hill by using the latest multi-cores multi-CPUs systems. As a result, it was found that 1) turbulence si...The present research attempted a Large-Eddy Simulation (LES) of airflow over a steep, three-dimensional isolated hill by using the latest multi-cores multi-CPUs systems. As a result, it was found that 1) turbulence simulations using approximately 50 million grid points are feasible and 2) the use of this system resulted in the achievement of a high computation speed, which exceeded the speed of parallel computation attained by a single CPU on one of the latest supercomputers. Furthermore, LES was conducted by using the multi-GPUs systems. The results of these simulations revealed the following findings: 1) the multi-GPUs environment which used the NVDIA? Tesla M2090 or the M2075 could simulate turbulence in a model with as many as approximately 50 million grid points. 2) The computation speed achieved by the multi-GPUs environments exceeded that by parallel computation which used four to six CPUs of one of the latest supercomputers.展开更多
Reconfigurable computing systems can be reconfigured at runtime and support partial reconfigurability which makes us able to execute tasks in a true multitasking manner. To manage such systems at runtime, a reconfigur...Reconfigurable computing systems can be reconfigured at runtime and support partial reconfigurability which makes us able to execute tasks in a true multitasking manner. To manage such systems at runtime, a reconfigurable operating system is needed. The main part of this operating system is resource management unit which performs on-line scheduling and placement of hardware tasks at runtime. Reconfiguration overhead is an important obstacle that limits the performance of on-line scheduling algorithms in reconfigurable computing systems and increases the overall execution time. Configuration reusing (task reusing) can decrease reconfiguration overhead considerably, particularly in periodic applications or the applications in which the probability of tasks recurrence is high. In this paper, we present a technique called reusing-based scheduling (RBS), for on-line scheduling and placement in which configuration reusing is considered as a main characteristic in order to reduce reconfiguration overhead and decrease total execution time of the tasks. Several experiments have been conducted on the proposed algorithm. Obtained results show considerable improvement in overall execution time of the tasks.展开更多
Neuromorphic computing aims to achieve artificial intelligence by mimicking the mechanisms of biological neurons and synapses that make up the human brain.However,the possibility of using one reconfigurable memristor ...Neuromorphic computing aims to achieve artificial intelligence by mimicking the mechanisms of biological neurons and synapses that make up the human brain.However,the possibility of using one reconfigurable memristor as both artificial neuron and synapse still requires intensive research in detail.In this work,Ag/SrTiO_(3)(STO)/Pt memristor with low operating voltage is manufactured and reconfigurable as both neuron and synapse for neuromorphic computing chip.By modulating the compliance current,two types of resistance switching,volatile and nonvolatile,can be obtained in amorphous STO thin film.This is attributed to the manipulation of the Ag conductive filament.Furthermore,through regulating electrical pulses and designing bionic circuits,the neuronal functions of leaky integrate and fire,as well as synaptic biomimicry with spike-timing-dependent plasticity and paired-pulse facilitation neural regulation,are successfully realized.This study shows that the reconfigurable devices based on STO thin film are promising for the application of neuromorphic computing systems.展开更多
This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer ...This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer simulations in various fields. We propose a new algorithm for multi-thread parallel computation of DEM, which makes effective use of the available memory and accelerates the computation. This study shows that memory usage is drastically reduced by using this algorithm. To show the practical use of DEM in industry, a large-scale powder system is simulated with a complicated drive unit. We compared the performance of the simulation between the latest GPU and CPU processors with optimized programs for each processor. The results show that the difference in performance is not substantial when using either GPUs or CPUs with a multi-thread parallel algorithm. In addition, DEM algorithm is shown to have high scalabilitv in a multi-thread parallel computation on a CPU.展开更多
This paper focuses on the design process for reconfigurable architecture. Our contribution focuses on introducing a new temporal partitioning algorithm. Our algorithm is based on typical mathematic flow to solve the t...This paper focuses on the design process for reconfigurable architecture. Our contribution focuses on introducing a new temporal partitioning algorithm. Our algorithm is based on typical mathematic flow to solve the temporal partitioning problem. This algorithm optimizes the transfer of data required between design partitions and the reconfiguration overhead. Results show that our algorithm considerably decreases the communication cost and the latency compared with other well known algorithms.展开更多
Reconfigurable computing tries to achieve the balance between high efficiency of custom computing and flexibility of general-purpose computing. This paper presents the implementation techniques in LEAP, a coarse-grain...Reconfigurable computing tries to achieve the balance between high efficiency of custom computing and flexibility of general-purpose computing. This paper presents the implementation techniques in LEAP, a coarse-grained reconfigurable array, and proposes a speculative execution mechanism for dynamic loop scheduling with the goal of one iteration per cycle and implementation techniques to support decoupling synchronization between the token generator and the collector. This paper also in- troduces the techniques of exploiting both data dependences of intra- and inter-iteration, with the help of two instructions for special data reuses in the loop-carried dependences. The experimental results show that the number of memory accesses reaches on average 3% of an RISC processor simulator with no memory optimization. In a practical image matching application, LEAP architecture achieves about 34 times of speedup in execution cycles, compared with general-purpose processors.展开更多
基金supported in part by the National Science and Technology Major Project of the Ministry of Science and Technology of China (Grant No. 2018ZX01028201)in part by the National Natural Science Foundation of China (Grant No. 61672317, No. 61834002)in part by the National Key R&D Program of China (Grant No. 2018YFB2202101)
文摘As a computing paradigm that combines temporal and spatial computations,dynamic reconfigurable computing provides superiorities of flexibility,energy efficiency and area efficiency,attracting interest from both academia and industry.However,dynamic reconfigurable computing is not yet mature because of several unsolved problems.This work introduces the concept,architecture,and compilation techniques of dynamic reconfigurable computing.It also discusses the existing major challenges and points out its potential applications.
文摘This paper describes a new specialized Reconfigurable Cryptographic for Block ciphersArchitecture(RCBA).Application-specific computation pipelines can be configured according to thecharacteristics of the block cipher processing in RCBA,which delivers high performance for crypto-graphic applications.RCBA adopts a coarse-grained reconfigurable architecture that mixes the ap-propriate amount of static configurations with dynamic configurations.RCBA has been implementedbased on Altera’s FPGA,and representative algorithms of block cipher such as DES,Rijndael and RC6have been mapped on RCBA architecture successfully.System performance has been analyzed,andfrom the analysis it is demonstrated that the RCBA architecture can achieve more flexibility and ef-ficiency when compared with other implementations.
文摘This paper introduces a new datapath architecture for reconfigurable processors. The proposed datapath is based on Network-on-Chip approach and facilitates tight coupling of all functional units. Reconfigurable functional elements can be dynamically allocated for application specific optimizations, enabling polymorphic computing. Using a modified network simulator, performance of several NoC topologies and parameters are investigated with standard benchmark programs, including fine grain and coarse grain computations. Simulation results highlight the flexibility and scalability of the proposed polymorphic NoC processor for a wide range of application domains.
基金Supported by the China Postdoctoral Science Foundation(No.2014M552115)the Fundamental Research Funds for the Central Universities,ChinaUniversity of Geosciences(Wuhan)(No.CUGL140833)the National Key Technology Support Program of China(No.2011BAH06B04)
文摘In order to improve the concurrent access performance of the web-based spatial computing system in cluster,a parallel scheduling strategy based on the multi-core environment is proposed,which includes two levels of parallel processing mechanisms.One is that it can evenly allocate tasks to each server node in the cluster and the other is that it can implement the load balancing inside a server node.Based on the strategy,a new web-based spatial computing model is designed in this paper,in which,a task response ratio calculation method,a request queue buffer mechanism and a thread scheduling strategy are focused on.Experimental results show that the new model can fully use the multi-core computing advantage of each server node in the concurrent access environment and improve the average hits per second,average I/O Hits,CPU utilization and throughput.Using speed-up ratio to analyze the traditional model and the new one,the result shows that the new model has the best performance.The performance of the multi-core server nodes in the cluster is optimized;the resource utilization and the parallel processing capabilities are enhanced.The more CPU cores you have,the higher parallel processing capabilities will be obtained.
基金supported by the National Science and Technology Major Project (2022ZD0119001)the National Natural Science Foundation of China (61834005)+1 种基金the Shaanxi Key Research and Development Project (2022GY-027)the Key Scientific Research Project of Shaanxi Department of Education (22JY060)。
文摘Due to the diversity of graph computing applications, the power-law distribution of graph data, and the high compute-to-memory ratio, traditional architectures face significant challenges regarding poor flexibility, imbalanced workload distribution, and inefficient memory access when executing graph computing tasks. Graph computing accelerator, GraphApp, based on a reconfigurable processing element(PE) array was proposed to address the challenges above. GraphApp utilizes 16 reconfigurable PEs for parallel computation and employs tiled data. By reasonably dividing the data into tiles, load balancing is achieved and the overall efficiency of parallel computation is enhanced. Additionally, it preprocesses graph data using the compressed sparse columns independently(CSCI) data compression format to alleviate the issue of low memory access efficiency caused by the high memory access-to-computation ratio. Lastly, GraphApp is evaluated using triangle counting(TC) and depth-first search(DFS) algorithms. Performance analysis is conducted by measuring the execution time of these algorithms in GraphApp against existing typical graph frameworks, Ligra, and GraphBIG, using six datasets from the Stanford Network Analysis Project(SNAP) database. The results show that GraphApp achieves a maximum performance improvement of 30.86% compared to Ligra and 20.43% compared to GraphBIG when processing the same datasets.
文摘The concept and advantage of reconfigurable technology is introduced. A kind of processor architecture of re configurable macro processor (RMP) model based on FPGA array and DSP is put forward and has been implemented. Two image algorithms are developed: template-based automatic target recognition and zone labeling. One is estimating for motion direction in the infrared image background, another is line picking-up algorithm based on image zone labeling and phase grouping technique. It is a kind of 'hardware' function that can be called by the DSP in high-level algorithm. It is also a kind of hardware algorithm of the DSP. The results of experiments show the reconfigurable computing technology based on RMP is an ideal accelerating means to deal with the high-speed image processing tasks. High real time performance is obtained in our two applications on RMP.
文摘Polymorphic computing is widely seen as next evolutionary step in designing advanced computing architectures. This paper presents a brief history of reconfigurable and polymorphic computing, and highlights the recent trends and challenges. A novel polymorphic architecture featuring programmable memory event triggers and a new concept of control agents is proposed. This architecture can provide dynamic load balancing, distributed control, separated memory and processing fabrics, configurable memory blocks, and task-optimized computation.
文摘Traditional digital processing approaches are based on semiconductor transistors, which suffer from high power consumption, aggravating with technology node scaling. To solve definitively this problem, a number of emerging non-volatile nanodevices are under intense investigations. Meanwhile, novel computing circuits are invented to dig the full potential of the nanodevices. The combination of non-volatile nanodevices with suitable computing paradigms have many merits compared with the complementary metal-oxide-semiconductor transistor (CMOS) technology based structures, such as zero standby power, ultra-high density, non-volatility, and acceptable access speed. In this paper, we overview and compare the computing paradigms based on the emerging nanodevices towards ultra-low dissipation.
基金Supported by the National Natural Science Foundation of China(No.61802304,61834005,61772417,61602377)the Shaanxi Province KeyR&D Plan(No.2021GY-029)。
文摘Deep learning algorithms have been widely used in computer vision,natural language processing and other fields.However,due to the ever-increasing scale of the deep learning model,the requirements for storage and computing performance are getting higher and higher,and the processors based on the von Neumann architecture have gradually exposed significant shortcomings such as consumption and long latency.In order to alleviate this problem,large-scale processing systems are shifting from a traditional computing-centric model to a data-centric model.A near-memory computing array architecture based on the shared buffer is proposed in this paper to improve system performance,which supports instructions with the characteristics of store-calculation integration,reducing the data movement between the processor and main memory.Through data reuse,the processing speed of the algorithm is further improved.The proposed architecture is verified and tested through the parallel realization of the convolutional neural network(CNN)algorithm.The experimental results show that at the frequency of 110 MHz,the calculation speed of a single convolution operation is increased by 66.64%on average compared with the CNN architecture that performs parallel calculations on field programmable gate array(FPGA).The processing speed of the whole convolution layer is improved by 8.81%compared with the reconfigurable array processor that does not support near-memory computing.
基金Project supported by the National Natural Science Foundation of China(Grant No.61073049)the Shanghai Leading Academic Discipline Project(Grant No.J50103)the Doctorate Foundation of Education Ministry of China(Grant No.20093108110016)
文摘Reconfiguration is the key to produce an applicable ternary optical computer (TOC). The method to implement the reconfiguration function determines whether a TOC can step into applied fields or not. In this work, a design of the reconfiguration circuit based on field programmable gates array (FPGA) is proposed, and the structure of the entire hardware system is discussed.
基金Project supported by the National Natural Science Foundation of China (Grant Nos.61925401,92064004,61927901,and 92164302)the 111 Project (Grant No.B18001)+1 种基金support from the Fok Ying-Tong Education Foundationthe Tencent Foundation through the XPLORER PRIZE。
文摘To simplify the fabrication process and increase the versatility of neuromorphic systems,the reconfiguration concept has attracted much attention.Here,we developed a novel electrochemical VO_(2)(EC-VO_(2))device,which can be reconfigured as synapses or LIF neurons.The ionic dynamic doping contributed to the resistance changes of VO_(2),which enables the reversible modulation of device states.The analog resistance switching and tunable LIF functions were both measured based on the same device to demonstrate the capacity of reconfiguration.Based on the reconfigurable EC-VO_(2),the simulated spiking neural network model exhibited excellent performances by using low-precision weights and tunable output neurons,whose final accuracy reached 91.92%.
文摘The present research attempted a Large-Eddy Simulation (LES) of airflow over a steep, three-dimensional isolated hill by using the latest multi-cores multi-CPUs systems. As a result, it was found that 1) turbulence simulations using approximately 50 million grid points are feasible and 2) the use of this system resulted in the achievement of a high computation speed, which exceeded the speed of parallel computation attained by a single CPU on one of the latest supercomputers. Furthermore, LES was conducted by using the multi-GPUs systems. The results of these simulations revealed the following findings: 1) the multi-GPUs environment which used the NVDIA? Tesla M2090 or the M2075 could simulate turbulence in a model with as many as approximately 50 million grid points. 2) The computation speed achieved by the multi-GPUs environments exceeded that by parallel computation which used four to six CPUs of one of the latest supercomputers.
基金Supported by a grant from Iran Telecommunication Research Center
文摘Reconfigurable computing systems can be reconfigured at runtime and support partial reconfigurability which makes us able to execute tasks in a true multitasking manner. To manage such systems at runtime, a reconfigurable operating system is needed. The main part of this operating system is resource management unit which performs on-line scheduling and placement of hardware tasks at runtime. Reconfiguration overhead is an important obstacle that limits the performance of on-line scheduling algorithms in reconfigurable computing systems and increases the overall execution time. Configuration reusing (task reusing) can decrease reconfiguration overhead considerably, particularly in periodic applications or the applications in which the probability of tasks recurrence is high. In this paper, we present a technique called reusing-based scheduling (RBS), for on-line scheduling and placement in which configuration reusing is considered as a main characteristic in order to reduce reconfiguration overhead and decrease total execution time of the tasks. Several experiments have been conducted on the proposed algorithm. Obtained results show considerable improvement in overall execution time of the tasks.
基金supported by the National Key R&D Program of China (Grant No.2018AAA0103300)the National Key R&D Plan“Nano Frontier”Key Special Project (Grant No.2021YFA1200502)+13 种基金the Cultivation Projects of National Major R&D Project (Grant No.92164109)the National Natural Science Foundation of China (Grant Nos.61874158,62004056,and 62104058)the Special Project of Strategic Leading Science and Technology of Chinese Academy of Sciences (Grant No.XDB44000000-7)Hebei Basic Research Special Key Project (Grant No.F2021201045)the Support Program for the Top Young Talents of Hebei Province (Grant No.70280011807)the Supporting Plan for 100 Excellent Innovative Talents in Colleges and Universities of Hebei Province (Grant No.SLRC2019018)the Interdisciplinary Research Program of Natural Science of Hebei University (No.DXK202101)Institute of Life Sciences and Green Development (No.521100311)the Natural Science Foundation of Hebei Province (Nos.F2022201054 and F2021201022)the Outstanding Young Scientific Research and Innovation team of Hebei University (Grant No.605020521001)Special Support Funds for National High Level Talents (Grant No.041500120001)High-level Talent Research Startup Project of Hebei University (Grant No.521000981426)the Science and Technology Project of Hebei Education Department (Grant Nos.QN2020178 and QN2021026)Baoding Science and Technology Plan Project (Nos.2172P011 and 2272P014).
文摘Neuromorphic computing aims to achieve artificial intelligence by mimicking the mechanisms of biological neurons and synapses that make up the human brain.However,the possibility of using one reconfigurable memristor as both artificial neuron and synapse still requires intensive research in detail.In this work,Ag/SrTiO_(3)(STO)/Pt memristor with low operating voltage is manufactured and reconfigurable as both neuron and synapse for neuromorphic computing chip.By modulating the compliance current,two types of resistance switching,volatile and nonvolatile,can be obtained in amorphous STO thin film.This is attributed to the manipulation of the Ag conductive filament.Furthermore,through regulating electrical pulses and designing bionic circuits,the neuronal functions of leaky integrate and fire,as well as synaptic biomimicry with spike-timing-dependent plasticity and paired-pulse facilitation neural regulation,are successfully realized.This study shows that the reconfigurable devices based on STO thin film are promising for the application of neuromorphic computing systems.
文摘This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer simulations in various fields. We propose a new algorithm for multi-thread parallel computation of DEM, which makes effective use of the available memory and accelerates the computation. This study shows that memory usage is drastically reduced by using this algorithm. To show the practical use of DEM in industry, a large-scale powder system is simulated with a complicated drive unit. We compared the performance of the simulation between the latest GPU and CPU processors with optimized programs for each processor. The results show that the difference in performance is not substantial when using either GPUs or CPUs with a multi-thread parallel algorithm. In addition, DEM algorithm is shown to have high scalabilitv in a multi-thread parallel computation on a CPU.
文摘This paper focuses on the design process for reconfigurable architecture. Our contribution focuses on introducing a new temporal partitioning algorithm. Our algorithm is based on typical mathematic flow to solve the temporal partitioning problem. This algorithm optimizes the transfer of data required between design partitions and the reconfiguration overhead. Results show that our algorithm considerably decreases the communication cost and the latency compared with other well known algorithms.
基金Supported by the National Natural Science Foundation of China (Grant No. 60633050, 60621003)the National High Technology Researchand Development Program of China (Grant No. 2007AA01Z06)
文摘Reconfigurable computing tries to achieve the balance between high efficiency of custom computing and flexibility of general-purpose computing. This paper presents the implementation techniques in LEAP, a coarse-grained reconfigurable array, and proposes a speculative execution mechanism for dynamic loop scheduling with the goal of one iteration per cycle and implementation techniques to support decoupling synchronization between the token generator and the collector. This paper also in- troduces the techniques of exploiting both data dependences of intra- and inter-iteration, with the help of two instructions for special data reuses in the loop-carried dependences. The experimental results show that the number of memory accesses reaches on average 3% of an RISC processor simulator with no memory optimization. In a practical image matching application, LEAP architecture achieves about 34 times of speedup in execution cycles, compared with general-purpose processors.