The computational capability of a coarse-grained reconfigurable array(CGRA)can be significantly restrained due to data and context memory bandwidth bottlenecks.Traditionally,two methods have been used to resolve this ...The computational capability of a coarse-grained reconfigurable array(CGRA)can be significantly restrained due to data and context memory bandwidth bottlenecks.Traditionally,two methods have been used to resolve this problem.One method loads the context into the CGRA at run time.This method occupies very small on-chip memory but induces very large latency,which leads to low computational efficiency.The other method adopts a multi-context structure.This method loads the context into the on-chip context memory at the boot phase.Broadcasting the pointer of a set of contexts changes the hardware configuration on a cycle-by-cycle basis.The size of the context memory induces a large area overhead in multi-context structures,which results in major restrictions on application complexity.This paper proposes a Predictable Context Cache(PCC)architecture to address the above context issues by buffering the context inside a CGRA.In this architecture,context is dynamically transferred into the CGRA.Utilizing a PCC significantly reduces the on-chip context memory and the complexity of the applications running on the CGRA is no longer restricted by the size of the on-chip context memory.Data preloading is the most frequently used approach to hide input data latency and speed up the data transmission process for the data bandwidth issue.Rather than fundamentally reducing the amount of input data,the transferred data and computations are processed in parallel.However,the data preloading method cannot work efficiently because data transmission becomes the critical path as the reconfigurable array scale increases.This paper also presents a Hierarchical Data Memory(HDM)architecture as a solution to the efficiency problem.In this architecture,high internal bandwidth is provided to buffer both reused input data and intermediate data.The HDM architecture relieves the external memory from the data transfer burden so that the performance is significantly improved.As a result of using PCC and HDM,experiments running mainstream video decoding programs achieved performance improvements of 13.57%–19.48%when there was a reasonable memory size.Therefore,1080p@35.7fps for H.264high profile video decoding can be achieved on PCC and HDM architecture when utilizing a 200 MHz working frequency.Further,the size of the on-chip context memory no longer restricted complex applications,which were efficiently executed on the PCC and HDM architecture.展开更多
A fault-tolerant spaceborne mass memory architecture is presented based on entirely commercial-off-theshelf components.The highly modularized and scalable memory kernel supports the hierarchical design and is well sui...A fault-tolerant spaceborne mass memory architecture is presented based on entirely commercial-off-theshelf components.The highly modularized and scalable memory kernel supports the hierarchical design and is well suited to redundancy structure.Error correcting code(ECC) and periodical scrubbing are used to deal with bit errors induced by single event upset.For 8-bit wide devices, the parallel Reed Solomon(10, 8) can perform coder/decoder calculations in one clock cycle, achieving a data rate of several Gb/...展开更多
Efficiency and linearity of the microwave power amplifier are critical elements for mobile communication systems. A memory polynomial baseband predistorter based on an indirect learning architecture is presented for i...Efficiency and linearity of the microwave power amplifier are critical elements for mobile communication systems. A memory polynomial baseband predistorter based on an indirect learning architecture is presented for improving the linearity of an envelope tracing (ET) amplifier with application to a wireless transmitter. To deal with large peak-to-average ratio (PAR) problem, a clipping procedure for the input signal is employed. Then the system performance is verified by simulation results. For a single carrier wideband code division multiple access (WCDMA) signal of 16-quadrature amplitude modulation (16-QAM), about 2% improvement of the error vector magnitude (EVM) is achieved at an average output power of 45.5 dBm and gain of 10.6 dB, with adjacent channel leakage ratio (ACLR) of -64.55 dBc at offset frequency of 5 MHz. Moreover, a three-carrier WCDMA signal and a third-generation (3G) long term evolution (LTE) signal are used as test signals to demonstrate the performance of the proposed linearization scheme under different bandwidth signals.展开更多
Image bitmaps,i.e.,data containing pixels and visual perception,have been widely used in emerging applica-tions for pixel operations while consuming lots of memory space and energy.Compared with legacy DRAM(dynamic ra...Image bitmaps,i.e.,data containing pixels and visual perception,have been widely used in emerging applica-tions for pixel operations while consuming lots of memory space and energy.Compared with legacy DRAM(dynamic ran-dom access memory),non-volatile memories(NVMs)are suitable for bitmap storage due to the salient features of high density and intrinsic durability.However,writing NVMs suffers from higher energy consumption and latency compared with read accesses.Existing precise or approximate compression schemes in NVM controllers show limited performance for bitmaps due to the irregular data patterns and variance in bitmaps.We observe the pixel-level similarity when writing bitmaps due to the analogous contents in adjacent pixels.By exploiting the pixel-level similarity,we propose SimCom,an approximate similarity-aware compression scheme in the NVM module controller,to efficiently compress data for each write access on-the-fly.The idea behind SimCom is to compress continuous similar words into the pairs of base words with runs.The storage costs for small runs are further mitigated by reusing the least significant bits of base words.SimCom adaptively selects an appropriate compression mode for various bitmap formats,thus achieving an efficient trade-off be-tween quality and memory performance.We implement SimCom on GEM5/zsim with NVMain and evaluate the perfor-mance with real-world image/video workloads.Our results demonstrate the efficacy and efficiency of our SimCom with an efficient quality-performance trade-off.展开更多
This paper presents a novel compact memory in the processing element (PE) for single-instruction multiple-data (SIMD) vision chips. The PE memory is constructed with 8×8 register cells, where one latch in the...This paper presents a novel compact memory in the processing element (PE) for single-instruction multiple-data (SIMD) vision chips. The PE memory is constructed with 8×8 register cells, where one latch in the slave stage is shared by eight latches in the master stage. The memory supports simultaneous read and write on the same address in one clock cycle. Its compact area of 14.33 μm^2/bit promises a higher integration level of the processor. A prototype chip with a 64×64 PE array is fabricated in a UMC 0.18 μm CMOS technology. Five types of the PE memory cell structure are designed and compared. The testing results demonstrate that the proposed PE memory architecture well satisfies the requirement of the vision chip in high-speed real-time vision applications, such as 1000 fps edge extraction.展开更多
Accesses Per Cycle(APC),Concurrent Average Memory Access Time(C-AMAT),and Layered Performance Matching(LPM)are three memory performance models that consider both data locality and memory assess concurrency.The APC mod...Accesses Per Cycle(APC),Concurrent Average Memory Access Time(C-AMAT),and Layered Performance Matching(LPM)are three memory performance models that consider both data locality and memory assess concurrency.The APC model measures the throughput of a memory architecture and therefore reflects the quality of service(QoS)of a memory system.The C-AMAT model provides a recursive expression for the memory access delay and therefore can be used for identifying the potential bottlenecks in a memory hierarchy.The LPM method transforms a global memory system optimization into localized optimizations at each memory layer by matching the data access demands of the applications with the underlying memory system design.These three models have been proposed separately through prior efforts.This paper reexamines the three models under one coherent mathematical framework.More specifically,we present a new memorycentric view of data accesses.We divide the memory cycles at each memory layer into four distinct categories and use them to recursively define the memory access latency and concurrency along the memory hierarchy.This new perspective offers new insights with a clear formulation of the memory performance considering both locality and concurrency.Consequently,the performance model can be easily understood and applied in engineering practices.As such,the memory-centric approach helps establish a unified mathematical foundation for model-driven performance analysis and optimization of contemporary and future memory systems.展开更多
基金supported by the National High Technology Research and Development Program of China(Grant No.2012AA012701)
文摘The computational capability of a coarse-grained reconfigurable array(CGRA)can be significantly restrained due to data and context memory bandwidth bottlenecks.Traditionally,two methods have been used to resolve this problem.One method loads the context into the CGRA at run time.This method occupies very small on-chip memory but induces very large latency,which leads to low computational efficiency.The other method adopts a multi-context structure.This method loads the context into the on-chip context memory at the boot phase.Broadcasting the pointer of a set of contexts changes the hardware configuration on a cycle-by-cycle basis.The size of the context memory induces a large area overhead in multi-context structures,which results in major restrictions on application complexity.This paper proposes a Predictable Context Cache(PCC)architecture to address the above context issues by buffering the context inside a CGRA.In this architecture,context is dynamically transferred into the CGRA.Utilizing a PCC significantly reduces the on-chip context memory and the complexity of the applications running on the CGRA is no longer restricted by the size of the on-chip context memory.Data preloading is the most frequently used approach to hide input data latency and speed up the data transmission process for the data bandwidth issue.Rather than fundamentally reducing the amount of input data,the transferred data and computations are processed in parallel.However,the data preloading method cannot work efficiently because data transmission becomes the critical path as the reconfigurable array scale increases.This paper also presents a Hierarchical Data Memory(HDM)architecture as a solution to the efficiency problem.In this architecture,high internal bandwidth is provided to buffer both reused input data and intermediate data.The HDM architecture relieves the external memory from the data transfer burden so that the performance is significantly improved.As a result of using PCC and HDM,experiments running mainstream video decoding programs achieved performance improvements of 13.57%–19.48%when there was a reasonable memory size.Therefore,1080p@35.7fps for H.264high profile video decoding can be achieved on PCC and HDM architecture when utilizing a 200 MHz working frequency.Further,the size of the on-chip context memory no longer restricted complex applications,which were efficiently executed on the PCC and HDM architecture.
基金Supported by Innovative Program of Chinese Academy of Sciences (No. KGCY-SYW-407-02)Grand International Cooperation Foundation of Shanghai Science and Technology Commission (No. 052207046)
文摘A fault-tolerant spaceborne mass memory architecture is presented based on entirely commercial-off-theshelf components.The highly modularized and scalable memory kernel supports the hierarchical design and is well suited to redundancy structure.Error correcting code(ECC) and periodical scrubbing are used to deal with bit errors induced by single event upset.For 8-bit wide devices, the parallel Reed Solomon(10, 8) can perform coder/decoder calculations in one clock cycle, achieving a data rate of several Gb/...
基金supported by the National High Technology Researchand Development Program of China (863 Program) (YJCB2008023WL)
文摘Efficiency and linearity of the microwave power amplifier are critical elements for mobile communication systems. A memory polynomial baseband predistorter based on an indirect learning architecture is presented for improving the linearity of an envelope tracing (ET) amplifier with application to a wireless transmitter. To deal with large peak-to-average ratio (PAR) problem, a clipping procedure for the input signal is employed. Then the system performance is verified by simulation results. For a single carrier wideband code division multiple access (WCDMA) signal of 16-quadrature amplitude modulation (16-QAM), about 2% improvement of the error vector magnitude (EVM) is achieved at an average output power of 45.5 dBm and gain of 10.6 dB, with adjacent channel leakage ratio (ACLR) of -64.55 dBc at offset frequency of 5 MHz. Moreover, a three-carrier WCDMA signal and a third-generation (3G) long term evolution (LTE) signal are used as test signals to demonstrate the performance of the proposed linearization scheme under different bandwidth signals.
基金This work was supported in part by the National Natural Science Foundation of China under Grant Nos.62125202 and U22B2022.
文摘Image bitmaps,i.e.,data containing pixels and visual perception,have been widely used in emerging applica-tions for pixel operations while consuming lots of memory space and energy.Compared with legacy DRAM(dynamic ran-dom access memory),non-volatile memories(NVMs)are suitable for bitmap storage due to the salient features of high density and intrinsic durability.However,writing NVMs suffers from higher energy consumption and latency compared with read accesses.Existing precise or approximate compression schemes in NVM controllers show limited performance for bitmaps due to the irregular data patterns and variance in bitmaps.We observe the pixel-level similarity when writing bitmaps due to the analogous contents in adjacent pixels.By exploiting the pixel-level similarity,we propose SimCom,an approximate similarity-aware compression scheme in the NVM module controller,to efficiently compress data for each write access on-the-fly.The idea behind SimCom is to compress continuous similar words into the pairs of base words with runs.The storage costs for small runs are further mitigated by reusing the least significant bits of base words.SimCom adaptively selects an appropriate compression mode for various bitmap formats,thus achieving an efficient trade-off be-tween quality and memory performance.We implement SimCom on GEM5/zsim with NVMain and evaluate the perfor-mance with real-world image/video workloads.Our results demonstrate the efficacy and efficiency of our SimCom with an efficient quality-performance trade-off.
基金Project supported by the National Natural Science Foundation of China(Nos.60976023,61234003)the Special Funds for Major State Basic Research Project of China(No.2011CB932902)
文摘This paper presents a novel compact memory in the processing element (PE) for single-instruction multiple-data (SIMD) vision chips. The PE memory is constructed with 8×8 register cells, where one latch in the slave stage is shared by eight latches in the master stage. The memory supports simultaneous read and write on the same address in one clock cycle. Its compact area of 14.33 μm^2/bit promises a higher integration level of the processor. A prototype chip with a 64×64 PE array is fabricated in a UMC 0.18 μm CMOS technology. Five types of the PE memory cell structure are designed and compared. The testing results demonstrate that the proposed PE memory architecture well satisfies the requirement of the vision chip in high-speed real-time vision applications, such as 1000 fps edge extraction.
基金supported in part by the U.S.National Science Foundation under Grant Nos.CCF-2008000,CNS-1730488,and CCF-2008907the U.S.Department of Homeland Security under Grant No.2017-ST-062-000002.
文摘Accesses Per Cycle(APC),Concurrent Average Memory Access Time(C-AMAT),and Layered Performance Matching(LPM)are three memory performance models that consider both data locality and memory assess concurrency.The APC model measures the throughput of a memory architecture and therefore reflects the quality of service(QoS)of a memory system.The C-AMAT model provides a recursive expression for the memory access delay and therefore can be used for identifying the potential bottlenecks in a memory hierarchy.The LPM method transforms a global memory system optimization into localized optimizations at each memory layer by matching the data access demands of the applications with the underlying memory system design.These three models have been proposed separately through prior efforts.This paper reexamines the three models under one coherent mathematical framework.More specifically,we present a new memorycentric view of data accesses.We divide the memory cycles at each memory layer into four distinct categories and use them to recursively define the memory access latency and concurrency along the memory hierarchy.This new perspective offers new insights with a clear formulation of the memory performance considering both locality and concurrency.Consequently,the performance model can be easily understood and applied in engineering practices.As such,the memory-centric approach helps establish a unified mathematical foundation for model-driven performance analysis and optimization of contemporary and future memory systems.