The implementation method of the IEEE 802.11 Medium Access Control (MAC) protocol is mainly based on DSP (Digital Signal Processor)/ ARM (Advanced Reduced instruction set computer Machine) processor or DSP/ARM IP (Int...The implementation method of the IEEE 802.11 Medium Access Control (MAC) protocol is mainly based on DSP (Digital Signal Processor)/ ARM (Advanced Reduced instruction set computer Machine) processor or DSP/ARM IP (Intellectual Property) core. This paper presents a method based on Nios II soft-core processor embedded in Altera’s Cyclone FPGA (Field Programmable Gate Array) and MicroC/OS-II RTOS (Real-Time Operation System). The benefits and drawbacks of above methods are compared, and then the method presented in this paper is described. The hardware and software partitioning are discussed; the hardware architecture is also illustrated and the MAC software programming is described in detail. The presented method has some advantages, such as low cost, easy-implementation and very suitable for the implementation of IEEE 802.11 MAC in research stage.展开更多
Matrix multiplication plays a pivotal role in the symmetric cipher algorithms, but it is one of the most complex and time consuming units, its performance directly affects the efficiency of cipher algorithms. Combined...Matrix multiplication plays a pivotal role in the symmetric cipher algorithms, but it is one of the most complex and time consuming units, its performance directly affects the efficiency of cipher algorithms. Combined with the characteristics of VLIW processor and matrix multiplication of symmetric cipher algorithms, this paper extracted the reconfigurable elements and analyzed the principle of matrix multiplication, then designed the reconfigurable architecture of matrix multiplication of VLIW processor further, at last we put forward single instructions for matrix multiplication between 4×1 and 4×4 matrix or two 4×4 matrix over GF(2~8), through the instructions extension, the instructions could support larger dimension operations. The experiment shows that the instructions we designed supports different dimensions matrix multiplication and improves the processing speed of multiplication greatly.展开更多
Network processors are used in the core node of network to flexibly process packet streams. With the increase of performance, the power of network processor increases fast, and power and cooling become a bottleneck. A...Network processors are used in the core node of network to flexibly process packet streams. With the increase of performance, the power of network processor increases fast, and power and cooling become a bottleneck. Architecture-level power conscious design must go beyond low-level circuit design. Architectural power and performance tradeoff should be considered at the same time. Simulation is an efficient method to design modem network processor before making chip. In order to achieve the tradeoff between performance and power, the processor simulator is used to design the architecture of network processor. Using Netbeneh, Commubench benchmark and processor simulator-SimpleScalar, the performance and power of network processor are quantitatively evaluated. New performance tradeoff evaluation metric is proposed to analyze the architecture of network processor. Based on the high performance lnteI IXP 2800 Network processor eonfignration, optimized instruction fetch width and speed ,instruction issue width, instruction window size are analyzed and selected. Simulation resuits show that the tradeoff design method makes the usage of network processor more effectively. The optimal key parameters of network processor are important in architecture-level design. It is meaningful for the next generation network processor design.展开更多
The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Co...The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Consequently,the single-core processor cannot meet the requirements of LTE system.This paper analyzes how to use multi-core processors to achieve parallel processing of uplink demodulation and decoding in LTE systems and designs an approach to parallel processing.The test results prove that this approach works quite well.展开更多
Simulators are generally used during the design of computer architectures. Typically, different simulators with different levels of complexity, speed and accuracy are used. However, for early design space exploration,...Simulators are generally used during the design of computer architectures. Typically, different simulators with different levels of complexity, speed and accuracy are used. However, for early design space exploration, simulators with less complexity, high simulation speed and reasonable accuracy are desired. It is also required that these simulators have a short development time and that changes in the design require less effort in the implementation in order to perform experiments and see the effects of changes in the design. These simulators are termed high-level simulators in the context of computer architecture. In this paper, we present multiple levels of abstractions in a high-level simulation of a general-purpose many-core system, where the objective of every level is to improve the accuracy in simulation without significantly affecting the complexity and simulation speed.展开更多
A detailed experiment of 1-pixel bit reconfigurable ternary optical processor (TOP) is proposed in the paper. 42 basic operation units (BOUs) and 28 typical logic operators of the TOP are realized in the experimen...A detailed experiment of 1-pixel bit reconfigurable ternary optical processor (TOP) is proposed in the paper. 42 basic operation units (BOUs) and 28 typical logic operators of the TOP are realized in the experiment. Results of the test cases elaborately cover the every combination of BOUs and all the nine inputs of ternary processor. Both the experiment process and results analysis are given in this paper. The experimental results demonstrate that the theory of reconfiguring a TOP is valid and that the reconfiguration circuitry is effective.展开更多
A novel clock structure of a low-power 16-bit very large instruction word (VLIW) digital signal processor (DSP) was proposed. To improve deterministic clock gating and to solve the drawback of conventional clock gatin...A novel clock structure of a low-power 16-bit very large instruction word (VLIW) digital signal processor (DSP) was proposed. To improve deterministic clock gating and to solve the drawback of conventional clock gating circuit in high speed circuit, a distributed and early clock gating method was developed on its instruction fetch & decoder unit, its pipelined data-path unit and its super-Harvard memory interface unit. The core was implemented following the Synopsys back-end flow under TSMC (Taiwan Silicon manufacture corporation) 0.18-μm 1.8-V 1P6M process, with a core size of 2 mm×2 mm. Result shows that it can run under 200 MHz with a power performance around 0.3 mW/MIPS. Meanwhile, only 39.7% circuit is active simultaneously in average, compared to its non-gating counterparts.展开更多
In order to eliminate the energy waste caused by the traditional static hardware multithreaded processor used in real-time embedded system working in the low workload situation, the energy efficiency of the hardware m...In order to eliminate the energy waste caused by the traditional static hardware multithreaded processor used in real-time embedded system working in the low workload situation, the energy efficiency of the hardware multithread is discussed and a novel dynamic multithreaded architecture is proposed. The proposed architecture saves the energy wasted by removing idle threads without manipulation on the original architecture, fulfills a seamless switching mechanism which protects active threads and avoids pipeline stall during power mode switching. The report of an implemented dynamic multithreaded processor with 45 nm process from synthesis tool indicates that the area of dynamic multithreaded architecture is only 2.27% higher than the static one in achieving dynamic power dissipation, and consumes 1.3% more power in the same peak performance.展开更多
The paper proposes a low power non-volatile baseband processor with wake-up identification(WUI) receiver for LR-WPAN transceiver.It consists of WUI receiver,main receiver,transmitter,non-volatile memory(NVM) and power...The paper proposes a low power non-volatile baseband processor with wake-up identification(WUI) receiver for LR-WPAN transceiver.It consists of WUI receiver,main receiver,transmitter,non-volatile memory(NVM) and power management module.The main receiver adopts a unified simplified synchronization method and channel codec with proactive Reed-Solomon Bypass technique,which increases the robustness and energy efficiency of receiver.The WUI receiver specifies the communication node and wakes up the transceiver to reduce average power consumption of the transceiver.The embedded NVM can backup/restore the states information of processor that avoids the loss of the state information caused by power failure and reduces the unnecessary power of repetitive computation when the processor is waked up from power down mode.The baseband processor is designed and verified on a FPGA board.The simulated power consumption of processor is 5.1uW for transmitting and 28.2μW for receiving.The WUI receiver technique reduces the average power consumption of transceiver remarkably.If the transceiver operates 30 seconds in every 15 minutes,the average power consumption of the transceiver can be reduced by two orders of magnitude.The NVM avoids the loss of the state information caused by power failure and energy waste caused by repetitive computation.展开更多
A kind of pseudo Gray code presentation of test patterns based on accumulation generators is presented and a low power test scheme is proposed to test computational function modules with contiguous subspace in very la...A kind of pseudo Gray code presentation of test patterns based on accumulation generators is presented and a low power test scheme is proposed to test computational function modules with contiguous subspace in very large scale integration (VLSI), especially in digital signal processors (DSP). If test patterns from accumulators for the modules are encoded in the pseudo Gray code presentation, the switching activities of the modules are reduced, and the decrease of the test power consumption is resulted in. Results of experimentation based on FPGA show that the test approach can reduce dynamic power consumption by an average of 17.40% for 8-bit ripple carry adder consisting of 3-2 counters. Then implementation of the low power test in hardware is exploited. Because of the reuse of adders, introduction of additional XOR logic gates is avoided successfully. The design minimizes additional hardware overhead for test and needs no adjustment of circuit structure. The low power test can detect any combinational stuck-at fault within the basic building block without any degradation of original circuit performance.展开更多
Under the direction of design space theory,in this paper we discuss the design of a superscalar pipelining using the way of multiple issues,and the implement of a superscalar based RISC DSP architecture,SDSP.Furthermo...Under the direction of design space theory,in this paper we discuss the design of a superscalar pipelining using the way of multiple issues,and the implement of a superscalar based RISC DSP architecture,SDSP.Furthermore,in this paper we discuss the validity of instruction prefetch,the branch prediction,the depth of instruction window and other issues that can affect the performance of superscalar DSP.展开更多
Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which h...Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing ele- ments (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS.展开更多
As semiconductor technology advances, there will be billions of transistors on a single chip. Chip many-core processors are emerging to take advantage of these greater transistor densities to deliver greater performan...As semiconductor technology advances, there will be billions of transistors on a single chip. Chip many-core processors are emerging to take advantage of these greater transistor densities to deliver greater performance. Effective fault tolerance techniques are essential to improve the yield of such complex chips. In this paper, a core-level redundancy scheme called N+M is proposed to improve N-core processors’ yield by providing M spare cores. In such architecture, topology is an important factor because it greatly affects the processors’ performance. The concept of logical topology and a topology reconfiguration problem are introduced, which is able to transparently provide target topology with lowest performance degradation as the presence of faulty cores on-chip. A row rippling and column stealing (RRCS) algorithm is also proposed. Results show that PRCS can give solutions with average 13.8% degradation with negligible computing time.展开更多
Purpose–The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core pr...Purpose–The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core processors such as graphics processing units(GPUs)and multi-core processors(MCPs).Design/methodology/approach–For distributed genetic algorithm(GA)models,the paper proposes a method where an island’s ID number is added to the header of data transferred by this island for use in fault detection.Findings–The paper has shown that the processing time of the proposed idea is practically negligible in applications and also shown that an optimal solution can be obtained even with a single stuck-at fault or a transient fault,and that increasing the number of parallel threads makes the system less susceptible to faults.Originality/value–The study described in this paper is a new approach to increase the sustainability of application program using distributed GA on GPUs and MCPs.展开更多
文摘The implementation method of the IEEE 802.11 Medium Access Control (MAC) protocol is mainly based on DSP (Digital Signal Processor)/ ARM (Advanced Reduced instruction set computer Machine) processor or DSP/ARM IP (Intellectual Property) core. This paper presents a method based on Nios II soft-core processor embedded in Altera’s Cyclone FPGA (Field Programmable Gate Array) and MicroC/OS-II RTOS (Real-Time Operation System). The benefits and drawbacks of above methods are compared, and then the method presented in this paper is described. The hardware and software partitioning are discussed; the hardware architecture is also illustrated and the MAC software programming is described in detail. The presented method has some advantages, such as low cost, easy-implementation and very suitable for the implementation of IEEE 802.11 MAC in research stage.
基金supported in part by open project foundation of State Key Laboratory of Cryptology National Natural Science Foundation of China (NSFC) under Grant No. 61272492, No. 61572521 and No. 61309008Natural Science Foundation for Young of Shaanxi Province under Grant No. 2013JQ8013
文摘Matrix multiplication plays a pivotal role in the symmetric cipher algorithms, but it is one of the most complex and time consuming units, its performance directly affects the efficiency of cipher algorithms. Combined with the characteristics of VLIW processor and matrix multiplication of symmetric cipher algorithms, this paper extracted the reconfigurable elements and analyzed the principle of matrix multiplication, then designed the reconfigurable architecture of matrix multiplication of VLIW processor further, at last we put forward single instructions for matrix multiplication between 4×1 and 4×4 matrix or two 4×4 matrix over GF(2~8), through the instructions extension, the instructions could support larger dimension operations. The experiment shows that the instructions we designed supports different dimensions matrix multiplication and improves the processing speed of multiplication greatly.
基金Sponsored by the National Defence Research Foundation of China(Grant No.413460303).
文摘Network processors are used in the core node of network to flexibly process packet streams. With the increase of performance, the power of network processor increases fast, and power and cooling become a bottleneck. Architecture-level power conscious design must go beyond low-level circuit design. Architectural power and performance tradeoff should be considered at the same time. Simulation is an efficient method to design modem network processor before making chip. In order to achieve the tradeoff between performance and power, the processor simulator is used to design the architecture of network processor. Using Netbeneh, Commubench benchmark and processor simulator-SimpleScalar, the performance and power of network processor are quantitatively evaluated. New performance tradeoff evaluation metric is proposed to analyze the architecture of network processor. Based on the high performance lnteI IXP 2800 Network processor eonfignration, optimized instruction fetch width and speed ,instruction issue width, instruction window size are analyzed and selected. Simulation resuits show that the tradeoff design method makes the usage of network processor more effectively. The optimal key parameters of network processor are important in architecture-level design. It is meaningful for the next generation network processor design.
文摘The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Consequently,the single-core processor cannot meet the requirements of LTE system.This paper analyzes how to use multi-core processors to achieve parallel processing of uplink demodulation and decoding in LTE systems and designs an approach to parallel processing.The test results prove that this approach works quite well.
文摘Simulators are generally used during the design of computer architectures. Typically, different simulators with different levels of complexity, speed and accuracy are used. However, for early design space exploration, simulators with less complexity, high simulation speed and reasonable accuracy are desired. It is also required that these simulators have a short development time and that changes in the design require less effort in the implementation in order to perform experiments and see the effects of changes in the design. These simulators are termed high-level simulators in the context of computer architecture. In this paper, we present multiple levels of abstractions in a high-level simulation of a general-purpose many-core system, where the objective of every level is to improve the accuracy in simulation without significantly affecting the complexity and simulation speed.
基金Project supported by the National Natural Science Foundation of China(Grant No.61073049)the Shanghai Leading Academic Discipline Project(Grant No.J50103)the Doctorate Foundation of Education Ministry of China(Grant No.20093108110016)
文摘A detailed experiment of 1-pixel bit reconfigurable ternary optical processor (TOP) is proposed in the paper. 42 basic operation units (BOUs) and 28 typical logic operators of the TOP are realized in the experiment. Results of the test cases elaborately cover the every combination of BOUs and all the nine inputs of ternary processor. Both the experiment process and results analysis are given in this paper. The experimental results demonstrate that the theory of reconfiguring a TOP is valid and that the reconfiguration circuitry is effective.
基金The Research Project of China Military Department (No6130325)
文摘A novel clock structure of a low-power 16-bit very large instruction word (VLIW) digital signal processor (DSP) was proposed. To improve deterministic clock gating and to solve the drawback of conventional clock gating circuit in high speed circuit, a distributed and early clock gating method was developed on its instruction fetch & decoder unit, its pipelined data-path unit and its super-Harvard memory interface unit. The core was implemented following the Synopsys back-end flow under TSMC (Taiwan Silicon manufacture corporation) 0.18-μm 1.8-V 1P6M process, with a core size of 2 mm×2 mm. Result shows that it can run under 200 MHz with a power performance around 0.3 mW/MIPS. Meanwhile, only 39.7% circuit is active simultaneously in average, compared to its non-gating counterparts.
基金supported partially by the National High Technical Research and Development Program of China (863 Program) under Grants No. 2011AA040101, No. 2008AA01Z134the National Natural Science Foundation of China under Grants No. 61003251, No. 61172049, No. 61173150+2 种基金the Doctoral Fund of Ministry of Education of China under Grant No. 20100006110015Beijing Municipal Natural Science Foundation under Grant No. Z111100054011078the 2012 Ladder Plan Project of Beijing Key Laboratory of Knowledge Engineering for Materials Science under Grant No. Z121101002812005
文摘In order to eliminate the energy waste caused by the traditional static hardware multithreaded processor used in real-time embedded system working in the low workload situation, the energy efficiency of the hardware multithread is discussed and a novel dynamic multithreaded architecture is proposed. The proposed architecture saves the energy wasted by removing idle threads without manipulation on the original architecture, fulfills a seamless switching mechanism which protects active threads and avoids pipeline stall during power mode switching. The report of an implemented dynamic multithreaded processor with 45 nm process from synthesis tool indicates that the area of dynamic multithreaded architecture is only 2.27% higher than the static one in achieving dynamic power dissipation, and consumes 1.3% more power in the same peak performance.
基金supported in part by the National Natural Science Foundation of China(No.61306027)
文摘The paper proposes a low power non-volatile baseband processor with wake-up identification(WUI) receiver for LR-WPAN transceiver.It consists of WUI receiver,main receiver,transmitter,non-volatile memory(NVM) and power management module.The main receiver adopts a unified simplified synchronization method and channel codec with proactive Reed-Solomon Bypass technique,which increases the robustness and energy efficiency of receiver.The WUI receiver specifies the communication node and wakes up the transceiver to reduce average power consumption of the transceiver.The embedded NVM can backup/restore the states information of processor that avoids the loss of the state information caused by power failure and reduces the unnecessary power of repetitive computation when the processor is waked up from power down mode.The baseband processor is designed and verified on a FPGA board.The simulated power consumption of processor is 5.1uW for transmitting and 28.2μW for receiving.The WUI receiver technique reduces the average power consumption of transceiver remarkably.If the transceiver operates 30 seconds in every 15 minutes,the average power consumption of the transceiver can be reduced by two orders of magnitude.The NVM avoids the loss of the state information caused by power failure and energy waste caused by repetitive computation.
基金supported by the National Natural Science Foundation of China under Grant No.90407007University Science Foundation of China under Grant No R0820207
文摘A kind of pseudo Gray code presentation of test patterns based on accumulation generators is presented and a low power test scheme is proposed to test computational function modules with contiguous subspace in very large scale integration (VLSI), especially in digital signal processors (DSP). If test patterns from accumulators for the modules are encoded in the pseudo Gray code presentation, the switching activities of the modules are reduced, and the decrease of the test power consumption is resulted in. Results of experimentation based on FPGA show that the test approach can reduce dynamic power consumption by an average of 17.40% for 8-bit ripple carry adder consisting of 3-2 counters. Then implementation of the low power test in hardware is exploited. Because of the reuse of adders, introduction of additional XOR logic gates is avoided successfully. The design minimizes additional hardware overhead for test and needs no adjustment of circuit structure. The low power test can detect any combinational stuck-at fault within the basic building block without any degradation of original circuit performance.
文摘Under the direction of design space theory,in this paper we discuss the design of a superscalar pipelining using the way of multiple issues,and the implement of a superscalar based RISC DSP architecture,SDSP.Furthermore,in this paper we discuss the validity of instruction prefetch,the branch prediction,the depth of instruction window and other issues that can affect the performance of superscalar DSP.
文摘Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing ele- ments (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS.
基金the National Natural Science Foundation of China (Nos. 60633060, 60606008, and 60576031)the National Key Basic Research and Development (973) Program of China (973)(Nos. 2005CB321604 and 2005CB321605)the fund of Chinese Academy of Sciences (No. 20074010) due to the President Scholarship
文摘As semiconductor technology advances, there will be billions of transistors on a single chip. Chip many-core processors are emerging to take advantage of these greater transistor densities to deliver greater performance. Effective fault tolerance techniques are essential to improve the yield of such complex chips. In this paper, a core-level redundancy scheme called N+M is proposed to improve N-core processors’ yield by providing M spare cores. In such architecture, topology is an important factor because it greatly affects the processors’ performance. The concept of logical topology and a topology reconfiguration problem are introduced, which is able to transparently provide target topology with lowest performance degradation as the presence of faulty cores on-chip. A row rippling and column stealing (RRCS) algorithm is also proposed. Results show that PRCS can give solutions with average 13.8% degradation with negligible computing time.
文摘Purpose–The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core processors such as graphics processing units(GPUs)and multi-core processors(MCPs).Design/methodology/approach–For distributed genetic algorithm(GA)models,the paper proposes a method where an island’s ID number is added to the header of data transferred by this island for use in fault detection.Findings–The paper has shown that the processing time of the proposed idea is practically negligible in applications and also shown that an optimal solution can be obtained even with a single stuck-at fault or a transient fault,and that increasing the number of parallel threads makes the system less susceptible to faults.Originality/value–The study described in this paper is a new approach to increase the sustainability of application program using distributed GA on GPUs and MCPs.