This paper describes a microprogrammed architecture for an embedded coprocessor that is able to control IEEE 1149.1 to IEEE 1149.7 test infrastructures, and explains how to expand the supported test command set. The c...This paper describes a microprogrammed architecture for an embedded coprocessor that is able to control IEEE 1149.1 to IEEE 1149.7 test infrastructures, and explains how to expand the supported test command set. The coprocessor uses a fast simplex link (FSL) channel to interface a 32-bit MicroBlaze CPU, but it can work with any microprocessor core that accepts this simple FIFO-based interface method. The implementation cost (logic resource usage for a Xilinx Spartan-6 FPGA) and the performance data (operating frequency) are presented for a test command set comprising two parts: 1) the full IEEE 1149.1 structural test operations;2) a subset of IEEE 1149.7 operations selected to illustrate the implementation of advanced scan formats.展开更多
A GF(p) elliptic curve cryptographic coprocessor is proposed and implemented on Field Programmable Gate Array (FPGA). The focus of the coprocessor is on the most critical, complicated and time-consuming point multipli...A GF(p) elliptic curve cryptographic coprocessor is proposed and implemented on Field Programmable Gate Array (FPGA). The focus of the coprocessor is on the most critical, complicated and time-consuming point multiplications. The technique of coordinates conversion and fast multiplication algorithm of two large integers are utilized to avoid frequent inversions and to accelerate the field multiplications used in point multiplications. The characteristic of hardware parallelism is considered in the implementation of point multiplications. The coprocessor implemented on XILINX XC2V3000 computes a point multiplication for an arbitrary point on a curve defined over GF(2192?264?1) with the frequency of 10 MHz in 4.40 ms in the average case and 5.74 ms in the worst case. At the same circumstance, the coprocessor implemented on XILINX XC2V4000 takes 2.2 ms in the average case and 2.88 ms in the worst case.展开更多
We have proposed a flexible coprocessor key-authentication architecture for 80/112-bit security-related applications over GF(2m)field by employing Elliptic-curve Diffie Hellman(ECDH)protocol.Towards flexibility,a seri...We have proposed a flexible coprocessor key-authentication architecture for 80/112-bit security-related applications over GF(2m)field by employing Elliptic-curve Diffie Hellman(ECDH)protocol.Towards flexibility,a serial input/output interface is used to load/produce secret,public,and shared keys sequentially.Moreover,to reduce the hardware resources and to achieve a reasonable time for cryptographic computations,we have proposed a finite field digit-serial multiplier architecture using combined shift and accumulate techniques.Furthermore,two finite-statemachine controllers are used to perform efficient control functionalities.The proposed coprocessor architecture over GF(2^(163))and GF(2^(233))is programmed using Verilog and then implemented on Xilinx Virtex-7 FPGA(field-programmable-gate-array)device.For GF(2^(163))and GF(2^(233)),the proposed flexible coprocessor use 1351 and 1789 slices,the achieved clock frequency is 250 and 235MHz,time for one public key computation is 40.50 and 79.20μs and time for one shared key generation is 81.00 and 158.40μs.Similarly,the consumed power over GF(2^(163))and GF(2^(233))is 0.91 and 1.37mW,respectively.The proposed coprocessor architecture outperforms state-of-the-art ECDH designs in terms of hardware resources.展开更多
With the increasing demand for flexible and efficient implementation of image and video processing algorithms, there should be a good tradeoff between hardware and software design method. This paper utilized the HW-SW...With the increasing demand for flexible and efficient implementation of image and video processing algorithms, there should be a good tradeoff between hardware and software design method. This paper utilized the HW-SW codesign method to implement the H.264 decoder in an SoC with an ARM core, a multimedia processor and a deblocking filter coprocessor. For the parallel processing features of the multimedia processor, clock cycles of decoding process can be dramatically reduced. And the hardware dedicated deblocking filter coprocessor can improve the efficiency a lot. With maximum clock frequency of 150 MHz, the whole system can achieve real time processing speed and flexibility.展开更多
The paper describes an efficient direct method to solve an equation Ax = b, where A is a sparse matrix, on the Intel®Xeon PhiTM coprocessor. The main challenge for such a system is how to engage all available ...The paper describes an efficient direct method to solve an equation Ax = b, where A is a sparse matrix, on the Intel®Xeon PhiTM coprocessor. The main challenge for such a system is how to engage all available threads (about 240) and how to reduce OpenMP* synchronization overhead, which is very expensive for hundreds of threads. The method consists of decomposing A into a product of lower-triangular, diagonal, and upper triangular matrices followed by solves of the resulting three subsystems. The main idea is based on the hybrid parallel algorithm used in the Intel®Math Kernel Library Parallel Direct Sparse Solver for Clusters [1]. Our implementation exploits a static scheduling algorithm during the factorization step to reduce OpenMP synchronization overhead. To effectively engage all available threads, a three-level approach of parallelization is used. Furthermore, we demonstrate that our implementation can perform up to 100 times better on factorization step and up to 65 times better in terms of overall performance on the 240 threads of the Intel®Xeon PhiTM coprocessor.展开更多
It is difficult for the existing Automated External Defibrillator (AED) on-board microprocessors to accurately classify electrocardiographic signals (ECGs) mixed with Cardiopulmonary Resuscitation artifacts in real-ti...It is difficult for the existing Automated External Defibrillator (AED) on-board microprocessors to accurately classify electrocardiographic signals (ECGs) mixed with Cardiopulmonary Resuscitation artifacts in real-time. In order to improve recognition speed and accuracy of electrocardiographic signals containing Cardiopulmonary Resuscitation artifacts, a new special coprocessor system-on-chip (SoC) for defibrillators was designed. In this study, a microprocessor was designed based on the RISC-V architecture to achieve hardware acceleration for ECGs classification;Besides, an Approximate Entropy (ApEn) and Convolutional neural networks (CNNs) integrated algorithm capable of running on it was designed. The algorithm differs from traditional electrocardiographic (ECG) classification algorithms. It can be used to perform ECG classification while chest compressions are applied. The proposed co-processor can be used to accelerate computation rate of ApEn by 34 times compared with pure software computation. It can also be used to accelerate the speed of CNNs ECG recognition by 33 times. The combined algorithm was used to classify ECGs with CPR artifacts. It achieved a precision of 96%, which was significantly superior to that of simple CNNs. The coprocessor can be used to significantly improve the recognition efficiency and accuracy of ECGs containing CPR artifacts. It is suitable for automatic external defibrillator and other medical devices in which one-dimensional physiological signals.展开更多
Most existing system-on-chip (SoC) architectures are for microprocessor-centric designs. They are not suitable for computing intensive SoCs, which have their own conflgurability, extendibility, perform- ance, and da...Most existing system-on-chip (SoC) architectures are for microprocessor-centric designs. They are not suitable for computing intensive SoCs, which have their own conflgurability, extendibility, perform- ance, and data exchange characteristics. This paper analyzes these characteristics and gives design princi- ples for computing intensive SoCs. Three architectures suitable for different situations are compared with selection criteria given. The architectural design of a high performance network security accelerator (HPNSA) is used to elaborate on the design techniques to fully exploit the performance potential of the ar- chitectures. A behavior-level simulation system is implemented with the C++ programming language to evaluate the HPNSA performance and to obtain the optimum system design parameters. Simulations show that this architecture provides high performance data transfer.展开更多
文摘This paper describes a microprogrammed architecture for an embedded coprocessor that is able to control IEEE 1149.1 to IEEE 1149.7 test infrastructures, and explains how to expand the supported test command set. The coprocessor uses a fast simplex link (FSL) channel to interface a 32-bit MicroBlaze CPU, but it can work with any microprocessor core that accepts this simple FIFO-based interface method. The implementation cost (logic resource usage for a Xilinx Spartan-6 FPGA) and the performance data (operating frequency) are presented for a test command set comprising two parts: 1) the full IEEE 1149.1 structural test operations;2) a subset of IEEE 1149.7 operations selected to illustrate the implementation of advanced scan formats.
基金Supported by the National Natural Science Foun dation of China ( 69973034 ) and the National High TechnologyResearch and Development Program of China (2002AA141050)
文摘A GF(p) elliptic curve cryptographic coprocessor is proposed and implemented on Field Programmable Gate Array (FPGA). The focus of the coprocessor is on the most critical, complicated and time-consuming point multiplications. The technique of coordinates conversion and fast multiplication algorithm of two large integers are utilized to avoid frequent inversions and to accelerate the field multiplications used in point multiplications. The characteristic of hardware parallelism is considered in the implementation of point multiplications. The coprocessor implemented on XILINX XC2V3000 computes a point multiplication for an arbitrary point on a curve defined over GF(2192?264?1) with the frequency of 10 MHz in 4.40 ms in the average case and 5.74 ms in the worst case. At the same circumstance, the coprocessor implemented on XILINX XC2V4000 takes 2.2 ms in the average case and 2.88 ms in the worst case.
基金This project has received funding by the NSTIP Strategic Technologies program under Grant Number 14-415 ELE1448-10,King Abdul Aziz City of Science and Technology of the Kingdom of Saudi Arabia.
文摘We have proposed a flexible coprocessor key-authentication architecture for 80/112-bit security-related applications over GF(2m)field by employing Elliptic-curve Diffie Hellman(ECDH)protocol.Towards flexibility,a serial input/output interface is used to load/produce secret,public,and shared keys sequentially.Moreover,to reduce the hardware resources and to achieve a reasonable time for cryptographic computations,we have proposed a finite field digit-serial multiplier architecture using combined shift and accumulate techniques.Furthermore,two finite-statemachine controllers are used to perform efficient control functionalities.The proposed coprocessor architecture over GF(2^(163))and GF(2^(233))is programmed using Verilog and then implemented on Xilinx Virtex-7 FPGA(field-programmable-gate-array)device.For GF(2^(163))and GF(2^(233)),the proposed flexible coprocessor use 1351 and 1789 slices,the achieved clock frequency is 250 and 235MHz,time for one public key computation is 40.50 and 79.20μs and time for one shared key generation is 81.00 and 158.40μs.Similarly,the consumed power over GF(2^(163))and GF(2^(233))is 0.91 and 1.37mW,respectively.The proposed coprocessor architecture outperforms state-of-the-art ECDH designs in terms of hardware resources.
文摘With the increasing demand for flexible and efficient implementation of image and video processing algorithms, there should be a good tradeoff between hardware and software design method. This paper utilized the HW-SW codesign method to implement the H.264 decoder in an SoC with an ARM core, a multimedia processor and a deblocking filter coprocessor. For the parallel processing features of the multimedia processor, clock cycles of decoding process can be dramatically reduced. And the hardware dedicated deblocking filter coprocessor can improve the efficiency a lot. With maximum clock frequency of 150 MHz, the whole system can achieve real time processing speed and flexibility.
文摘The paper describes an efficient direct method to solve an equation Ax = b, where A is a sparse matrix, on the Intel®Xeon PhiTM coprocessor. The main challenge for such a system is how to engage all available threads (about 240) and how to reduce OpenMP* synchronization overhead, which is very expensive for hundreds of threads. The method consists of decomposing A into a product of lower-triangular, diagonal, and upper triangular matrices followed by solves of the resulting three subsystems. The main idea is based on the hybrid parallel algorithm used in the Intel®Math Kernel Library Parallel Direct Sparse Solver for Clusters [1]. Our implementation exploits a static scheduling algorithm during the factorization step to reduce OpenMP synchronization overhead. To effectively engage all available threads, a three-level approach of parallelization is used. Furthermore, we demonstrate that our implementation can perform up to 100 times better on factorization step and up to 65 times better in terms of overall performance on the 240 threads of the Intel®Xeon PhiTM coprocessor.
文摘It is difficult for the existing Automated External Defibrillator (AED) on-board microprocessors to accurately classify electrocardiographic signals (ECGs) mixed with Cardiopulmonary Resuscitation artifacts in real-time. In order to improve recognition speed and accuracy of electrocardiographic signals containing Cardiopulmonary Resuscitation artifacts, a new special coprocessor system-on-chip (SoC) for defibrillators was designed. In this study, a microprocessor was designed based on the RISC-V architecture to achieve hardware acceleration for ECGs classification;Besides, an Approximate Entropy (ApEn) and Convolutional neural networks (CNNs) integrated algorithm capable of running on it was designed. The algorithm differs from traditional electrocardiographic (ECG) classification algorithms. It can be used to perform ECG classification while chest compressions are applied. The proposed co-processor can be used to accelerate computation rate of ApEn by 34 times compared with pure software computation. It can also be used to accelerate the speed of CNNs ECG recognition by 33 times. The combined algorithm was used to classify ECGs with CPR artifacts. It achieved a precision of 96%, which was significantly superior to that of simple CNNs. The coprocessor can be used to significantly improve the recognition efficiency and accuracy of ECGs containing CPR artifacts. It is suitable for automatic external defibrillator and other medical devices in which one-dimensional physiological signals.
基金Supported by the National Natural Science Foundation of China (No. 60576027)the National High-Tech Research and Development (863) Program of China (No. 2006AA01Z415)
文摘Most existing system-on-chip (SoC) architectures are for microprocessor-centric designs. They are not suitable for computing intensive SoCs, which have their own conflgurability, extendibility, perform- ance, and data exchange characteristics. This paper analyzes these characteristics and gives design princi- ples for computing intensive SoCs. Three architectures suitable for different situations are compared with selection criteria given. The architectural design of a high performance network security accelerator (HPNSA) is used to elaborate on the design techniques to fully exploit the performance potential of the ar- chitectures. A behavior-level simulation system is implemented with the C++ programming language to evaluate the HPNSA performance and to obtain the optimum system design parameters. Simulations show that this architecture provides high performance data transfer.