Microprocessor development emphasizes hardware and software co design. Hw/Sw co design is a modern technique aimed at shortening the time to market in designing the real time and embedded systems. Key feature of this ...Microprocessor development emphasizes hardware and software co design. Hw/Sw co design is a modern technique aimed at shortening the time to market in designing the real time and embedded systems. Key feature of this approach is simultaneous development of the program tools and the target processor to match software application. An effective co design flow must therefore support automatic software toolkits generation, without loss of optimizing efficiency. This has resulted in a paradigm shift towards a language based design methodology for microprocessor optimization and exploration. This paper proposes a formal grammar, UNI SPEC, which supports the automatic generation of assemblers, to describe the translation rules from assembly to binary. Based on UNI SPEC, it implements two typical applications, i.e., automatically generating the assembler and the test suites.展开更多
In order to gain the great performance of ASIP, this paper discusses different aspects of an ASIP instruction set specification like syntax, encoding, constraints as welt as behaviors, and introduces our ADL model bas...In order to gain the great performance of ASIP, this paper discusses different aspects of an ASIP instruction set specification like syntax, encoding, constraints as welt as behaviors, and introduces our ADL model based methodology to check them. The automatic generation of test cases based on our straight-forward instruction representation is shown, and the efficient generation of them with good coverage is shown as well. The verification of the constraint checker, a very important tool for programmer, is performed. Results show that the toolkit can find some errors in previous delivery tools, and the introduced methodology verifies the feasibility of our instruction set specification.展开更多
A new efficient adapting virtual intermediate instruction set,V-IIS,is designed and implemented towards the optimized dynamic binary translator (DBT) system.With the help of this powerful but previously little-studied...A new efficient adapting virtual intermediate instruction set,V-IIS,is designed and implemented towards the optimized dynamic binary translator (DBT) system.With the help of this powerful but previously little-studied component,DBTs can not only get rid of the dependence of machine(s),but also get better performance.From our systematical study and evaluation,experimental results demonstrate that if V-IIS is well designed,without affecting the other optimizing measures,this could make DBT's performance close to those who do not have intermediate instructions.This study is an important step towards the grand goal of high performance "multi-source" and "multi-target" dynamic binary translation.展开更多
The relativity of instructions of motor control digital signal processor (MCDSP) in the design is analyzed. A method for obtaining a minimum instruction set in plac e of the complete instruction set during generatio...The relativity of instructions of motor control digital signal processor (MCDSP) in the design is analyzed. A method for obtaining a minimum instruction set in plac e of the complete instruction set during generation of testing procedures is giv en in terms of the processor presentation matrix between micro-operators and in structions of MCDSP.展开更多
The cost of the central register file and the size of the program code limit the scalability of very long instruction word(VLIW) processors with increasing numbers of functional units.This paper presents the archite...The cost of the central register file and the size of the program code limit the scalability of very long instruction word(VLIW) processors with increasing numbers of functional units.This paper presents the architectural design of a six-way VLIW digital signal processor(DSP) with clustered register files.The architecture uses a variable length instruction set and supports dynamic instruction dispatching.The one-level memory system architecture of the processor includes 16-KB instruction and data caches and 16-KB instruction and data on-chip RAM.A compiler based on the Open64 was developed for the system.Evaluations show that the processor is suitable for high performance applications with a high code density and small program code size.展开更多
Instruction Set Simulator (ISS) is a highly abstracted and executable model of micro architecture. It is widely used in the fields of verification and debugging during the development of microprocessors. However, wi...Instruction Set Simulator (ISS) is a highly abstracted and executable model of micro architecture. It is widely used in the fields of verification and debugging during the development of microprocessors. However, with the emergence of Chip Multi-Processors, the single-core ISS cannot meet the needs of microprocessor development. In this paper, we introduce our multi-core chip architecture first, after that a general methodology to expand a single-core ISS to a multi- core ISS (MCISS) is proposed. On this basis, a real-time comparison environment is created for multi-core verification, and the problems of multi-core communication and synchronization are addressed gracefully. With the "save and restore" mechanism, the verification procedure and the debugging are speeding up greatly.展开更多
This paper proposes an asynchronous complex pipeline based on ARM-V3 instruction set. Muller pipeline structure is used as prototype, and the factors which may affect pipeline performance are analyzed. To balance the ...This paper proposes an asynchronous complex pipeline based on ARM-V3 instruction set. Muller pipeline structure is used as prototype, and the factors which may affect pipeline performance are analyzed. To balance the difficulty of asynchronous design and performance analysis, both complete asynchronous and partial asynchronous structures aere designed and compared. Results of comparison with the well-Rnown industrial product ARM922T verify that about 30% and 40% performance improvement of the partiM and complete asynchronous complex pipelines can be obtained respectively. The design methodologies can also be used in the design of other asynchronous pipelines.展开更多
A Taylor series expansion(TSE) based design for minimum mean-square error(MMSE) and QR decomposition(QRD) of multi-input and multi-output(MIMO) systems is proposed based on application specific instruction set process...A Taylor series expansion(TSE) based design for minimum mean-square error(MMSE) and QR decomposition(QRD) of multi-input and multi-output(MIMO) systems is proposed based on application specific instruction set processor(ASIP), which uses TSE algorithm instead of resource-consuming reciprocal and reciprocal square root(RSR) operations.The aim is to give a high performance implementation for MMSE and QRD in one programmable platform simultaneously.Furthermore, instruction set architecture(ISA) and the allocation of data paths in single instruction multiple data-very long instruction word(SIMD-VLIW) architecture are provided, offering more data parallelism and instruction parallelism for different dimension matrices and operation types.Meanwhile, multiple level numerical precision can be achieved with flexible table size and expansion order in TSE ISA.The ASIP has been implemented to a 28 nm CMOS process and frequency reaches 800 MHz.Experimental results show that the proposed design provides perfect numerical precision within the fixed bit-width of the ASIP, higher matrix processing rate better than the requirements of 5G system and more rate-area efficiency comparable with ASIC implementations.展开更多
In the field of industrial process control,a fast-development platform for programmable logic controller(PLC)systems is designed in order to solve two main problems of rapid development of PLC control system and progr...In the field of industrial process control,a fast-development platform for programmable logic controller(PLC)systems is designed in order to solve two main problems of rapid development of PLC control system and programmability of controlling software.In the aspect of design,the platform is composed of hardware controlling and software monitoring and is taking industrial computer as the core.Under the Windows environment,the platform establishes the control instruction set,develops the configuration function and visual programming function of the monitoring software and it integrates PLC controller based on Visual Basic software.In order to achieve the function of data monitoring,it has realized the serial communication between computer and PLC by using RS-485 and RS-232 serial ports line communication.The platform designs the intelligent instruction scheduling strategy by studying the encoding and decoding rules of the communication instruction set.It proposes a method for rapidly developing control programs by adopting the expert control mode,which enables clients to develop and modify programs conveniently by importing instructions in a non-coded manner.After experimental testing,the platform is proved successful achieving both the rapid development of PLC control system and the rapid modification of monitoring software.展开更多
5G baseband signal processing places greater real-time and reliability requirements on hardware.Based on the architecture of the MaPU,a reconfigurable computing architecture is proposed according to the characteristic...5G baseband signal processing places greater real-time and reliability requirements on hardware.Based on the architecture of the MaPU,a reconfigurable computing architecture is proposed according to the characteristics of the 5G baseband signal processing.A dedicated instruction set for 5G baseband signal processing is proposed.The corresponding functional units are designed for reuse of hardware resources.A redirected register file is proposed to address latency and power consumption issues in internetwork.A two-dimensional code compression scheme is proposed for cases in which the use ratio of instruction memory is low.The access mode of the data memory is extended,the performance is improved and the power consumption is reduced.The throughput of 5G baseband processing algorithm is one to two orders of magnitude higher than that of the TMS320C6670 with less power consumption.The silicon area evaluated by layout is 5.8 mm2,which is 1/6 of the MaPU’s.The average power consumption is 0.7 W,which is 1/5 of the MaPU’s.展开更多
The 32-bit extensible embedded processor RISC3200 originating from an RTL prototype core is intended for low-cost consumer multimedia products. In order to incorporate the reduced instruction set and the multimedia ex...The 32-bit extensible embedded processor RISC3200 originating from an RTL prototype core is intended for low-cost consumer multimedia products. In order to incorporate the reduced instruction set and the multimedia extension instruction set in a unifying pipeline, a scalable super-pipeline technique is adopted. Several other optimization techniques are proposed to boost the frequency and reduce the average CPI of the unifying pipeline. Based on a data flow graph (DFG) with delay information, the critical path of the pipeline stage can be located and shortened. This paper presents a distributed data bypass unit and a centralized pipeline control scheme for achieving lower CPI. Synthesis and simulation showed that the optimization techniques enable RISC3200 to operate at 200 MHz with an average CPI of 1.16. The core was integrated into a media SOC chip taped out in SMIC 0.18-micron technology. Preliminary testing result showed that the processor works well as we expected.展开更多
A hardware-software co-simulation method for system on chip (SOC) design is discussed. It is based on an instruction set simulator (ISS) and an event-driven hardware simulator, and a bus interface model that is descri...A hardware-software co-simulation method for system on chip (SOC) design is discussed. It is based on an instruction set simulator (ISS) and an event-driven hardware simulator, and a bus interface model that is described in C language provides the interface between the two. The bus interface model and the ISS are linked into a singleton program--the software simulator, which communicate with the hardware simulator through Windows sockets. The implementation of the bus interface model and the synchronization between hardware and software simulator are discussed in detail. Co-simulation control of the hardware simulator is also discussed.展开更多
As the traditional RISC+ASIC/ASSP approach for network processor design can not meet the today’s requirements, this paper described an alternate approach, Reconfigurable Processing Architecture, to boost the performa...As the traditional RISC+ASIC/ASSP approach for network processor design can not meet the today’s requirements, this paper described an alternate approach, Reconfigurable Processing Architecture, to boost the performance to ASIC level while reserve the programmability of the traditional RISC based system. This paper covers both the hardware architecture and the software development environment architecture.展开更多
We presented a wireless endoscope system developed on the embedded Linux platform. This wireless endoscope system consists of three parts: the endoscope capsule,the portable equipment and the image workstation,which m...We presented a wireless endoscope system developed on the embedded Linux platform. This wireless endoscope system consists of three parts: the endoscope capsule,the portable equipment and the image workstation,which make this system wireless and portable. Hardware design of this system and software design of the portable equipment are described. In clinical use,patients swallow the endoscope capsule for gastrointestinal inspection. The portable equipment receives images of gastrointestinal tract via the radio frequency module while it sets and reads back parameters of the capsule via the low frequency module. Acquired image data can be saved as a FAT32 format file in an universal serial bus disk. Images can be processed and reviewed at an Microsoft Windows PC workstation in the next stage. The image acquired by this wireless endoscope system is quite effective with clear details of gastrointestinal tract. However,the endoscope capsule was large,making it difficult to swallow. So the printed circuit board design of endoscope capsule needs to be improved.展开更多
This paper illustrates the importance of the configuration of function units and the change of an application’s critical path when using instruction set extension (ISE) with multi-issue architectures. This paper al...This paper illustrates the importance of the configuration of function units and the change of an application’s critical path when using instruction set extension (ISE) with multi-issue architectures. This paper also presents an automatic identification approach for customized instruction without input/output number constraints for multi-issue architectures. The approach identifies customized instructions using multiple attribute decision-making based on the analysis of several attributes for each candidate node. Tests indicate that the approach achieves higher speedup ratios than previous approaches, as well as less area cost. In addition, this approach provides designers with multiple candidate designs.展开更多
The emerging mobile robot industry has spurred a flurry of interest in solving the simultaneous localization and mapping(SLAM)problem.However,existing SLAM platforms have difficulty in meeting the real-time and low-po...The emerging mobile robot industry has spurred a flurry of interest in solving the simultaneous localization and mapping(SLAM)problem.However,existing SLAM platforms have difficulty in meeting the real-time and low-pow-er requirements imposed by mobile systems.Though specialized hardware is promising with regard to achieving high per-formance and lowering the power,designing an efficient accelerator for SLAM is severely hindered by a wide variety of SLAM algorithms.Based on our detailed analysis of representative SLAM algorithms,we observe that SLAM algorithms advance two challenges for designing efficient hardware accelerators:the large number of computational primitives and ir-regular control flows.To address these two challenges,we propose a hardware accelerator that features composable com-putation units classified as the matrix,vector,scalar,and control units.In addition,we design a hierarchical instruction set for coping with a broad range of SLAM algorithms with irregular control flows.Experimental results show that,com-pared against an Intel x86 processor,on average,our accelerator with the area of 7.41 mm^(2) achieves 10.52x and 112.62x better performance and energy savings,respectively,across different datasets.Compared against a more energy-efficient ARM Cortex processor,our accelerator still achieves 33.03x and 62.64x better performance and energy savings,respec-tively.展开更多
Secure computing paradigms impose new architectural challenges for general-purpose processors. Cryptographic processing is needed for secure communications, storage, and computations. We identify two categories of ope...Secure computing paradigms impose new architectural challenges for general-purpose processors. Cryptographic processing is needed for secure communications, storage, and computations. We identify two categories of operations in symmetric-key and public-key cryptographic algorithms that are not common in previous general-purpose workloads: advanced bit operations within a word and multi-word operations. We define MOMR (Multiple Operands Multiple Results) execution or datarich execution as a unified solution to both challenges. It allows arbitrary n-bit permutations to be achieved in one or two cycles, rather than O(n) cycles as in existing RISC processors. It also enables significant acceleration of multiword multiplications needed by public-key ciphers. We propose two implementations of MOMR: one employs only hardware changes while the other uses Instruction Set Architecture (ISA) support. We show that MOMR execution leverages available resources in typical multi-issue processors with minimal additional cost. Multi-issue processors enhanced with MOMR units provide additional speedup over standard multi-issue processors with the same datapath. MOMR is a general architectural solution for word-oriented processor architectures to incorporate datarich operations.展开更多
The combination of growing transistor counts and limited power budget within a silicon die leads to the utilization wall problem (a.k.a. "Dark Silicon"), that is only a small fraction of chip can run at full speed...The combination of growing transistor counts and limited power budget within a silicon die leads to the utilization wall problem (a.k.a. "Dark Silicon"), that is only a small fraction of chip can run at full speed during a period of time. Designing accelerators for specific applications or algorithms is considered to be one of the most promising approaches to improving energy-efficiency. However, most current design methods for accelerators are dedicated for certain applications or algorithms, which greatly constrains their applicability. In this paper, we propose a novel general-purpose many-accelerator architecture. Our contributions are two-fold. Firstly, we propose to cluster dataflow graphs (DFGs) of hotspot basic blocks (BBs) in applications. The DFG clusters are then used for accelerators design. This is because a DFC is the largest program unit which is not specific to a certain application. We analyze 17 benchmarks in SPEC CPU 2006, acquire over 300 DFGs hotspots by using LLVM compiler tool, and divide them into 15 clusters based on graph similarity. Secondly, we introduce a function instruction set architecture (FISC) and illustrate how DFG accelerators can be integrated with a processor core and how they can be used by applications. Our results show that the proposed DFG clustering and FISC design can speed up SPEC benchmarks 6.2X on average.展开更多
文摘Microprocessor development emphasizes hardware and software co design. Hw/Sw co design is a modern technique aimed at shortening the time to market in designing the real time and embedded systems. Key feature of this approach is simultaneous development of the program tools and the target processor to match software application. An effective co design flow must therefore support automatic software toolkits generation, without loss of optimizing efficiency. This has resulted in a paradigm shift towards a language based design methodology for microprocessor optimization and exploration. This paper proposes a formal grammar, UNI SPEC, which supports the automatic generation of assemblers, to describe the translation rules from assembly to binary. Based on UNI SPEC, it implements two typical applications, i.e., automatically generating the assembler and the test suites.
文摘In order to gain the great performance of ASIP, this paper discusses different aspects of an ASIP instruction set specification like syntax, encoding, constraints as welt as behaviors, and introduces our ADL model based methodology to check them. The automatic generation of test cases based on our straight-forward instruction representation is shown, and the efficient generation of them with good coverage is shown as well. The verification of the constraint checker, a very important tool for programmer, is performed. Results show that the toolkit can find some errors in previous delivery tools, and the introduced methodology verifies the feasibility of our instruction set specification.
基金Projects(12R21414600)supported by Shanghai Municipal Science and Technology Commission,China
文摘A new efficient adapting virtual intermediate instruction set,V-IIS,is designed and implemented towards the optimized dynamic binary translator (DBT) system.With the help of this powerful but previously little-studied component,DBTs can not only get rid of the dependence of machine(s),but also get better performance.From our systematical study and evaluation,experimental results demonstrate that if V-IIS is well designed,without affecting the other optimizing measures,this could make DBT's performance close to those who do not have intermediate instructions.This study is an important step towards the grand goal of high performance "multi-source" and "multi-target" dynamic binary translation.
文摘The relativity of instructions of motor control digital signal processor (MCDSP) in the design is analyzed. A method for obtaining a minimum instruction set in plac e of the complete instruction set during generation of testing procedures is giv en in terms of the processor presentation matrix between micro-operators and in structions of MCDSP.
基金Supported by the National Natural Science Foundation of China (No.60236020)the Specialized Research Fund for the Doctoral Program of Higher Education of MOE,China (No.20050003083)
文摘The cost of the central register file and the size of the program code limit the scalability of very long instruction word(VLIW) processors with increasing numbers of functional units.This paper presents the architectural design of a six-way VLIW digital signal processor(DSP) with clustered register files.The architecture uses a variable length instruction set and supports dynamic instruction dispatching.The one-level memory system architecture of the processor includes 16-KB instruction and data caches and 16-KB instruction and data on-chip RAM.A compiler based on the Open64 was developed for the system.Evaluations show that the processor is suitable for high performance applications with a high code density and small program code size.
文摘Instruction Set Simulator (ISS) is a highly abstracted and executable model of micro architecture. It is widely used in the fields of verification and debugging during the development of microprocessors. However, with the emergence of Chip Multi-Processors, the single-core ISS cannot meet the needs of microprocessor development. In this paper, we introduce our multi-core chip architecture first, after that a general methodology to expand a single-core ISS to a multi- core ISS (MCISS) is proposed. On this basis, a real-time comparison environment is created for multi-core verification, and the problems of multi-core communication and synchronization are addressed gracefully. With the "save and restore" mechanism, the verification procedure and the debugging are speeding up greatly.
基金the Research Project of China Military Department (No. 6130325)
文摘This paper proposes an asynchronous complex pipeline based on ARM-V3 instruction set. Muller pipeline structure is used as prototype, and the factors which may affect pipeline performance are analyzed. To balance the difficulty of asynchronous design and performance analysis, both complete asynchronous and partial asynchronous structures aere designed and compared. Results of comparison with the well-Rnown industrial product ARM922T verify that about 30% and 40% performance improvement of the partiM and complete asynchronous complex pipelines can be obtained respectively. The design methodologies can also be used in the design of other asynchronous pipelines.
基金Supported by the Industrial Internet Innovation and Development Project of Ministry of Industry and Information Technology (No.GHBJ2004)。
文摘A Taylor series expansion(TSE) based design for minimum mean-square error(MMSE) and QR decomposition(QRD) of multi-input and multi-output(MIMO) systems is proposed based on application specific instruction set processor(ASIP), which uses TSE algorithm instead of resource-consuming reciprocal and reciprocal square root(RSR) operations.The aim is to give a high performance implementation for MMSE and QRD in one programmable platform simultaneously.Furthermore, instruction set architecture(ISA) and the allocation of data paths in single instruction multiple data-very long instruction word(SIMD-VLIW) architecture are provided, offering more data parallelism and instruction parallelism for different dimension matrices and operation types.Meanwhile, multiple level numerical precision can be achieved with flexible table size and expansion order in TSE ISA.The ASIP has been implemented to a 28 nm CMOS process and frequency reaches 800 MHz.Experimental results show that the proposed design provides perfect numerical precision within the fixed bit-width of the ASIP, higher matrix processing rate better than the requirements of 5G system and more rate-area efficiency comparable with ASIC implementations.
基金Supported by the International S&T Cooperation Program of China(No.2014DFR70280)Key Research and Development Project of Shanxi Province(No.201903D321012)+1 种基金Key Research and Development Project of Shanxi Province(No.201903D121023)Patent Promotion and Implementation Funding Research Project of Shanxi Province(No.20210521).
文摘In the field of industrial process control,a fast-development platform for programmable logic controller(PLC)systems is designed in order to solve two main problems of rapid development of PLC control system and programmability of controlling software.In the aspect of design,the platform is composed of hardware controlling and software monitoring and is taking industrial computer as the core.Under the Windows environment,the platform establishes the control instruction set,develops the configuration function and visual programming function of the monitoring software and it integrates PLC controller based on Visual Basic software.In order to achieve the function of data monitoring,it has realized the serial communication between computer and PLC by using RS-485 and RS-232 serial ports line communication.The platform designs the intelligent instruction scheduling strategy by studying the encoding and decoding rules of the communication instruction set.It proposes a method for rapidly developing control programs by adopting the expert control mode,which enables clients to develop and modify programs conveniently by importing instructions in a non-coded manner.After experimental testing,the platform is proved successful achieving both the rapid development of PLC control system and the rapid modification of monitoring software.
基金Project(XDA-06010402)supported by the Strategic Priority Research Program of Chinese Academy of SciencesProject(Y5S7061G51)supported by the Youth Innovation Promotion Association of Chinese Academy of Sciences
文摘5G baseband signal processing places greater real-time and reliability requirements on hardware.Based on the architecture of the MaPU,a reconfigurable computing architecture is proposed according to the characteristics of the 5G baseband signal processing.A dedicated instruction set for 5G baseband signal processing is proposed.The corresponding functional units are designed for reuse of hardware resources.A redirected register file is proposed to address latency and power consumption issues in internetwork.A two-dimensional code compression scheme is proposed for cases in which the use ratio of instruction memory is low.The access mode of the data memory is extended,the performance is improved and the power consumption is reduced.The throughput of 5G baseband processing algorithm is one to two orders of magnitude higher than that of the TMS320C6670 with less power consumption.The silicon area evaluated by layout is 5.8 mm2,which is 1/6 of the MaPU’s.The average power consumption is 0.7 W,which is 1/5 of the MaPU’s.
基金Project supported by the Hi-Tech Research and Development Pro-gram (863) of China (No. 2002 AA1Z1140) and the Fork Ying TongEducation Foundation (No. 94031), China
文摘The 32-bit extensible embedded processor RISC3200 originating from an RTL prototype core is intended for low-cost consumer multimedia products. In order to incorporate the reduced instruction set and the multimedia extension instruction set in a unifying pipeline, a scalable super-pipeline technique is adopted. Several other optimization techniques are proposed to boost the frequency and reduce the average CPI of the unifying pipeline. Based on a data flow graph (DFG) with delay information, the critical path of the pipeline stage can be located and shortened. This paper presents a distributed data bypass unit and a centralized pipeline control scheme for achieving lower CPI. Synthesis and simulation showed that the optimization techniques enable RISC3200 to operate at 200 MHz with an average CPI of 1.16. The core was integrated into a media SOC chip taped out in SMIC 0.18-micron technology. Preliminary testing result showed that the processor works well as we expected.
文摘A hardware-software co-simulation method for system on chip (SOC) design is discussed. It is based on an instruction set simulator (ISS) and an event-driven hardware simulator, and a bus interface model that is described in C language provides the interface between the two. The bus interface model and the ISS are linked into a singleton program--the software simulator, which communicate with the hardware simulator through Windows sockets. The implementation of the bus interface model and the synchronization between hardware and software simulator are discussed in detail. Co-simulation control of the hardware simulator is also discussed.
文摘As the traditional RISC+ASIC/ASSP approach for network processor design can not meet the today’s requirements, this paper described an alternate approach, Reconfigurable Processing Architecture, to boost the performance to ASIC level while reserve the programmability of the traditional RISC based system. This paper covers both the hardware architecture and the software development environment architecture.
文摘We presented a wireless endoscope system developed on the embedded Linux platform. This wireless endoscope system consists of three parts: the endoscope capsule,the portable equipment and the image workstation,which make this system wireless and portable. Hardware design of this system and software design of the portable equipment are described. In clinical use,patients swallow the endoscope capsule for gastrointestinal inspection. The portable equipment receives images of gastrointestinal tract via the radio frequency module while it sets and reads back parameters of the capsule via the low frequency module. Acquired image data can be saved as a FAT32 format file in an universal serial bus disk. Images can be processed and reviewed at an Microsoft Windows PC workstation in the next stage. The image acquired by this wireless endoscope system is quite effective with clear details of gastrointestinal tract. However,the endoscope capsule was large,making it difficult to swallow. So the printed circuit board design of endoscope capsule needs to be improved.
基金Supported by the Basic Research Fund of Tsinghua University
文摘This paper illustrates the importance of the configuration of function units and the change of an application’s critical path when using instruction set extension (ISE) with multi-issue architectures. This paper also presents an automatic identification approach for customized instruction without input/output number constraints for multi-issue architectures. The approach identifies customized instructions using multiple attribute decision-making based on the analysis of several attributes for each candidate node. Tests indicate that the approach achieves higher speedup ratios than previous approaches, as well as less area cost. In addition, this approach provides designers with multiple candidate designs.
基金supported by the National Natural Science Foundation of China under Grant Nos.61925208,61906179,U19B2019,and U20A20227the Strategic Priority Research Program of Chinese Academy of Sciences under Grant No.XDB32050200+1 种基金Beijing Academy of Artificial Intelligence(BAAI),Chinese Academy of Sciences(CAS)Project for Young Scientists in Basic Research(YSBR-029)Youth Innovation Promotion Association CAS.
文摘The emerging mobile robot industry has spurred a flurry of interest in solving the simultaneous localization and mapping(SLAM)problem.However,existing SLAM platforms have difficulty in meeting the real-time and low-pow-er requirements imposed by mobile systems.Though specialized hardware is promising with regard to achieving high per-formance and lowering the power,designing an efficient accelerator for SLAM is severely hindered by a wide variety of SLAM algorithms.Based on our detailed analysis of representative SLAM algorithms,we observe that SLAM algorithms advance two challenges for designing efficient hardware accelerators:the large number of computational primitives and ir-regular control flows.To address these two challenges,we propose a hardware accelerator that features composable com-putation units classified as the matrix,vector,scalar,and control units.In addition,we design a hierarchical instruction set for coping with a broad range of SLAM algorithms with irregular control flows.Experimental results show that,com-pared against an Intel x86 processor,on average,our accelerator with the area of 7.41 mm^(2) achieves 10.52x and 112.62x better performance and energy savings,respectively,across different datasets.Compared against a more energy-efficient ARM Cortex processor,our accelerator still achieves 33.03x and 62.64x better performance and energy savings,respec-tively.
文摘Secure computing paradigms impose new architectural challenges for general-purpose processors. Cryptographic processing is needed for secure communications, storage, and computations. We identify two categories of operations in symmetric-key and public-key cryptographic algorithms that are not common in previous general-purpose workloads: advanced bit operations within a word and multi-word operations. We define MOMR (Multiple Operands Multiple Results) execution or datarich execution as a unified solution to both challenges. It allows arbitrary n-bit permutations to be achieved in one or two cycles, rather than O(n) cycles as in existing RISC processors. It also enables significant acceleration of multiword multiplications needed by public-key ciphers. We propose two implementations of MOMR: one employs only hardware changes while the other uses Instruction Set Architecture (ISA) support. We show that MOMR execution leverages available resources in typical multi-issue processors with minimal additional cost. Multi-issue processors enhanced with MOMR units provide additional speedup over standard multi-issue processors with the same datapath. MOMR is a general architectural solution for word-oriented processor architectures to incorporate datarich operations.
基金supported by the National Natural Science Foundation of China under Grant Nos.601173006,61221062the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No.XDA06010403
文摘The combination of growing transistor counts and limited power budget within a silicon die leads to the utilization wall problem (a.k.a. "Dark Silicon"), that is only a small fraction of chip can run at full speed during a period of time. Designing accelerators for specific applications or algorithms is considered to be one of the most promising approaches to improving energy-efficiency. However, most current design methods for accelerators are dedicated for certain applications or algorithms, which greatly constrains their applicability. In this paper, we propose a novel general-purpose many-accelerator architecture. Our contributions are two-fold. Firstly, we propose to cluster dataflow graphs (DFGs) of hotspot basic blocks (BBs) in applications. The DFG clusters are then used for accelerators design. This is because a DFC is the largest program unit which is not specific to a certain application. We analyze 17 benchmarks in SPEC CPU 2006, acquire over 300 DFGs hotspots by using LLVM compiler tool, and divide them into 15 clusters based on graph similarity. Secondly, we introduce a function instruction set architecture (FISC) and illustrate how DFG accelerators can be integrated with a processor core and how they can be used by applications. Our results show that the proposed DFG clustering and FISC design can speed up SPEC benchmarks 6.2X on average.