The combination of growing transistor counts and limited power budget within a silicon die leads to the utilization wall problem (a.k.a. "Dark Silicon"), that is only a small fraction of chip can run at full speed...The combination of growing transistor counts and limited power budget within a silicon die leads to the utilization wall problem (a.k.a. "Dark Silicon"), that is only a small fraction of chip can run at full speed during a period of time. Designing accelerators for specific applications or algorithms is considered to be one of the most promising approaches to improving energy-efficiency. However, most current design methods for accelerators are dedicated for certain applications or algorithms, which greatly constrains their applicability. In this paper, we propose a novel general-purpose many-accelerator architecture. Our contributions are two-fold. Firstly, we propose to cluster dataflow graphs (DFGs) of hotspot basic blocks (BBs) in applications. The DFG clusters are then used for accelerators design. This is because a DFC is the largest program unit which is not specific to a certain application. We analyze 17 benchmarks in SPEC CPU 2006, acquire over 300 DFGs hotspots by using LLVM compiler tool, and divide them into 15 clusters based on graph similarity. Secondly, we introduce a function instruction set architecture (FISC) and illustrate how DFG accelerators can be integrated with a processor core and how they can be used by applications. Our results show that the proposed DFG clustering and FISC design can speed up SPEC benchmarks 6.2X on average.展开更多
Real-time multi-media applications are increasingly mapped on modern embedded systems based on multiprocessor systems-on-chip (MPSoC). Tasks of the applications need to be mapped on the MPSoC resources efficiently i...Real-time multi-media applications are increasingly mapped on modern embedded systems based on multiprocessor systems-on-chip (MPSoC). Tasks of the applications need to be mapped on the MPSoC resources efficiently in order to satisity their performance constraints. Exploring all the possible mappings, i.e., tasks to resources combinations exhaustively may take days or weeks. Additionally, the exploration is performed at design-time, which cannot handle dynamism in applications and resources' status. A runtime mapping technique can cater for the dynamism but cannot guarantee for strict timing deadlines due to large computations involved at run-time. Thus, an approach performing feasible compute intensive exploration at design-time and using the explored results at run-time is required. This paper presents a solution in the same direction. Communicationaware design space exploration (CADSE) techniques have been proposed to explore different mapping options to be selected at run-time subject to desired performance and available MPSoC resources. Experiments show that the proposed techniques for exploration are faster over an exhaustive exploration and provides almost the same quality of results.展开更多
基金supported by the National Natural Science Foundation of China under Grant Nos.601173006,61221062the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No.XDA06010403
文摘The combination of growing transistor counts and limited power budget within a silicon die leads to the utilization wall problem (a.k.a. "Dark Silicon"), that is only a small fraction of chip can run at full speed during a period of time. Designing accelerators for specific applications or algorithms is considered to be one of the most promising approaches to improving energy-efficiency. However, most current design methods for accelerators are dedicated for certain applications or algorithms, which greatly constrains their applicability. In this paper, we propose a novel general-purpose many-accelerator architecture. Our contributions are two-fold. Firstly, we propose to cluster dataflow graphs (DFGs) of hotspot basic blocks (BBs) in applications. The DFG clusters are then used for accelerators design. This is because a DFC is the largest program unit which is not specific to a certain application. We analyze 17 benchmarks in SPEC CPU 2006, acquire over 300 DFGs hotspots by using LLVM compiler tool, and divide them into 15 clusters based on graph similarity. Secondly, we introduce a function instruction set architecture (FISC) and illustrate how DFG accelerators can be integrated with a processor core and how they can be used by applications. Our results show that the proposed DFG clustering and FISC design can speed up SPEC benchmarks 6.2X on average.
基金The authors would like to thank the reviewers for their feedback and suggestions. We also wish to mention that this work is partly supported by Singapore Ministry of Education Academic Research Fund Tier 1 (R-263-000-655-133) and National Natural Science Foundation of China (NSFC) (Grant No. 61173032).
文摘Real-time multi-media applications are increasingly mapped on modern embedded systems based on multiprocessor systems-on-chip (MPSoC). Tasks of the applications need to be mapped on the MPSoC resources efficiently in order to satisity their performance constraints. Exploring all the possible mappings, i.e., tasks to resources combinations exhaustively may take days or weeks. Additionally, the exploration is performed at design-time, which cannot handle dynamism in applications and resources' status. A runtime mapping technique can cater for the dynamism but cannot guarantee for strict timing deadlines due to large computations involved at run-time. Thus, an approach performing feasible compute intensive exploration at design-time and using the explored results at run-time is required. This paper presents a solution in the same direction. Communicationaware design space exploration (CADSE) techniques have been proposed to explore different mapping options to be selected at run-time subject to desired performance and available MPSoC resources. Experiments show that the proposed techniques for exploration are faster over an exhaustive exploration and provides almost the same quality of results.