Recent years,the deep learning algorithm has been widely deployed from cloud servers to terminal units.And researchers proposed various neural network accelerators and software development environments.In this article...Recent years,the deep learning algorithm has been widely deployed from cloud servers to terminal units.And researchers proposed various neural network accelerators and software development environments.In this article,we have reviewed the representative neural network accelerators.As an entirety,the corresponding software stack must consider the hardware architecture of the specific accelerator to enhance the end-to-end performance.And we summarize the programming environments of neural network accelerators and optimizations in software stack.Finally,we comment the future trend of neural network accelerator and programming environments.展开更多
To improve the inference efficiency of convolutional neural networks(CNN),the existing neural networks mainly adopt heuristic and dynamic programming algorithms to realize parallel scheduling among operators.Heuristic...To improve the inference efficiency of convolutional neural networks(CNN),the existing neural networks mainly adopt heuristic and dynamic programming algorithms to realize parallel scheduling among operators.Heuristic scheduling algorithms can generate local optima easily,while the dynamic programming algorithm has a long convergence time for complex structural models.This paper mainly studies the parallel scheduling between operators and proposes an inter-operator parallelism schedule(IOPS)scheduling algorithm that guarantees the minimum similar execution delay.Firstly,a graph partitioning algorithm based on the largest block is designed to split the neural network model into multiple subgraphs.Then,the operators that meet the conditions is replaced according to the defined operator replacement rules.Finally,the optimal scheduling method based on backtracking is used to schedule the computational graph.Network models such as Inception-v3,ResNet-50,and RandWire are selected for testing.The experimental results show that the algorithm designed in this paper can achieve a 1.6×speedup compared with the existing sequential execution methods.展开更多
This paper proposes a new Graphics Processing Unit(GPU)-accelerated storage format to speed up Sparse Matrix Vector Products(SMVPs) for Finite Element Method(FEM) analysis of electromagnetic problems.A new format call...This paper proposes a new Graphics Processing Unit(GPU)-accelerated storage format to speed up Sparse Matrix Vector Products(SMVPs) for Finite Element Method(FEM) analysis of electromagnetic problems.A new format called Modified Compile Time Optimization(MCTO) format is used to reduce much execution time and design for hastening the iterative solution of FEM equations especially when rows have uneven lengths.The MCTO-applied FEM is about 10 times faster than conventional FEM on a CPU,and faster than other row-major ordering formats on a GPU.Numerical results show that the proposed GPU-accelerated storage format turns out to be an excellent accelerator.展开更多
Symbolic execution is an effective way of systematically exploring the search space of a program,and is often used for automatic software testing and bug finding.The program to be analyzed is usually compiled into a b...Symbolic execution is an effective way of systematically exploring the search space of a program,and is often used for automatic software testing and bug finding.The program to be analyzed is usually compiled into a binary or an intermediate representation,on which symbolic execution is carried out.During this process,compiler optimizations influence the effectiveness and efficiency of symbolic execution.However,to the best of our knowledge,there exists no work on compiler optimization recommendation for symbolic execution with respect to(w.r.t.)modified condition/decision coverage(MC/DC),which is an important testing coverage criterion widely used for mission-critical software.This study describes our use of a state-of-the-art symbolic execution tool to carry out extensive experiments to study the impact of compiler optimizations on symbolic execution w.r.t.MC/DC.The results indicate that instruction combining(IC)optimization is the important and dominant optimization for symbolic execution w.r.t.MC/DC.We designed and implemented a support vector machine based optimization recommendation method w.r.t.IC(denoted as auto).The experiments on two standard benchmarks(Coreutils and NECLA)showed that auto achieves the best MC/DC on 67.47%of Coreutils programs and 78.26%of NECLA programs.展开更多
Stream Register File (SRF) is a large on-chip memory of the stream processor and its efficient management is essential for good performance. Current stream programming languages expose the management of SRF to the p...Stream Register File (SRF) is a large on-chip memory of the stream processor and its efficient management is essential for good performance. Current stream programming languages expose the management of SRF to the programmer, incurring heavy burden on the programmer and bringing difficulties to inheriting the legacy codes. SF95 is the language developed for FT64 which is the first 64-bit stream processor designed for scientific applications. SF95 conceals SRF from the programmer and leaves the management of SRF to its compiler. In this paper, we present a compiler approach named SRF Coloring to manage SRF automatically. The novelties of this paper are: first, it is the first time to use the graph coloring-based algorithm for the SRF management; second, an algorithm framework for SRF Coloring that is well suited to the FT64 architecture is proposed this framework is based on a well-understood graph coloring algorithm for register allocation, together with some modifications to deal with the unusual aspects of SRF problem; third, the SRF Coloring algorithm is implemented in SF95Compiler, a compiler designed for FT64 and SF95. The experimental results show that our approach represents a practical and promising solution to SRF allocation.展开更多
Existing methods of obtaining runtime feedback for structure data-layout optimization have several drawbacks, such as large overhead and difficulty composing training sets. As a result, structure data-layout optimizat...Existing methods of obtaining runtime feedback for structure data-layout optimization have several drawbacks, such as large overhead and difficulty composing training sets. As a result, structure data-layout optimization is not widely used. To overcome these drawbacks, a performance monitoring unit (PMU) sampling method was developed with much less overhead and better portability and usability. An algorithm was developed to correct incomplete and inaccurate PMU sampling. With the corrected PMU feedback, a structure data-layout optimizer achieved a 45.1% performance improvement compared to a design without data-layout optimization, which is 97.6% of the performance improvement achieved with instrumented feedback. Calculation of the PMU feedback increased the execution time by 12.3%, compared to the overhead for the instrumented feedback of 341.5%. Tests show that the PMU feedback is efficient and effective for structure data-layout optimization.展开更多
The highlevel Compiler Intermediate Language CIL is a generalpurpose descripotion language of parallel graph rewriting computational model intended for parallelimplementatioll of declarative languages on multiprocesso...The highlevel Compiler Intermediate Language CIL is a generalpurpose descripotion language of parallel graph rewriting computational model intended for parallelimplementatioll of declarative languages on multiprocessor systems. In this paper, wefirst outline a new Hybrid Execution Model(HEM) and corresponding parallel abstract machine PAM/TGR based on Extended parallel Graph Rewriting ComputationalModel EGRCM for implementing CIL language on distributed memory multiprocessorsystems. Then we focus on the compiling CIL language with various optindzing techniques such as pattern matching, rule indexing, node ordering and compile-time partialscheduling. The experimental results on a 16-node Thansputer Array demonstrates the effectiveness of our model and strategies.展开更多
Loop tiling (or loop blocking) is a well-known loop transformation to improve temporal locality in nested loops which perform matrix computations. When targeting caches that have low associativities, one of the key ...Loop tiling (or loop blocking) is a well-known loop transformation to improve temporal locality in nested loops which perform matrix computations. When targeting caches that have low associativities, one of the key challenges for loop tiling is to simultaneously minimize capacity misses and conflict misses. This paper analyzes the effect of the tile size and the array-dimension size on capacity misses and conflict misses. The analysis supports the approach of combining tile-size selection (to minimize capacity misses) with array padding (to minimize conflict misses).展开更多
In this paper,we focus on the compiling implementation of parallel logic language PARLOG and functional language ML on distributed memory multiprocessors.Under the graph rewriting framework, a Heterogeneous Parallel G...In this paper,we focus on the compiling implementation of parallel logic language PARLOG and functional language ML on distributed memory multiprocessors.Under the graph rewriting framework, a Heterogeneous Parallel Graph Rewriting Execution Model(HPGREM)is presented firstly.Then based on HPGREM,a parallel abstract machine PAM/TGR is described.Furthermore,several optimizing compilation schemes for executing declarative programs on transputer array are proposed. The performance statistics on a transputer array demonstrate the effectiveness of our model,parallel ab- stract machine,optimizing compilation strategies and compiler.展开更多
基金partially supported by the National Key Research and Development Program of China (under Grant 2017YFB1003101, 2018AAA0103300, 2017YFA0700900, 2017YFA0700902, 2017YFA0700901)the National Natural Science Foundation of China (under Grant 61732007, 61432016, 61532016, 61672491, 61602441, 61602446, 61732002, 61702478, and 61732020)+6 种基金Beijing Natural Science Foundation (JQ18013)National Science and Technology Major Project (2018ZX01031102)the Transformation and Transferof Scientific and Technological Achievements of Chinese Academy of Sciences (KFJ-HGZX-013)Key Research Projects in Frontier Science of Chinese Academy of Sciences (QYZDBSSW-JSC001)Strategic Priority Research Program of Chinese Academy of Science (XDB32050200, XDC01020000)Standardization Research Project of Chinese Academy of Sciences (BZ201800001)Beijing Academy of Artificial Intelligence (BAAI) and Beijing Nova Program of Science and Technology (Z191100001119093)
文摘Recent years,the deep learning algorithm has been widely deployed from cloud servers to terminal units.And researchers proposed various neural network accelerators and software development environments.In this article,we have reviewed the representative neural network accelerators.As an entirety,the corresponding software stack must consider the hardware architecture of the specific accelerator to enhance the end-to-end performance.And we summarize the programming environments of neural network accelerators and optimizations in software stack.Finally,we comment the future trend of neural network accelerator and programming environments.
基金Supported by the National Key Research and Development Project of China(No.2020AAA0104603)the National Natural Science Foundation of China(No.61834005,61772417)the Shaanxi Province Key R&D Plan(No.2021GY-029).
文摘To improve the inference efficiency of convolutional neural networks(CNN),the existing neural networks mainly adopt heuristic and dynamic programming algorithms to realize parallel scheduling among operators.Heuristic scheduling algorithms can generate local optima easily,while the dynamic programming algorithm has a long convergence time for complex structural models.This paper mainly studies the parallel scheduling between operators and proposes an inter-operator parallelism schedule(IOPS)scheduling algorithm that guarantees the minimum similar execution delay.Firstly,a graph partitioning algorithm based on the largest block is designed to split the neural network model into multiple subgraphs.Then,the operators that meet the conditions is replaced according to the defined operator replacement rules.Finally,the optimal scheduling method based on backtracking is used to schedule the computational graph.Network models such as Inception-v3,ResNet-50,and RandWire are selected for testing.The experimental results show that the algorithm designed in this paper can achieve a 1.6×speedup compared with the existing sequential execution methods.
基金Supported by the National Science Foundation of China(Nos.61272097,71203064,71103077)the Natural Science Foundation of Shanghai(No.12ZR1443000)+2 种基金the Funding Research and Innovation Project of Shanghai Municipal Education Commission(No.12ZZ182)the Fundamental Research Funds for the Central Universitiesand the Local Colleges and Universities "1025" Connotation Construction Project of Shanghai(No.nhky-2012-10)the Foundation of Shanghai University of Engineering Science(No.A-0501-13-012)
文摘This paper proposes a new Graphics Processing Unit(GPU)-accelerated storage format to speed up Sparse Matrix Vector Products(SMVPs) for Finite Element Method(FEM) analysis of electromagnetic problems.A new format called Modified Compile Time Optimization(MCTO) format is used to reduce much execution time and design for hastening the iterative solution of FEM equations especially when rows have uneven lengths.The MCTO-applied FEM is about 10 times faster than conventional FEM on a CPU,and faster than other row-major ordering formats on a GPU.Numerical results show that the proposed GPU-accelerated storage format turns out to be an excellent accelerator.
基金Project supported by the National Key R&D Program of China(No.2017YFB1001802)the National Natural Science Foundation of China(Nos.61472440,61632015,61690203,and 61532007)。
文摘Symbolic execution is an effective way of systematically exploring the search space of a program,and is often used for automatic software testing and bug finding.The program to be analyzed is usually compiled into a binary or an intermediate representation,on which symbolic execution is carried out.During this process,compiler optimizations influence the effectiveness and efficiency of symbolic execution.However,to the best of our knowledge,there exists no work on compiler optimization recommendation for symbolic execution with respect to(w.r.t.)modified condition/decision coverage(MC/DC),which is an important testing coverage criterion widely used for mission-critical software.This study describes our use of a state-of-the-art symbolic execution tool to carry out extensive experiments to study the impact of compiler optimizations on symbolic execution w.r.t.MC/DC.The results indicate that instruction combining(IC)optimization is the important and dominant optimization for symbolic execution w.r.t.MC/DC.We designed and implemented a support vector machine based optimization recommendation method w.r.t.IC(denoted as auto).The experiments on two standard benchmarks(Coreutils and NECLA)showed that auto achieves the best MC/DC on 67.47%of Coreutils programs and 78.26%of NECLA programs.
基金Supported by the National Natural Science Foundation of China under Grant Nos.60621003 and 60633050.
文摘Stream Register File (SRF) is a large on-chip memory of the stream processor and its efficient management is essential for good performance. Current stream programming languages expose the management of SRF to the programmer, incurring heavy burden on the programmer and bringing difficulties to inheriting the legacy codes. SF95 is the language developed for FT64 which is the first 64-bit stream processor designed for scientific applications. SF95 conceals SRF from the programmer and leaves the management of SRF to its compiler. In this paper, we present a compiler approach named SRF Coloring to manage SRF automatically. The novelties of this paper are: first, it is the first time to use the graph coloring-based algorithm for the SRF management; second, an algorithm framework for SRF Coloring that is well suited to the FT64 architecture is proposed this framework is based on a well-understood graph coloring algorithm for register allocation, together with some modifications to deal with the unusual aspects of SRF problem; third, the SRF Coloring algorithm is implemented in SF95Compiler, a compiler designed for FT64 and SF95. The experimental results show that our approach represents a practical and promising solution to SRF allocation.
文摘Existing methods of obtaining runtime feedback for structure data-layout optimization have several drawbacks, such as large overhead and difficulty composing training sets. As a result, structure data-layout optimization is not widely used. To overcome these drawbacks, a performance monitoring unit (PMU) sampling method was developed with much less overhead and better portability and usability. An algorithm was developed to correct incomplete and inaccurate PMU sampling. With the corrected PMU feedback, a structure data-layout optimizer achieved a 45.1% performance improvement compared to a design without data-layout optimization, which is 97.6% of the performance improvement achieved with instrumented feedback. Calculation of the PMU feedback increased the execution time by 12.3%, compared to the overhead for the instrumented feedback of 341.5%. Tests show that the PMU feedback is efficient and effective for structure data-layout optimization.
文摘The highlevel Compiler Intermediate Language CIL is a generalpurpose descripotion language of parallel graph rewriting computational model intended for parallelimplementatioll of declarative languages on multiprocessor systems. In this paper, wefirst outline a new Hybrid Execution Model(HEM) and corresponding parallel abstract machine PAM/TGR based on Extended parallel Graph Rewriting ComputationalModel EGRCM for implementing CIL language on distributed memory multiprocessorsystems. Then we focus on the compiling CIL language with various optindzing techniques such as pattern matching, rule indexing, node ordering and compile-time partialscheduling. The experimental results on a 16-node Thansputer Array demonstrates the effectiveness of our model and strategies.
基金This work is sponsored in part by National Science Foundation of USA under Grant Nos.ST-HEC-0444285,CCR-950254,ACI/ITR-0082834 and CCR-9975309,by Indiana 21st Century Fund,and by a donation from Sun Microsystems,Inc.
文摘Loop tiling (or loop blocking) is a well-known loop transformation to improve temporal locality in nested loops which perform matrix computations. When targeting caches that have low associativities, one of the key challenges for loop tiling is to simultaneously minimize capacity misses and conflict misses. This paper analyzes the effect of the tile size and the array-dimension size on capacity misses and conflict misses. The analysis supports the approach of combining tile-size selection (to minimize capacity misses) with array padding (to minimize conflict misses).
基金This work was partially supported by the National 863 High Technical Grant 863-306-101the National Doctoral Subject Foundation Grant 0249136.
文摘In this paper,we focus on the compiling implementation of parallel logic language PARLOG and functional language ML on distributed memory multiprocessors.Under the graph rewriting framework, a Heterogeneous Parallel Graph Rewriting Execution Model(HPGREM)is presented firstly.Then based on HPGREM,a parallel abstract machine PAM/TGR is described.Furthermore,several optimizing compilation schemes for executing declarative programs on transputer array are proposed. The performance statistics on a transputer array demonstrate the effectiveness of our model,parallel ab- stract machine,optimizing compilation strategies and compiler.