The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability.Among th...The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability.Among the existing deep learning compilers,TVM is well known for its efficiency in code generation and optimization across diverse hardware devices.In the meanwhile,the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific computing and deep learning workloads.This paper combines the trends in these two directions.Specifically,we propose swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway.In addition,we leverage the architecture features during the compilation such as core group for massive parallelism,DMA for high bandwidth memory transfer and local device memory for data locality,in order to generate efficient codes for deep learning workloads on Sunway.The experiment results show that the codes generated by swTVM achieve 1.79x improvement of inference latency on average compared to the state-of-the-art deep learning framework on Sunway,across eight representative benchmarks.This work is the first attempt from the compiler perspective to bridge the gap of deep learning and Sunway processor particularly with productivity and efficiency in mind.We believe this work will encourage more people to embrace the power of deep learning and Sunwaymany-coreprocessor.展开更多
The use of multi-core processors will become a trend in safety critical systems. For safe execution of multi- threaded code, automatic code generation from formal spec- ification is a desirable method. Signal, a synch...The use of multi-core processors will become a trend in safety critical systems. For safe execution of multi- threaded code, automatic code generation from formal spec- ification is a desirable method. Signal, a synchronous lan- guage dedicated for the functional description of safety crit- ical systems, provides soundness semantics for determinis- tic concurrency. Although sequential code generation of Sig- nal has been implemented in Polychrony compiler, deter- ministic multi-threaded code generation strategy is still far from mature. Moreover, existing code generation methods use certain multi-thread library, which limits the cross plat- form executions. OpenMP is an application program inter- face (API) standard for parallel programming, supported by several mainstream compilers from different platforms. This paper presents a methodology translating Signal program to OpenMP-based multi-threaded C code. First, the intermedi- ate representation of the core syntax of Signal using syn- chronous guarded actions is defined. Then, according to the compositional semantics of Signal equations, the Signal pro- gram is synthesized to dependency graph (DG). After par- allel tasks are extracted from dependency graph, the Signal program can be finally translated into OpenMP-based C code which can be executed on multiple platforms.展开更多
Computation reuse is known as an effective optimization technique. However, due to the complexity of modern GPU architectures, there is yet not enough understanding regarding the intriguing implications of the interpl...Computation reuse is known as an effective optimization technique. However, due to the complexity of modern GPU architectures, there is yet not enough understanding regarding the intriguing implications of the interplay of compu- ration reuse and hardware specifics on application performance. In this paper, we propose an automatic code generator for a class of stencil codes with inherent computation reuse on CPUs. For such applications, the proper reuse of intermediate results, combined with careful register and on-chip local memory usage, has profound implications on performance. Current state of the art does not address this problem in depth, partially due to the lack of a good program representation that can expose all potential computation reuse. In this paper, we leverage the computation overlap graph (COG), a simple representation of data dependence and data reuse with "element view", to expose potential reuse opportunities. Using COG, we propose a portable code generation and tuning framework for GPUs. Compared with current state-of-the-art code generators, our experimental results show up to 56.7% performance improvement on modern GPUs such as NVIDIA C2050.展开更多
While grid computing receives more and more attentions, it is not widely used partly due to requirement of sophisticated development. This paper discusses a code generation framework for grid computing. We firstly int...While grid computing receives more and more attentions, it is not widely used partly due to requirement of sophisticated development. This paper discusses a code generation framework for grid computing. We firstly introduce GBuilder as a rapid development tool for building grid computing applications, then present the details of the code generation framework. We then discuss a case study to show the advantages of the whole processing of code generation framework, which including saved development time and less intricacy burden on the grid application developers.展开更多
This paper aims to explore a simpler and more user-friendly way of generating software based on model-driven development.Previous studies have attempted to generate code from domain models,hoping to reduce coding time...This paper aims to explore a simpler and more user-friendly way of generating software based on model-driven development.Previous studies have attempted to generate code from domain models,hoping to reduce coding time by increasing modeling time.However,as code tools become more advanced,it is challenging to improve efficiency because models are abstract while implementations are concrete.This paper proposes a novel approach that integrates ChatGPT as a plug-in into the whole R&D process and combines it with our code generation tool to enhance R&D efficiency.We have developed some demos to demonstrate the effectiveness of our approach.According to our evaluation,our approach can save more than 90%of the work in implementing the code generation tool,leaving only about 10%of the work for code review,code improvement,and unit testing.展开更多
In today’s digital era,algorithms have become an indispensable part of our daily lives and work.Algorithm education plays a crucial role in computer science and software engineering,aiming to cultivate students’prob...In today’s digital era,algorithms have become an indispensable part of our daily lives and work.Algorithm education plays a crucial role in computer science and software engineering,aiming to cultivate students’problem-solving skills and computational thinking.However,traditional algorithm education often requires significant time and efforts from teachers,lacks interactivity,and provides limited examples.The rapid advancement of AI technology,particularly generative models,and large language models(LLMs),has the potential to revolutionize computer education.Models like OpenAI’s GPT-4 and ChatGPT have conversational capabilities and contribute to various aspects of computer education.GPT-3.5,as an assistant in algorithm education,assists teachers in automatically generating explanations and algorithmic examples to enhance students’understanding of algorithms.While existing research has certain limitations,such as focusing on specific scenarios and lacking comprehensive benchmark testing,this paper explores the role of ChatGPT(GPT-3.5)in algorithm education.By refining prompts and evaluating generative capabilities,the study demonstrates that GPT-3.5 holds significant potential as a teaching aid.With an average accuracy of 0.81.GPT-3.5 can generate explanations,code examples,and visualizations of the corresponding algorithms.Other tests including algorithm problem-solving and examples giving also prove the practicability of GPT-3.5 in algorithm education.展开更多
Application developers of today need to produce code which is error-free, and whose performance is optimized for plethora of devices. Performance of application code is studied e.g. by analyzing performance data obtai...Application developers of today need to produce code which is error-free, and whose performance is optimized for plethora of devices. Performance of application code is studied e.g. by analyzing performance data obtained by executing application with tracing tool. Developers typically have their favorite tools which they prefer to use but unfortunately target devices are based on different computing platforms that have different performance probes which cause difficulties for using same tool with different multicore platforms. Universal Tracing Interface for Multicore Processors (UTIMP) aims to provide an unchangeable tracing interface enabling developers to perform required tracing tasks with the UTIMP, utilizing the favorite tool when possible, for different multicore platforms.展开更多
The resolution characteristic of GaAs/GaAlAs transmission photocathode is an important parameter in third generation intensifiers. The modulation transfer function of GaAs/GaAlAs transmission photo...The resolution characteristic of GaAs/GaAlAs transmission photocathode is an important parameter in third generation intensifiers. The modulation transfer function of GaAs/GaAlAs transmission photocathode is derived from a simple two-dimensional diffusion equation. The theoretical resolution characteristic of a 2 μm thick GaAs/GaAlAs transmission photocathode is calculated. The relationship between resolution and parameters in GaAs/GaAlAs transmission photocathode is discussed. A conclusion is shown that one can design the GaAs/GaAlAs transmission photocathode for maximum quantum efficiency, since the sacrifice in the resolution doesn't limit system performances.展开更多
The electronic control unit (ECU) in electrical powered hybrid and fuel cell vehicles is exceedingly complex. Rapid prototyping control is used to reduce development time and eliminate errors during software develop...The electronic control unit (ECU) in electrical powered hybrid and fuel cell vehicles is exceedingly complex. Rapid prototyping control is used to reduce development time and eliminate errors during software development. This paper describes a high-efficiency development method and a flexible tool chain suitable for various applications in automotive engineering. The control algorithm can be deployed directly from a Matlab/Simulink/Stateflow environment into the ECU hardware together with an OSEK real-time operating system (RTOS). The system has been successfully used to develop a 20-kW fuel cell system ECU based on a Motorola PowerPC 555 (MPC555) microcontroller. The total software development time is greatly reduced and the code quality and reliability are greatly enhanced.展开更多
Register allocation is a major step for all compilers. Various register allocation algorithms have been developed over the dec- ades. This work describes a new class of rapid register allocation algorithms and present...Register allocation is a major step for all compilers. Various register allocation algorithms have been developed over the dec- ades. This work describes a new class of rapid register allocation algorithms and presents experimental data on their behavior. Our re- search encourages the avoidance of graphing and graph-coloring based on the fact that precise graph-coloring is nondeterministic poly- nomial time-complete (NP-complete), which is not suitable for real-time tasks. In addition, practical graph-coloring algorithms tend to use polynomial-time heuristics. In dynamic compilation environments, their super linear complexity makes them unsuitable for register allocation and code generation. Existing tools for code generation and register allocation do not completely fulfill the requirements of fast compilation. Existing approaches either do not allow for the optimization of register allocation to be achieved comprehensively with a sufficient degree of performance or they require an unjustifiable amount of time and/or resources. Therefore, we propose a new class of register allocation and code generation algorithms that can be performed in linear time. These algorithms are based on the mathematic- al foundations of abstract interpretation and the computation of the level of abstraction. They have been implemented in a specialized library for just-in-time compilation. The specialization of this library involves the execution of common intermediate language (CIL) and low level virtual machine (LLVM) with a focus on embedded systems.展开更多
基金supported by the National Key Research and Development Program of China (No.2020YFB1506703)the National Natural Science Foundation of China (Grant Nos.62072018 and 61732002)+1 种基金the State Key Laboratory of Software Development Environment (No.SKLSDE-2021ZX-06)the Fundamental Research Funds for the Central Universities。
文摘The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability.Among the existing deep learning compilers,TVM is well known for its efficiency in code generation and optimization across diverse hardware devices.In the meanwhile,the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific computing and deep learning workloads.This paper combines the trends in these two directions.Specifically,we propose swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway.In addition,we leverage the architecture features during the compilation such as core group for massive parallelism,DMA for high bandwidth memory transfer and local device memory for data locality,in order to generate efficient codes for deep learning workloads on Sunway.The experiment results show that the codes generated by swTVM achieve 1.79x improvement of inference latency on average compared to the state-of-the-art deep learning framework on Sunway,across eight representative benchmarks.This work is the first attempt from the compiler perspective to bridge the gap of deep learning and Sunway processor particularly with productivity and efficiency in mind.We believe this work will encourage more people to embrace the power of deep learning and Sunwaymany-coreprocessor.
文摘The use of multi-core processors will become a trend in safety critical systems. For safe execution of multi- threaded code, automatic code generation from formal spec- ification is a desirable method. Signal, a synchronous lan- guage dedicated for the functional description of safety crit- ical systems, provides soundness semantics for determinis- tic concurrency. Although sequential code generation of Sig- nal has been implemented in Polychrony compiler, deter- ministic multi-threaded code generation strategy is still far from mature. Moreover, existing code generation methods use certain multi-thread library, which limits the cross plat- form executions. OpenMP is an application program inter- face (API) standard for parallel programming, supported by several mainstream compilers from different platforms. This paper presents a methodology translating Signal program to OpenMP-based multi-threaded C code. First, the intermedi- ate representation of the core syntax of Signal using syn- chronous guarded actions is defined. Then, according to the compositional semantics of Signal equations, the Signal pro- gram is synthesized to dependency graph (DG). After par- allel tasks are extracted from dependency graph, the Signal program can be finally translated into OpenMP-based C code which can be executed on multiple platforms.
基金This work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2012AA010902, and the National Natural Science Foundation of China under Grant No. 61303059.
文摘Computation reuse is known as an effective optimization technique. However, due to the complexity of modern GPU architectures, there is yet not enough understanding regarding the intriguing implications of the interplay of compu- ration reuse and hardware specifics on application performance. In this paper, we propose an automatic code generator for a class of stencil codes with inherent computation reuse on CPUs. For such applications, the proper reuse of intermediate results, combined with careful register and on-chip local memory usage, has profound implications on performance. Current state of the art does not address this problem in depth, partially due to the lack of a good program representation that can expose all potential computation reuse. In this paper, we leverage the computation overlap graph (COG), a simple representation of data dependence and data reuse with "element view", to expose potential reuse opportunities. Using COG, we propose a portable code generation and tuning framework for GPUs. Compared with current state-of-the-art code generators, our experimental results show up to 56.7% performance improvement on modern GPUs such as NVIDIA C2050.
文摘While grid computing receives more and more attentions, it is not widely used partly due to requirement of sophisticated development. This paper discusses a code generation framework for grid computing. We firstly introduce GBuilder as a rapid development tool for building grid computing applications, then present the details of the code generation framework. We then discuss a case study to show the advantages of the whole processing of code generation framework, which including saved development time and less intricacy burden on the grid application developers.
基金fully supported by the Natural Science Foundation of Hubei Province in China(Grant No.2021CFB482)Basic Research Science and Technology Project of Xiangyang(High-tech Domain 2022ABH007013)Hubei Superior and Distinctive Discipline Group of“New Energy Vehicle and Smart Transportation”。
文摘This paper aims to explore a simpler and more user-friendly way of generating software based on model-driven development.Previous studies have attempted to generate code from domain models,hoping to reduce coding time by increasing modeling time.However,as code tools become more advanced,it is challenging to improve efficiency because models are abstract while implementations are concrete.This paper proposes a novel approach that integrates ChatGPT as a plug-in into the whole R&D process and combines it with our code generation tool to enhance R&D efficiency.We have developed some demos to demonstrate the effectiveness of our approach.According to our evaluation,our approach can save more than 90%of the work in implementing the code generation tool,leaving only about 10%of the work for code review,code improvement,and unit testing.
基金funded by the Double First Class Graduate Quality Curriculum Construction Project of Shanghai Jiao Tong University。
文摘In today’s digital era,algorithms have become an indispensable part of our daily lives and work.Algorithm education plays a crucial role in computer science and software engineering,aiming to cultivate students’problem-solving skills and computational thinking.However,traditional algorithm education often requires significant time and efforts from teachers,lacks interactivity,and provides limited examples.The rapid advancement of AI technology,particularly generative models,and large language models(LLMs),has the potential to revolutionize computer education.Models like OpenAI’s GPT-4 and ChatGPT have conversational capabilities and contribute to various aspects of computer education.GPT-3.5,as an assistant in algorithm education,assists teachers in automatically generating explanations and algorithmic examples to enhance students’understanding of algorithms.While existing research has certain limitations,such as focusing on specific scenarios and lacking comprehensive benchmark testing,this paper explores the role of ChatGPT(GPT-3.5)in algorithm education.By refining prompts and evaluating generative capabilities,the study demonstrates that GPT-3.5 holds significant potential as a teaching aid.With an average accuracy of 0.81.GPT-3.5 can generate explanations,code examples,and visualizations of the corresponding algorithms.Other tests including algorithm problem-solving and examples giving also prove the practicability of GPT-3.5 in algorithm education.
文摘Application developers of today need to produce code which is error-free, and whose performance is optimized for plethora of devices. Performance of application code is studied e.g. by analyzing performance data obtained by executing application with tracing tool. Developers typically have their favorite tools which they prefer to use but unfortunately target devices are based on different computing platforms that have different performance probes which cause difficulties for using same tool with different multicore platforms. Universal Tracing Interface for Multicore Processors (UTIMP) aims to provide an unchangeable tracing interface enabling developers to perform required tracing tasks with the UTIMP, utilizing the favorite tool when possible, for different multicore platforms.
文摘The resolution characteristic of GaAs/GaAlAs transmission photocathode is an important parameter in third generation intensifiers. The modulation transfer function of GaAs/GaAlAs transmission photocathode is derived from a simple two-dimensional diffusion equation. The theoretical resolution characteristic of a 2 μm thick GaAs/GaAlAs transmission photocathode is calculated. The relationship between resolution and parameters in GaAs/GaAlAs transmission photocathode is discussed. A conclusion is shown that one can design the GaAs/GaAlAs transmission photocathode for maximum quantum efficiency, since the sacrifice in the resolution doesn't limit system performances.
基金Supported by the National High-Tech Research and Development (863) Program of China (No. 2003AA)
文摘The electronic control unit (ECU) in electrical powered hybrid and fuel cell vehicles is exceedingly complex. Rapid prototyping control is used to reduce development time and eliminate errors during software development. This paper describes a high-efficiency development method and a flexible tool chain suitable for various applications in automotive engineering. The control algorithm can be deployed directly from a Matlab/Simulink/Stateflow environment into the ECU hardware together with an OSEK real-time operating system (RTOS). The system has been successfully used to develop a 20-kW fuel cell system ECU based on a Motorola PowerPC 555 (MPC555) microcontroller. The total software development time is greatly reduced and the code quality and reliability are greatly enhanced.
文摘Register allocation is a major step for all compilers. Various register allocation algorithms have been developed over the dec- ades. This work describes a new class of rapid register allocation algorithms and presents experimental data on their behavior. Our re- search encourages the avoidance of graphing and graph-coloring based on the fact that precise graph-coloring is nondeterministic poly- nomial time-complete (NP-complete), which is not suitable for real-time tasks. In addition, practical graph-coloring algorithms tend to use polynomial-time heuristics. In dynamic compilation environments, their super linear complexity makes them unsuitable for register allocation and code generation. Existing tools for code generation and register allocation do not completely fulfill the requirements of fast compilation. Existing approaches either do not allow for the optimization of register allocation to be achieved comprehensively with a sufficient degree of performance or they require an unjustifiable amount of time and/or resources. Therefore, we propose a new class of register allocation and code generation algorithms that can be performed in linear time. These algorithms are based on the mathematic- al foundations of abstract interpretation and the computation of the level of abstraction. They have been implemented in a specialized library for just-in-time compilation. The specialization of this library involves the execution of common intermediate language (CIL) and low level virtual machine (LLVM) with a focus on embedded systems.