Agile hardware design is gaining increasing momentum and bringing new chips in larger quantities to the market faster.However,it also takes new challenges for compiler developers to retarget existing compilers to thes...Agile hardware design is gaining increasing momentum and bringing new chips in larger quantities to the market faster.However,it also takes new challenges for compiler developers to retarget existing compilers to these new chips in shorter time than ever before.Currently,retargeting a compiler backend,e.g.,an LLVM backend to a new target,requires compiler developers to write manually a set of target description files(totalling 10300+lines of code(LOC)for RISC-V in LLVM),which is error-prone and time-consuming.In this paper,we introduce a new approach,Au-tomatic Target Description File Generation(ATG),which accelerates the generation of a compiler backend for a new tar-get by generating its target description files automatically.Given a new target,ATG proceeds in two stages.First,ATG synthesizes a small list of target-specific properties and a list of code-layout templates from the target description files of a set of existing targets with similar instruction set architectures(ISAs).Second,ATG requests compiler developers to fill in the information for each instruction in the new target in tabular form according to the list of target-specific properties syn-thesized and then generates its target description files automatically according to the list of code-layout templates synthe-sized.The first stage can often be reused by different new targets sharing similar ISAs.We evaluate ATG using nine RISC-V instruction sets drawn from a total of 1029 instructions in LLVM 12.0.ATG enables compiler developers to gen-erate compiler backends for these ISAs that emit the same assembly code as the existing compiler backends for RISC-V but with significantly less development effort(by specifying each instruction in terms of up to 61 target-specific properties only).展开更多
Tensors are a popular programming interface for developing artificial intelligence(AI)algorithms.Layout refers to the order of placing tensor data in the memory and will affect performance by affecting data locality;t...Tensors are a popular programming interface for developing artificial intelligence(AI)algorithms.Layout refers to the order of placing tensor data in the memory and will affect performance by affecting data locality;therefore the deep neural network library has a convention on the layout.Since AI applications can use arbitrary layouts,and existing AI systems do not provide programming abstractions to shield the layout conventions of libraries,operator developers need to write a lot of layout-related code,which reduces the efficiency of integrating new libraries or developing new operators.Furthermore,the developer assigns the layout conversion operation to the internal operator to deal with the uncertainty of the input layout,thus losing the opportunity for layout optimization.Based on the idea of polymorphism,we propose a layout-agnostic virtual tensor programming interface,namely the VTensor framework,which enables developers to write new operators without caring about the underlying physical layout of tensors.In addition,the VTensor framework performs global layout inference at runtime to transparently resolve the required layout of virtual tensors,and runtime layout-oriented optimizations to globally minimize the number of layout transformation operations.Experimental results demonstrate that with VTensor,developers can avoid writing layout-dependent code.Compared with TensorFlow,for the 16 operations used in 12 popular networks,VTensor can reduce the lines of code(LOC)of writing a new operation by 47.82%on average,and improve the overall performance by 18.65%on average.展开更多
Frequent itemset mining (FIM) is a popular data mining issue adopted in many fields, such as commodity recommendation in the retail industry, log analysis in web searching, and query recommendation (or related sea...Frequent itemset mining (FIM) is a popular data mining issue adopted in many fields, such as commodity recommendation in the retail industry, log analysis in web searching, and query recommendation (or related search). A large number of FIM algorithms have been proposed to obtain better performance, including parallelized algorithms for processing large data volumes. Besides, incremental FIM algorithms are also proposed to deal with incremental database updates. However, most of these incremental algorithms have low parallelism, causing low efficiency on huge databases. This paper presents two parallel incremental FIM algorithms called IncMiningPFP and IncBuildingPFP, implemented on the MapReduce framework. IncMiningPFP preserves the FP-tree mining results of the original pass, and utilizes them for incremental calculations. In particular, we propose a method to generate a partial FP-tree in the incremental pass, in order to avoid unnecessary mining work. Further, some of the incremental parallel tasks can be omitted when the inserted transactions include fewer items. IncbuildingPFP preserves the CanTrees built in the original pass, and then adds new transactions to them during the incremental passes. Our experimental results show that IncMiningPFP can achieve significant speedup over PFP (Parallel FPGrowth) and a sequential incremental algorithm (CanTree) in most cases of incremental input database, and in other cases IncBuildingPFP can achieve it.展开更多
In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods ...In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented by both shared memory and registers effectively in a balanced manner. We describe a framework that automatically finds the best placement of data in registers and shared memory in order to maximize the performance of stencil computations. Validation using four different types of stencils on three different GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory only.展开更多
基金supported by the Strategic Pilot Science and Technology Project of Chinese Academy of Sciences(Category C)under Grant No.XDC05000000the Youth Program of National Natural Science Foundation of China under Grant No.61802368.
文摘Agile hardware design is gaining increasing momentum and bringing new chips in larger quantities to the market faster.However,it also takes new challenges for compiler developers to retarget existing compilers to these new chips in shorter time than ever before.Currently,retargeting a compiler backend,e.g.,an LLVM backend to a new target,requires compiler developers to write manually a set of target description files(totalling 10300+lines of code(LOC)for RISC-V in LLVM),which is error-prone and time-consuming.In this paper,we introduce a new approach,Au-tomatic Target Description File Generation(ATG),which accelerates the generation of a compiler backend for a new tar-get by generating its target description files automatically.Given a new target,ATG proceeds in two stages.First,ATG synthesizes a small list of target-specific properties and a list of code-layout templates from the target description files of a set of existing targets with similar instruction set architectures(ISAs).Second,ATG requests compiler developers to fill in the information for each instruction in the new target in tabular form according to the list of target-specific properties syn-thesized and then generates its target description files automatically according to the list of code-layout templates synthe-sized.The first stage can often be reused by different new targets sharing similar ISAs.We evaluate ATG using nine RISC-V instruction sets drawn from a total of 1029 instructions in LLVM 12.0.ATG enables compiler developers to gen-erate compiler backends for these ISAs that emit the same assembly code as the existing compiler backends for RISC-V but with significantly less development effort(by specifying each instruction in terms of up to 61 target-specific properties only).
基金supported by the National Key Research and Development Program of China under Grant No.2021zD0110101the National Natural Science Foundation of China under Grant Nos.62090024,61872043,and 61802368the Australian Research Council grant under Grant Nos.DP180104069 and DP210102409。
文摘Tensors are a popular programming interface for developing artificial intelligence(AI)algorithms.Layout refers to the order of placing tensor data in the memory and will affect performance by affecting data locality;therefore the deep neural network library has a convention on the layout.Since AI applications can use arbitrary layouts,and existing AI systems do not provide programming abstractions to shield the layout conventions of libraries,operator developers need to write a lot of layout-related code,which reduces the efficiency of integrating new libraries or developing new operators.Furthermore,the developer assigns the layout conversion operation to the internal operator to deal with the uncertainty of the input layout,thus losing the opportunity for layout optimization.Based on the idea of polymorphism,we propose a layout-agnostic virtual tensor programming interface,namely the VTensor framework,which enables developers to write new operators without caring about the underlying physical layout of tensors.In addition,the VTensor framework performs global layout inference at runtime to transparently resolve the required layout of virtual tensors,and runtime layout-oriented optimizations to globally minimize the number of layout transformation operations.Experimental results demonstrate that with VTensor,developers can avoid writing layout-dependent code.Compared with TensorFlow,for the 16 operations used in 12 popular networks,VTensor can reduce the lines of code(LOC)of writing a new operation by 47.82%on average,and improve the overall performance by 18.65%on average.
基金This work was supported by the National High Technology Research and Development 863 Program of China under Grant Nos. 2015AA011505, 2015AA015306, and 2012AA010902, the National Natural Science Foundation of China under Grant Nos. 61202055, 61221062, 61521092, 61303053, 61432016, 61402445, and 61672492, and the National Key Research and Development Program of China under Grant No. 2016YFB1000402.
文摘Frequent itemset mining (FIM) is a popular data mining issue adopted in many fields, such as commodity recommendation in the retail industry, log analysis in web searching, and query recommendation (or related search). A large number of FIM algorithms have been proposed to obtain better performance, including parallelized algorithms for processing large data volumes. Besides, incremental FIM algorithms are also proposed to deal with incremental database updates. However, most of these incremental algorithms have low parallelism, causing low efficiency on huge databases. This paper presents two parallel incremental FIM algorithms called IncMiningPFP and IncBuildingPFP, implemented on the MapReduce framework. IncMiningPFP preserves the FP-tree mining results of the original pass, and utilizes them for incremental calculations. In particular, we propose a method to generate a partial FP-tree in the incremental pass, in order to avoid unnecessary mining work. Further, some of the incremental parallel tasks can be omitted when the inserted transactions include fewer items. IncbuildingPFP preserves the CanTrees built in the original pass, and then adds new transactions to them during the incremental passes. Our experimental results show that IncMiningPFP can achieve significant speedup over PFP (Parallel FPGrowth) and a sequential incremental algorithm (CanTree) in most cases of incremental input database, and in other cases IncBuildingPFP can achieve it.
基金Supported in part by the National Basic Research 973 Program of China under Grant Nos. 2011CB302504 and 2011ZX01028-001-002the National High Technology Research and Development 863 Program of China under Grant No. 2009AA01A129+1 种基金the National Natural Science Foundation of China (NSFC) under Grant No. 60970024the Innovation Research Group of NSFC under Grant No. 60921002
文摘In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented by both shared memory and registers effectively in a balanced manner. We describe a framework that automatically finds the best placement of data in registers and shared memory in order to maximize the performance of stencil computations. Validation using four different types of stencils on three different GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory only.