期刊文献+
共找到4篇文章
< 1 >
每页显示 20 50 100
Towards optimized tensor code generation for deep learning on sunway many-core processor
1
作者 Mingzhen LI Changxi LIU +8 位作者 Jianjin LIAO Xuegui ZHENG Hailong YANG Rujun SUN Jun XU Lin GAN Guangwen YANG Zhongzhi LUAN Depei QIAN 《Frontiers of Computer Science》 SCIE EI CSCD 2024年第2期1-15,共15页
The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability.Among th... The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability.Among the existing deep learning compilers,TVM is well known for its efficiency in code generation and optimization across diverse hardware devices.In the meanwhile,the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific computing and deep learning workloads.This paper combines the trends in these two directions.Specifically,we propose swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway.In addition,we leverage the architecture features during the compilation such as core group for massive parallelism,DMA for high bandwidth memory transfer and local device memory for data locality,in order to generate efficient codes for deep learning workloads on Sunway.The experiment results show that the codes generated by swTVM achieve 1.79x improvement of inference latency on average compared to the state-of-the-art deep learning framework on Sunway,across eight representative benchmarks.This work is the first attempt from the compiler perspective to bridge the gap of deep learning and Sunway processor particularly with productivity and efficiency in mind.We believe this work will encourage more people to embrace the power of deep learning and Sunwaymany-coreprocessor. 展开更多
关键词 sunway processor deep learning compiler code generation performance optimization
原文传递
swSpAMM:optimizing large-scale sparse approximate matrix multiplication on Sunway Taihulight 被引量:1
2
作者 Xiaoyan LIU Yi LIU +3 位作者 Bohong YIN Hailong YANG Zhongzhi LUAN Depei QIAN 《Frontiers of Computer Science》 SCIE EI CSCD 2023年第4期29-41,共13页
Although matrix multiplication plays an essential role in a wide range of applications,previous works only focus on optimizing dense or sparse matrix multiplications.The Sparse Approximate Matrix Multiply(SpAMM)is an ... Although matrix multiplication plays an essential role in a wide range of applications,previous works only focus on optimizing dense or sparse matrix multiplications.The Sparse Approximate Matrix Multiply(SpAMM)is an algorithm to accelerate the multiplication of decay matrices,the sparsity of which is between dense and sparse matrices.In addition,large-scale decay matrix multiplication is performed in scientific applications to solve cutting-edge problems.To optimize large-scale decay matrix multiplication using SpAMM on supercomputers such as Sunway Taihulight,we present swSpAMM,an optimized SpAMM algorithm by adapting the computation characteristics to the architecture features of Sunway Taihulight.Specifically,we propose both intra-node and inter-node optimizations to accelerate swSpAMM for large-scale execution.For intra-node optimizations,we explore algorithm parallelization and block-major data layout that are tailored to better utilize the architecture advantage of Sunway processor.For inter-node optimizations,we propose a matrix organization strategy for better distributing sub-matrices across nodes and a dynamic scheduling strategy for improving load balance across nodes.We compare swSpAMM with the existing GEMM library on a single node as well as large-scale matrix multiplication methods on multiple nodes.The experiment results show that swSpAMM achieves a speedup up to 14.5×and 2.2×when compared to xMath library on a single node and 2D GEMM method on multiple nodes,respectively. 展开更多
关键词 approximate calculation sunway processor performance optimization
原文传递
Thread Private Variable Access Optimization Technique for Sunway High-Performance Multi-core Processors
3
作者 Jinying Kong Kai Nie +2 位作者 Qinglei Zhou Jinlong Xu Lin Han 《国际计算机前沿大会会议论文集》 2021年第1期180-189,共10页
The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow acce... The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow accessto thread private variables in the compilation of Sunway OpenMP programs, thispaper proposes a thread private variable access technique based on privilegedinstructions. The privileged instruction-based thread-private variable access techniquecentralizes the implementation of thread-private variables at the compilerlevel, eliminating the model switching overhead of invoking OS core processingand improving the speed of accessing thread-private variables. On the Sunway1621 server platform, NPB3.3-OMP and SPEC OMP2012 achieved 6.2% and6.8% running efficiency gains, respectively. The results show that the techniquesproposed in this paper can provide technical support for giving full play to theadvantages of Sunway’s high-performance multi-core processors. 展开更多
关键词 sunway high-performance multi-core processors OpenMP programming technique Privileged instruction-based thread-private variable access technique sunway 1621 processor
原文传递
Parallel Region Reconstruction Technique for Sunway High-Performance Multi-core Processors
4
作者 Kai Nie Qinglei Zhou +3 位作者 Hong Qian Jianmin Pang Jinlong Xu Yapeng Li 《国际计算机前沿大会会议论文集》 2021年第1期163-179,共17页
The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by hight... The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by highthread group control overhead in the compilation of Sunway OpenMP programs,this paper proposes the parallel region reconstruction technique. The parallelregion reconstruction technique expands the parallel scope of parallel regionsin OpenMP programs by parallel region merging and parallel region extending.Moreover, it reduces the number of parallel regions in OpenMP programs,decreases the overhead of frequent creation and convergence of thread groups,and converts standard fork-join model OpenMP programs to higher performanceSPMD modelOpenMP programs. On the Sunway 1621 server computer, NPB3.3-OMP and SPEC OMP2012 achieved 8.9% and 7.9% running efficiency improvementrespectively through parallel region reconstruction technique. As a result,the parallel region reconstruction technique is feasible and effective. It providestechnical support to fully exploit the multi-core parallelism advantage of Sunway’shigh-performance processors. 展开更多
关键词 sunway high-performance multi-core processors OpenMP programming technique Parallel domain reconstruction technique
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部