期刊文献+
共找到8篇文章
< 1 >
每页显示 20 50 100
Towards optimized tensor code generation for deep learning on sunway many-core processor
1
作者 Mingzhen LI Changxi LIU +8 位作者 Jianjin LIAO Xuegui ZHENG Hailong YANG Rujun SUN Jun XU Lin GAN Guangwen YANG Zhongzhi LUAN depei qian 《Frontiers of Computer Science》 SCIE EI CSCD 2024年第2期1-15,共15页
The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability.Among th... The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability.Among the existing deep learning compilers,TVM is well known for its efficiency in code generation and optimization across diverse hardware devices.In the meanwhile,the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific computing and deep learning workloads.This paper combines the trends in these two directions.Specifically,we propose swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway.In addition,we leverage the architecture features during the compilation such as core group for massive parallelism,DMA for high bandwidth memory transfer and local device memory for data locality,in order to generate efficient codes for deep learning workloads on Sunway.The experiment results show that the codes generated by swTVM achieve 1.79x improvement of inference latency on average compared to the state-of-the-art deep learning framework on Sunway,across eight representative benchmarks.This work is the first attempt from the compiler perspective to bridge the gap of deep learning and Sunway processor particularly with productivity and efficiency in mind.We believe this work will encourage more people to embrace the power of deep learning and Sunwaymany-coreprocessor. 展开更多
关键词 sunway processor deep learning compiler code generation performance optimization
原文传递
Software approaches for resilience of high performance computing systems:a survey
2
作者 Jie JIA Yi LIU +2 位作者 Guozhen ZHANG Yulin GAO depei qian 《Frontiers of Computer Science》 SCIE EI CSCD 2023年第4期43-56,共14页
With the scaling up of high-performance computing systems in recent years,their reliability has been descending continuously.Therefore,system resilience has been regarded as one of the critical challenges for large-sc... With the scaling up of high-performance computing systems in recent years,their reliability has been descending continuously.Therefore,system resilience has been regarded as one of the critical challenges for large-scale HPC systems.Various techniques and systems have been proposed to ensure the correct execution and completion of parallel programs.This paper provides a comprehensive survey of existing software resilience approaches.Firstly,a classification of software resilience approaches is presented;then we introduce major approaches and techniques,including checkpointing,replication,soft error resilience,algorithmbased fault tolerance,fault detection and prediction.In addition,challenges exposed by system-scale and heterogeneous architecture are also discussed. 展开更多
关键词 RESILIENCE high-performance computing fault tolerance CHALLENGE
原文传递
swSpAMM:optimizing large-scale sparse approximate matrix multiplication on Sunway Taihulight
3
作者 Xiaoyan LIU Yi LIU +3 位作者 Bohong YIN Hailong YANG Zhongzhi LUAN depei qian 《Frontiers of Computer Science》 SCIE EI CSCD 2023年第4期29-41,共13页
Although matrix multiplication plays an essential role in a wide range of applications,previous works only focus on optimizing dense or sparse matrix multiplications.The Sparse Approximate Matrix Multiply(SpAMM)is an ... Although matrix multiplication plays an essential role in a wide range of applications,previous works only focus on optimizing dense or sparse matrix multiplications.The Sparse Approximate Matrix Multiply(SpAMM)is an algorithm to accelerate the multiplication of decay matrices,the sparsity of which is between dense and sparse matrices.In addition,large-scale decay matrix multiplication is performed in scientific applications to solve cutting-edge problems.To optimize large-scale decay matrix multiplication using SpAMM on supercomputers such as Sunway Taihulight,we present swSpAMM,an optimized SpAMM algorithm by adapting the computation characteristics to the architecture features of Sunway Taihulight.Specifically,we propose both intra-node and inter-node optimizations to accelerate swSpAMM for large-scale execution.For intra-node optimizations,we explore algorithm parallelization and block-major data layout that are tailored to better utilize the architecture advantage of Sunway processor.For inter-node optimizations,we propose a matrix organization strategy for better distributing sub-matrices across nodes and a dynamic scheduling strategy for improving load balance across nodes.We compare swSpAMM with the existing GEMM library on a single node as well as large-scale matrix multiplication methods on multiple nodes.The experiment results show that swSpAMM achieves a speedup up to 14.5×and 2.2×when compared to xMath library on a single node and 2D GEMM method on multiple nodes,respectively. 展开更多
关键词 approximate calculation sunway processor performance optimization
原文传递
E级计算的几个问题 被引量:9
4
作者 钱德沛 王锐 《中国科学:信息科学》 CSCD 北大核心 2020年第9期1303-1326,共24页
过去20余年,在国家科技计划持续支持下,中国的高性能计算事业得到长足发展,目前,正在向EFlops级(百亿亿次级,简称E级)高性能计算机的目标冲刺.本文简要回顾了我国高性能计算发展的历史,针对当前E级计算所遇到的困难,从体系结构、处理器... 过去20余年,在国家科技计划持续支持下,中国的高性能计算事业得到长足发展,目前,正在向EFlops级(百亿亿次级,简称E级)高性能计算机的目标冲刺.本文简要回顾了我国高性能计算发展的历史,针对当前E级计算所遇到的困难,从体系结构、处理器、互连网络、并行操作系统、并行编程、算法和可靠性等7个方面,探讨了需要重点研究和解决的技术问题. 展开更多
关键词 E级计算机 异构体系结构 众核处理器 互连网 并行编程
原文传递
Coordinating workload balancing and power switching in renewable energy powered data center 被引量:1
5
作者 Xian LI Rui WANG +2 位作者 Zhongzhi LUAN Yi LIU depei qian 《Frontiers of Computer Science》 SCIE EI CSCD 2016年第3期574-587,共14页
There has been growing concern about energy consumption and environmental impact of datacenters. Some pioneers begin to power datacenters with renewable energy to offset carbon footprint. However, it is challenging to... There has been growing concern about energy consumption and environmental impact of datacenters. Some pioneers begin to power datacenters with renewable energy to offset carbon footprint. However, it is challenging to integrate intermittent renewable energy into datacenter power system. Grid-tied system is widely deployed in renewable energy powered datacenters. But the drawbacks (e.g. Harmonic dis- turbance and costliness) of grid tie inverter harass this design. Besides, the mixture of green load and brown load makes power management heavily depend on software measurement and monitoring, which often suffers inaccuracy. We propose DualPower, a novel power provisioning architecture that en- ables green datacenters to integrate renewable power supply without grid tie inverters. To optimize DualPower operation, we propose a specially designed power management frame- work to coordinate workload balancing with power supply switching. We evaluate three optimization schemes (LM, PS and JO) under different datacenter operation scenarios on our trace-driven simulation platform. The experimental results show that DualPower can be as efficient as grid-tied system and has good scalability. In contrast to previous works, Du- alPower integrates renewable power at lower cost and main- tains full availability of datacenter servers. 展开更多
关键词 renewable energy green computing power pro-visioning power management
原文传递
A novel index system describing program runtime characteristics for workload consolidation
6
作者 Lin WANG depei qian +3 位作者 Rui WANG Zhongzhi LUAN Hailong YANG Huaxiang ZHANG 《Frontiers of Computer Science》 SCIE EI CSCD 2019年第3期489-499,共11页
Workload consolidation is a common method to improve the resource utilization in clusters or data centers. In order to achieve efficient workload consolidation, the runtime characteristics of a program should be taken... Workload consolidation is a common method to improve the resource utilization in clusters or data centers. In order to achieve efficient workload consolidation, the runtime characteristics of a program should be taken into con-sideration in scheduling. In this paper, we propose a novel index system for efficiently describing the program runtime characteristics. With the help of this index system, programs can be classified by the following runtime characteristics: 1) dependence to multi-dimensional resources including CPU, disk I/O, memory and network I/O;and 2) impact and vulnerability to resource sharing embodied by resource usage and resource sensitivity. In order to verify the effectiveness of this novel index system in workload consolidation, a scheduling strategy, Sche-index, using the new index system for workload consolidation is proposed. Experiment results show that compared with traditional least-loaded scheduling strategy, Sche-index can improve both program performance and system resource utilization significantly. 展开更多
关键词 index system RUNTIME CHARACTERISTICS WORKLOAD CONSOLIDATION CLUSTER SCHEDULING
原文传递
User-level failure detection and auto-recovery of parallel programs in HPC systems
7
作者 Guozhen ZHANG Yi LIU +2 位作者 Hailong YANG Jun XU depei qian 《Frontiers of Computer Science》 SCIE EI CSCD 2021年第6期31-42,共12页
As the mean-time-between-failures(MTBF)continues to decline with the increasing number of components on large-scale high performance computing(HPC)systems,program failures might occur during the execution period with ... As the mean-time-between-failures(MTBF)continues to decline with the increasing number of components on large-scale high performance computing(HPC)systems,program failures might occur during the execution period with high probability.Ensuring successful execution of the HPC programs has become an issue that the unprivileged users should be concerned.From the user perspective,if the program failure cannot be detected and handled in time,it would waste resources and delay the progress of program execution.Unfortunately,the unprivileged users are unable to perform program state checking due to execution control by the job management system as well as the limited privilege.Currently,automated tools for supporting user-level failure detection and autorecovery of parallel programs in HPC systems are missing.This paper proposes an innovative method for the unprivileged user to achieve failure detection of job execution and automatic resubmission of failed jobs.The state checker in our method is encapsulated as an independent job to reduce interference with the user jobs.In addition,we propose a dual-checker mechanism to improve the robustness of our approach.We implement the proposed method as a tool named automatic re-launcher(ARL)and evaluate it on the Tianhe-2 system.Experiment results show that ARL can detect the execution failures effectively on Tianhe-2 system.In addition,the communication and performance overhead caused by ARL is negligible.The good scalability of ARL makes it applicable for large-scale HPC systems. 展开更多
关键词 high performance computing parallel program failure detection failure auto-recovery
原文传递
Accelerating the cryo-EM structure determination in RELION on GPU cluster
8
作者 Xin YOU Hailong YANG +1 位作者 Zhongzhi LUAN depei qian 《Frontiers of Computer Science》 SCIE EI CSCD 2022年第3期21-39,共19页
The cryo-electron microscopy(cryo-EM)is one of the most powerful technologies available today for structural biology.The RELION(Regularized Likelihood Optimization)implements a Bayesian algorithm for cryo-EM structure... The cryo-electron microscopy(cryo-EM)is one of the most powerful technologies available today for structural biology.The RELION(Regularized Likelihood Optimization)implements a Bayesian algorithm for cryo-EM structure determination,which is one of the most widely used software in this field.Many researchers have devoted effort to improve the performance of RELION to satisfy the analysis for the ever-increasing volume of datasets.In this paper,we focus on performance analysis of the most time-consuming computation steps in RELION and identify their performance bottlenecks for specific optimizations.We propose several performance optimization strategies to improve the overall performance of RELION,including optimization of expectation step,parallelization of maximization step,accelerating the computation of symmetries,and memory affinity optimization.The experiment results show that our proposed optimizations achieve significant speedups of RELION across representative datasets.In addition,we perform roofline model analysis to understand the effectiveness of our optimizations. 展开更多
关键词 cryo-EM structure determination performance optimization GPU acceleration RELION
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部