The spin Hall magnetoresistance(SMR)effect in Pt/Gd_(3)Fe_(5)O_(12)(Gd IG)bilayers was systematically investigated.The sign of SMR changes twice with increasing magnetic field in the vicinity of the magnetization comp...The spin Hall magnetoresistance(SMR)effect in Pt/Gd_(3)Fe_(5)O_(12)(Gd IG)bilayers was systematically investigated.The sign of SMR changes twice with increasing magnetic field in the vicinity of the magnetization compensation point(TM)of Gd IG.However,conventional SMR theory predicts the invariant SMR sign in the heterostructure composed of a heavy metal film in contact with a ferromagnetic or antiferromagnetic film.We conclude that this is because of the significant enhancement of the magnetic moment of the Gd sub-lattice and the unchanged moment of the Fe sub-lattice with a relatively large field,meaning that a small net magnetic moment is induced at TM.As a result,the Néel vector aligns with the field after the spin-flop transition,meaning that a bi-reorientation of the Néel vector is produced.Theoretical calculations based on the Néel’s theory and SMR theory also support our conclusions.Our findings indicate that the Néel-vector direction of a ferrimagnet can be tuned across a wide range by a relatively low external field around TM.展开更多
The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability.Among th...The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability.Among the existing deep learning compilers,TVM is well known for its efficiency in code generation and optimization across diverse hardware devices.In the meanwhile,the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific computing and deep learning workloads.This paper combines the trends in these two directions.Specifically,we propose swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway.In addition,we leverage the architecture features during the compilation such as core group for massive parallelism,DMA for high bandwidth memory transfer and local device memory for data locality,in order to generate efficient codes for deep learning workloads on Sunway.The experiment results show that the codes generated by swTVM achieve 1.79x improvement of inference latency on average compared to the state-of-the-art deep learning framework on Sunway,across eight representative benchmarks.This work is the first attempt from the compiler perspective to bridge the gap of deep learning and Sunway processor particularly with productivity and efficiency in mind.We believe this work will encourage more people to embrace the power of deep learning and Sunwaymany-coreprocessor.展开更多
Although matrix multiplication plays an essential role in a wide range of applications,previous works only focus on optimizing dense or sparse matrix multiplications.The Sparse Approximate Matrix Multiply(SpAMM)is an ...Although matrix multiplication plays an essential role in a wide range of applications,previous works only focus on optimizing dense or sparse matrix multiplications.The Sparse Approximate Matrix Multiply(SpAMM)is an algorithm to accelerate the multiplication of decay matrices,the sparsity of which is between dense and sparse matrices.In addition,large-scale decay matrix multiplication is performed in scientific applications to solve cutting-edge problems.To optimize large-scale decay matrix multiplication using SpAMM on supercomputers such as Sunway Taihulight,we present swSpAMM,an optimized SpAMM algorithm by adapting the computation characteristics to the architecture features of Sunway Taihulight.Specifically,we propose both intra-node and inter-node optimizations to accelerate swSpAMM for large-scale execution.For intra-node optimizations,we explore algorithm parallelization and block-major data layout that are tailored to better utilize the architecture advantage of Sunway processor.For inter-node optimizations,we propose a matrix organization strategy for better distributing sub-matrices across nodes and a dynamic scheduling strategy for improving load balance across nodes.We compare swSpAMM with the existing GEMM library on a single node as well as large-scale matrix multiplication methods on multiple nodes.The experiment results show that swSpAMM achieves a speedup up to 14.5×and 2.2×when compared to xMath library on a single node and 2D GEMM method on multiple nodes,respectively.展开更多
To address the increasing need for detecting and validating protein biomarkers in clinical specimens,mass spectrometry(MS)-based targeted proteomic techniques,including the selected reaction monitoring(SRM),parallel r...To address the increasing need for detecting and validating protein biomarkers in clinical specimens,mass spectrometry(MS)-based targeted proteomic techniques,including the selected reaction monitoring(SRM),parallel reaction monitoring(PRM),and massively parallel dataindependent acquisition(DIA),have been developed.For optimal performance,they require the fragment ion spectra of targeted peptides as prior knowledge.In this report,we describe a MS pipeline and spectral resource to support targeted proteomics studies for human tissue samples.To build the spectral resource,we integrated common open-source MS computational tools to assemble a freely accessible computational workflow based on Docker.We then applied the workflow to generate DPHL,a comprehensive DIA pan-human library,from 1096 data-dependent acquisition(DDA)MS raw files for 16 types of cancer samples.This extensive spectral resource was then applied to a proteomic study of 17 prostate cancer(PCa)patients.Thereafter,PRM validation was applied to a larger study of 57 PCa patients and the differential expression of three proteins in prostate tumor was validated.As a second application,the DPHL spectral resource was applied to a study consisting of plasma samples from 19 diffuse large B cell lymphoma(DLBCL)patients and 18 healthy control subjects.Differentially expressed proteins between DLBCL patients and healthy control subjects were detected by DIA-MS and confirmed by PRM.These data demonstrate that the DPHL supports DIA and PRM MS pipelines for robust protein biomarker discovery.DPHL is freely accessible at https://www.iprox.org/page/project.html?id=IPX0001400000.展开更多
There has been growing concern about energy consumption and environmental impact of datacenters. Some pioneers begin to power datacenters with renewable energy to offset carbon footprint. However, it is challenging to...There has been growing concern about energy consumption and environmental impact of datacenters. Some pioneers begin to power datacenters with renewable energy to offset carbon footprint. However, it is challenging to integrate intermittent renewable energy into datacenter power system. Grid-tied system is widely deployed in renewable energy powered datacenters. But the drawbacks (e.g. Harmonic dis- turbance and costliness) of grid tie inverter harass this design. Besides, the mixture of green load and brown load makes power management heavily depend on software measurement and monitoring, which often suffers inaccuracy. We propose DualPower, a novel power provisioning architecture that en- ables green datacenters to integrate renewable power supply without grid tie inverters. To optimize DualPower operation, we propose a specially designed power management frame- work to coordinate workload balancing with power supply switching. We evaluate three optimization schemes (LM, PS and JO) under different datacenter operation scenarios on our trace-driven simulation platform. The experimental results show that DualPower can be as efficient as grid-tied system and has good scalability. In contrast to previous works, Du- alPower integrates renewable power at lower cost and main- tains full availability of datacenter servers.展开更多
Workload consolidation is a common method to improve the resource utilization in clusters or data centers. In order to achieve efficient workload consolidation, the runtime characteristics of a program should be taken...Workload consolidation is a common method to improve the resource utilization in clusters or data centers. In order to achieve efficient workload consolidation, the runtime characteristics of a program should be taken into con-sideration in scheduling. In this paper, we propose a novel index system for efficiently describing the program runtime characteristics. With the help of this index system, programs can be classified by the following runtime characteristics: 1) dependence to multi-dimensional resources including CPU, disk I/O, memory and network I/O;and 2) impact and vulnerability to resource sharing embodied by resource usage and resource sensitivity. In order to verify the effectiveness of this novel index system in workload consolidation, a scheduling strategy, Sche-index, using the new index system for workload consolidation is proposed. Experiment results show that compared with traditional least-loaded scheduling strategy, Sche-index can improve both program performance and system resource utilization significantly.展开更多
The cryo-electron microscopy(cryo-EM)is one of the most powerful technologies available today for structural biology.The RELION(Regularized Likelihood Optimization)implements a Bayesian algorithm for cryo-EM structure...The cryo-electron microscopy(cryo-EM)is one of the most powerful technologies available today for structural biology.The RELION(Regularized Likelihood Optimization)implements a Bayesian algorithm for cryo-EM structure determination,which is one of the most widely used software in this field.Many researchers have devoted effort to improve the performance of RELION to satisfy the analysis for the ever-increasing volume of datasets.In this paper,we focus on performance analysis of the most time-consuming computation steps in RELION and identify their performance bottlenecks for specific optimizations.We propose several performance optimization strategies to improve the overall performance of RELION,including optimization of expectation step,parallelization of maximization step,accelerating the computation of symmetries,and memory affinity optimization.The experiment results show that our proposed optimizations achieve significant speedups of RELION across representative datasets.In addition,we perform roofline model analysis to understand the effectiveness of our optimizations.展开更多
基金Project supported by the National Key Research and Development Program of China(Grant Nos.2017YFA0303202 and 2016YFA0300803)the National Natural Science Foundation of China(Grant Nos.11904194,11727808,and 11674159)the Fundamental Research Funds for the Central Universities,China(Grant No.020414380121)。
文摘The spin Hall magnetoresistance(SMR)effect in Pt/Gd_(3)Fe_(5)O_(12)(Gd IG)bilayers was systematically investigated.The sign of SMR changes twice with increasing magnetic field in the vicinity of the magnetization compensation point(TM)of Gd IG.However,conventional SMR theory predicts the invariant SMR sign in the heterostructure composed of a heavy metal film in contact with a ferromagnetic or antiferromagnetic film.We conclude that this is because of the significant enhancement of the magnetic moment of the Gd sub-lattice and the unchanged moment of the Fe sub-lattice with a relatively large field,meaning that a small net magnetic moment is induced at TM.As a result,the Néel vector aligns with the field after the spin-flop transition,meaning that a bi-reorientation of the Néel vector is produced.Theoretical calculations based on the Néel’s theory and SMR theory also support our conclusions.Our findings indicate that the Néel-vector direction of a ferrimagnet can be tuned across a wide range by a relatively low external field around TM.
基金supported by the National Key Research and Development Program of China (No.2020YFB1506703)the National Natural Science Foundation of China (Grant Nos.62072018 and 61732002)+1 种基金the State Key Laboratory of Software Development Environment (No.SKLSDE-2021ZX-06)the Fundamental Research Funds for the Central Universities。
文摘The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability.Among the existing deep learning compilers,TVM is well known for its efficiency in code generation and optimization across diverse hardware devices.In the meanwhile,the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific computing and deep learning workloads.This paper combines the trends in these two directions.Specifically,we propose swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway.In addition,we leverage the architecture features during the compilation such as core group for massive parallelism,DMA for high bandwidth memory transfer and local device memory for data locality,in order to generate efficient codes for deep learning workloads on Sunway.The experiment results show that the codes generated by swTVM achieve 1.79x improvement of inference latency on average compared to the state-of-the-art deep learning framework on Sunway,across eight representative benchmarks.This work is the first attempt from the compiler perspective to bridge the gap of deep learning and Sunway processor particularly with productivity and efficiency in mind.We believe this work will encourage more people to embrace the power of deep learning and Sunwaymany-coreprocessor.
基金supported by the National Key Research and Development Program of China(2020YFB1506703)the National Natural Science Foundation of China(Grant Nos.62072018 and 61732002)State Key Laboratory of Software Development Environment(SKLSDE-2021ZX-06)。
文摘Although matrix multiplication plays an essential role in a wide range of applications,previous works only focus on optimizing dense or sparse matrix multiplications.The Sparse Approximate Matrix Multiply(SpAMM)is an algorithm to accelerate the multiplication of decay matrices,the sparsity of which is between dense and sparse matrices.In addition,large-scale decay matrix multiplication is performed in scientific applications to solve cutting-edge problems.To optimize large-scale decay matrix multiplication using SpAMM on supercomputers such as Sunway Taihulight,we present swSpAMM,an optimized SpAMM algorithm by adapting the computation characteristics to the architecture features of Sunway Taihulight.Specifically,we propose both intra-node and inter-node optimizations to accelerate swSpAMM for large-scale execution.For intra-node optimizations,we explore algorithm parallelization and block-major data layout that are tailored to better utilize the architecture advantage of Sunway processor.For inter-node optimizations,we propose a matrix organization strategy for better distributing sub-matrices across nodes and a dynamic scheduling strategy for improving load balance across nodes.We compare swSpAMM with the existing GEMM library on a single node as well as large-scale matrix multiplication methods on multiple nodes.The experiment results show that swSpAMM achieves a speedup up to 14.5×and 2.2×when compared to xMath library on a single node and 2D GEMM method on multiple nodes,respectively.
基金supported by the National Natural Science Foundation of China(Grant No.81972492)National Science Fund for Young Scholars(Grant No.21904107)+7 种基金Zhejiang Provincial Natural Science Foundation for Distinguished Young Scholars(Grant No.LR19C050001)Hangzhou Agriculture and Society Advancement Program(Grant No.20190101A04)Westlake Startup Grantresearch funds from the National Cancer Centre Singapore and Singapore General Hospital,Singaporethe National Key R&D Program of China(Grant No.2016YFC0901704)Zhejiang Innovation Discipline Project of Laboratory Animal Genetic Engineering(Grant No.201510)the Netherlands Cancer Society(Grant No.NKI 2014-6651)The Netherlands Organization for Scientific Research(NWO)-Middelgroot(Grant No.91116017)
文摘To address the increasing need for detecting and validating protein biomarkers in clinical specimens,mass spectrometry(MS)-based targeted proteomic techniques,including the selected reaction monitoring(SRM),parallel reaction monitoring(PRM),and massively parallel dataindependent acquisition(DIA),have been developed.For optimal performance,they require the fragment ion spectra of targeted peptides as prior knowledge.In this report,we describe a MS pipeline and spectral resource to support targeted proteomics studies for human tissue samples.To build the spectral resource,we integrated common open-source MS computational tools to assemble a freely accessible computational workflow based on Docker.We then applied the workflow to generate DPHL,a comprehensive DIA pan-human library,from 1096 data-dependent acquisition(DDA)MS raw files for 16 types of cancer samples.This extensive spectral resource was then applied to a proteomic study of 17 prostate cancer(PCa)patients.Thereafter,PRM validation was applied to a larger study of 57 PCa patients and the differential expression of three proteins in prostate tumor was validated.As a second application,the DPHL spectral resource was applied to a study consisting of plasma samples from 19 diffuse large B cell lymphoma(DLBCL)patients and 18 healthy control subjects.Differentially expressed proteins between DLBCL patients and healthy control subjects were detected by DIA-MS and confirmed by PRM.These data demonstrate that the DPHL supports DIA and PRM MS pipelines for robust protein biomarker discovery.DPHL is freely accessible at https://www.iprox.org/page/project.html?id=IPX0001400000.
基金This work was supported by 863 Program of China (2012AA010902), the National Natural Science Foundation of China (Grant Nos. 61202425, 61133004 and 61361126011), State Key Laboratory of Soft- ware Development Environment (SKLSDE-2013ZX-22), and the Funda- mental Research Funds for the Central Universities.
文摘There has been growing concern about energy consumption and environmental impact of datacenters. Some pioneers begin to power datacenters with renewable energy to offset carbon footprint. However, it is challenging to integrate intermittent renewable energy into datacenter power system. Grid-tied system is widely deployed in renewable energy powered datacenters. But the drawbacks (e.g. Harmonic dis- turbance and costliness) of grid tie inverter harass this design. Besides, the mixture of green load and brown load makes power management heavily depend on software measurement and monitoring, which often suffers inaccuracy. We propose DualPower, a novel power provisioning architecture that en- ables green datacenters to integrate renewable power supply without grid tie inverters. To optimize DualPower operation, we propose a specially designed power management frame- work to coordinate workload balancing with power supply switching. We evaluate three optimization schemes (LM, PS and JO) under different datacenter operation scenarios on our trace-driven simulation platform. The experimental results show that DualPower can be as efficient as grid-tied system and has good scalability. In contrast to previous works, Du- alPower integrates renewable power at lower cost and main- tains full availability of datacenter servers.
基金National Key Research and Development Program of China (2016YFB1000503)the National Natural Science Foundation of China (Grant Nos. 61133004, 61361126011, 61502019, 61732002, 61373081, 61772322)+1 种基金China Postdoctoral Science Foundation (2017M622263)Natural Science Foundation of Shandong Province (ZR2015PF006).
文摘Workload consolidation is a common method to improve the resource utilization in clusters or data centers. In order to achieve efficient workload consolidation, the runtime characteristics of a program should be taken into con-sideration in scheduling. In this paper, we propose a novel index system for efficiently describing the program runtime characteristics. With the help of this index system, programs can be classified by the following runtime characteristics: 1) dependence to multi-dimensional resources including CPU, disk I/O, memory and network I/O;and 2) impact and vulnerability to resource sharing embodied by resource usage and resource sensitivity. In order to verify the effectiveness of this novel index system in workload consolidation, a scheduling strategy, Sche-index, using the new index system for workload consolidation is proposed. Experiment results show that compared with traditional least-loaded scheduling strategy, Sche-index can improve both program performance and system resource utilization significantly.
基金the National Key R&D Program of China(2020YFB1506703)the National Natural Science Foundation of China(Grant No.62072018)the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing(2019A12).
文摘The cryo-electron microscopy(cryo-EM)is one of the most powerful technologies available today for structural biology.The RELION(Regularized Likelihood Optimization)implements a Bayesian algorithm for cryo-EM structure determination,which is one of the most widely used software in this field.Many researchers have devoted effort to improve the performance of RELION to satisfy the analysis for the ever-increasing volume of datasets.In this paper,we focus on performance analysis of the most time-consuming computation steps in RELION and identify their performance bottlenecks for specific optimizations.We propose several performance optimization strategies to improve the overall performance of RELION,including optimization of expectation step,parallelization of maximization step,accelerating the computation of symmetries,and memory affinity optimization.The experiment results show that our proposed optimizations achieve significant speedups of RELION across representative datasets.In addition,we perform roofline model analysis to understand the effectiveness of our optimizations.