In this paper,a typical experiment is carried out based on a high-resolution air-sea coupled model,namely,the coupled ocean-atmosphere-wave-sediment transport(COAWST)model,on both heterogeneous many-core(SW)and homoge...In this paper,a typical experiment is carried out based on a high-resolution air-sea coupled model,namely,the coupled ocean-atmosphere-wave-sediment transport(COAWST)model,on both heterogeneous many-core(SW)and homogenous multicore(Intel)supercomputing platforms.We construct a hindcast of Typhoon Lekima on both the SW and Intel platforms,compare the simulation results between these two platforms and compare the key elements of the atmospheric and ocean modules to reanalysis data.The comparative experiment in this typhoon case indicates that the domestic many-core computing platform and general cluster yield almost no differences in the simulated typhoon path and intensity,and the differences in surface pressure(PSFC)in the WRF model and sea surface temperature(SST)in the short-range forecast are very small,whereas a major difference can be identified at high latitudes after the first 10 days.Further heat budget analysis verifies that the differences in SST after 10 days are mainly caused by shortwave radiation variations,as influenced by subsequently generated typhoons in the system.These typhoons generated in the hindcast after the first 10 days attain obviously different trajectories between the two platforms.展开更多
Recent architectures of multi-core systems may have a relatively large number of cores that typically ranges from tens to hundreds;therefore called many-core systems.Such systems require an efficient interconnection n...Recent architectures of multi-core systems may have a relatively large number of cores that typically ranges from tens to hundreds;therefore called many-core systems.Such systems require an efficient interconnection network that tries to address two major problems.First,the overhead of power and area cost and its effect on scalability.Second,high access latency is caused by multiple cores’simultaneous accesses of the same shared module.This paper presents an interconnection scheme called N-conjugate Shuffle Clusters(NCSC)based on multi-core multicluster architecture to reduce the overhead of the just mentioned problems.NCSC eliminated the need for router devices and their complexity and hence reduced the power and area costs.It also resigned and distributed the shared caches across the interconnection network to increase the ability for simultaneous access and hence reduce the access latency.For intra-cluster communication,Multi-port Content Addressable Memory(MPCAM)is used.The experimental results using four clusters and four cores each indicated that the average access latency for a write process is 1.14785±0.04532 ns which is nearly equal to the latency of a write operation in MPCAM.Moreover,it was demonstrated that the average read latency within a cluster is 1.26226±0.090591 ns and around 1.92738±0.139588 ns for read access between cores from different clusters.展开更多
With the development of computer technology, network bandwidth and network traffic continue to increase. Considering the large data flow, it is imperative to perform inspection effectively on network packets. In order...With the development of computer technology, network bandwidth and network traffic continue to increase. Considering the large data flow, it is imperative to perform inspection effectively on network packets. In order to find a solution of deep packet inspection which can appropriate to the current network environment, this paper built a deep packet inspection system based on many-core platform, and in this way, verified the feasibility to implement a deep packet inspection system under many-core platform with both high performance and low consumption. After testing and analysis of the system performance, it has been found that the deep packet inspection based on many-core platform TILE_Gx36 [1] [2] can process network traffic of which the bandwidth reaches up to 4 Gbps. To a certain extent, the performance has improved compared to most deep packet inspection system based on X86 platform at present.展开更多
Simulators are generally used during the design of computer architectures. Typically, different simulators with different levels of complexity, speed and accuracy are used. However, for early design space exploration,...Simulators are generally used during the design of computer architectures. Typically, different simulators with different levels of complexity, speed and accuracy are used. However, for early design space exploration, simulators with less complexity, high simulation speed and reasonable accuracy are desired. It is also required that these simulators have a short development time and that changes in the design require less effort in the implementation in order to perform experiments and see the effects of changes in the design. These simulators are termed high-level simulators in the context of computer architecture. In this paper, we present multiple levels of abstractions in a high-level simulation of a general-purpose many-core system, where the objective of every level is to improve the accuracy in simulation without significantly affecting the complexity and simulation speed.展开更多
Moore's law will grant computer architects ever more transistors for the foreseeable future, and the challenge is how to use them to deliver efficient performance and flexible programmability. We propose a many-core ...Moore's law will grant computer architects ever more transistors for the foreseeable future, and the challenge is how to use them to deliver efficient performance and flexible programmability. We propose a many-core architecture, Godson- T, to attack this challenge. On the one hand, Godson-T features a region-based cache coherence protocol, asynchronous data transfer agents and hardware-supported synchronization mechanisms, to provide full potential for the high efficiency of the on-chip resource utilization. On the other hand, Godson-T features a highly efficient runtime system, a Pthreadslike programming model, and versatile parallel libraries, which make this many-core design flexibly programmable. This hardware/software cooperating design methodology bridges the high-end computing with mass programmers. Experimental evaluations are conducted on a cycle-accurate simulator of Godson-T. The results show that the proposed architecture has good scalability, fast synchronization, high computational efficiency, and flexible programmability.展开更多
Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which h...Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing ele- ments (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS.展开更多
This article presents a comprehensive performance evaluation of Phytium 2000+,an ARMv8-based 64-core architecture.We focus on the cache and memory subsystems,analyzing the characteristics that impact the high-performa...This article presents a comprehensive performance evaluation of Phytium 2000+,an ARMv8-based 64-core architecture.We focus on the cache and memory subsystems,analyzing the characteristics that impact the high-performance computing applications.We provide insights into the memory-relevant performance behaviours of the Phytium 2000+system through micro-benchmarking.With the help of the well-known roofline model,we analyze the Phytium 2000+system,taking both memory accesses and computations into account.Based on the knowledge gained from these micro-benchmarks,we evaluate two applications and use them to assess the capabilities of the Phytium 2000+system.The results show that the ARMv8-based many-core system is capable of delivering high performance for a wide range of scientific kernels.展开更多
As semiconductor technology advances, there will be billions of transistors on a single chip. Chip many-core processors are emerging to take advantage of these greater transistor densities to deliver greater performan...As semiconductor technology advances, there will be billions of transistors on a single chip. Chip many-core processors are emerging to take advantage of these greater transistor densities to deliver greater performance. Effective fault tolerance techniques are essential to improve the yield of such complex chips. In this paper, a core-level redundancy scheme called N+M is proposed to improve N-core processors’ yield by providing M spare cores. In such architecture, topology is an important factor because it greatly affects the processors’ performance. The concept of logical topology and a topology reconfiguration problem are introduced, which is able to transparently provide target topology with lowest performance degradation as the presence of faulty cores on-chip. A row rippling and column stealing (RRCS) algorithm is also proposed. Results show that PRCS can give solutions with average 13.8% degradation with negligible computing time.展开更多
The short-range pair interaction consumes most of the CPU time in molecular dynamics(MD)simulations.The inherent computation sparsity makes it challenging to achieve high-performance kernel on the emerging many-core a...The short-range pair interaction consumes most of the CPU time in molecular dynamics(MD)simulations.The inherent computation sparsity makes it challenging to achieve high-performance kernel on the emerging many-core architecture.In this paper,we present a highly efficient short-range force kernel on the Sunway,a novel many-core architecture with many unique features.The parallel efficiency of this algorithm on the Sunway many-core processor is strongly limited by the poor data locality and write conflicts.To enhance the data locality,we adopt a super cluster based neighbor list with an appropriate granularity that fits in the local memory of computing cores.In the absence of a low overhead locking mechanism,using data-privatization force array is a more feasible method to avoid write conflicts,but results in the large overhead of data reduction.We adopt a dual-slice partitioning scheme for both hardware resources and computing tasks,which utilizes the on-chip data communication to reduce data reduction overhead and provide load balancing.Moreover,we exploit the single instruction multiple data(SIMD)parallelism and perform instruction reordering of the force kernel on this many-core processor.The experimental results show that the optimized force kernel obtains a performance speedup of 226x compared with the reference implementation and achieves 20%of peak flop rate on the Sunway many-core processor.展开更多
OpenCL is an open heterogeneous programming framework. Although OpenCL programs are func- tionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When ...OpenCL is an open heterogeneous programming framework. Although OpenCL programs are func- tionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus has been extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. Typi- cally, the use of OpenCL's local memory on multi-core/many-core CPUs may lead to an opposite performance effect, because local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by (1) removing all the unwanted local-memory arrays together with the obsolete barrier statements and (2) optimizing the coalesced kernel code with vectorization and locality re-exploitation. Moreover, we have developed an automated tool chain that makes this transformation of GPU-specific OpenCL kernels into a CPU-friendly form, which is accompanied with a scheduler that forms a new OpenCL runtime. Experiments show that the automated transformation can improve OpenCL kernel performance on a multi-core CPU by an average factor of 3.24. Satisfactory performance improvements axe also achieved on Intel's many-integrated-core coprocessor. The resultant performance on both architectures is better than or comparable with the corresponding OpenMP performance.展开更多
Cache performance is a critical design constraint for modern many-core systems.Since the cache often works in a"black-box"manner,it is difficult for the software to reason about the cache behavior to match t...Cache performance is a critical design constraint for modern many-core systems.Since the cache often works in a"black-box"manner,it is difficult for the software to reason about the cache behavior to match the running software to the underlying hardware.To better support code optimization,we need to understand and characterize the cache be-havior.While cache performance characterization is heavily studied on traditional x86 architectures,there is little work for understanding the cache implementations on emerging ARMv8-based many-cores.This paper presents a comprehensive study to evaluate the cache architecture design on three representative ARMv8 multi-cores,Phytium 2000+,ThunderX2,and Kunpeng 920(KP920).To this end,we develop wrBench,a micro-benchmark suite to measure the realized latency and bandwidth of caches at different memory hierarchies when performing core-to-core communication.Our evaluation pro-vides inter-core latency and bandwidth in different cache levels and coherency states for the three ARMv8 many-cores.The quantitative performance data is shown in tables.We mine the characteristics of caches and coherency protocols by analyzing the data for the three processors,Phytium 2000+,ThunderX2,and KP920.Our paper also provides discussions and guidelines for optimizing memory access on ARMv8 many-cores.展开更多
Purpose–The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core pr...Purpose–The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core processors such as graphics processing units(GPUs)and multi-core processors(MCPs).Design/methodology/approach–For distributed genetic algorithm(GA)models,the paper proposes a method where an island’s ID number is added to the header of data transferred by this island for use in fault detection.Findings–The paper has shown that the processing time of the proposed idea is practically negligible in applications and also shown that an optimal solution can be obtained even with a single stuck-at fault or a transient fault,and that increasing the number of parallel threads makes the system less susceptible to faults.Originality/value–The study described in this paper is a new approach to increase the sustainability of application program using distributed GA on GPUs and MCPs.展开更多
Unchecked breast cell growth is one of the leading causes of death in women globally and is the cause of breast cancer.The only method to avoid breast cancer-related deaths is through early detection and treatment.The...Unchecked breast cell growth is one of the leading causes of death in women globally and is the cause of breast cancer.The only method to avoid breast cancer-related deaths is through early detection and treatment.The proper classification of malignancies is one of the most significant challenges in the medical industry.Due to their high precision and accuracy,machine learning techniques are extensively employed for identifying and classifying various forms of cancer.Several data mining algorithms were studied and implemented by the author of this review and compared them to the present parameters and accuracy of various algorithms for breast cancer diagnosis such that clinicians might use them to accurately detect cancer cells early on.This article introduces several techniques,including support vector machine(SVM),K star(K∗)classifier,Additive Regression(AR),Back Propagation Neural Network(BP),and Bagging.These algorithms are trained using a set of data that contains tumor parameters from breast cancer patients.Comparing the results,the author found that Support Vector Machine and Bagging had the highest precision and accuracy,respectively.Also,assess the number of studies that provide machine learning techniques for breast cancer detection.展开更多
A discord is a refinement of the concept of an anomalous subsequence of a time series.Being one of the topical issues of time series mining,discords discovery is applied in a wide range of real-world areas(medicine,as...A discord is a refinement of the concept of an anomalous subsequence of a time series.Being one of the topical issues of time series mining,discords discovery is applied in a wide range of real-world areas(medicine,astronomy,economics,climate modeling,predictive maintenance,energy consumption,etc.).In this article,we propose a novel parallel algorithm for discords discovery on high-performance cluster with nodes based on many-core accelerators in the case when time series cannot fit in the main memory.We assumed that the time series is partitioned across the cluster nodes and achieved parallelization among the cluster nodes as well as within a single node.Within a cluster node,the algorithm employs a set of matrix data structures to store and index the subsequences of a time series,and to provide an efficient vectorization of computations on the accelerator.At each node,the algorithm processes its own partition and performs in two phases,namely candidate selection and discord refinement,with each phase requiring one linear scan through the partition.Then the local discords found are combined into the global candidate set and transmitted to each cluster node.Next,a node performs refinement of the global candidate set over its own partition resulting in the local true discord set.Finally,the global true discords set is constructed as intersection of the local true discord sets.The experimental evaluation on the real computer cluster with real and synthetic time series shows a high scalability of the proposed algorithm.展开更多
Increasing needs for the study of complex dynamical systems require computing solutions of a large number of ordinary and partial differential time-dependent equations in near real-time. Numerical integration algorith...Increasing needs for the study of complex dynamical systems require computing solutions of a large number of ordinary and partial differential time-dependent equations in near real-time. Numerical integration algorithms, which are computationally expensive and inherently sequential, are typically used to compute solutions of ordinary and partial differential time-dependent equations. This presents challenges to study complex dynamical systems in near real-time. This paper examines the challenges of computing solutions of ordinary differential time-dependent equations using the Parareal algorithm belonging to the class of parallel-in-time algorithms on various high-performance computing accelerator-based architectures and associated programming models. The paper presents the code refactoring steps and performance analysis of the Parareal algorithm on two accelerator computing architectures: the Intel Xeon Phi CPU and Graphics Processing Unit many-core architectures, and with OpenMP, OpenACC, and CUDA programming models. The speedup and scaling performance analysis are used to demonstrate the suitability of the Parareal to compute the solutions of a single ordinary differential time-dependent equation and a family of interdependent ordinary differential time-dependent. The speedup, weak and strong scaling results demonstrate the suitability of Graphical Processing Units with the CUDA programming model as the most efficient accelerator for computing solutions of ordinary differential time-dependent equations using parallel-in-time algorithms. Considering the time and effort required to refactor the code for execution on the accelerator architectures, the Graphical Processing Units with the OpenACC programming model is the most efficient accelerator for computing solutions of ordinary differential time-dependent equations using parallel-in-time algorithms.展开更多
A Weighted Essentially Non-Oscillatory scheme(WENO) is a solution to hyperbolic conservation laws,suitable for solving high-density fluid interface instability with strong intermittency. These problems have a large an...A Weighted Essentially Non-Oscillatory scheme(WENO) is a solution to hyperbolic conservation laws,suitable for solving high-density fluid interface instability with strong intermittency. These problems have a large and complex flow structure. To fully utilize the computing power of High Performance Computing(HPC) systems, it is necessary to develop specific methodologies to optimize the performance of applications based on the particular system’s architecture. The Sunway TaihuLight supercomputer is currently ranked as the fastest supercomputer in the world. This article presents a heterogeneous parallel algorithm design and performance optimization of a high-order WENO on Sunway TaihuLight. We analyzed characteristics of kernel functions, and proposed an appropriate heterogeneous parallel model. We also figured out the best division strategy for computing tasks,and implemented the parallel algorithm on Sunway TaihuLight. By using access optimization, data dependency elimination, and vectorization optimization, our parallel algorithm can achieve up to 172× speedup on one single node, and additional 58× speedup on 64 nodes, with nearly linear scalability.展开更多
Equipped with 512-bit wide SIMD inst d large numbers of computing cores, the emerging x86-based Intel(R) Many Integrated Core (MIC) Architecture ot only high floating-point performance, but also substantial ...Equipped with 512-bit wide SIMD inst d large numbers of computing cores, the emerging x86-based Intel(R) Many Integrated Core (MIC) Architecture ot only high floating-point performance, but also substantial off-chip memory bandwidth. The 3D FFT (three-di fast Fourier transform) is a widely-studied algorithm; however, the conventional algorithm needs to traverse the three times. In each pass, it computes multiple 1D FFTs along one of three dimensions, giving rise to plenty of rided memory accesses. In this paper, we propose a two-pass 3D FFT algorithm, which mainly aims to reduce of explicit data transfer between the memory and the on-chip cache. The main idea is to split one dimension into ensions, and then combine the transform along each sub-dimension with one of the rest dimensions respectively erence in amount of TLB misses resulting from decomposition along different dimensions is analyzed in detail. el parallelism is leveraged on the many-core system for a high degree of parallelism and better data reuse of loc On top of this, a number of optimization techniques, such as memory padding, loop transformation and vectoriz employed in our implementation to further enhance the performance. We evaluate the algorithm on the Intel(R) PhiTM coprocessor 7110P, and achieve a maximum performance of 136 Gflops with 240 threads in offload mode, which ts the vendor-specific Intel(R)MKL library by a factor of up to 2.22X.展开更多
Moore's law continues to grant computer architects ever more transistors in the foreseeable future, and parallelism is the key to continued performance scaling in modern microprocessors. In this paper, the achievemen...Moore's law continues to grant computer architects ever more transistors in the foreseeable future, and parallelism is the key to continued performance scaling in modern microprocessors. In this paper, the achievements in our research project, which is supported by the National Basic Research 973 Program of China, on parallel architecture, are systematically presented. The innovative approaches and techniques to solve the significant problems in parallel architecture design are smnmarized, including architecture level optimization, compiler and language-supported technologies, reliability, power-performance efficient design, test and verification challenges, and platform building. Two prototype chips, a multi-heavy-core Godson-3 and a many-light-core Godson-T, are described to demonstrate the highly scalable and reconfigurable parallel architecture designs. We also present some of our achievements appearing in ISCA, MICRO, ISSCC, HPCA, PLDI, PACT, IJCAI, Hot Chips, DATE, IEEE Trans. VLSI, IEEE Micro, IEEE Trans. Computers, etc.展开更多
The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunit...The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technology -- together they may have profound impact. This paper presents a case study (using the 1-D Jacobi computation) of compiler-amendable performance optimization techniques on a many-core architecture Godson-T. Godson-T architecture has several unique features that are chosen for this study: 1) chip-level global addressable memory in particular the scratchpad memories (SPM) local to the processing cores; 2) fine-grain memory based synchronization (e.g., full-empty bit for fine-grain synchronization). Leveraging state-of-the-art performance optimization methods for 1-D stencil parallelization (e.g., timed tiling and variants), we developed and implement a number of many-core-based optimization for Godson-T. Our experimental study shows good performance in both execution time speedup and scalability, validate the value of globally accessed SPM and fine-grain synchronization mechanism (full-empty bits) under the Godson-T, and provides some useful guidelines for future compiler technology of many-core chip architectures.展开更多
During the era of global warming and highly urbanized development,extreme and high impact weather as well as air pollution incidents influence everyday life and might even cause the incalculable loss of life and prope...During the era of global warming and highly urbanized development,extreme and high impact weather as well as air pollution incidents influence everyday life and might even cause the incalculable loss of life and property.Despite the vast development of atmospheric models,there still exist substantial numerical forecast biases objectively.To accurately predict extreme weather,severe air pollution,and abrupt climate change,numerical atmospheric model requires not only to simulate meteorology and atmospheric compositions simultaneously involving many sophisticated physical and chemical processes but also at high spatiotemporal resolution.Global integrated atmospheric simulation at spatial resolutions of a few kilometers remains challenging due to its intensive computational and input/output(I/O)requirement.Through multi-dimension-parallelism structuring,aggressive and finer-grained optimizing,manual vectorizing,and parallelized I/O fragmenting,an integrated Atmospheric Model Across Scales(iAMAS)was established on the new Sunway supercomputer platform to significantly increase the computational efficiency and reduce the I/O cost.The global 3-km atmospheric simulation for meteorology with online integrated aerosol feedbacks with iAMAS was scaled to 39,000,000 processor cores and achieved the speed of 0.82 simulation day per hour(SDPH)with routine I/O,which enabled us to perform 5-day global weather forecast at 3-km horizontal resolution with online natural aerosol impacts.The results demonstrate the promising future that the increasing of spatial resolution to a few kilometers with online integrated aerosol feedbacks may significantly improve the global weather forecast.展开更多
基金This work is supported by the National Key Research and Development Plan program of the Ministry of Science and Technology of China(No.2016YFB0201100)Additionally,this work is supported by the National Laboratory for Marine Science and Technology(Qingdao)Major Project of the Aoshan Science and Technology Innovation Program(No.2018ASKJ01-04)the Open Fundation of Key Laboratory of Marine Science and Numerical Simulation,Ministry of Natural Resources(No.2021-YB-02).
文摘In this paper,a typical experiment is carried out based on a high-resolution air-sea coupled model,namely,the coupled ocean-atmosphere-wave-sediment transport(COAWST)model,on both heterogeneous many-core(SW)and homogenous multicore(Intel)supercomputing platforms.We construct a hindcast of Typhoon Lekima on both the SW and Intel platforms,compare the simulation results between these two platforms and compare the key elements of the atmospheric and ocean modules to reanalysis data.The comparative experiment in this typhoon case indicates that the domestic many-core computing platform and general cluster yield almost no differences in the simulated typhoon path and intensity,and the differences in surface pressure(PSFC)in the WRF model and sea surface temperature(SST)in the short-range forecast are very small,whereas a major difference can be identified at high latitudes after the first 10 days.Further heat budget analysis verifies that the differences in SST after 10 days are mainly caused by shortwave radiation variations,as influenced by subsequently generated typhoons in the system.These typhoons generated in the hindcast after the first 10 days attain obviously different trajectories between the two platforms.
文摘Recent architectures of multi-core systems may have a relatively large number of cores that typically ranges from tens to hundreds;therefore called many-core systems.Such systems require an efficient interconnection network that tries to address two major problems.First,the overhead of power and area cost and its effect on scalability.Second,high access latency is caused by multiple cores’simultaneous accesses of the same shared module.This paper presents an interconnection scheme called N-conjugate Shuffle Clusters(NCSC)based on multi-core multicluster architecture to reduce the overhead of the just mentioned problems.NCSC eliminated the need for router devices and their complexity and hence reduced the power and area costs.It also resigned and distributed the shared caches across the interconnection network to increase the ability for simultaneous access and hence reduce the access latency.For intra-cluster communication,Multi-port Content Addressable Memory(MPCAM)is used.The experimental results using four clusters and four cores each indicated that the average access latency for a write process is 1.14785±0.04532 ns which is nearly equal to the latency of a write operation in MPCAM.Moreover,it was demonstrated that the average read latency within a cluster is 1.26226±0.090591 ns and around 1.92738±0.139588 ns for read access between cores from different clusters.
文摘With the development of computer technology, network bandwidth and network traffic continue to increase. Considering the large data flow, it is imperative to perform inspection effectively on network packets. In order to find a solution of deep packet inspection which can appropriate to the current network environment, this paper built a deep packet inspection system based on many-core platform, and in this way, verified the feasibility to implement a deep packet inspection system under many-core platform with both high performance and low consumption. After testing and analysis of the system performance, it has been found that the deep packet inspection based on many-core platform TILE_Gx36 [1] [2] can process network traffic of which the bandwidth reaches up to 4 Gbps. To a certain extent, the performance has improved compared to most deep packet inspection system based on X86 platform at present.
文摘Simulators are generally used during the design of computer architectures. Typically, different simulators with different levels of complexity, speed and accuracy are used. However, for early design space exploration, simulators with less complexity, high simulation speed and reasonable accuracy are desired. It is also required that these simulators have a short development time and that changes in the design require less effort in the implementation in order to perform experiments and see the effects of changes in the design. These simulators are termed high-level simulators in the context of computer architecture. In this paper, we present multiple levels of abstractions in a high-level simulation of a general-purpose many-core system, where the objective of every level is to improve the accuracy in simulation without significantly affecting the complexity and simulation speed.
基金Supported by the National Basic Research 973 Program of China under Grant No. 2005CB321600the National High-Tech Research and Development 863 Program of China under Grant No. 2009AA01Z103+2 种基金the National Natural Science Foundation of Chinaunder Grant No. 60736012the National Science Fund for Distinguished Young Scholars under Grant No. 60925009the Beijing Natural Science Foundation under Grant No. 4092044
文摘Moore's law will grant computer architects ever more transistors for the foreseeable future, and the challenge is how to use them to deliver efficient performance and flexible programmability. We propose a many-core architecture, Godson- T, to attack this challenge. On the one hand, Godson-T features a region-based cache coherence protocol, asynchronous data transfer agents and hardware-supported synchronization mechanisms, to provide full potential for the high efficiency of the on-chip resource utilization. On the other hand, Godson-T features a highly efficient runtime system, a Pthreadslike programming model, and versatile parallel libraries, which make this many-core design flexibly programmable. This hardware/software cooperating design methodology bridges the high-end computing with mass programmers. Experimental evaluations are conducted on a cycle-accurate simulator of Godson-T. The results show that the proposed architecture has good scalability, fast synchronization, high computational efficiency, and flexible programmability.
文摘Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing ele- ments (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS.
基金The National Key Research and Development Program of China under Grant No.2018YFB0204301the National Natural Science Foundation of China under Grant Nos.61972408 and 61602501.
文摘This article presents a comprehensive performance evaluation of Phytium 2000+,an ARMv8-based 64-core architecture.We focus on the cache and memory subsystems,analyzing the characteristics that impact the high-performance computing applications.We provide insights into the memory-relevant performance behaviours of the Phytium 2000+system through micro-benchmarking.With the help of the well-known roofline model,we analyze the Phytium 2000+system,taking both memory accesses and computations into account.Based on the knowledge gained from these micro-benchmarks,we evaluate two applications and use them to assess the capabilities of the Phytium 2000+system.The results show that the ARMv8-based many-core system is capable of delivering high performance for a wide range of scientific kernels.
基金the National Natural Science Foundation of China (Nos. 60633060, 60606008, and 60576031)the National Key Basic Research and Development (973) Program of China (973)(Nos. 2005CB321604 and 2005CB321605)the fund of Chinese Academy of Sciences (No. 20074010) due to the President Scholarship
文摘As semiconductor technology advances, there will be billions of transistors on a single chip. Chip many-core processors are emerging to take advantage of these greater transistor densities to deliver greater performance. Effective fault tolerance techniques are essential to improve the yield of such complex chips. In this paper, a core-level redundancy scheme called N+M is proposed to improve N-core processors’ yield by providing M spare cores. In such architecture, topology is an important factor because it greatly affects the processors’ performance. The concept of logical topology and a topology reconfiguration problem are introduced, which is able to transparently provide target topology with lowest performance degradation as the presence of faulty cores on-chip. A row rippling and column stealing (RRCS) algorithm is also proposed. Results show that PRCS can give solutions with average 13.8% degradation with negligible computing time.
基金The work was supported by the National Key Research and Development Program of China under Grant No. 2018YFB0204102。
文摘The short-range pair interaction consumes most of the CPU time in molecular dynamics(MD)simulations.The inherent computation sparsity makes it challenging to achieve high-performance kernel on the emerging many-core architecture.In this paper,we present a highly efficient short-range force kernel on the Sunway,a novel many-core architecture with many unique features.The parallel efficiency of this algorithm on the Sunway many-core processor is strongly limited by the poor data locality and write conflicts.To enhance the data locality,we adopt a super cluster based neighbor list with an appropriate granularity that fits in the local memory of computing cores.In the absence of a low overhead locking mechanism,using data-privatization force array is a more feasible method to avoid write conflicts,but results in the large overhead of data reduction.We adopt a dual-slice partitioning scheme for both hardware resources and computing tasks,which utilizes the on-chip data communication to reduce data reduction overhead and provide load balancing.Moreover,we exploit the single instruction multiple data(SIMD)parallelism and perform instruction reordering of the force kernel on this many-core processor.The experimental results show that the optimized force kernel obtains a performance speedup of 226x compared with the reference implementation and achieves 20%of peak flop rate on the Sunway many-core processor.
基金Project supported by the National Natural Science Foundation of China(No.61272145)the National High-Tech R&D Program(863)of China(No.2012AA012706)
文摘OpenCL is an open heterogeneous programming framework. Although OpenCL programs are func- tionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus has been extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. Typi- cally, the use of OpenCL's local memory on multi-core/many-core CPUs may lead to an opposite performance effect, because local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by (1) removing all the unwanted local-memory arrays together with the obsolete barrier statements and (2) optimizing the coalesced kernel code with vectorization and locality re-exploitation. Moreover, we have developed an automated tool chain that makes this transformation of GPU-specific OpenCL kernels into a CPU-friendly form, which is accompanied with a scheduler that forms a new OpenCL runtime. Experiments show that the automated transformation can improve OpenCL kernel performance on a multi-core CPU by an average factor of 3.24. Satisfactory performance improvements axe also achieved on Intel's many-integrated-core coprocessor. The resultant performance on both architectures is better than or comparable with the corresponding OpenMP performance.
基金funded by the National Key Research and Development Program of China under Grant No.2018YFB0204301the National Natural Science Foundation of China under Grant Nos.61972408 and 61872294.
文摘Cache performance is a critical design constraint for modern many-core systems.Since the cache often works in a"black-box"manner,it is difficult for the software to reason about the cache behavior to match the running software to the underlying hardware.To better support code optimization,we need to understand and characterize the cache be-havior.While cache performance characterization is heavily studied on traditional x86 architectures,there is little work for understanding the cache implementations on emerging ARMv8-based many-cores.This paper presents a comprehensive study to evaluate the cache architecture design on three representative ARMv8 multi-cores,Phytium 2000+,ThunderX2,and Kunpeng 920(KP920).To this end,we develop wrBench,a micro-benchmark suite to measure the realized latency and bandwidth of caches at different memory hierarchies when performing core-to-core communication.Our evaluation pro-vides inter-core latency and bandwidth in different cache levels and coherency states for the three ARMv8 many-cores.The quantitative performance data is shown in tables.We mine the characteristics of caches and coherency protocols by analyzing the data for the three processors,Phytium 2000+,ThunderX2,and KP920.Our paper also provides discussions and guidelines for optimizing memory access on ARMv8 many-cores.
文摘Purpose–The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core processors such as graphics processing units(GPUs)and multi-core processors(MCPs).Design/methodology/approach–For distributed genetic algorithm(GA)models,the paper proposes a method where an island’s ID number is added to the header of data transferred by this island for use in fault detection.Findings–The paper has shown that the processing time of the proposed idea is practically negligible in applications and also shown that an optimal solution can be obtained even with a single stuck-at fault or a transient fault,and that increasing the number of parallel threads makes the system less susceptible to faults.Originality/value–The study described in this paper is a new approach to increase the sustainability of application program using distributed GA on GPUs and MCPs.
基金the Deanship of Scientific Research at King Khalid University for funding this work through the General Research Project under Grant Number(RGP2/230/44).
文摘Unchecked breast cell growth is one of the leading causes of death in women globally and is the cause of breast cancer.The only method to avoid breast cancer-related deaths is through early detection and treatment.The proper classification of malignancies is one of the most significant challenges in the medical industry.Due to their high precision and accuracy,machine learning techniques are extensively employed for identifying and classifying various forms of cancer.Several data mining algorithms were studied and implemented by the author of this review and compared them to the present parameters and accuracy of various algorithms for breast cancer diagnosis such that clinicians might use them to accurately detect cancer cells early on.This article introduces several techniques,including support vector machine(SVM),K star(K∗)classifier,Additive Regression(AR),Back Propagation Neural Network(BP),and Bagging.These algorithms are trained using a set of data that contains tumor parameters from breast cancer patients.Comparing the results,the author found that Support Vector Machine and Bagging had the highest precision and accuracy,respectively.Also,assess the number of studies that provide machine learning techniques for breast cancer detection.
基金the Russian Foundation for Basic Research(Grant No.20-07-00140)the Ministry of Science and Higher Education of the Russian Federation(Government Order FENU-2020-0022).
文摘A discord is a refinement of the concept of an anomalous subsequence of a time series.Being one of the topical issues of time series mining,discords discovery is applied in a wide range of real-world areas(medicine,astronomy,economics,climate modeling,predictive maintenance,energy consumption,etc.).In this article,we propose a novel parallel algorithm for discords discovery on high-performance cluster with nodes based on many-core accelerators in the case when time series cannot fit in the main memory.We assumed that the time series is partitioned across the cluster nodes and achieved parallelization among the cluster nodes as well as within a single node.Within a cluster node,the algorithm employs a set of matrix data structures to store and index the subsequences of a time series,and to provide an efficient vectorization of computations on the accelerator.At each node,the algorithm processes its own partition and performs in two phases,namely candidate selection and discord refinement,with each phase requiring one linear scan through the partition.Then the local discords found are combined into the global candidate set and transmitted to each cluster node.Next,a node performs refinement of the global candidate set over its own partition resulting in the local true discord set.Finally,the global true discords set is constructed as intersection of the local true discord sets.The experimental evaluation on the real computer cluster with real and synthetic time series shows a high scalability of the proposed algorithm.
文摘Increasing needs for the study of complex dynamical systems require computing solutions of a large number of ordinary and partial differential time-dependent equations in near real-time. Numerical integration algorithms, which are computationally expensive and inherently sequential, are typically used to compute solutions of ordinary and partial differential time-dependent equations. This presents challenges to study complex dynamical systems in near real-time. This paper examines the challenges of computing solutions of ordinary differential time-dependent equations using the Parareal algorithm belonging to the class of parallel-in-time algorithms on various high-performance computing accelerator-based architectures and associated programming models. The paper presents the code refactoring steps and performance analysis of the Parareal algorithm on two accelerator computing architectures: the Intel Xeon Phi CPU and Graphics Processing Unit many-core architectures, and with OpenMP, OpenACC, and CUDA programming models. The speedup and scaling performance analysis are used to demonstrate the suitability of the Parareal to compute the solutions of a single ordinary differential time-dependent equation and a family of interdependent ordinary differential time-dependent. The speedup, weak and strong scaling results demonstrate the suitability of Graphical Processing Units with the CUDA programming model as the most efficient accelerator for computing solutions of ordinary differential time-dependent equations using parallel-in-time algorithms. Considering the time and effort required to refactor the code for execution on the accelerator architectures, the Graphical Processing Units with the OpenACC programming model is the most efficient accelerator for computing solutions of ordinary differential time-dependent equations using parallel-in-time algorithms.
基金supported by the National High-Tech Research and Development (863) Program of China (No. 2015AA015306)the Science and Technology Plan of Beijing Municipality (No. Z161100000216147)+2 种基金the National Natural Science Foundation of China (No. 61762074)Youth Foundation Program of Qinghai University (No. 2016-QGY-5)the National Natural Science Foundation of Qinghai Province (No. 2019-ZJ7034)
文摘A Weighted Essentially Non-Oscillatory scheme(WENO) is a solution to hyperbolic conservation laws,suitable for solving high-density fluid interface instability with strong intermittency. These problems have a large and complex flow structure. To fully utilize the computing power of High Performance Computing(HPC) systems, it is necessary to develop specific methodologies to optimize the performance of applications based on the particular system’s architecture. The Sunway TaihuLight supercomputer is currently ranked as the fastest supercomputer in the world. This article presents a heterogeneous parallel algorithm design and performance optimization of a high-order WENO on Sunway TaihuLight. We analyzed characteristics of kernel functions, and proposed an appropriate heterogeneous parallel model. We also figured out the best division strategy for computing tasks,and implemented the parallel algorithm on Sunway TaihuLight. By using access optimization, data dependency elimination, and vectorization optimization, our parallel algorithm can achieve up to 172× speedup on one single node, and additional 58× speedup on 64 nodes, with nearly linear scalability.
基金supported by the National Natural Science Foundation of China under Grant Nos.61133005,61272136,61221062,61402441,61432018the National High Technology Research and Development 863 Program of China under Grant No.2012AA010903the Chinese Academy of Sciences Special Grant for Postgraduate Research,Innovation and Practice under Grant No.11000GBF01
文摘Equipped with 512-bit wide SIMD inst d large numbers of computing cores, the emerging x86-based Intel(R) Many Integrated Core (MIC) Architecture ot only high floating-point performance, but also substantial off-chip memory bandwidth. The 3D FFT (three-di fast Fourier transform) is a widely-studied algorithm; however, the conventional algorithm needs to traverse the three times. In each pass, it computes multiple 1D FFTs along one of three dimensions, giving rise to plenty of rided memory accesses. In this paper, we propose a two-pass 3D FFT algorithm, which mainly aims to reduce of explicit data transfer between the memory and the on-chip cache. The main idea is to split one dimension into ensions, and then combine the transform along each sub-dimension with one of the rest dimensions respectively erence in amount of TLB misses resulting from decomposition along different dimensions is analyzed in detail. el parallelism is leveraged on the many-core system for a high degree of parallelism and better data reuse of loc On top of this, a number of optimization techniques, such as memory padding, loop transformation and vectoriz employed in our implementation to further enhance the performance. We evaluate the algorithm on the Intel(R) PhiTM coprocessor 7110P, and achieve a maximum performance of 136 Gflops with 240 threads in offload mode, which ts the vendor-specific Intel(R)MKL library by a factor of up to 2.22X.
基金supported by the National Basic Research 973 Program of China under Grant Nos.2011CB302500,2005CB321600the National Natural Science Foundation of China under Grant No.60921002
文摘Moore's law continues to grant computer architects ever more transistors in the foreseeable future, and parallelism is the key to continued performance scaling in modern microprocessors. In this paper, the achievements in our research project, which is supported by the National Basic Research 973 Program of China, on parallel architecture, are systematically presented. The innovative approaches and techniques to solve the significant problems in parallel architecture design are smnmarized, including architecture level optimization, compiler and language-supported technologies, reliability, power-performance efficient design, test and verification challenges, and platform building. Two prototype chips, a multi-heavy-core Godson-3 and a many-light-core Godson-T, are described to demonstrate the highly scalable and reconfigurable parallel architecture designs. We also present some of our achievements appearing in ISCA, MICRO, ISSCC, HPCA, PLDI, PACT, IJCAI, Hot Chips, DATE, IEEE Trans. VLSI, IEEE Micro, IEEE Trans. Computers, etc.
基金Supported by the National Basic Research 973 Program of China under Grant No.2005CB321602the National Natural Science Foundation of China under Grant No.60736012the National High Technology Research and Development 863 Program of China under Grant Nos.2007AA01Z110 and 2009AA01Z103
文摘The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technology -- together they may have profound impact. This paper presents a case study (using the 1-D Jacobi computation) of compiler-amendable performance optimization techniques on a many-core architecture Godson-T. Godson-T architecture has several unique features that are chosen for this study: 1) chip-level global addressable memory in particular the scratchpad memories (SPM) local to the processing cores; 2) fine-grain memory based synchronization (e.g., full-empty bit for fine-grain synchronization). Leveraging state-of-the-art performance optimization methods for 1-D stencil parallelization (e.g., timed tiling and variants), we developed and implement a number of many-core-based optimization for Godson-T. Our experimental study shows good performance in both execution time speedup and scalability, validate the value of globally accessed SPM and fine-grain synchronization mechanism (full-empty bits) under the Godson-T, and provides some useful guidelines for future compiler technology of many-core chip architectures.
基金supported by the Strategic Priority Research Program of Chinese Academy of Sciences(XDB41000000)the Research Funds of the Double First-Class Initiative of University of Science and Technology of China(YD2080002007)the National Natural Science Foundation of China(91837310,42061134009,and 41775146)。
文摘During the era of global warming and highly urbanized development,extreme and high impact weather as well as air pollution incidents influence everyday life and might even cause the incalculable loss of life and property.Despite the vast development of atmospheric models,there still exist substantial numerical forecast biases objectively.To accurately predict extreme weather,severe air pollution,and abrupt climate change,numerical atmospheric model requires not only to simulate meteorology and atmospheric compositions simultaneously involving many sophisticated physical and chemical processes but also at high spatiotemporal resolution.Global integrated atmospheric simulation at spatial resolutions of a few kilometers remains challenging due to its intensive computational and input/output(I/O)requirement.Through multi-dimension-parallelism structuring,aggressive and finer-grained optimizing,manual vectorizing,and parallelized I/O fragmenting,an integrated Atmospheric Model Across Scales(iAMAS)was established on the new Sunway supercomputer platform to significantly increase the computational efficiency and reduce the I/O cost.The global 3-km atmospheric simulation for meteorology with online integrated aerosol feedbacks with iAMAS was scaled to 39,000,000 processor cores and achieved the speed of 0.82 simulation day per hour(SDPH)with routine I/O,which enabled us to perform 5-day global weather forecast at 3-km horizontal resolution with online natural aerosol impacts.The results demonstrate the promising future that the increasing of spatial resolution to a few kilometers with online integrated aerosol feedbacks may significantly improve the global weather forecast.