期刊文献+
共找到20篇文章
< 1 >
每页显示 20 50 100
Typhoon Case Comparison Analysis Between Heterogeneous Many-Core and Homogenous Multicore Supercomputing Platforms
1
作者 LIU Xin YU Xiaolin +5 位作者 ZHAO Haoran HAN Qiqi ZHANG Jie WANG Chengzhi MA Weiwei XU Da 《Journal of Ocean University of China》 SCIE CAS CSCD 2023年第2期324-334,共11页
In this paper,a typical experiment is carried out based on a high-resolution air-sea coupled model,namely,the coupled ocean-atmosphere-wave-sediment transport(COAWST)model,on both heterogeneous many-core(SW)and homoge... In this paper,a typical experiment is carried out based on a high-resolution air-sea coupled model,namely,the coupled ocean-atmosphere-wave-sediment transport(COAWST)model,on both heterogeneous many-core(SW)and homogenous multicore(Intel)supercomputing platforms.We construct a hindcast of Typhoon Lekima on both the SW and Intel platforms,compare the simulation results between these two platforms and compare the key elements of the atmospheric and ocean modules to reanalysis data.The comparative experiment in this typhoon case indicates that the domestic many-core computing platform and general cluster yield almost no differences in the simulated typhoon path and intensity,and the differences in surface pressure(PSFC)in the WRF model and sea surface temperature(SST)in the short-range forecast are very small,whereas a major difference can be identified at high latitudes after the first 10 days.Further heat budget analysis verifies that the differences in SST after 10 days are mainly caused by shortwave radiation variations,as influenced by subsequently generated typhoons in the system.These typhoons generated in the hindcast after the first 10 days attain obviously different trajectories between the two platforms. 展开更多
关键词 heterogeneous many-core supercomputing platform homogenous multicore supercomputing platform comparison analysis typhoon case
下载PDF
A Scalable Interconnection Scheme in Many-Core Systems
2
作者 Allam Abumwais Mujahed Eleyat 《Computers, Materials & Continua》 SCIE EI 2023年第10期615-632,共18页
Recent architectures of multi-core systems may have a relatively large number of cores that typically ranges from tens to hundreds;therefore called many-core systems.Such systems require an efficient interconnection n... Recent architectures of multi-core systems may have a relatively large number of cores that typically ranges from tens to hundreds;therefore called many-core systems.Such systems require an efficient interconnection network that tries to address two major problems.First,the overhead of power and area cost and its effect on scalability.Second,high access latency is caused by multiple cores’simultaneous accesses of the same shared module.This paper presents an interconnection scheme called N-conjugate Shuffle Clusters(NCSC)based on multi-core multicluster architecture to reduce the overhead of the just mentioned problems.NCSC eliminated the need for router devices and their complexity and hence reduced the power and area costs.It also resigned and distributed the shared caches across the interconnection network to increase the ability for simultaneous access and hence reduce the access latency.For intra-cluster communication,Multi-port Content Addressable Memory(MPCAM)is used.The experimental results using four clusters and four cores each indicated that the average access latency for a write process is 1.14785±0.04532 ns which is nearly equal to the latency of a write operation in MPCAM.Moreover,it was demonstrated that the average read latency within a cluster is 1.26226±0.090591 ns and around 1.92738±0.139588 ns for read access between cores from different clusters. 展开更多
关键词 many-core MULTI-CORE N-conjugate shuffle multi-port content addressable memory interconnection network
下载PDF
Deep Packet Inspection Based on Many-Core Platform
3
作者 Ya-Ru Zhan Zhao-Shun Wang 《Journal of Computer and Communications》 2015年第5期1-6,共6页
With the development of computer technology, network bandwidth and network traffic continue to increase. Considering the large data flow, it is imperative to perform inspection effectively on network packets. In order... With the development of computer technology, network bandwidth and network traffic continue to increase. Considering the large data flow, it is imperative to perform inspection effectively on network packets. In order to find a solution of deep packet inspection which can appropriate to the current network environment, this paper built a deep packet inspection system based on many-core platform, and in this way, verified the feasibility to implement a deep packet inspection system under many-core platform with both high performance and low consumption. After testing and analysis of the system performance, it has been found that the deep packet inspection based on many-core platform TILE_Gx36 [1] [2] can process network traffic of which the bandwidth reaches up to 4 Gbps. To a certain extent, the performance has improved compared to most deep packet inspection system based on X86 platform at present. 展开更多
关键词 many-core PLATFORM DEEP PACKET Inspection Application Layer PROTOCOL TILE_Gx36
下载PDF
Multiple Levels of Abstraction in the Simulation of Microthreaded Many-Core Architectures
4
作者 Irfan Uddin 《Open Journal of Modelling and Simulation》 2015年第4期159-190,共32页
Simulators are generally used during the design of computer architectures. Typically, different simulators with different levels of complexity, speed and accuracy are used. However, for early design space exploration,... Simulators are generally used during the design of computer architectures. Typically, different simulators with different levels of complexity, speed and accuracy are used. However, for early design space exploration, simulators with less complexity, high simulation speed and reasonable accuracy are desired. It is also required that these simulators have a short development time and that changes in the design require less effort in the implementation in order to perform experiments and see the effects of changes in the design. These simulators are termed high-level simulators in the context of computer architecture. In this paper, we present multiple levels of abstractions in a high-level simulation of a general-purpose many-core system, where the objective of every level is to improve the accuracy in simulation without significantly affecting the complexity and simulation speed. 展开更多
关键词 HIGH-LEVEL Simulations MULTIPLE LEVELS of ABSTRACTION Design Space Exploration many-core Systems
下载PDF
Godson-T:An Efficient Many-Core Architecture for Parallel Program Executions 被引量:12
5
作者 范东睿 袁楠 +9 位作者 张军超 周永彬 林伟 宋风龙 叶笑春 黄河 余磊 龙国平 张浩 刘磊 《Journal of Computer Science & Technology》 SCIE EI CSCD 2009年第6期1061-1073,共13页
Moore's law will grant computer architects ever more transistors for the foreseeable future, and the challenge is how to use them to deliver efficient performance and flexible programmability. We propose a many-core ... Moore's law will grant computer architects ever more transistors for the foreseeable future, and the challenge is how to use them to deliver efficient performance and flexible programmability. We propose a many-core architecture, Godson- T, to attack this challenge. On the one hand, Godson-T features a region-based cache coherence protocol, asynchronous data transfer agents and hardware-supported synchronization mechanisms, to provide full potential for the high efficiency of the on-chip resource utilization. On the other hand, Godson-T features a highly efficient runtime system, a Pthreadslike programming model, and versatile parallel libraries, which make this many-core design flexibly programmable. This hardware/software cooperating design methodology bridges the high-end computing with mass programmers. Experimental evaluations are conducted on a cycle-accurate simulator of Godson-T. The results show that the proposed architecture has good scalability, fast synchronization, high computational efficiency, and flexible programmability. 展开更多
关键词 many-core parallel computing multithread data communication thread synchronization runtime system
原文传递
Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture 被引量:13
6
作者 郑方 李宏亮 +3 位作者 吕晖 过锋 许晓红 谢向辉 《Journal of Computer Science & Technology》 SCIE EI CSCD 2015年第1期145-162,共18页
Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which h... Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing ele- ments (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS. 展开更多
关键词 heterogeneous many-core processor data stream transfer register-level communication mechanism hardwaresynchronization technique processor prototype
原文传递
Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures:A Case Study with Phytium 2000+ 被引量:3
7
作者 Jian-Bin Fang Xiang-Ke Liao +1 位作者 Chun Huang De-Zun Dong 《Journal of Computer Science & Technology》 SCIE EI CSCD 2021年第1期33-43,共11页
This article presents a comprehensive performance evaluation of Phytium 2000+,an ARMv8-based 64-core architecture.We focus on the cache and memory subsystems,analyzing the characteristics that impact the high-performa... This article presents a comprehensive performance evaluation of Phytium 2000+,an ARMv8-based 64-core architecture.We focus on the cache and memory subsystems,analyzing the characteristics that impact the high-performance computing applications.We provide insights into the memory-relevant performance behaviours of the Phytium 2000+system through micro-benchmarking.With the help of the well-known roofline model,we analyze the Phytium 2000+system,taking both memory accesses and computations into account.Based on the knowledge gained from these micro-benchmarks,we evaluate two applications and use them to assess the capabilities of the Phytium 2000+system.The results show that the ARMv8-based many-core system is capable of delivering high performance for a wide range of scientific kernels. 展开更多
关键词 many-core architecture memory-centric design performance evaluation
原文传递
Fault Tolerance Mechanism in Chip Many-Core Processors 被引量:1
8
作者 张磊 韩银和 +1 位作者 李华伟 李晓维 《Tsinghua Science and Technology》 SCIE EI CAS 2007年第S1期169-174,共6页
As semiconductor technology advances, there will be billions of transistors on a single chip. Chip many-core processors are emerging to take advantage of these greater transistor densities to deliver greater performan... As semiconductor technology advances, there will be billions of transistors on a single chip. Chip many-core processors are emerging to take advantage of these greater transistor densities to deliver greater performance. Effective fault tolerance techniques are essential to improve the yield of such complex chips. In this paper, a core-level redundancy scheme called N+M is proposed to improve N-core processors’ yield by providing M spare cores. In such architecture, topology is an important factor because it greatly affects the processors’ performance. The concept of logical topology and a topology reconfiguration problem are introduced, which is able to transparently provide target topology with lowest performance degradation as the presence of faulty cores on-chip. A row rippling and column stealing (RRCS) algorithm is also proposed. Results show that PRCS can give solutions with average 13.8% degradation with negligible computing time. 展开更多
关键词 chip many-core processors YIELD fault tolerance RECONFIGURATION NETWORK-ON-CHIP
原文传递
Towards Efficient Short-Range Pair Interaction on Sunway Many-Core Architecture 被引量:1
9
作者 Jun-Shi Chen Hong An +2 位作者 Wen-Ting Han Zeng Lin Xin Liu 《Journal of Computer Science & Technology》 SCIE EI CSCD 2021年第1期123-139,共17页
The short-range pair interaction consumes most of the CPU time in molecular dynamics(MD)simulations.The inherent computation sparsity makes it challenging to achieve high-performance kernel on the emerging many-core a... The short-range pair interaction consumes most of the CPU time in molecular dynamics(MD)simulations.The inherent computation sparsity makes it challenging to achieve high-performance kernel on the emerging many-core architecture.In this paper,we present a highly efficient short-range force kernel on the Sunway,a novel many-core architecture with many unique features.The parallel efficiency of this algorithm on the Sunway many-core processor is strongly limited by the poor data locality and write conflicts.To enhance the data locality,we adopt a super cluster based neighbor list with an appropriate granularity that fits in the local memory of computing cores.In the absence of a low overhead locking mechanism,using data-privatization force array is a more feasible method to avoid write conflicts,but results in the large overhead of data reduction.We adopt a dual-slice partitioning scheme for both hardware resources and computing tasks,which utilizes the on-chip data communication to reduce data reduction overhead and provide load balancing.Moreover,we exploit the single instruction multiple data(SIMD)parallelism and perform instruction reordering of the force kernel on this many-core processor.The experimental results show that the optimized force kernel obtains a performance speedup of 226x compared with the reference implementation and achieves 20%of peak flop rate on the Sunway many-core processor. 展开更多
关键词 molecular dynamics sunway many-core pair interaction parallel algorithm
原文传递
Improving performance portability for GPU-specific Open CL kernels on multi-core/many-core CPUs by analysis-based transformations
10
作者 Mei WEN Da-fei HUANG +1 位作者 Chang-qing XUN Dong CHEN 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2015年第11期899-916,共18页
OpenCL is an open heterogeneous programming framework. Although OpenCL programs are func- tionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When ... OpenCL is an open heterogeneous programming framework. Although OpenCL programs are func- tionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus has been extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. Typi- cally, the use of OpenCL's local memory on multi-core/many-core CPUs may lead to an opposite performance effect, because local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by (1) removing all the unwanted local-memory arrays together with the obsolete barrier statements and (2) optimizing the coalesced kernel code with vectorization and locality re-exploitation. Moreover, we have developed an automated tool chain that makes this transformation of GPU-specific OpenCL kernels into a CPU-friendly form, which is accompanied with a scheduler that forms a new OpenCL runtime. Experiments show that the automated transformation can improve OpenCL kernel performance on a multi-core CPU by an average factor of 3.24. Satisfactory performance improvements axe also achieved on Intel's many-integrated-core coprocessor. The resultant performance on both architectures is better than or comparable with the corresponding OpenMP performance. 展开更多
关键词 OpenCL Performance portability Multi-core/many-core CPU Analysis-based transformation
原文传递
wrBench:Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems
11
作者 高琬蓉 方建滨 +2 位作者 黄春 徐传福 王峥 《Journal of Computer Science & Technology》 SCIE EI CSCD 2023年第6期1323-1338,共16页
Cache performance is a critical design constraint for modern many-core systems.Since the cache often works in a"black-box"manner,it is difficult for the software to reason about the cache behavior to match t... Cache performance is a critical design constraint for modern many-core systems.Since the cache often works in a"black-box"manner,it is difficult for the software to reason about the cache behavior to match the running software to the underlying hardware.To better support code optimization,we need to understand and characterize the cache be-havior.While cache performance characterization is heavily studied on traditional x86 architectures,there is little work for understanding the cache implementations on emerging ARMv8-based many-cores.This paper presents a comprehensive study to evaluate the cache architecture design on three representative ARMv8 multi-cores,Phytium 2000+,ThunderX2,and Kunpeng 920(KP920).To this end,we develop wrBench,a micro-benchmark suite to measure the realized latency and bandwidth of caches at different memory hierarchies when performing core-to-core communication.Our evaluation pro-vides inter-core latency and bandwidth in different cache levels and coherency states for the three ARMv8 many-cores.The quantitative performance data is shown in tables.We mine the characteristics of caches and coherency protocols by analyzing the data for the three processors,Phytium 2000+,ThunderX2,and KP920.Our paper also provides discussions and guidelines for optimizing memory access on ARMv8 many-cores. 展开更多
关键词 ARMv8 many-core cache architecture microbenchmark core-to-core communication
原文传递
Parallelization and sustainability of distributed genetic algorithms on many-core processors
12
作者 Yuji Sato Mikiko Sato 《International Journal of Intelligent Computing and Cybernetics》 EI 2014年第1期2-23,共22页
Purpose–The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core pr... Purpose–The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core processors such as graphics processing units(GPUs)and multi-core processors(MCPs).Design/methodology/approach–For distributed genetic algorithm(GA)models,the paper proposes a method where an island’s ID number is added to the header of data transferred by this island for use in fault detection.Findings–The paper has shown that the processing time of the proposed idea is practically negligible in applications and also shown that an optimal solution can be obtained even with a single stuck-at fault or a transient fault,and that increasing the number of parallel threads makes the system less susceptible to faults.Originality/value–The study described in this paper is a new approach to increase the sustainability of application program using distributed GA on GPUs and MCPs. 展开更多
关键词 Evolutionary computation Genetic algorithms Fault identification many-core processors PARALLELIZATION
原文传递
Comparative Evaluation of Data Mining Algorithms in Breast Cancer
13
作者 Fuad A.M.Al-Yarimi 《Computers, Materials & Continua》 SCIE EI 2023年第10期633-645,共13页
Unchecked breast cell growth is one of the leading causes of death in women globally and is the cause of breast cancer.The only method to avoid breast cancer-related deaths is through early detection and treatment.The... Unchecked breast cell growth is one of the leading causes of death in women globally and is the cause of breast cancer.The only method to avoid breast cancer-related deaths is through early detection and treatment.The proper classification of malignancies is one of the most significant challenges in the medical industry.Due to their high precision and accuracy,machine learning techniques are extensively employed for identifying and classifying various forms of cancer.Several data mining algorithms were studied and implemented by the author of this review and compared them to the present parameters and accuracy of various algorithms for breast cancer diagnosis such that clinicians might use them to accurately detect cancer cells early on.This article introduces several techniques,including support vector machine(SVM),K star(K∗)classifier,Additive Regression(AR),Back Propagation Neural Network(BP),and Bagging.These algorithms are trained using a set of data that contains tumor parameters from breast cancer patients.Comparing the results,the author found that Support Vector Machine and Bagging had the highest precision and accuracy,respectively.Also,assess the number of studies that provide machine learning techniques for breast cancer detection. 展开更多
关键词 many-core MULTI-CORE N-conjugate shuffle multi-port content addressable memory interconnection network
下载PDF
A Parallel Approach to Discords Discovery in Massive Time Series Data
14
作者 Mikhail Zymbler Alexander Grents +1 位作者 Yana Kraeva Sachin Kumar 《Computers, Materials & Continua》 SCIE EI 2021年第2期1867-1878,共12页
A discord is a refinement of the concept of an anomalous subsequence of a time series.Being one of the topical issues of time series mining,discords discovery is applied in a wide range of real-world areas(medicine,as... A discord is a refinement of the concept of an anomalous subsequence of a time series.Being one of the topical issues of time series mining,discords discovery is applied in a wide range of real-world areas(medicine,astronomy,economics,climate modeling,predictive maintenance,energy consumption,etc.).In this article,we propose a novel parallel algorithm for discords discovery on high-performance cluster with nodes based on many-core accelerators in the case when time series cannot fit in the main memory.We assumed that the time series is partitioned across the cluster nodes and achieved parallelization among the cluster nodes as well as within a single node.Within a cluster node,the algorithm employs a set of matrix data structures to store and index the subsequences of a time series,and to provide an efficient vectorization of computations on the accelerator.At each node,the algorithm processes its own partition and performs in two phases,namely candidate selection and discord refinement,with each phase requiring one linear scan through the partition.Then the local discords found are combined into the global candidate set and transmitted to each cluster node.Next,a node performs refinement of the global candidate set over its own partition resulting in the local true discord set.Finally,the global true discords set is constructed as intersection of the local true discord sets.The experimental evaluation on the real computer cluster with real and synthetic time series shows a high scalability of the proposed algorithm. 展开更多
关键词 Time series discords discovery computer cluster many-core accelerator VECTORIZATION
下载PDF
Performance Analysis of Accelerator Architectures and Programming Models for Parareal Algorithm Solutions of Ordinary Differential Equations
15
作者 Sumathi Lakshmiranganatha Suresh S. Muknahallipatna 《Journal of Computer and Communications》 2021年第2期29-56,共28页
Increasing needs for the study of complex dynamical systems require computing solutions of a large number of ordinary and partial differential time-dependent equations in near real-time. Numerical integration algorith... Increasing needs for the study of complex dynamical systems require computing solutions of a large number of ordinary and partial differential time-dependent equations in near real-time. Numerical integration algorithms, which are computationally expensive and inherently sequential, are typically used to compute solutions of ordinary and partial differential time-dependent equations. This presents challenges to study complex dynamical systems in near real-time. This paper examines the challenges of computing solutions of ordinary differential time-dependent equations using the Parareal algorithm belonging to the class of parallel-in-time algorithms on various high-performance computing accelerator-based architectures and associated programming models. The paper presents the code refactoring steps and performance analysis of the Parareal algorithm on two accelerator computing architectures: the Intel Xeon Phi CPU and Graphics Processing Unit many-core architectures, and with OpenMP, OpenACC, and CUDA programming models. The speedup and scaling performance analysis are used to demonstrate the suitability of the Parareal to compute the solutions of a single ordinary differential time-dependent equation and a family of interdependent ordinary differential time-dependent. The speedup, weak and strong scaling results demonstrate the suitability of Graphical Processing Units with the CUDA programming model as the most efficient accelerator for computing solutions of ordinary differential time-dependent equations using parallel-in-time algorithms. Considering the time and effort required to refactor the code for execution on the accelerator architectures, the Graphical Processing Units with the OpenACC programming model is the most efficient accelerator for computing solutions of ordinary differential time-dependent equations using parallel-in-time algorithms. 展开更多
关键词 ACCELERATORS many-core Directive-Based Time-Parallel Scaling SPEEDUP
下载PDF
Heterogeneous Parallel Algorithm Design and Performance Optimization for WENO on the Sunway TaihuLight Supercomputer 被引量:4
16
作者 Jianqiang Huang Wentao Han +1 位作者 Xiaoying Wang Wenguang Chen 《Tsinghua Science and Technology》 SCIE EI CAS CSCD 2020年第1期56-67,共12页
A Weighted Essentially Non-Oscillatory scheme(WENO) is a solution to hyperbolic conservation laws,suitable for solving high-density fluid interface instability with strong intermittency. These problems have a large an... A Weighted Essentially Non-Oscillatory scheme(WENO) is a solution to hyperbolic conservation laws,suitable for solving high-density fluid interface instability with strong intermittency. These problems have a large and complex flow structure. To fully utilize the computing power of High Performance Computing(HPC) systems, it is necessary to develop specific methodologies to optimize the performance of applications based on the particular system’s architecture. The Sunway TaihuLight supercomputer is currently ranked as the fastest supercomputer in the world. This article presents a heterogeneous parallel algorithm design and performance optimization of a high-order WENO on Sunway TaihuLight. We analyzed characteristics of kernel functions, and proposed an appropriate heterogeneous parallel model. We also figured out the best division strategy for computing tasks,and implemented the parallel algorithm on Sunway TaihuLight. By using access optimization, data dependency elimination, and vectorization optimization, our parallel algorithm can achieve up to 172× speedup on one single node, and additional 58× speedup on 64 nodes, with nearly linear scalability. 展开更多
关键词 parallel algorithms WEIGHTED Essentially Non-Oscillatory scheme(WENO) optimization many-core Sunway TaihuLight
原文传递
Memory Efficient Two-Pass 3D FFT Algorithm for Intel~ Xeon Phi^(TM) Coprocessor 被引量:2
17
作者 刘益群 李焱 +1 位作者 张云泉 张先轶 《Journal of Computer Science & Technology》 SCIE EI CSCD 2014年第6期989-1002,共14页
Equipped with 512-bit wide SIMD inst d large numbers of computing cores, the emerging x86-based Intel(R) Many Integrated Core (MIC) Architecture ot only high floating-point performance, but also substantial ... Equipped with 512-bit wide SIMD inst d large numbers of computing cores, the emerging x86-based Intel(R) Many Integrated Core (MIC) Architecture ot only high floating-point performance, but also substantial off-chip memory bandwidth. The 3D FFT (three-di fast Fourier transform) is a widely-studied algorithm; however, the conventional algorithm needs to traverse the three times. In each pass, it computes multiple 1D FFTs along one of three dimensions, giving rise to plenty of rided memory accesses. In this paper, we propose a two-pass 3D FFT algorithm, which mainly aims to reduce of explicit data transfer between the memory and the on-chip cache. The main idea is to split one dimension into ensions, and then combine the transform along each sub-dimension with one of the rest dimensions respectively erence in amount of TLB misses resulting from decomposition along different dimensions is analyzed in detail. el parallelism is leveraged on the many-core system for a high degree of parallelism and better data reuse of loc On top of this, a number of optimization techniques, such as memory padding, loop transformation and vectoriz employed in our implementation to further enhance the performance. We evaluate the algorithm on the Intel(R) PhiTM coprocessor 7110P, and achieve a maximum performance of 136 Gflops with 240 threads in offload mode, which ts the vendor-specific Intel(R)MKL library by a factor of up to 2.22X. 展开更多
关键词 3D-FFT memory efficie many-core Many Integrated Core Intel(R) Xeon PhiTM
原文传递
New Methodologies for Parallel Architecture 被引量:1
18
作者 范东睿 李晓维 李国杰 《Journal of Computer Science & Technology》 SCIE EI CSCD 2011年第4期578-587,共10页
Moore's law continues to grant computer architects ever more transistors in the foreseeable future, and parallelism is the key to continued performance scaling in modern microprocessors. In this paper, the achievemen... Moore's law continues to grant computer architects ever more transistors in the foreseeable future, and parallelism is the key to continued performance scaling in modern microprocessors. In this paper, the achievements in our research project, which is supported by the National Basic Research 973 Program of China, on parallel architecture, are systematically presented. The innovative approaches and techniques to solve the significant problems in parallel architecture design are smnmarized, including architecture level optimization, compiler and language-supported technologies, reliability, power-performance efficient design, test and verification challenges, and platform building. Two prototype chips, a multi-heavy-core Godson-3 and a many-light-core Godson-T, are described to demonstrate the highly scalable and reconfigurable parallel architecture designs. We also present some of our achievements appearing in ISCA, MICRO, ISSCC, HPCA, PLDI, PACT, IJCAI, Hot Chips, DATE, IEEE Trans. VLSI, IEEE Micro, IEEE Trans. Computers, etc. 展开更多
关键词 ARCHITECTURE MULTI-CORE many-core PARALLELISM
原文传递
Landing Stencil Code on Godson-T 被引量:1
19
作者 崔慧敏 王蕾 +1 位作者 范东睿 冯晓兵 《Journal of Computer Science & Technology》 SCIE EI CSCD 2010年第4期886-894,共9页
The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunit... The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technology -- together they may have profound impact. This paper presents a case study (using the 1-D Jacobi computation) of compiler-amendable performance optimization techniques on a many-core architecture Godson-T. Godson-T architecture has several unique features that are chosen for this study: 1) chip-level global addressable memory in particular the scratchpad memories (SPM) local to the processing cores; 2) fine-grain memory based synchronization (e.g., full-empty bit for fine-grain synchronization). Leveraging state-of-the-art performance optimization methods for 1-D stencil parallelization (e.g., timed tiling and variants), we developed and implement a number of many-core-based optimization for Godson-T. Our experimental study shows good performance in both execution time speedup and scalability, validate the value of globally accessed SPM and fine-grain synchronization mechanism (full-empty bits) under the Godson-T, and provides some useful guidelines for future compiler technology of many-core chip architectures. 展开更多
关键词 many-core stencil Jacobi compiler SPM fine-grain synchronization
原文传递
Establishing a non-hydrostatic global atmospheric modeling system at3-km horizontal resolution with aerosol feedbacks on the Sunway supercomputer of China
20
作者 Jun Gu Jiawang Feng +10 位作者 Xiaoyu Hao Tao Fang Chun Zhao Hong An Junshi Chen Mingyue Xu Jian Li Wenting Han Chao Yang Fang Li Dexun Chen 《Science Bulletin》 SCIE EI CSCD 2022年第11期1170-1181,共12页
During the era of global warming and highly urbanized development,extreme and high impact weather as well as air pollution incidents influence everyday life and might even cause the incalculable loss of life and prope... During the era of global warming and highly urbanized development,extreme and high impact weather as well as air pollution incidents influence everyday life and might even cause the incalculable loss of life and property.Despite the vast development of atmospheric models,there still exist substantial numerical forecast biases objectively.To accurately predict extreme weather,severe air pollution,and abrupt climate change,numerical atmospheric model requires not only to simulate meteorology and atmospheric compositions simultaneously involving many sophisticated physical and chemical processes but also at high spatiotemporal resolution.Global integrated atmospheric simulation at spatial resolutions of a few kilometers remains challenging due to its intensive computational and input/output(I/O)requirement.Through multi-dimension-parallelism structuring,aggressive and finer-grained optimizing,manual vectorizing,and parallelized I/O fragmenting,an integrated Atmospheric Model Across Scales(iAMAS)was established on the new Sunway supercomputer platform to significantly increase the computational efficiency and reduce the I/O cost.The global 3-km atmospheric simulation for meteorology with online integrated aerosol feedbacks with iAMAS was scaled to 39,000,000 processor cores and achieved the speed of 0.82 simulation day per hour(SDPH)with routine I/O,which enabled us to perform 5-day global weather forecast at 3-km horizontal resolution with online natural aerosol impacts.The results demonstrate the promising future that the increasing of spatial resolution to a few kilometers with online integrated aerosol feedbacks may significantly improve the global weather forecast. 展开更多
关键词 Non-hydrostatic global model Domestic supercomputer Convection-permitting resolution Online integrated aerosol Heterogeneous many-core architecture
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部