期刊文献+

新一代神威处理器上高效任务流并行系统

Efficient Task Flow Parallel System for New Generation Sunway Processor
下载PDF
导出
摘要 我国自主研制的新一代神威超级计算机相比前一代的神威太湖之光,具有更强大的内存系统和更高的计算密度,其主力编程模型仍然是块同步(Bulk Synchronous Parallelism,BSP)模型。顺序任务流(Sequential Task Flow,STF)模型基于数据流信息实现对串行程序的自动任务并行,并通过任务间的细粒度同步实现异步并行,相比于BSP模型的全局同步,并行度更高,负载更均衡。STF模型为用户高效使用神威平台提供了一种新选择。但在众核系统上,STF模型的运行时开销会直接影响并行程序性能。首先,分析新一代神威处理器影响STF模型高效实现的两个特征;然后,利用处理器架构的独有特性,提出一种基于代理的数据流构图机制以实现模型的构图需求,以及一种无锁的集中式任务调度机制以优化调度开销。最后,基于以上技术,为AceMesh模型实现了高效的任务流并行系统。实验表明,实现的任务流并行系统相比传统运行时支持优势显著,在细粒度任务场景下最高加速2.37倍;AceMesh性能高于神威平台的OpenACC模型,对典型应用的加速最高达到2.07倍。 China’s independently developed next-generation Sunway supercomputer features a more powerful memory system and higher computational density compared to its predecessor,the Sunway TaihuLight.Its primary programming model remains the bulk synchronous parallelism(BSP)model.The sequential task flow(STF)model,based on data flow information,automates the task parallelization of serial programs and achieves asynchronous parallelism through fine-grained synchronization between tasks.Compared to the global synchronization of the BSP model,STF offers higher parallelism and more balanced load distribution,providing users with a new option for efficiently utilizing the Sunway platform.However,on many-core systems,the runtime overhead of the STF model directly impacts the performance of parallel programs.This paper first analyzes two characteristics of the new Sunway processor that affect the efficient implementation of the STF model.Then,leveraging the unique features of the processor architecture,it proposes an agent-based dataflow graph construction mechanism to meet the modeling requirements and a lock-free centralized task scheduling mechanism to optimize scheduling overhead.Finally,based on these technologies,an efficient task flow parallel system is implemented for the AceMesh model.Experiments show that the implemented task flow parallel system has significant advantages over traditional runtime support,achieving a maximum speedup of 2.37 times in fine-grained task scenarios;the performance of AceMesh exceeds that of the OpenACC model on the Sunway platform,with a maximum speedup of 2.07 times for typical applications.
作者 傅游 杜雷明 高希然 陈莉 FU You;DU Leiming;GAO Xiran;CHEN Li(College of Computer Science and Engineering,Shandong University of Science and Technology,Qingdao,Shandong 266590,China;State Key Lab of Processors,Institute of Computing Technology,CAS,Beijing 100190,China)
出处 《计算机科学》 CSCD 北大核心 2024年第12期137-146,共10页 Computer Science
基金 山东省自然科学基金(ZR2022MF274,ZR2021LZH004) 国家重点研发计划(2017YFB0202002)。
关键词 顺序任务流模型 异构众核并行 任务调度 数据流并行 块同步模型 Sequential task flow model Heterogeneous multi-core parallelism Task scheduling Dataflow parallelism Bulk synchronous model
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部