期刊文献+
共找到7篇文章
< 1 >
每页显示 20 50 100
System-Enforced Deterministic Streaming for Efficient Pipeline Parallelism 被引量:2
1
作者 张昱 李兆鹏 曹慧芳 《Journal of Computer Science & Technology》 SCIE EI CSCD 2015年第1期57-73,共17页
Pipeline parallelism is a popular parallel programming pattern for emerging applications. However, program- ming pipelines directly on conventional multithreaded shared memory is difficult and error-prone. We present ... Pipeline parallelism is a popular parallel programming pattern for emerging applications. However, program- ming pipelines directly on conventional multithreaded shared memory is difficult and error-prone. We present DStream, a C library that provides high-level abstractions of deterministic threads and streams for simply representing pipeline stage work- ers and their communications. The deterministic stream is established atop our proposed single-producer/multi-consumer (SPMC) virtual memory, which integrates synchronization with the virtual memory model to enforce determinism on shared memory accesses. We investigate various strategies on how to efficiently implement DStream atop the SPMC memory, so that an infinite sequence of data items can be asynchronously published (fixed) and asynchronously consumed in order among adjacent stage workers. We have successfully transformed two representative pipeline applications ferret and dedup using DStream, and conclude conversion rules. An empirical evaluation shows that the converted ferret performed on par with its Pthreads and TBB counterparts in term of running time, while the converted dedup is close to 2.56X, 7.05X faster than the Pthreads counterpart and 1.06X, 3.9X faster than the TBB counterpart on 16 and 32 CPUs, respectively. 展开更多
关键词 deterministic parallelism pipeline parallelism single-producer/multi-consumer virtual memory
原文传递
Advances of Pipeline Model Parallelism for Deep Learning Training:An Overview
2
作者 关磊 李东升 +3 位作者 梁吉业 王文剑 葛可适 卢锡城 《Journal of Computer Science & Technology》 SCIE EI CSCD 2024年第3期567-584,共18页
Deep learning has become the cornerstone of artificial intelligence,playing an increasingly important role in human production and lifestyle.However,as the complexity of problem-solving increases,deep learning models ... Deep learning has become the cornerstone of artificial intelligence,playing an increasingly important role in human production and lifestyle.However,as the complexity of problem-solving increases,deep learning models become increasingly intricate,resulting in a proliferation of large language models with an astonishing number of parameters.Pipeline model parallelism(PMP)has emerged as one of the mainstream approaches to addressing the significant challenge of training“big models”.This paper presents a comprehensive review of PMP.It covers the basic concepts and main challenges of PMP.It also comprehensively compares synchronous and asynchronous pipeline schedules for PMP approaches,and discusses the main techniques to achieve load balance for both intra-node and inter-node training.Furthermore,the main techniques to optimize computation,storage,and communication are presented,with potential research directions being discussed. 展开更多
关键词 deep learning pipeline schedule load balance multi-GPU system pipeline model parallelism(PMP)
原文传递
Fast Parallel Algorithm for Slicing STL Based on Pipeline 被引量:4
3
作者 MA Xulong LIN Feng YAO Bo 《Chinese Journal of Mechanical Engineering》 SCIE EI CAS CSCD 2016年第3期549-555,共7页
In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a paral... In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a parallel algorithm has great advantages. However, traditional algorithms can't make full use of multi-core CPU hardware resources. In the paper, a fast parallel algorithm is presented to speed up data processing. A pipeline mode is adopted to design the parallel algorithm. And the complexity of the pipeline algorithm is analyzed theoretically. To evaluate the performance of the new algorithm, effects of threads number and layers number are investigated by a serial of experiments. The experimental results show that the threads number and layers number are two remarkable factors to the speedup ratio. The tendency of speedup versus threads number reveals a positive relationship which greatly agrees with the Amdahl's law, and the tendency of speedup versus layers number also keeps a positive relationship agreeing with Gustafson's law. The new algorithm uses topological information to compute contours with a parallel method of speedup. Another parallel algorithm based on data parallel is used in experiments to show that pipeline parallel mode is more efficient. A case study at last shows a suspending performance of the new parallel algorithm. Compared with the serial slicing algorithm, the new pipeline parallel algorithm can make full use of the multi-core CPU hardware, accelerate the slicing process, and compared with the data parallel slicing algorithm, the new slicing algorithm in this paper adopts a pipeline parallel model, and a much higher speedup ratio and efficiency is achieved. 展开更多
关键词 additive manufacturing STL model slicing algorithm data parallel pipeline parallel
下载PDF
A parallel pipeline connected-component labeling method for on-orbit space target monitoring
4
作者 LI Zongling ZHANG Qingjun +1 位作者 LONG Teng ZHAO Baojun 《Journal of Systems Engineering and Electronics》 SCIE EI CSCD 2022年第5期1095-1107,共13页
The paper designs a peripheral maximum gray differ-ence(PMGD)image segmentation method,a connected-compo-nent labeling(CCL)algorithm based on dynamic run length(DRL),and a real-time implementation streaming processor ... The paper designs a peripheral maximum gray differ-ence(PMGD)image segmentation method,a connected-compo-nent labeling(CCL)algorithm based on dynamic run length(DRL),and a real-time implementation streaming processor for DRL-CCL.And it verifies the function and performance in space target monitoring scene by the carrying experiment of Tianzhou-3 cargo spacecraft(TZ-3).The PMGD image segmentation method can segment the image into highly discrete and simple point tar-gets quickly,which reduces the generation of equivalences greatly and improves the real-time performance for DRL-CCL.Through parallel pipeline design,the storage of the streaming processor is optimized by 55%with no need for external me-mory,the logic is optimized by 60%,and the energy efficiency ratio is 12 times than that of the graphics processing unit,62 times than that of the digital signal proccessing,and 147 times than that of personal computers.Analyzing the results of 8756 images completed on-orbit,the speed is up to 5.88 FPS and the target detection rate is 100%.Our algorithm and implementation method meet the requirements of lightweight,high real-time,strong robustness,full-time,and stable operation in space irradia-tion environment. 展开更多
关键词 Tianzhou-3 cargo spacecraft(TZ-3) connected-component labeling(CCL)algorithms parallel pipeline processing on-orbit space target detection streaming processor
下载PDF
A parallel-pipelining software process model
5
作者 赵鹏 龚鹏 《Journal of Harbin Institute of Technology(New Series)》 EI CAS 2007年第5期646-651,共6页
Software process is a framework for effective and timely delivery of software system. The framework plays a crucial role for software success. However, the development of large-scale software still faces the crisis of... Software process is a framework for effective and timely delivery of software system. The framework plays a crucial role for software success. However, the development of large-scale software still faces the crisis of high risks, low quality, high costs and long cycle time. This paper proposed a three-phase parallel-pipelining software process model for improving speed and productivity, and reducing software costs and risks without sacrificing software quality. In this model, two strategies were presented. One strategy, based on subsystem-cost priority, was used to prevent software development cost wasting and to reduce software complexity as well; the other strategy, used for balancing subsystem complexity, was designed to reduce the software complexity in the later development stages. Moreover, the proposed function-detailed and workload-simplified subsystem pipelining software process model presents much higher parallelity than the concurrent incremental model. Finally, the component-based product line technology not only ensures software quality and further reduces cycle time, software costs, and software risks but also sufficiently and rationally utilizes previous software product resources and enhances the competition ability of software development organizations. 展开更多
关键词 software process improvement parallel pipelining cost priority product line
下载PDF
CHAUS:Scalable VM-Based Channels for Unbounded Streaming
6
作者 Yu Zhang Yu-Fen Yu +2 位作者 Hui-Fang Cao Jian-Kang Chen Qi-Liang Zhang 《Journal of Computer Science & Technology》 SCIE EI CSCD 2017年第6期1288-1304,共17页
Stream processing is a special form of the dataflow execution model that offers extensive opportunities for optimization and automatic parallelism. A streaming application is represented by a graph of computation stag... Stream processing is a special form of the dataflow execution model that offers extensive opportunities for optimization and automatic parallelism. A streaming application is represented by a graph of computation stages that communicate with each other via FIFO channels. In shared-memory environment, an FIFO channel is classically a com- mon, fixed-size synchronized buffer shared between the producer and the consumer. As the number of concurrent stage workers increases, the synchronization overheads, such as contention and waiting times, rise sharply and severely impair application performance. In this paper, we present a novel multithreaded model which isolates memory between threads by default and provides a higher level abstraction for scalable unicast or multicast communication between threads -- CHAUS (Channel for Unbounded Streaming). The CHAUS model hides the underlying synchronization details, but requires the user to declare producer-consumer relationship of a channel in advance. It is the duty of the runtime system to ensure reliable data transmission at data item granularity as declared. To achieve unbounded buffer for streaming and reduce the synchronization overheads, we propose a virtual memory based solution to implement a scalable CHAUS channel. We check the programmability of CHAUS by successfully porting dedup and ferret from PARSEC as well as implementing MapReduce library with Phoenix-like API. The experimental results show that workloads built with CHAUS run faster than those with Pthreads, and CHAUS has the best scalability compared with two Pthread versions. There are three workloads whose CHAUS versions only spend no more than 0.17x runtime of Pthreads on both 16 and 32 cores. 展开更多
关键词 STREAMING thread model pipeline parallelism unbounded channel virtual memory
原文传递
System architecture for high-performance permissioned blockchains 被引量:3
7
作者 Libo FENG Hui ZHANG +1 位作者 Wei-Tek TSAI Simeng SUN 《Frontiers of Computer Science》 SCIE EI CSCD 2019年第6期1151-1165,共15页
Blockchain(BC),as an emerging distributed database technology with advanced security and reliability,has attracted much attention from experts who devoted to efinance,intellectual property protection,the internet of t... Blockchain(BC),as an emerging distributed database technology with advanced security and reliability,has attracted much attention from experts who devoted to efinance,intellectual property protection,the internet of things(IoT)and so forth.However,the inefficient transaction processing speed,which hinders the BC’s widespread,has not been well tackled yet.In this paper,we propose a novel architecture,called Dual-Channel Parallel Broadcast model(DCPB),which could address such a problem to a greater extent by using three methods which are dual communication channels,parallel pipeline processing and block broadcast strategy.In the dual-channel model,one channel processes transactions,and the other engages in the execution of BFT.The parallel pipeline processing allows the system to operate asynchronously.The block generation strategy improves the efficiency and speed of processing.Extensive experiments have been applied to BeihangChain,a simplified prototype for BC system,illustrates that its transaction processing speed could be improved to 16K transaction per second which could well support many real-world scenarios such as BC-based energy trading system and Micro-film copyright trading system in CCTV. 展开更多
关键词 blockchain CONCURRENCY PERFORMANCE dualchannel model parallel pipeline consensus algorithm
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部