Pipeline parallelism is a popular parallel programming pattern for emerging applications. However, program- ming pipelines directly on conventional multithreaded shared memory is difficult and error-prone. We present ...Pipeline parallelism is a popular parallel programming pattern for emerging applications. However, program- ming pipelines directly on conventional multithreaded shared memory is difficult and error-prone. We present DStream, a C library that provides high-level abstractions of deterministic threads and streams for simply representing pipeline stage work- ers and their communications. The deterministic stream is established atop our proposed single-producer/multi-consumer (SPMC) virtual memory, which integrates synchronization with the virtual memory model to enforce determinism on shared memory accesses. We investigate various strategies on how to efficiently implement DStream atop the SPMC memory, so that an infinite sequence of data items can be asynchronously published (fixed) and asynchronously consumed in order among adjacent stage workers. We have successfully transformed two representative pipeline applications ferret and dedup using DStream, and conclude conversion rules. An empirical evaluation shows that the converted ferret performed on par with its Pthreads and TBB counterparts in term of running time, while the converted dedup is close to 2.56X, 7.05X faster than the Pthreads counterpart and 1.06X, 3.9X faster than the TBB counterpart on 16 and 32 CPUs, respectively.展开更多
Deep learning has become the cornerstone of artificial intelligence,playing an increasingly important role in human production and lifestyle.However,as the complexity of problem-solving increases,deep learning models ...Deep learning has become the cornerstone of artificial intelligence,playing an increasingly important role in human production and lifestyle.However,as the complexity of problem-solving increases,deep learning models become increasingly intricate,resulting in a proliferation of large language models with an astonishing number of parameters.Pipeline model parallelism(PMP)has emerged as one of the mainstream approaches to addressing the significant challenge of training“big models”.This paper presents a comprehensive review of PMP.It covers the basic concepts and main challenges of PMP.It also comprehensively compares synchronous and asynchronous pipeline schedules for PMP approaches,and discusses the main techniques to achieve load balance for both intra-node and inter-node training.Furthermore,the main techniques to optimize computation,storage,and communication are presented,with potential research directions being discussed.展开更多
In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a paral...In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a parallel algorithm has great advantages. However, traditional algorithms can't make full use of multi-core CPU hardware resources. In the paper, a fast parallel algorithm is presented to speed up data processing. A pipeline mode is adopted to design the parallel algorithm. And the complexity of the pipeline algorithm is analyzed theoretically. To evaluate the performance of the new algorithm, effects of threads number and layers number are investigated by a serial of experiments. The experimental results show that the threads number and layers number are two remarkable factors to the speedup ratio. The tendency of speedup versus threads number reveals a positive relationship which greatly agrees with the Amdahl's law, and the tendency of speedup versus layers number also keeps a positive relationship agreeing with Gustafson's law. The new algorithm uses topological information to compute contours with a parallel method of speedup. Another parallel algorithm based on data parallel is used in experiments to show that pipeline parallel mode is more efficient. A case study at last shows a suspending performance of the new parallel algorithm. Compared with the serial slicing algorithm, the new pipeline parallel algorithm can make full use of the multi-core CPU hardware, accelerate the slicing process, and compared with the data parallel slicing algorithm, the new slicing algorithm in this paper adopts a pipeline parallel model, and a much higher speedup ratio and efficiency is achieved.展开更多
The paper designs a peripheral maximum gray differ-ence(PMGD)image segmentation method,a connected-compo-nent labeling(CCL)algorithm based on dynamic run length(DRL),and a real-time implementation streaming processor ...The paper designs a peripheral maximum gray differ-ence(PMGD)image segmentation method,a connected-compo-nent labeling(CCL)algorithm based on dynamic run length(DRL),and a real-time implementation streaming processor for DRL-CCL.And it verifies the function and performance in space target monitoring scene by the carrying experiment of Tianzhou-3 cargo spacecraft(TZ-3).The PMGD image segmentation method can segment the image into highly discrete and simple point tar-gets quickly,which reduces the generation of equivalences greatly and improves the real-time performance for DRL-CCL.Through parallel pipeline design,the storage of the streaming processor is optimized by 55%with no need for external me-mory,the logic is optimized by 60%,and the energy efficiency ratio is 12 times than that of the graphics processing unit,62 times than that of the digital signal proccessing,and 147 times than that of personal computers.Analyzing the results of 8756 images completed on-orbit,the speed is up to 5.88 FPS and the target detection rate is 100%.Our algorithm and implementation method meet the requirements of lightweight,high real-time,strong robustness,full-time,and stable operation in space irradia-tion environment.展开更多
Software process is a framework for effective and timely delivery of software system. The framework plays a crucial role for software success. However, the development of large-scale software still faces the crisis of...Software process is a framework for effective and timely delivery of software system. The framework plays a crucial role for software success. However, the development of large-scale software still faces the crisis of high risks, low quality, high costs and long cycle time. This paper proposed a three-phase parallel-pipelining software process model for improving speed and productivity, and reducing software costs and risks without sacrificing software quality. In this model, two strategies were presented. One strategy, based on subsystem-cost priority, was used to prevent software development cost wasting and to reduce software complexity as well; the other strategy, used for balancing subsystem complexity, was designed to reduce the software complexity in the later development stages. Moreover, the proposed function-detailed and workload-simplified subsystem pipelining software process model presents much higher parallelity than the concurrent incremental model. Finally, the component-based product line technology not only ensures software quality and further reduces cycle time, software costs, and software risks but also sufficiently and rationally utilizes previous software product resources and enhances the competition ability of software development organizations.展开更多
Stream processing is a special form of the dataflow execution model that offers extensive opportunities for optimization and automatic parallelism. A streaming application is represented by a graph of computation stag...Stream processing is a special form of the dataflow execution model that offers extensive opportunities for optimization and automatic parallelism. A streaming application is represented by a graph of computation stages that communicate with each other via FIFO channels. In shared-memory environment, an FIFO channel is classically a com- mon, fixed-size synchronized buffer shared between the producer and the consumer. As the number of concurrent stage workers increases, the synchronization overheads, such as contention and waiting times, rise sharply and severely impair application performance. In this paper, we present a novel multithreaded model which isolates memory between threads by default and provides a higher level abstraction for scalable unicast or multicast communication between threads -- CHAUS (Channel for Unbounded Streaming). The CHAUS model hides the underlying synchronization details, but requires the user to declare producer-consumer relationship of a channel in advance. It is the duty of the runtime system to ensure reliable data transmission at data item granularity as declared. To achieve unbounded buffer for streaming and reduce the synchronization overheads, we propose a virtual memory based solution to implement a scalable CHAUS channel. We check the programmability of CHAUS by successfully porting dedup and ferret from PARSEC as well as implementing MapReduce library with Phoenix-like API. The experimental results show that workloads built with CHAUS run faster than those with Pthreads, and CHAUS has the best scalability compared with two Pthread versions. There are three workloads whose CHAUS versions only spend no more than 0.17x runtime of Pthreads on both 16 and 32 cores.展开更多
Blockchain(BC),as an emerging distributed database technology with advanced security and reliability,has attracted much attention from experts who devoted to efinance,intellectual property protection,the internet of t...Blockchain(BC),as an emerging distributed database technology with advanced security and reliability,has attracted much attention from experts who devoted to efinance,intellectual property protection,the internet of things(IoT)and so forth.However,the inefficient transaction processing speed,which hinders the BC’s widespread,has not been well tackled yet.In this paper,we propose a novel architecture,called Dual-Channel Parallel Broadcast model(DCPB),which could address such a problem to a greater extent by using three methods which are dual communication channels,parallel pipeline processing and block broadcast strategy.In the dual-channel model,one channel processes transactions,and the other engages in the execution of BFT.The parallel pipeline processing allows the system to operate asynchronously.The block generation strategy improves the efficiency and speed of processing.Extensive experiments have been applied to BeihangChain,a simplified prototype for BC system,illustrates that its transaction processing speed could be improved to 16K transaction per second which could well support many real-world scenarios such as BC-based energy trading system and Micro-film copyright trading system in CCTV.展开更多
基金This work was supported in part by the National High Technology Research and Development 863 Program of China under Grant No. 2012AA010901, the National Natural Science Foundation of China under Grant No. 61229201, and the China Postdoctoral Science Foundation under Grant No. 2012M521250.
文摘Pipeline parallelism is a popular parallel programming pattern for emerging applications. However, program- ming pipelines directly on conventional multithreaded shared memory is difficult and error-prone. We present DStream, a C library that provides high-level abstractions of deterministic threads and streams for simply representing pipeline stage work- ers and their communications. The deterministic stream is established atop our proposed single-producer/multi-consumer (SPMC) virtual memory, which integrates synchronization with the virtual memory model to enforce determinism on shared memory accesses. We investigate various strategies on how to efficiently implement DStream atop the SPMC memory, so that an infinite sequence of data items can be asynchronously published (fixed) and asynchronously consumed in order among adjacent stage workers. We have successfully transformed two representative pipeline applications ferret and dedup using DStream, and conclude conversion rules. An empirical evaluation shows that the converted ferret performed on par with its Pthreads and TBB counterparts in term of running time, while the converted dedup is close to 2.56X, 7.05X faster than the Pthreads counterpart and 1.06X, 3.9X faster than the TBB counterpart on 16 and 32 CPUs, respectively.
基金supported in part by the National Natural Science Foundation of China under Grant Nos.62025208,U21A20473,U21A20513,62076154,and 62302512the State Administration of Science,Technology,and Industry for National Defense of China under Grant No.WDZC20235250118.
文摘Deep learning has become the cornerstone of artificial intelligence,playing an increasingly important role in human production and lifestyle.However,as the complexity of problem-solving increases,deep learning models become increasingly intricate,resulting in a proliferation of large language models with an astonishing number of parameters.Pipeline model parallelism(PMP)has emerged as one of the mainstream approaches to addressing the significant challenge of training“big models”.This paper presents a comprehensive review of PMP.It covers the basic concepts and main challenges of PMP.It also comprehensively compares synchronous and asynchronous pipeline schedules for PMP approaches,and discusses the main techniques to achieve load balance for both intra-node and inter-node training.Furthermore,the main techniques to optimize computation,storage,and communication are presented,with potential research directions being discussed.
文摘In Additive Manufacturing field, the current researches of data processing mainly focus on a slicing process of large STL files or complicated CAD models. To improve the efficiency and reduce the slicing time, a parallel algorithm has great advantages. However, traditional algorithms can't make full use of multi-core CPU hardware resources. In the paper, a fast parallel algorithm is presented to speed up data processing. A pipeline mode is adopted to design the parallel algorithm. And the complexity of the pipeline algorithm is analyzed theoretically. To evaluate the performance of the new algorithm, effects of threads number and layers number are investigated by a serial of experiments. The experimental results show that the threads number and layers number are two remarkable factors to the speedup ratio. The tendency of speedup versus threads number reveals a positive relationship which greatly agrees with the Amdahl's law, and the tendency of speedup versus layers number also keeps a positive relationship agreeing with Gustafson's law. The new algorithm uses topological information to compute contours with a parallel method of speedup. Another parallel algorithm based on data parallel is used in experiments to show that pipeline parallel mode is more efficient. A case study at last shows a suspending performance of the new parallel algorithm. Compared with the serial slicing algorithm, the new pipeline parallel algorithm can make full use of the multi-core CPU hardware, accelerate the slicing process, and compared with the data parallel slicing algorithm, the new slicing algorithm in this paper adopts a pipeline parallel model, and a much higher speedup ratio and efficiency is achieved.
文摘The paper designs a peripheral maximum gray differ-ence(PMGD)image segmentation method,a connected-compo-nent labeling(CCL)algorithm based on dynamic run length(DRL),and a real-time implementation streaming processor for DRL-CCL.And it verifies the function and performance in space target monitoring scene by the carrying experiment of Tianzhou-3 cargo spacecraft(TZ-3).The PMGD image segmentation method can segment the image into highly discrete and simple point tar-gets quickly,which reduces the generation of equivalences greatly and improves the real-time performance for DRL-CCL.Through parallel pipeline design,the storage of the streaming processor is optimized by 55%with no need for external me-mory,the logic is optimized by 60%,and the energy efficiency ratio is 12 times than that of the graphics processing unit,62 times than that of the digital signal proccessing,and 147 times than that of personal computers.Analyzing the results of 8756 images completed on-orbit,the speed is up to 5.88 FPS and the target detection rate is 100%.Our algorithm and implementation method meet the requirements of lightweight,high real-time,strong robustness,full-time,and stable operation in space irradia-tion environment.
文摘Software process is a framework for effective and timely delivery of software system. The framework plays a crucial role for software success. However, the development of large-scale software still faces the crisis of high risks, low quality, high costs and long cycle time. This paper proposed a three-phase parallel-pipelining software process model for improving speed and productivity, and reducing software costs and risks without sacrificing software quality. In this model, two strategies were presented. One strategy, based on subsystem-cost priority, was used to prevent software development cost wasting and to reduce software complexity as well; the other strategy, used for balancing subsystem complexity, was designed to reduce the software complexity in the later development stages. Moreover, the proposed function-detailed and workload-simplified subsystem pipelining software process model presents much higher parallelity than the concurrent incremental model. Finally, the component-based product line technology not only ensures software quality and further reduces cycle time, software costs, and software risks but also sufficiently and rationally utilizes previous software product resources and enhances the competition ability of software development organizations.
文摘Stream processing is a special form of the dataflow execution model that offers extensive opportunities for optimization and automatic parallelism. A streaming application is represented by a graph of computation stages that communicate with each other via FIFO channels. In shared-memory environment, an FIFO channel is classically a com- mon, fixed-size synchronized buffer shared between the producer and the consumer. As the number of concurrent stage workers increases, the synchronization overheads, such as contention and waiting times, rise sharply and severely impair application performance. In this paper, we present a novel multithreaded model which isolates memory between threads by default and provides a higher level abstraction for scalable unicast or multicast communication between threads -- CHAUS (Channel for Unbounded Streaming). The CHAUS model hides the underlying synchronization details, but requires the user to declare producer-consumer relationship of a channel in advance. It is the duty of the runtime system to ensure reliable data transmission at data item granularity as declared. To achieve unbounded buffer for streaming and reduce the synchronization overheads, we propose a virtual memory based solution to implement a scalable CHAUS channel. We check the programmability of CHAUS by successfully porting dedup and ferret from PARSEC as well as implementing MapReduce library with Phoenix-like API. The experimental results show that workloads built with CHAUS run faster than those with Pthreads, and CHAUS has the best scalability compared with two Pthread versions. There are three workloads whose CHAUS versions only spend no more than 0.17x runtime of Pthreads on both 16 and 32 cores.
基金supported by National Key Research and Development Program of China(2017YFB1400200)the National Natural Science Foundation of China(Grant Nos.61672075,M1450009 and 61462003).
文摘Blockchain(BC),as an emerging distributed database technology with advanced security and reliability,has attracted much attention from experts who devoted to efinance,intellectual property protection,the internet of things(IoT)and so forth.However,the inefficient transaction processing speed,which hinders the BC’s widespread,has not been well tackled yet.In this paper,we propose a novel architecture,called Dual-Channel Parallel Broadcast model(DCPB),which could address such a problem to a greater extent by using three methods which are dual communication channels,parallel pipeline processing and block broadcast strategy.In the dual-channel model,one channel processes transactions,and the other engages in the execution of BFT.The parallel pipeline processing allows the system to operate asynchronously.The block generation strategy improves the efficiency and speed of processing.Extensive experiments have been applied to BeihangChain,a simplified prototype for BC system,illustrates that its transaction processing speed could be improved to 16K transaction per second which could well support many real-world scenarios such as BC-based energy trading system and Micro-film copyright trading system in CCTV.