As data volumes grow rapidly, distributed computations are widely employed in data-centers to provide cheap and efficient methods to process large-scale parallel datasets. Various computation models have been proposed...As data volumes grow rapidly, distributed computations are widely employed in data-centers to provide cheap and efficient methods to process large-scale parallel datasets. Various computation models have been proposed to improve the abstraction of distributed datasets and hide the details of parallelism. However, most of them follow the single-layer partitioning method, which limits developers to express a multi-level partitioning operation succinctly. To overcome the problem, we present the NDD (Nested Distributed Dataset) data model. It is a more compact and expressive extension of Spark RDD (Resilient Distributed Dataset), in order to remove the burden on developers to manually write the logic for multi-level partitioning cases. Base on the NDD model, we develop an open-source framework called Bigflow, which serves as an optimization layer over computation engines from most widely used processing frameworks. With the help of Bigflow, some advanced optimization techniques, which may only be applied by experienced programmers manually, are enabled automatically in a distributed data processing job. Currently, Bigflow is processing about 3 PB data volumes daily in the data-centers of Baidu. According to customer experience, it can significantly save code length and improve performance over the intuitive programming style.展开更多
The key to high performance for GPU architecture lies in its massive threading capability to drive a large number of cores and enable execution overlapping among threads. However, in reality, the number of threads tha...The key to high performance for GPU architecture lies in its massive threading capability to drive a large number of cores and enable execution overlapping among threads. However, in reality, the number of threads that can simultaneously execute is often limited by the size of the register file on GPUs. The traditional SRAM-based register file takes up so large amount of chip area that it cannot scale to meet the increasing demand of GPU applications. Racetrack memory (RM) is a promising technology for designing large capacity register file on GPUs due to its high data storage density. However, without careful deployment of RM-based register file, the lengthy shift operations of RM may hurt the performance. In this paper, we explore RM for designing high-performance register file for GPU architecture. High storage density RM helps to improve the thread level parallelism (TLP), but if the bits of the registers are not aligned to the ports, shift operations are required to move the bits to the access ports before they are accessed, and thus the read/write operations are delayed. We develop an optimization framework for RM-based register file on GPUs, which employs three different optimization techniques at the application, compilation, and architecture level, respectively. More clearly, we optimize the TLP at the application level, design a register mapping algorithm at the compilation level, and design a preshifting mechanism at the architecture level. Collectively, these optimizations help to determine the TLP without causing cache and register file resource contention and reduce the shift operation overhead. Experimental results using a variety of representative workloads demonstrate that our optimization framework achieves up to 29% (21% on average) performance improvement.展开更多
Resistive random access memory(RRAM)has been demonstrated to implement multiply-and-accumulate(MAC)operations using a highly parallel analog fashion,which dramatically accelerates the convolutional neural networks(CNN...Resistive random access memory(RRAM)has been demonstrated to implement multiply-and-accumulate(MAC)operations using a highly parallel analog fashion,which dramatically accelerates the convolutional neural networks(CNNs).Since CNNs require considerable converters between analog crossbars and digital peripheral circuits,recent studies map the binary neural networks(BNNs)onto RRAM and binarize the weights to{+1,-1}.However,two mainstream representations for BNN weights introduce patterns of redundant 0s and 1s when dealing with negative weights.In this work,we reduce the area of redundant 0s and 1s by proposing a BNN weight representation framework based on the novel pattern representation and a corresponding architecture.First,we spilt the weight matrix into several small matrices by clustering adjacent columns together.Second,we extract 1s'patterns,i.e.,the submatrices only containing 1s,from the small weight matrix,such that each final output can be represented by the sum of several patterns.Third,we map these patterns onto RRAM crossbars,including pattern computation crossbars(PCCs)and pattern accumulation crossbars(PACs).Finally,we compare the pattern representation with two mainstream representations and adopt the more area efficient one.The evaluation results demonstrate that our framework can save over 20%of crossbar area effectively,compared with two mainstream representations.展开更多
ACM SIGOPS ChinaSys conference is organized twice a year by ChinaSys, which is an active community forresearchers and practitioners on computer systems in China. Since August 2015, ChinaSys has become an ACMSIGOPS cha...ACM SIGOPS ChinaSys conference is organized twice a year by ChinaSys, which is an active community forresearchers and practitioners on computer systems in China. Since August 2015, ChinaSys has become an ACMSIGOPS chapter. The first ChinaSys conference happened in November 2011 in Shenzhen. Now it has become anew leading international forum for academia, industry, and government to present novel research results in theprinciple and practice of computer systems. All topic areas related to design and implementation of computersystems are of interest and in scope.展开更多
基金This work is supported by the National Key Research and Development Project of China under Grant No. 2018YFB1003304 and Beijing Academy of Artificial Intelligence (BAAI).
文摘As data volumes grow rapidly, distributed computations are widely employed in data-centers to provide cheap and efficient methods to process large-scale parallel datasets. Various computation models have been proposed to improve the abstraction of distributed datasets and hide the details of parallelism. However, most of them follow the single-layer partitioning method, which limits developers to express a multi-level partitioning operation succinctly. To overcome the problem, we present the NDD (Nested Distributed Dataset) data model. It is a more compact and expressive extension of Spark RDD (Resilient Distributed Dataset), in order to remove the burden on developers to manually write the logic for multi-level partitioning cases. Base on the NDD model, we develop an open-source framework called Bigflow, which serves as an optimization layer over computation engines from most widely used processing frameworks. With the help of Bigflow, some advanced optimization techniques, which may only be applied by experienced programmers manually, are enabled automatically in a distributed data processing job. Currently, Bigflow is processing about 3 PB data volumes daily in the data-centers of Baidu. According to customer experience, it can significantly save code length and improve performance over the intuitive programming style.
基金This work was supported by the National Natural Science Foundation of China under Grant No. 61300005.
文摘The key to high performance for GPU architecture lies in its massive threading capability to drive a large number of cores and enable execution overlapping among threads. However, in reality, the number of threads that can simultaneously execute is often limited by the size of the register file on GPUs. The traditional SRAM-based register file takes up so large amount of chip area that it cannot scale to meet the increasing demand of GPU applications. Racetrack memory (RM) is a promising technology for designing large capacity register file on GPUs due to its high data storage density. However, without careful deployment of RM-based register file, the lengthy shift operations of RM may hurt the performance. In this paper, we explore RM for designing high-performance register file for GPU architecture. High storage density RM helps to improve the thread level parallelism (TLP), but if the bits of the registers are not aligned to the ports, shift operations are required to move the bits to the access ports before they are accessed, and thus the read/write operations are delayed. We develop an optimization framework for RM-based register file on GPUs, which employs three different optimization techniques at the application, compilation, and architecture level, respectively. More clearly, we optimize the TLP at the application level, design a register mapping algorithm at the compilation level, and design a preshifting mechanism at the architecture level. Collectively, these optimizations help to determine the TLP without causing cache and register file resource contention and reduce the shift operation overhead. Experimental results using a variety of representative workloads demonstrate that our optimization framework achieves up to 29% (21% on average) performance improvement.
基金partly supported by the National Key Research and Development Program of China under Grant No.2020AAA0130400Beijing Municipal Science and Technology Program of China under Grant No.Z201100004220007+2 种基金the National Natural Science Foundation of China under Grant No.62090021Beijing Academy of Artificial Intelligence(BAAI)Alibaba Innovative Research(AIR)Program.
文摘Resistive random access memory(RRAM)has been demonstrated to implement multiply-and-accumulate(MAC)operations using a highly parallel analog fashion,which dramatically accelerates the convolutional neural networks(CNNs).Since CNNs require considerable converters between analog crossbars and digital peripheral circuits,recent studies map the binary neural networks(BNNs)onto RRAM and binarize the weights to{+1,-1}.However,two mainstream representations for BNN weights introduce patterns of redundant 0s and 1s when dealing with negative weights.In this work,we reduce the area of redundant 0s and 1s by proposing a BNN weight representation framework based on the novel pattern representation and a corresponding architecture.First,we spilt the weight matrix into several small matrices by clustering adjacent columns together.Second,we extract 1s'patterns,i.e.,the submatrices only containing 1s,from the small weight matrix,such that each final output can be represented by the sum of several patterns.Third,we map these patterns onto RRAM crossbars,including pattern computation crossbars(PCCs)and pattern accumulation crossbars(PACs).Finally,we compare the pattern representation with two mainstream representations and adopt the more area efficient one.The evaluation results demonstrate that our framework can save over 20%of crossbar area effectively,compared with two mainstream representations.
基金We would like to thank all the authors for their contributions, including those whose manuscripts were notaccepted. Our special thanks also go to the reviewers for their valuable time and thorough evaluation of themanuscripts. We appreciate the Editor-in-Chief, Professor Guo-Jie Li, for hosting this special section in theJournal of Computer Science and Technology (JCST). We are also very grateful to the editorial office staff of JCSTfor their excellent work during the course of preparation for this special section.
文摘ACM SIGOPS ChinaSys conference is organized twice a year by ChinaSys, which is an active community forresearchers and practitioners on computer systems in China. Since August 2015, ChinaSys has become an ACMSIGOPS chapter. The first ChinaSys conference happened in November 2011 in Shenzhen. Now it has become anew leading international forum for academia, industry, and government to present novel research results in theprinciple and practice of computer systems. All topic areas related to design and implementation of computersystems are of interest and in scope.