This paper analyzes the main elements in NS network simulator, makes adetailed view of dataflow management in a link, a node, and an agent, respectively, and introducesthe information described by its trace file. Base...This paper analyzes the main elements in NS network simulator, makes adetailed view of dataflow management in a link, a node, and an agent, respectively, and introducesthe information described by its trace file. Based on the analysis of transportation and treatmentof different packets in NS, a dataflow state machine is proposed with its states exchange triggeringevents and a dataflow analyzer is designed and implemented according to it. As the machine statefunctions, the analyzer can make statistic of total transportation flux of a specified dataflow andoffer a general fluctuation diagram. Finally, a concrete example is used to test its performance.展开更多
Sub Farm Interface is the event builder of the ATLAS(A Toroidal LHC ApparatuS) Dataflow System. It receives event fragments from the Read Out System, builds full events and sends complete events to the Event Filter ...Sub Farm Interface is the event builder of the ATLAS(A Toroidal LHC ApparatuS) Dataflow System. It receives event fragments from the Read Out System, builds full events and sends complete events to the Event Filter for high level event selection. This paper describes the implementation of the Sub Farm Interface. Furthermore, this paper introduces some issues on SFI(Sub Farm Interface) optimization and the monitoring service inside SFI.展开更多
Driven by continuous scaling of nanoscale semiconductor technologies,the past years have witnessed the progressive advancement of machine learning techniques and applications.Recently,dedicated machine learning accele...Driven by continuous scaling of nanoscale semiconductor technologies,the past years have witnessed the progressive advancement of machine learning techniques and applications.Recently,dedicated machine learning accelerators,especially for neural networks,have attracted the research interests of computer architects and VLSI designers.State-of-the-art accelerators increase performance by deploying a huge amount of processing elements,however still face the issue of degraded resource utilization across hybrid and non-standard algorithmic kernels.In this work,we exploit the properties of important neural network kernels for both perception and control to propose a reconfigurable dataflow processor,which adjusts the patterns of data flowing,functionalities of processing elements and on-chip storages according to network kernels.In contrast to stateof-the-art fine-grained data flowing techniques,the proposed coarse-grained dataflow reconfiguration approach enables extensive sharing of computing and storage resources.Three hybrid networks for MobileNet,deep reinforcement learning and sequence classification are constructed and analyzed with customized instruction sets and toolchain.A test chip has been designed and fabricated under UMC 65 nm CMOS technology,with the measured power consumption of 7.51 mW under 100 MHz frequency on a die size of 1.8×1.8 mm^2.展开更多
Edge computing can alleviate the problem of insufficient computational resources for the user equipment,improve the network processing environment,and promote the user experience.Edge computing is well known as a pros...Edge computing can alleviate the problem of insufficient computational resources for the user equipment,improve the network processing environment,and promote the user experience.Edge computing is well known as a prospective method for the development of the Internet of Things(IoT).However,with the development of smart terminals,much more time is required for scheduling the terminal high-intensity upstream dataflow in the edge server than for scheduling that in the downstream dataflow.In this paper,we study the scheduling strategy for upstream dataflows in edge computing networks and introduce a three-tier edge computing network architecture.We propose a Time-Slicing Self-Adaptive Scheduling(TSAS)algorithm based on the hierarchical queue,which can reduce the queuing delay of the dataflow,improve the timeliness of dataflow processing and achieve an efficient and reasonable performance of dataflow scheduling.The experimental results show that the TSAS algorithm can reduce latency,minimize energy consumption,and increase system throughput.展开更多
Dataflow architecture has shown its advantages in many high-performance computing cases. In dataflow computing, a large amount of data are frequently transferred among processing elements through the network-on-chip ...Dataflow architecture has shown its advantages in many high-performance computing cases. In dataflow computing, a large amount of data are frequently transferred among processing elements through the network-on-chip (NoC). Thus the router design has a significant impact on the performance of dataflow architecture. Common routers are designed for control-flow multi-core architecture and we find they are not suitable for dataflow architecture. In this work, we analyze and extract the features of data transfers in NoCs of dataflow architecture: multiple destinations, high injection rate, and performance sensitive to delay. Based on the three features, we propose a novel and efficient NoC router for dataflow architecture. The proposed router supports multi-destination; thus it can transfer data with multiple destinations in a single transfer. Moreover, the router adopts output buffer to maximize throughput and adopts non-flit packets to minimize transfer delay. Experimental results show that the proposed router can improve the performance of dataflow architecture by 3.6x over a state-of-the-art router.展开更多
Double buffering is an effective mechanism to hide the latency of data transfers between on-chip and off-chip memory. However, in dataflow architecture, the swapping of two buffers during the execution of many tiles d...Double buffering is an effective mechanism to hide the latency of data transfers between on-chip and off-chip memory. However, in dataflow architecture, the swapping of two buffers during the execution of many tiles decreases the performance because of repetitive filling and draining of the dataflow accelerator. In this work, we propose a non-stop double buffering mechanism for dataflow architecture. The proposed non-stop mechanism assigns tiles to the processing element array without stopping the execution of processing elements through optimizing control logic in dataflow architecture. Moreover, we propose a work-flow program to cooperate with the non-stop double buffering mechanism. After optimizations both on control logic and on work-flow program, the filling and draining of the array needs to be done only once across the execution of all tiles belonging to the same dataflow graph. Experimental results show that the proposed double buffering mechanism for dataftow architecture achieves a 16.2% average efficiency improvement over that without the optimization.展开更多
With the coming of exascale supercomputing era, power efficiency has become the most important obstacle to build an exascale system. Dataflow architecture has native advantage in achieving high power efficiency for sc...With the coming of exascale supercomputing era, power efficiency has become the most important obstacle to build an exascale system. Dataflow architecture has native advantage in achieving high power efficiency for scientific applications. However, the state-of-the-art dataflow architectures fail to exploit high parallelism for loop processing. To address this issue, we propose a pipelining loop optimization method (PLO), which makes iterations in loops flow in the processing element (PE) array of dataflow accelerator. This method consists of two techniques, architecture-assisted hardware iteration and instruction-assisted software iteration. In hardware iteration execution model, an on-chip loop controller is designed to generate loop indexes, reducing the complexity of computing kernel and laying a good f(mndation for pipelining execution. In software iteration execution model, additional loop instructions are presented to solve the iteration dependency problem. Via these two techniques, the average number of instructions ready to execute per cycle is increased to keep floating-point unit busy. Simulation results show that our proposed method outperforms static and dynamic loop execution model in floating-point efficiency by 2.45x and 1.1x on average, respectively, while the hardware cost of these two techniques is acceptable.展开更多
The pervasiveness of the smart Internet of Things(IoTs) enables many electric sensors and devices to be connected and generates a large amount of dataflow. Compared with traditional big data, the streaming dataflow is...The pervasiveness of the smart Internet of Things(IoTs) enables many electric sensors and devices to be connected and generates a large amount of dataflow. Compared with traditional big data, the streaming dataflow is faced with representative challenges, such as high speed, strong variability, rough continuity, and demanding timeliness, which pose severe tests of its efficient management. In this paper, we provide an overall review of IoT dataflow management. We first analyze the key challenges faced with IoT dataflow and initially overview the related techniques in dataflow management, spanning dataflow sensing, mining, control, security, privacy protection,etc. Then, we illustrate and compare representative tools or platforms for IoT dataflow management. In addition,promising application scenarios, such as smart cities, smart transportation, and smart manufacturing, are elaborated,which will provide significant guidance for further research. The management of IoT dataflow is also an important area, which merits in-depth discussions and further study.展开更多
Inheriting from a data-driven communication pattern other than a location-driven pattern, named data net- working (NDN) offers better support to network-layer dataflow. However, the application developers have to ha...Inheriting from a data-driven communication pattern other than a location-driven pattern, named data net- working (NDN) offers better support to network-layer dataflow. However, the application developers have to handle complex tasks, such as data segmentation, packet verification, and flow control, due to the lack of proper transport-layer protocols over the network layer. In this study, we design a dataflow-oriented programming interface to provide transport strategies for NDN, which greatly improves the efficiency in developing applications. This interface presents two application data unit; (ADU) retrieval strategies according to different data publishing patterns, in which it adopts an adaptive ADU pipelining algorithm to control the dataflow based on the current network status and data generation rate. The interface also offers network measurement strategies to monitor an abundance of critical metrics infuencing the application performance. We verify the functionality and performance of our interface by implementing a video streaming application spanning 11 time zones over the worldwide NDN testbed. Our experiments show that the interface can efficiently support developing high-performance and dataflow-driven NDN applications.展开更多
The Manchester dataflow computer is a famous dynamic dataflow computer.It is centralized in architecture and simple in organization. Its overhead for communication and scheduling is very small. Its efficiency comes do...The Manchester dataflow computer is a famous dynamic dataflow computer.It is centralized in architecture and simple in organization. Its overhead for communication and scheduling is very small. Its efficiency comes down, when processing elements in the processing subsystem increaJse. Several articles eval uated its performance and presented improved methods. The authors studied its processing subsystem and carried out the simulation. The simulation rer sults show that the efficiency of the processing subsystem drops dramatically when average instruction execution microcycles become less and the maximum instruction execution rate is nearly attained. Two improved methods are pre-sented to overcome the disadvantage. The improved processing subsystem with a cheap distributor made up of a bus and a twthlevel fixed priority circuit pos-sesses almost full efficiency no matter whether the average instruction execution microcycles number is large or small and even if the mtalmum instruction execution rate is approached.展开更多
The combination of growing transistor counts and limited power budget within a silicon die leads to the utilization wall problem (a.k.a. "Dark Silicon"), that is only a small fraction of chip can run at full speed...The combination of growing transistor counts and limited power budget within a silicon die leads to the utilization wall problem (a.k.a. "Dark Silicon"), that is only a small fraction of chip can run at full speed during a period of time. Designing accelerators for specific applications or algorithms is considered to be one of the most promising approaches to improving energy-efficiency. However, most current design methods for accelerators are dedicated for certain applications or algorithms, which greatly constrains their applicability. In this paper, we propose a novel general-purpose many-accelerator architecture. Our contributions are two-fold. Firstly, we propose to cluster dataflow graphs (DFGs) of hotspot basic blocks (BBs) in applications. The DFG clusters are then used for accelerators design. This is because a DFC is the largest program unit which is not specific to a certain application. We analyze 17 benchmarks in SPEC CPU 2006, acquire over 300 DFGs hotspots by using LLVM compiler tool, and divide them into 15 clusters based on graph similarity. Secondly, we introduce a function instruction set architecture (FISC) and illustrate how DFG accelerators can be integrated with a processor core and how they can be used by applications. Our results show that the proposed DFG clustering and FISC design can speed up SPEC benchmarks 6.2X on average.展开更多
The dataflow architecture,which is characterized by a lack of a redundant unified control logic,has been shown to have an advantage over the control-flow architecture as it improves the computational performance and p...The dataflow architecture,which is characterized by a lack of a redundant unified control logic,has been shown to have an advantage over the control-flow architecture as it improves the computational performance and power efficiency,especially of applications used in high-performance computing(HPC).Importantly,the high computational efficiency of systems using the dataflow architecture is achieved by allowing program kernels to be activated in a simultaneous manner.Therefore,a proper acknowledgment mechanism is required to distinguish the data that logically belongs to different contexts.Possible solutions include the tagged-token matching mechanism in which the data is sent before acknowledgments are received but retried after rejection,or a handshake mechanism in which the data is only sent after acknowledgments are received.However,these mechanisms are characterized by both inefficient data transfer and increased area cost.Good performance of the dataflow architecture depends on the efficiency of data transfer.In order to optimize the efficiency of data transfer in existing dataflow architectures with a minimal increase in area and power cost,we propose a Look-Ahead Acknowledgment(LAA)mechanism.LAA accelerates the execution flow by speculatively acknowledging ahead without penalties.Our simulation analysis based on a handshake mechanism shows that our LAA increases the average utilization of computational units by 23.9%,with a reduction in the average execution time by 17.4%and an increase in the average power efficiency of dataflow processors by 22.4%.Crucially,our novel approach results in a relatively small increase in the area and power consumption of the on-chip logic of less than 0.9%.In conclusion,the evaluation results suggest that Look-Ahead Acknowledgment is an effective improvement for data transfer in existing dataflow architectures.展开更多
基金The Natural Science Foundation of Jiangsu Province (BK2001205).
文摘This paper analyzes the main elements in NS network simulator, makes adetailed view of dataflow management in a link, a node, and an agent, respectively, and introducesthe information described by its trace file. Based on the analysis of transportation and treatmentof different packets in NS, a dataflow state machine is proposed with its states exchange triggeringevents and a dataflow analyzer is designed and implemented according to it. As the machine statefunctions, the analyzer can make statistic of total transportation flux of a specified dataflow andoffer a general fluctuation diagram. Finally, a concrete example is used to test its performance.
文摘Sub Farm Interface is the event builder of the ATLAS(A Toroidal LHC ApparatuS) Dataflow System. It receives event fragments from the Read Out System, builds full events and sends complete events to the Event Filter for high level event selection. This paper describes the implementation of the Sub Farm Interface. Furthermore, this paper introduces some issues on SFI(Sub Farm Interface) optimization and the monitoring service inside SFI.
基金supported by NSFC with Grant No. 61702493, 51707191Science and Technology Planning Project of Guangdong Province with Grant No. 2018B030338001+2 种基金Shenzhen S&T Funding with Grant No. KQJSCX20170731163915914Basic Research Program No. JCYJ20170818164527303, JCYJ20180507182619669SIAT Innovation Program for Excellent Young Researchers with Grant No. 2017001
文摘Driven by continuous scaling of nanoscale semiconductor technologies,the past years have witnessed the progressive advancement of machine learning techniques and applications.Recently,dedicated machine learning accelerators,especially for neural networks,have attracted the research interests of computer architects and VLSI designers.State-of-the-art accelerators increase performance by deploying a huge amount of processing elements,however still face the issue of degraded resource utilization across hybrid and non-standard algorithmic kernels.In this work,we exploit the properties of important neural network kernels for both perception and control to propose a reconfigurable dataflow processor,which adjusts the patterns of data flowing,functionalities of processing elements and on-chip storages according to network kernels.In contrast to stateof-the-art fine-grained data flowing techniques,the proposed coarse-grained dataflow reconfiguration approach enables extensive sharing of computing and storage resources.Three hybrid networks for MobileNet,deep reinforcement learning and sequence classification are constructed and analyzed with customized instruction sets and toolchain.A test chip has been designed and fabricated under UMC 65 nm CMOS technology,with the measured power consumption of 7.51 mW under 100 MHz frequency on a die size of 1.8×1.8 mm^2.
基金This work were supported in part by the National Natural Science Foundation of China(No.61572191)Natural Science Foundation of Hunan Province(Nos.2022JJ30398,2022JJ40277 and 2022JJ40278)Scientific Research Fund of Hunan Provincial Education Department(No.17A130).
文摘Edge computing can alleviate the problem of insufficient computational resources for the user equipment,improve the network processing environment,and promote the user experience.Edge computing is well known as a prospective method for the development of the Internet of Things(IoT).However,with the development of smart terminals,much more time is required for scheduling the terminal high-intensity upstream dataflow in the edge server than for scheduling that in the downstream dataflow.In this paper,we study the scheduling strategy for upstream dataflows in edge computing networks and introduce a three-tier edge computing network architecture.We propose a Time-Slicing Self-Adaptive Scheduling(TSAS)algorithm based on the hierarchical queue,which can reduce the queuing delay of the dataflow,improve the timeliness of dataflow processing and achieve an efficient and reasonable performance of dataflow scheduling.The experimental results show that the TSAS algorithm can reduce latency,minimize energy consumption,and increase system throughput.
基金This work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2015AA01A301, the National Natural Science Foundation of China under Grant No. 61332009, the National HeGaoJi Project of China under Grant No. 2013ZX0102-8001-001-001, and the Beijing Municipal Science and Technology Commission under Grant Nos. Z15010101009 and Z151100003615006.
文摘Dataflow architecture has shown its advantages in many high-performance computing cases. In dataflow computing, a large amount of data are frequently transferred among processing elements through the network-on-chip (NoC). Thus the router design has a significant impact on the performance of dataflow architecture. Common routers are designed for control-flow multi-core architecture and we find they are not suitable for dataflow architecture. In this work, we analyze and extract the features of data transfers in NoCs of dataflow architecture: multiple destinations, high injection rate, and performance sensitive to delay. Based on the three features, we propose a novel and efficient NoC router for dataflow architecture. The proposed router supports multi-destination; thus it can transfer data with multiple destinations in a single transfer. Moreover, the router adopts output buffer to maximize throughput and adopts non-flit packets to minimize transfer delay. Experimental results show that the proposed router can improve the performance of dataflow architecture by 3.6x over a state-of-the-art router.
基金This work was supported by the National Key Research and Development Program of China under Grant No. 2016YFB0200501, the National Natural Science Foundation of China under Grant Nos. 61332009 and 61521092, the Open Project Program of State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2016A04, and the Beijing Municipal Science and Technology Commission under Grant No. Z15010101009.
文摘Double buffering is an effective mechanism to hide the latency of data transfers between on-chip and off-chip memory. However, in dataflow architecture, the swapping of two buffers during the execution of many tiles decreases the performance because of repetitive filling and draining of the dataflow accelerator. In this work, we propose a non-stop double buffering mechanism for dataflow architecture. The proposed non-stop mechanism assigns tiles to the processing element array without stopping the execution of processing elements through optimizing control logic in dataflow architecture. Moreover, we propose a work-flow program to cooperate with the non-stop double buffering mechanism. After optimizations both on control logic and on work-flow program, the filling and draining of the array needs to be done only once across the execution of all tiles belonging to the same dataflow graph. Experimental results show that the proposed double buffering mechanism for dataftow architecture achieves a 16.2% average efficiency improvement over that without the optimization.
基金This work was supported by the National Key Research and Development Program of China under Grant No. 2016YFB0200501, tile National Natural Science Foundation of China under Grant Nos. 61332009 and 61521092, the Open Project Program of State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2016A04 and tile Beijing Municipal Science and Technology Commission under Grant No. Z15010101009, the Open Project Program of State Key Laboratory of Computer Architecture under Grant No. CARCH201503, China Scholarship Council, and Beijing Advanced hmovation Center for hnaging Technology.
文摘With the coming of exascale supercomputing era, power efficiency has become the most important obstacle to build an exascale system. Dataflow architecture has native advantage in achieving high power efficiency for scientific applications. However, the state-of-the-art dataflow architectures fail to exploit high parallelism for loop processing. To address this issue, we propose a pipelining loop optimization method (PLO), which makes iterations in loops flow in the processing element (PE) array of dataflow accelerator. This method consists of two techniques, architecture-assisted hardware iteration and instruction-assisted software iteration. In hardware iteration execution model, an on-chip loop controller is designed to generate loop indexes, reducing the complexity of computing kernel and laying a good f(mndation for pipelining execution. In software iteration execution model, additional loop instructions are presented to solve the iteration dependency problem. Via these two techniques, the average number of instructions ready to execute per cycle is increased to keep floating-point unit busy. Simulation results show that our proposed method outperforms static and dynamic loop execution model in floating-point efficiency by 2.45x and 1.1x on average, respectively, while the hardware cost of these two techniques is acceptable.
基金supported in part by the National Natural Science Foundation of China (No.61872038)。
文摘The pervasiveness of the smart Internet of Things(IoTs) enables many electric sensors and devices to be connected and generates a large amount of dataflow. Compared with traditional big data, the streaming dataflow is faced with representative challenges, such as high speed, strong variability, rough continuity, and demanding timeliness, which pose severe tests of its efficient management. In this paper, we provide an overall review of IoT dataflow management. We first analyze the key challenges faced with IoT dataflow and initially overview the related techniques in dataflow management, spanning dataflow sensing, mining, control, security, privacy protection,etc. Then, we illustrate and compare representative tools or platforms for IoT dataflow management. In addition,promising application scenarios, such as smart cities, smart transportation, and smart manufacturing, are elaborated,which will provide significant guidance for further research. The management of IoT dataflow is also an important area, which merits in-depth discussions and further study.
基金This work is supported by the National Natural Science Foundation of China under Grant No. 61373025.
文摘Inheriting from a data-driven communication pattern other than a location-driven pattern, named data net- working (NDN) offers better support to network-layer dataflow. However, the application developers have to handle complex tasks, such as data segmentation, packet verification, and flow control, due to the lack of proper transport-layer protocols over the network layer. In this study, we design a dataflow-oriented programming interface to provide transport strategies for NDN, which greatly improves the efficiency in developing applications. This interface presents two application data unit; (ADU) retrieval strategies according to different data publishing patterns, in which it adopts an adaptive ADU pipelining algorithm to control the dataflow based on the current network status and data generation rate. The interface also offers network measurement strategies to monitor an abundance of critical metrics infuencing the application performance. We verify the functionality and performance of our interface by implementing a video streaming application spanning 11 time zones over the worldwide NDN testbed. Our experiments show that the interface can efficiently support developing high-performance and dataflow-driven NDN applications.
文摘The Manchester dataflow computer is a famous dynamic dataflow computer.It is centralized in architecture and simple in organization. Its overhead for communication and scheduling is very small. Its efficiency comes down, when processing elements in the processing subsystem increaJse. Several articles eval uated its performance and presented improved methods. The authors studied its processing subsystem and carried out the simulation. The simulation rer sults show that the efficiency of the processing subsystem drops dramatically when average instruction execution microcycles become less and the maximum instruction execution rate is nearly attained. Two improved methods are pre-sented to overcome the disadvantage. The improved processing subsystem with a cheap distributor made up of a bus and a twthlevel fixed priority circuit pos-sesses almost full efficiency no matter whether the average instruction execution microcycles number is large or small and even if the mtalmum instruction execution rate is approached.
基金supported by the National Natural Science Foundation of China under Grant Nos.601173006,61221062the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No.XDA06010403
文摘The combination of growing transistor counts and limited power budget within a silicon die leads to the utilization wall problem (a.k.a. "Dark Silicon"), that is only a small fraction of chip can run at full speed during a period of time. Designing accelerators for specific applications or algorithms is considered to be one of the most promising approaches to improving energy-efficiency. However, most current design methods for accelerators are dedicated for certain applications or algorithms, which greatly constrains their applicability. In this paper, we propose a novel general-purpose many-accelerator architecture. Our contributions are two-fold. Firstly, we propose to cluster dataflow graphs (DFGs) of hotspot basic blocks (BBs) in applications. The DFG clusters are then used for accelerators design. This is because a DFC is the largest program unit which is not specific to a certain application. We analyze 17 benchmarks in SPEC CPU 2006, acquire over 300 DFGs hotspots by using LLVM compiler tool, and divide them into 15 clusters based on graph similarity. Secondly, we introduce a function instruction set architecture (FISC) and illustrate how DFG accelerators can be integrated with a processor core and how they can be used by applications. Our results show that the proposed DFG clustering and FISC design can speed up SPEC benchmarks 6.2X on average.
基金This work was supported by the Project of the State Grid Corporation of China in 2020"Integration Technology Research and Prototype Development for High End Controller Chip"under Grant No.5700-202041264A-0-0-00.
文摘The dataflow architecture,which is characterized by a lack of a redundant unified control logic,has been shown to have an advantage over the control-flow architecture as it improves the computational performance and power efficiency,especially of applications used in high-performance computing(HPC).Importantly,the high computational efficiency of systems using the dataflow architecture is achieved by allowing program kernels to be activated in a simultaneous manner.Therefore,a proper acknowledgment mechanism is required to distinguish the data that logically belongs to different contexts.Possible solutions include the tagged-token matching mechanism in which the data is sent before acknowledgments are received but retried after rejection,or a handshake mechanism in which the data is only sent after acknowledgments are received.However,these mechanisms are characterized by both inefficient data transfer and increased area cost.Good performance of the dataflow architecture depends on the efficiency of data transfer.In order to optimize the efficiency of data transfer in existing dataflow architectures with a minimal increase in area and power cost,we propose a Look-Ahead Acknowledgment(LAA)mechanism.LAA accelerates the execution flow by speculatively acknowledging ahead without penalties.Our simulation analysis based on a handshake mechanism shows that our LAA increases the average utilization of computational units by 23.9%,with a reduction in the average execution time by 17.4%and an increase in the average power efficiency of dataflow processors by 22.4%.Crucially,our novel approach results in a relatively small increase in the area and power consumption of the on-chip logic of less than 0.9%.In conclusion,the evaluation results suggest that Look-Ahead Acknowledgment is an effective improvement for data transfer in existing dataflow architectures.