Uniform memory multicore neural network accelerators(UNNAs)furnish huge computing power to emerging neural network applications.Meanwhile,with neural network architectures going deeper and wider,the limited memory cap...Uniform memory multicore neural network accelerators(UNNAs)furnish huge computing power to emerging neural network applications.Meanwhile,with neural network architectures going deeper and wider,the limited memory capacity has become a constraint to deploy models on UNNA platforms.Therefore how to efficiently manage memory space and how to reduce workload footprints are urgently significant.In this paper,we propose Tetris:a heuristic static memory management framework for UNNA platforms.Tetris reconstructs execution flows and synchronization relationships among cores to analyze each tensor’s liveness interval.Then the memory management problem is converted to a sequence permutation problem.Tetris uses a genetic algorithm to explore the permutation space to optimize the memory management strategy and reduce memory footprints.We evaluate several typical neural networks and the experimental results demonstrate that Tetris outperforms the state-of-the-art memory allocation methods,and achieves an average memory reduction ratio of 91.9%and 87.9%for a quad-core and a 16-core Cambricon-X platform,respectively.展开更多
Many systems have been built to employ the delta-based iterative execution model to support iterative algorithms on distributed platforms by exploiting the sparse computational dependencies between data items of these...Many systems have been built to employ the delta-based iterative execution model to support iterative algorithms on distributed platforms by exploiting the sparse computational dependencies between data items of these iterative algorithms in a synchronous or asynchronous approach. However, for large-scale iterative algorithms, existing synchronous solutions suffer from slow convergence speed and load imbalance, because of the strict barrier between iterations;while existing asynchronous approaches induce excessive redundant communication and computation cost as a result of being barrier-free. In view of the performance trade-off between these two approaches, this paper designs an efficient execution manager, called Aiter-R, which can be integrated into existing delta-based iterative processing systems to efficiently support the execution of delta-based iterative algorithms, by using our proposed group-based iterative execution approach. It can efficiently and correctly explore the middle ground of the two extremes. A heuristic scheduling algorithm is further proposed to allow an iterative algorithm to adaptively choose its trade-off point so as to achieve the maximum efficiency. Experimental results show that Aiter-R strikes a good balance between the synchronous and asynchronous policies and outperforms state-of-the-art solutions. It reduces the execution time by up to 54.1% and 84.6% in comparison with existing asynchronous and the synchronous models, respectively.展开更多
基金the Beijing Natural Science Foundation under Grant No.JQ18013the National Natural Science Foundation of China under Grant Nos.61925208,61732007,61732002 and 61906179+1 种基金the Strategic Priority Research Program of Chinese Academy of Sciences(CAS)under Grant No.XDB32050200the Youth Innovation Promotion Association CAS,Beijing Academy of Artificial Intelligence(BAAI)and Xplore Prize.
文摘Uniform memory multicore neural network accelerators(UNNAs)furnish huge computing power to emerging neural network applications.Meanwhile,with neural network architectures going deeper and wider,the limited memory capacity has become a constraint to deploy models on UNNA platforms.Therefore how to efficiently manage memory space and how to reduce workload footprints are urgently significant.In this paper,we propose Tetris:a heuristic static memory management framework for UNNA platforms.Tetris reconstructs execution flows and synchronization relationships among cores to analyze each tensor’s liveness interval.Then the memory management problem is converted to a sequence permutation problem.Tetris uses a genetic algorithm to explore the permutation space to optimize the memory management strategy and reduce memory footprints.We evaluate several typical neural networks and the experimental results demonstrate that Tetris outperforms the state-of-the-art memory allocation methods,and achieves an average memory reduction ratio of 91.9%and 87.9%for a quad-core and a 16-core Cambricon-X platform,respectively.
基金This paper is supported by the National Natural Science Foundation of China under Grant Nos.61832006,61825202,62072193,and 61929103.
文摘Many systems have been built to employ the delta-based iterative execution model to support iterative algorithms on distributed platforms by exploiting the sparse computational dependencies between data items of these iterative algorithms in a synchronous or asynchronous approach. However, for large-scale iterative algorithms, existing synchronous solutions suffer from slow convergence speed and load imbalance, because of the strict barrier between iterations;while existing asynchronous approaches induce excessive redundant communication and computation cost as a result of being barrier-free. In view of the performance trade-off between these two approaches, this paper designs an efficient execution manager, called Aiter-R, which can be integrated into existing delta-based iterative processing systems to efficiently support the execution of delta-based iterative algorithms, by using our proposed group-based iterative execution approach. It can efficiently and correctly explore the middle ground of the two extremes. A heuristic scheduling algorithm is further proposed to allow an iterative algorithm to adaptively choose its trade-off point so as to achieve the maximum efficiency. Experimental results show that Aiter-R strikes a good balance between the synchronous and asynchronous policies and outperforms state-of-the-art solutions. It reduces the execution time by up to 54.1% and 84.6% in comparison with existing asynchronous and the synchronous models, respectively.