Due to the recent trend of software intelligence in the Fourth Industrial Revolution,deep learning has become a mainstream workload for modern computer systems.Since the data size of deep learning increasingly grows,m...Due to the recent trend of software intelligence in the Fourth Industrial Revolution,deep learning has become a mainstream workload for modern computer systems.Since the data size of deep learning increasingly grows,managing the limited memory capacity efficiently for deep learning workloads becomes important.In this paper,we analyze memory accesses in deep learning workloads and find out some unique characteristics differentiated from traditional workloads.First,when comparing instruction and data accesses,data access accounts for 96%–99%of total memory accesses in deep learning workloads,which is quite different from traditional workloads.Second,when comparing read and write accesses,write access dominates,accounting for 64%–80%of total memory accesses.Third,although write access makes up the majority of memory accesses,it shows a low access bias of 0.3 in the Zipf parameter.Fourth,in predicting re-access,recency is important in read access,but frequency provides more accurate information in write access.Based on these observations,we introduce a Non-Volatile Random Access Memory(NVRAM)-accelerated memory architecture for deep learning workloads,and present a new memory management policy for this architecture.By considering the memory access characteristics of deep learning workloads,the proposed policy improves memory performance by 64.3%on average compared to the CLOCK policy.展开更多
Recently,various mobile apps have included more features to improve user convenience.Mobile operating systems load as many apps into memory for faster app launching and execution.The least recently used(LRU)-based ter...Recently,various mobile apps have included more features to improve user convenience.Mobile operating systems load as many apps into memory for faster app launching and execution.The least recently used(LRU)-based termination of cached apps is a widely adopted approach when free space of the main memory is running low.However,the LRUbased cached app termination does not distinguish between frequently or infrequently used apps.The app launch performance degrades if LRU terminates frequently used apps.Recent studies have suggested the potential of using users’app usage patterns to predict the next app launch and address the limitations of the current least recently used(LRU)approach.However,existing methods only focus on predicting the probability of the next launch and do not consider how soon the app will launch again.In this paper,we present a new approach for predicting future app launches by utilizing the relaunch distance.We define the relaunch distance as the interval between two consecutive launches of an app and propose a memory management based on app relaunch prediction(M2ARP).M2ARP utilizes past app usage patterns to predict the relaunch distance.It uses the predicted relaunch distance to determine which apps are least likely to be launched soon and terminate them to improve the efficiency of the main memory.展开更多
Currently,the number of functions to improve user convenience in smartphone applications is increasing.In addition,more mobile applications are being loaded into mobile operating system memory for faster launches,thus...Currently,the number of functions to improve user convenience in smartphone applications is increasing.In addition,more mobile applications are being loaded into mobile operating system memory for faster launches,thus increasing the memory requirements for smartphones.The memory used by applications in mobile operating systems is managed using software;allocated memory is freed up by either considering the usage state of the application or terminating the least recently used(LRU)application.As LRU-based memory management schemes do not consider the application launch frequency in a low memory situation,currently used mobile operating systems can lead to the termination of a frequently executed application,thereby increasing its relaunch time.This study proposes a memory management system that can efficiently utilize the main memory space by analyzing the application usage information.The proposed system reduces the application launch time by leaving the most frequently used or likely to be run applications in the main memory for as long as possible.The performance evaluation conducted utilizing actual smartphone usage records showed that the proposed memory management system increases the number of times the applications resume from the main memory compared with the conventional memory management system,and that the average application execution time is reduced by approximately 17%.展开更多
Engineering application domains need database management systems to supply them with a good means of modeling, a high data access efficiency and a language interface with strong functionality. This paper presents a se...Engineering application domains need database management systems to supply them with a good means of modeling, a high data access efficiency and a language interface with strong functionality. This paper presents a semantic hypergraph model based on relations, in order to express many-to-many relations among objects belonging to defferent semanic classes in engineering applications. A management mechanism expressed by the model and the basic data of engineering databases are managed in main memory. Especially, different objects are linked by different kinds of semantics defined by users, therefore the table swap, the record swap and some unnecessary examinations are reduced and the access efficiency of the engineering data is increased.C language interface that includes some generic and special functionality is proposed for closer connection with application programs.展开更多
Uniform memory multicore neural network accelerators(UNNAs)furnish huge computing power to emerging neural network applications.Meanwhile,with neural network architectures going deeper and wider,the limited memory cap...Uniform memory multicore neural network accelerators(UNNAs)furnish huge computing power to emerging neural network applications.Meanwhile,with neural network architectures going deeper and wider,the limited memory capacity has become a constraint to deploy models on UNNA platforms.Therefore how to efficiently manage memory space and how to reduce workload footprints are urgently significant.In this paper,we propose Tetris:a heuristic static memory management framework for UNNA platforms.Tetris reconstructs execution flows and synchronization relationships among cores to analyze each tensor’s liveness interval.Then the memory management problem is converted to a sequence permutation problem.Tetris uses a genetic algorithm to explore the permutation space to optimize the memory management strategy and reduce memory footprints.We evaluate several typical neural networks and the experimental results demonstrate that Tetris outperforms the state-of-the-art memory allocation methods,and achieves an average memory reduction ratio of 91.9%and 87.9%for a quad-core and a 16-core Cambricon-X platform,respectively.展开更多
Although computer architectures incorporate fast processing hardware resources, high performance real-time implementation of a complex control algorithm requires an efficient design and software coding of the algorith...Although computer architectures incorporate fast processing hardware resources, high performance real-time implementation of a complex control algorithm requires an efficient design and software coding of the algorithm so as to exploit special features of the hardware and avoid associated architecture shortcomings. This paper presents an investigation into the analysis and design mechanisms that will lead to reduction in the execution time in implementing real-time control algorithms. The proposed mechanisms are exemplified by means of one algorithm, which demonstrates their applicability to real-time applications. An active vibration control (AVC) algorithm for a flexible beam system simulated using the finite difference (FD) method is considered to demonstrate the effectiveness of the proposed methods. A comparative performance evaluation of the proposed design mechanisms is presented and discussed through a set of experiments.展开更多
This paper presents techniques and approaches capable of achieving a real-time JPEG2000 compressing system using DSP chips. We propose a three-DSP real-time parallel processing system using efficient memory management...This paper presents techniques and approaches capable of achieving a real-time JPEG2000 compressing system using DSP chips. We propose a three-DSP real-time parallel processing system using efficient memory management for discrete wavelet transform (DWT) and parallel-pass architecture for embedded block coding with optimized truncation (EBCOT). This system performs compression of 1392×1040 pixels monochrome images with the speed of 10 fps/camera of 2 digital still cameras and is proven to be a practical and efficient DSP solution.展开更多
The key-value store can provide flexibility of data types because it does not need to specify the data types to be stored in advance and can store any types of data as the value of the key-value pair.Various types of ...The key-value store can provide flexibility of data types because it does not need to specify the data types to be stored in advance and can store any types of data as the value of the key-value pair.Various types of studies have been conducted to improve the performance of the key-value store while maintaining its flexibility.However,the research efforts storing the large-scale values such as multimedia data files(e.g.,images or videos)in the key-value store were limited.In this study,we propose a new key-value store,WR-Store++aiming to store the large-scale values stably.Specifically,it provides a new design of separating data and index by working with the built-in data structure of the Windows operating system and the file system.The utilization of the built-in data structure of the Windows operating system achieves the efficiency of the key-value store and that of the file system extends the limited space of the storage significantly.We also present chunk-based memory management and parallel processing of WR-Store++to further improve its performance in the GET operation.Through the experiments,we show that WR-Store++can store at least 32.74 times larger datasets than the existing baseline key-value store,WR-Store,which has the limitation in storing large-scale data sets.Furthermore,in terms of processing efficiency,we show that WR-Store++outperforms not only WR-Store but also the other state-ofthe-art key-value stores,LevelDB,RocksDB,and BerkeleyDB,for individual key-value operations and mixed workloads.展开更多
On virtualization platforms, peak memory de- mand caused by hotspot applications often triggers page swapping in guest OS, causing performance degradation in- side and outside of this virtual machine (VM). Even thou...On virtualization platforms, peak memory de- mand caused by hotspot applications often triggers page swapping in guest OS, causing performance degradation in- side and outside of this virtual machine (VM). Even though host holds sufficient memory pages, guest OS is unable to utilize free pages in host directly due to the semantic gap between virtual machine monitor (MM) and guest operat- ing system (OS). Our work aims at utilizing the free memory scattered in multiple hosts in a virtualization environment to improve the performance of guest swapping in a transparent and implicit way. Based on the insightful analysis of behav- ioral characteristics of guest swapping, we design and im- plement a distributed and scalable framework HybridSwap. It dynamically constructs virtual swap pools using various policies, and builds up a synthetic swapping mechanism in a peer-to-peer way, which can adaptively choose different vir- tual swap pools. We implement the prototype of HybridSwap and evaluate it with some benchmarks in different scenar- ios. The evaluation results demonstrate that our solution has the ability to promote the guest swapping efficiency indeed and shows a double performance promotion in some cases. Even in the worst case, the system overhead brought by Hy- bridSwap is acceptable.展开更多
Stream Register File (SRF) is a large on-chip memory of the stream processor and its efficient management is essential for good performance. Current stream programming languages expose the management of SRF to the p...Stream Register File (SRF) is a large on-chip memory of the stream processor and its efficient management is essential for good performance. Current stream programming languages expose the management of SRF to the programmer, incurring heavy burden on the programmer and bringing difficulties to inheriting the legacy codes. SF95 is the language developed for FT64 which is the first 64-bit stream processor designed for scientific applications. SF95 conceals SRF from the programmer and leaves the management of SRF to its compiler. In this paper, we present a compiler approach named SRF Coloring to manage SRF automatically. The novelties of this paper are: first, it is the first time to use the graph coloring-based algorithm for the SRF management; second, an algorithm framework for SRF Coloring that is well suited to the FT64 architecture is proposed this framework is based on a well-understood graph coloring algorithm for register allocation, together with some modifications to deal with the unusual aspects of SRF problem; third, the SRF Coloring algorithm is implemented in SF95Compiler, a compiler designed for FT64 and SF95. The experimental results show that our approach represents a practical and promising solution to SRF allocation.展开更多
基金supported in part by the NRF(National Research Foundation of Korea)Grant(No.2019R1A2C1009275)by the Institute of Information&communications Technology Planning&Evaluation(IITP)grant funded by theKorean government(MSIT)(No.2021-0-02068,Artificial Intelligence Innovation Hub).
文摘Due to the recent trend of software intelligence in the Fourth Industrial Revolution,deep learning has become a mainstream workload for modern computer systems.Since the data size of deep learning increasingly grows,managing the limited memory capacity efficiently for deep learning workloads becomes important.In this paper,we analyze memory accesses in deep learning workloads and find out some unique characteristics differentiated from traditional workloads.First,when comparing instruction and data accesses,data access accounts for 96%–99%of total memory accesses in deep learning workloads,which is quite different from traditional workloads.Second,when comparing read and write accesses,write access dominates,accounting for 64%–80%of total memory accesses.Third,although write access makes up the majority of memory accesses,it shows a low access bias of 0.3 in the Zipf parameter.Fourth,in predicting re-access,recency is important in read access,but frequency provides more accurate information in write access.Based on these observations,we introduce a Non-Volatile Random Access Memory(NVRAM)-accelerated memory architecture for deep learning workloads,and present a new memory management policy for this architecture.By considering the memory access characteristics of deep learning workloads,the proposed policy improves memory performance by 64.3%on average compared to the CLOCK policy.
基金This work was supported in part by the National Research Foundation of Korea(NRF)Grant funded by the Korea Government(MSIT)under Grant 2020R1A2C100526513in part by the R&D Program for Forest Science Technology(Project No.2021338C10-2323-CD02)provided by Korea Forest Service(Korea Forestry Promotion Institute).
文摘Recently,various mobile apps have included more features to improve user convenience.Mobile operating systems load as many apps into memory for faster app launching and execution.The least recently used(LRU)-based termination of cached apps is a widely adopted approach when free space of the main memory is running low.However,the LRUbased cached app termination does not distinguish between frequently or infrequently used apps.The app launch performance degrades if LRU terminates frequently used apps.Recent studies have suggested the potential of using users’app usage patterns to predict the next app launch and address the limitations of the current least recently used(LRU)approach.However,existing methods only focus on predicting the probability of the next launch and do not consider how soon the app will launch again.In this paper,we present a new approach for predicting future app launches by utilizing the relaunch distance.We define the relaunch distance as the interval between two consecutive launches of an app and propose a memory management based on app relaunch prediction(M2ARP).M2ARP utilizes past app usage patterns to predict the relaunch distance.It uses the predicted relaunch distance to determine which apps are least likely to be launched soon and terminate them to improve the efficiency of the main memory.
基金This work was supported by the National Research Foundation of Korea(NRF)Grant funded by the Korea Government(MSIT)under Grant 2020R1A2C1005265.
文摘Currently,the number of functions to improve user convenience in smartphone applications is increasing.In addition,more mobile applications are being loaded into mobile operating system memory for faster launches,thus increasing the memory requirements for smartphones.The memory used by applications in mobile operating systems is managed using software;allocated memory is freed up by either considering the usage state of the application or terminating the least recently used(LRU)application.As LRU-based memory management schemes do not consider the application launch frequency in a low memory situation,currently used mobile operating systems can lead to the termination of a frequently executed application,thereby increasing its relaunch time.This study proposes a memory management system that can efficiently utilize the main memory space by analyzing the application usage information.The proposed system reduces the application launch time by leaving the most frequently used or likely to be run applications in the main memory for as long as possible.The performance evaluation conducted utilizing actual smartphone usage records showed that the proposed memory management system increases the number of times the applications resume from the main memory compared with the conventional memory management system,and that the average application execution time is reduced by approximately 17%.
文摘Engineering application domains need database management systems to supply them with a good means of modeling, a high data access efficiency and a language interface with strong functionality. This paper presents a semantic hypergraph model based on relations, in order to express many-to-many relations among objects belonging to defferent semanic classes in engineering applications. A management mechanism expressed by the model and the basic data of engineering databases are managed in main memory. Especially, different objects are linked by different kinds of semantics defined by users, therefore the table swap, the record swap and some unnecessary examinations are reduced and the access efficiency of the engineering data is increased.C language interface that includes some generic and special functionality is proposed for closer connection with application programs.
基金the Beijing Natural Science Foundation under Grant No.JQ18013the National Natural Science Foundation of China under Grant Nos.61925208,61732007,61732002 and 61906179+1 种基金the Strategic Priority Research Program of Chinese Academy of Sciences(CAS)under Grant No.XDB32050200the Youth Innovation Promotion Association CAS,Beijing Academy of Artificial Intelligence(BAAI)and Xplore Prize.
文摘Uniform memory multicore neural network accelerators(UNNAs)furnish huge computing power to emerging neural network applications.Meanwhile,with neural network architectures going deeper and wider,the limited memory capacity has become a constraint to deploy models on UNNA platforms.Therefore how to efficiently manage memory space and how to reduce workload footprints are urgently significant.In this paper,we propose Tetris:a heuristic static memory management framework for UNNA platforms.Tetris reconstructs execution flows and synchronization relationships among cores to analyze each tensor’s liveness interval.Then the memory management problem is converted to a sequence permutation problem.Tetris uses a genetic algorithm to explore the permutation space to optimize the memory management strategy and reduce memory footprints.We evaluate several typical neural networks and the experimental results demonstrate that Tetris outperforms the state-of-the-art memory allocation methods,and achieves an average memory reduction ratio of 91.9%and 87.9%for a quad-core and a 16-core Cambricon-X platform,respectively.
文摘Although computer architectures incorporate fast processing hardware resources, high performance real-time implementation of a complex control algorithm requires an efficient design and software coding of the algorithm so as to exploit special features of the hardware and avoid associated architecture shortcomings. This paper presents an investigation into the analysis and design mechanisms that will lead to reduction in the execution time in implementing real-time control algorithms. The proposed mechanisms are exemplified by means of one algorithm, which demonstrates their applicability to real-time applications. An active vibration control (AVC) algorithm for a flexible beam system simulated using the finite difference (FD) method is considered to demonstrate the effectiveness of the proposed methods. A comparative performance evaluation of the proposed design mechanisms is presented and discussed through a set of experiments.
文摘This paper presents techniques and approaches capable of achieving a real-time JPEG2000 compressing system using DSP chips. We propose a three-DSP real-time parallel processing system using efficient memory management for discrete wavelet transform (DWT) and parallel-pass architecture for embedded block coding with optimized truncation (EBCOT). This system performs compression of 1392×1040 pixels monochrome images with the speed of 10 fps/camera of 2 digital still cameras and is proven to be a practical and efficient DSP solution.
文摘The key-value store can provide flexibility of data types because it does not need to specify the data types to be stored in advance and can store any types of data as the value of the key-value pair.Various types of studies have been conducted to improve the performance of the key-value store while maintaining its flexibility.However,the research efforts storing the large-scale values such as multimedia data files(e.g.,images or videos)in the key-value store were limited.In this study,we propose a new key-value store,WR-Store++aiming to store the large-scale values stably.Specifically,it provides a new design of separating data and index by working with the built-in data structure of the Windows operating system and the file system.The utilization of the built-in data structure of the Windows operating system achieves the efficiency of the key-value store and that of the file system extends the limited space of the storage significantly.We also present chunk-based memory management and parallel processing of WR-Store++to further improve its performance in the GET operation.Through the experiments,we show that WR-Store++can store at least 32.74 times larger datasets than the existing baseline key-value store,WR-Store,which has the limitation in storing large-scale data sets.Furthermore,in terms of processing efficiency,we show that WR-Store++outperforms not only WR-Store but also the other state-ofthe-art key-value stores,LevelDB,RocksDB,and BerkeleyDB,for individual key-value operations and mixed workloads.
文摘On virtualization platforms, peak memory de- mand caused by hotspot applications often triggers page swapping in guest OS, causing performance degradation in- side and outside of this virtual machine (VM). Even though host holds sufficient memory pages, guest OS is unable to utilize free pages in host directly due to the semantic gap between virtual machine monitor (MM) and guest operat- ing system (OS). Our work aims at utilizing the free memory scattered in multiple hosts in a virtualization environment to improve the performance of guest swapping in a transparent and implicit way. Based on the insightful analysis of behav- ioral characteristics of guest swapping, we design and im- plement a distributed and scalable framework HybridSwap. It dynamically constructs virtual swap pools using various policies, and builds up a synthetic swapping mechanism in a peer-to-peer way, which can adaptively choose different vir- tual swap pools. We implement the prototype of HybridSwap and evaluate it with some benchmarks in different scenar- ios. The evaluation results demonstrate that our solution has the ability to promote the guest swapping efficiency indeed and shows a double performance promotion in some cases. Even in the worst case, the system overhead brought by Hy- bridSwap is acceptable.
基金Supported by the National Natural Science Foundation of China under Grant Nos.60621003 and 60633050.
文摘Stream Register File (SRF) is a large on-chip memory of the stream processor and its efficient management is essential for good performance. Current stream programming languages expose the management of SRF to the programmer, incurring heavy burden on the programmer and bringing difficulties to inheriting the legacy codes. SF95 is the language developed for FT64 which is the first 64-bit stream processor designed for scientific applications. SF95 conceals SRF from the programmer and leaves the management of SRF to its compiler. In this paper, we present a compiler approach named SRF Coloring to manage SRF automatically. The novelties of this paper are: first, it is the first time to use the graph coloring-based algorithm for the SRF management; second, an algorithm framework for SRF Coloring that is well suited to the FT64 architecture is proposed this framework is based on a well-understood graph coloring algorithm for register allocation, together with some modifications to deal with the unusual aspects of SRF problem; third, the SRF Coloring algorithm is implemented in SF95Compiler, a compiler designed for FT64 and SF95. The experimental results show that our approach represents a practical and promising solution to SRF allocation.