Despite the increasing investment in integrated GPU and next-generation interconnect research,discrete GPU connected by PCIe still account for the dominant position of the market,the management of data communication b...Despite the increasing investment in integrated GPU and next-generation interconnect research,discrete GPU connected by PCIe still account for the dominant position of the market,the management of data communication between CPU and GPU continues to evolve.Initially,the programmer explicitly controls the data transfer between CPU and GPU.To simplify programming and enable systemwide atomic memory operations,GPU vendors have developed a programming model that provides a single,virtual address space for accessing all CPU and GPU memories in the system.The page migration engine in this model automatically migrates pages between CPU and GPU on demand.To meet the needs of high-performance workloads,the page size tends to be larger.Limited by low bandwidth and high latency interconnects compared to GDDR,larger page migration has longer delay,which may reduce the overlap of computation and transmission,waste time to migrate unrequested data,block subsequent requests,and cause serious performance decline.In this paper,we propose partial page migration that only migrates the requested part of a page to reduce the migration unit,shorten the migration latency,and avoid the performance degradation of the full page migration when the page becomes larger.We show that partial page migration is possible to largely hide the performance overheads of full page migration.Compared with programmer controlled data transmission,when the page size is 2MB and the PCIe bandwidth is 16GB/sec,full page migration is 72.72×slower,while our partial page migration achieves 1.29×speedup.When the PCIe bandwidth is changed to 96GB/sec,full page migration is 18.85×slower,while our partial page migration provides 1.37×speedup.Additionally,we examine the performance impact that PCIe bandwidth and migration unit size have on execution time,enabling designers to make informed decisions.展开更多
Static cache partitioning can reduce inter- application cache interference and improve the composite performance of a cache-polluted application and a cache- sensitive application when they run on cores that share the...Static cache partitioning can reduce inter- application cache interference and improve the composite performance of a cache-polluted application and a cache- sensitive application when they run on cores that share the last level cache in the same multi-core processor. In a virtu- alized system, since different applications might run on dif- ferent virtual machines (VMs) in different time, it is inappli- cable to partition the cache statically in advance. This paper proposes a dynamic cache partitioning scheme that makes use of hot page detection and page migration to improve the com- posite performance of co-hosted virtual machines dynami- cally according to prior knowledge of cache-sensitive appli- cations. Experimental results show that the overhead of our page migration scheme is low, while in most cases, the com- posite performance is an improvement over free composition.展开更多
Page migration has long been adopted in hybrid memory systems comprising dynamic random access memory(DRAM)and non-volatile memories(NVMs),to improve the system performance and energy efficiency.However,page migration...Page migration has long been adopted in hybrid memory systems comprising dynamic random access memory(DRAM)and non-volatile memories(NVMs),to improve the system performance and energy efficiency.However,page migration introduces some side effects,such as more translation lookaside buffer(TLB)misses,breaking memory contiguity,and extra memory accesses due to page table updating.In this paper,we propose superpagefriendly page table called SuperPT to reduce the performance overhead of serving TLB misses.By leveraging a virtual hashed page table and a hybrid DRAM allocator,SuperPT performs address translations in a flexible and efficient way while still remaining the contiguity within the migrated pages.展开更多
文摘Despite the increasing investment in integrated GPU and next-generation interconnect research,discrete GPU connected by PCIe still account for the dominant position of the market,the management of data communication between CPU and GPU continues to evolve.Initially,the programmer explicitly controls the data transfer between CPU and GPU.To simplify programming and enable systemwide atomic memory operations,GPU vendors have developed a programming model that provides a single,virtual address space for accessing all CPU and GPU memories in the system.The page migration engine in this model automatically migrates pages between CPU and GPU on demand.To meet the needs of high-performance workloads,the page size tends to be larger.Limited by low bandwidth and high latency interconnects compared to GDDR,larger page migration has longer delay,which may reduce the overlap of computation and transmission,waste time to migrate unrequested data,block subsequent requests,and cause serious performance decline.In this paper,we propose partial page migration that only migrates the requested part of a page to reduce the migration unit,shorten the migration latency,and avoid the performance degradation of the full page migration when the page becomes larger.We show that partial page migration is possible to largely hide the performance overheads of full page migration.Compared with programmer controlled data transmission,when the page size is 2MB and the PCIe bandwidth is 16GB/sec,full page migration is 72.72×slower,while our partial page migration achieves 1.29×speedup.When the PCIe bandwidth is changed to 96GB/sec,full page migration is 18.85×slower,while our partial page migration provides 1.37×speedup.Additionally,we examine the performance impact that PCIe bandwidth and migration unit size have on execution time,enabling designers to make informed decisions.
文摘Static cache partitioning can reduce inter- application cache interference and improve the composite performance of a cache-polluted application and a cache- sensitive application when they run on cores that share the last level cache in the same multi-core processor. In a virtu- alized system, since different applications might run on dif- ferent virtual machines (VMs) in different time, it is inappli- cable to partition the cache statically in advance. This paper proposes a dynamic cache partitioning scheme that makes use of hot page detection and page migration to improve the com- posite performance of co-hosted virtual machines dynami- cally according to prior knowledge of cache-sensitive appli- cations. Experimental results show that the overhead of our page migration scheme is low, while in most cases, the com- posite performance is an improvement over free composition.
文摘Page migration has long been adopted in hybrid memory systems comprising dynamic random access memory(DRAM)and non-volatile memories(NVMs),to improve the system performance and energy efficiency.However,page migration introduces some side effects,such as more translation lookaside buffer(TLB)misses,breaking memory contiguity,and extra memory accesses due to page table updating.In this paper,we propose superpagefriendly page table called SuperPT to reduce the performance overhead of serving TLB misses.By leveraging a virtual hashed page table and a hybrid DRAM allocator,SuperPT performs address translations in a flexible and efficient way while still remaining the contiguity within the migrated pages.