Conventional dynamically scheduled processors often use fully associative structures named load/store queue (LSQ) to implement the value communication between loads and the older in-flight stores and to detect the s...Conventional dynamically scheduled processors often use fully associative structures named load/store queue (LSQ) to implement the value communication between loads and the older in-flight stores and to detect the store-load order violation. But this in-flight forwarding only occupies about 15% of all store-load communications, which makes the CAM-based micro-architecture the major bottleneck to scale store-load communication further. This paper presents a new micro-architecture named ASW (short for active store window). It provides a new structure named speculative active store window to implement more aggressively speculative store-load forwarding than conventional LSQ. This structure could forward the data of committed stores to the executing loads without accessing to L1 data cache, which is referred to as far forwarding in this paper. At the back-end of the pipeline, it uses in-order load re-execution filtered by the tagged SSBF (short for store sequence bloom filter) to verify the correctness of the store-load forwarding. The speculative active store window and tagged store sequence bloom filter are all set-associate structures that are more efficient and scalable than fully associative structures. Experiments show that this simpler and faster design outperforms a conventional load/store queue based design and the NoSO desien on most benchmarks by 10.22% and 8.71% respectively.展开更多
Nowadays energy-efficiency becomes the first design metric in chip development. To pursue higher energy efficiency, the processor architects should reduce or eliminate those unnecessary energy dissipations. Indirect-b...Nowadays energy-efficiency becomes the first design metric in chip development. To pursue higher energy efficiency, the processor architects should reduce or eliminate those unnecessary energy dissipations. Indirect-branch pre- diction has become a performance bottleneck, especially for the applications written in object-oriented languages. Previous hardware-based indirect-branch predictors are generally inefficient, for they either require significant hardware storage or predict indirect-branch targets slowly. In this paper, we propose an energy-efficient indirect-branch prediction technique called TAP (target address pointer) prediction. Its key idea includes two parts: utilizing specific hardware pointers to accelerate the indirect branch prediction flow and reusing the existing processor components to reduce additional hardware costs and power consumption. When fetching an indirect branch, TAP prediction first gets the specific pointers called target address pointers from the conditional branch predictor, and then uses such pointers to generate virtual addresses which index the indirect-branch targets. This technique spends similar time compared to the dedicated storage techniques without requiring additional large amounts of storage. Our evaluation shows that TAP prediction with some representative state-of-the-art branch predictors can improve performance significantly over the baseline processor. Compared with those hardware-based indirect-branch predictors, the TAP-Perceptron scheme achieves performance improvement equivalent to that provided by an 8 K-entry TTC predictor, and also outperforms the VPC predictor.展开更多
Speculative execution attacks can leak arbitrary program data under malicious speculation,presenting a severe security threat.Based on two key observations,this paper presents a software-transparent defense mechanism ...Speculative execution attacks can leak arbitrary program data under malicious speculation,presenting a severe security threat.Based on two key observations,this paper presents a software-transparent defense mechanism called speculative secret flow tracking(SSFT),which is capable of defending against all cache-based speculative execution attacks with a low performance overhead.First,we observe that the attacker must use array or pointer variables in the victim code to access arbitrary memory data.Therefore,we propose a strict definition of secret data to reduce the amount of data to be protected.Second,if the load is not data-dependent and control-dependent on secrets,its speculative execution will not leak any secrets.Thus,this paper introduces the concept of speculative secret flow to analyze how secret data are obtained and propagated during speculative execution.By tracking speculative secret flow in hardware,SSFT can identify all unsafe speculative loads(USLs)that are dependent on secrets.Moreover,SSFT exploits three different methods to constrain USLs’speculative execution and prevent them from leaking secrets into the cache and translation lookaside buffer(TLB)states.This paper evaluates the performance of SSFT on the SPEC CPU 2006 workloads,and the results show that SSFT is effective and its performance overhead is very low.To defend against all speculative execution attack variants,SSFT only incurs an average slowdown of 4.5%(Delay USL-L1Miss)or 3.8%(Invisible USLs)compared to a non-secure processor.Our analysis also shows that SSFT maintains a low hardware overhead.展开更多
Predicting indirect-branch targets has become a performance bottleneck for many applications. Previous high- performance indirect-branch predictors usually require significant hardware storage or additional compiler s...Predicting indirect-branch targets has become a performance bottleneck for many applications. Previous high- performance indirect-branch predictors usually require significant hardware storage or additional compiler support, which increases the complexity of the processor front-end or the compilers. This paper proposes a complexity-effective indirect- branch prediction mechanism, called the Set-Way Index Pointing (SWIP) prediction. It stores multiple indirect-branch targets in different branch target buffer (BTB) entries, whose set indices and way locations are treated as set-way index pointers. These pointers are stored in the existing branch-direction predictor. SWIP prediction reuses the branch direction predictor to provide such pointers, and then accesses the pointed BTB entries for the predicted indirect-branch target. Our evaluation shows that SWIP prediction could achieve attractive performance improvement without requiring large dedicated storage or additional compiler support. It improves the indirect-branch prediction accuracy by 36.5% compared to that of a commonly-used BTB, resulting in average performance improvement of 18.56%. Its energy consumption is also reduced by 14.34% over that of the baseline.展开更多
The performance loss resulting from different cache misses is variable in modern systems for two reasons: 1) memory access latency is not uniform, and 2) the latency toleration ability of processor cores varies acr...The performance loss resulting from different cache misses is variable in modern systems for two reasons: 1) memory access latency is not uniform, and 2) the latency toleration ability of processor cores varies across different misses. Compared with parallel misses and store misses, isolated fetch and load misses are more costly. The variation of cache miss penalty suggests that the cache replacement policy should take it into account. To that end, first, we propose the notion of retention benefit. Retention benefits can evaluate not only the increment of processor stall cycles on cache misses, but also the reduction of processor stall cycles due to cache hits. Then, we propose Retention Benefit Based Replacement (RBR) which aims to maximize the aggregate retention benefits of blocks reserved in the cache. RBR keeps track of the total retention benefit for each block in the cache, and it preferentially evicts the block with the minimum total retention benefit on replacement. The evaluation shows that RBR can improve cache performance significantly in both single-core and multi-core environment while requiring a low storage overhead. It also outperforms other state-of-the-art techniques.展开更多
基金supported by the National High Technology Research and Development 863 Program of China under Grant No.2009ZX01029-001-002the Postdoctoral Science Foundation of China under Grant No. 20110490208
文摘Conventional dynamically scheduled processors often use fully associative structures named load/store queue (LSQ) to implement the value communication between loads and the older in-flight stores and to detect the store-load order violation. But this in-flight forwarding only occupies about 15% of all store-load communications, which makes the CAM-based micro-architecture the major bottleneck to scale store-load communication further. This paper presents a new micro-architecture named ASW (short for active store window). It provides a new structure named speculative active store window to implement more aggressively speculative store-load forwarding than conventional LSQ. This structure could forward the data of committed stores to the executing loads without accessing to L1 data cache, which is referred to as far forwarding in this paper. At the back-end of the pipeline, it uses in-order load re-execution filtered by the tagged SSBF (short for store sequence bloom filter) to verify the correctness of the store-load forwarding. The speculative active store window and tagged store sequence bloom filter are all set-associate structures that are more efficient and scalable than fully associative structures. Experiments show that this simpler and faster design outperforms a conventional load/store queue based design and the NoSO desien on most benchmarks by 10.22% and 8.71% respectively.
文摘Nowadays energy-efficiency becomes the first design metric in chip development. To pursue higher energy efficiency, the processor architects should reduce or eliminate those unnecessary energy dissipations. Indirect-branch pre- diction has become a performance bottleneck, especially for the applications written in object-oriented languages. Previous hardware-based indirect-branch predictors are generally inefficient, for they either require significant hardware storage or predict indirect-branch targets slowly. In this paper, we propose an energy-efficient indirect-branch prediction technique called TAP (target address pointer) prediction. Its key idea includes two parts: utilizing specific hardware pointers to accelerate the indirect branch prediction flow and reusing the existing processor components to reduce additional hardware costs and power consumption. When fetching an indirect branch, TAP prediction first gets the specific pointers called target address pointers from the conditional branch predictor, and then uses such pointers to generate virtual addresses which index the indirect-branch targets. This technique spends similar time compared to the dedicated storage techniques without requiring additional large amounts of storage. Our evaluation shows that TAP prediction with some representative state-of-the-art branch predictors can improve performance significantly over the baseline processor. Compared with those hardware-based indirect-branch predictors, the TAP-Perceptron scheme achieves performance improvement equivalent to that provided by an 8 K-entry TTC predictor, and also outperforms the VPC predictor.
基金supported by the Key-Area Research and Development Program of Guangdong Province of China under Grant No.2018B010115002.
文摘Speculative execution attacks can leak arbitrary program data under malicious speculation,presenting a severe security threat.Based on two key observations,this paper presents a software-transparent defense mechanism called speculative secret flow tracking(SSFT),which is capable of defending against all cache-based speculative execution attacks with a low performance overhead.First,we observe that the attacker must use array or pointer variables in the victim code to access arbitrary memory data.Therefore,we propose a strict definition of secret data to reduce the amount of data to be protected.Second,if the load is not data-dependent and control-dependent on secrets,its speculative execution will not leak any secrets.Thus,this paper introduces the concept of speculative secret flow to analyze how secret data are obtained and propagated during speculative execution.By tracking speculative secret flow in hardware,SSFT can identify all unsafe speculative loads(USLs)that are dependent on secrets.Moreover,SSFT exploits three different methods to constrain USLs’speculative execution and prevent them from leaking secrets into the cache and translation lookaside buffer(TLB)states.This paper evaluates the performance of SSFT on the SPEC CPU 2006 workloads,and the results show that SSFT is effective and its performance overhead is very low.To defend against all speculative execution attack variants,SSFT only incurs an average slowdown of 4.5%(Delay USL-L1Miss)or 3.8%(Invisible USLs)compared to a non-secure processor.Our analysis also shows that SSFT maintains a low hardware overhead.
基金supported by the "HGJ" National Science and Technology Major Project of China under Grant No. 2009ZX01029-001-002
文摘Predicting indirect-branch targets has become a performance bottleneck for many applications. Previous high- performance indirect-branch predictors usually require significant hardware storage or additional compiler support, which increases the complexity of the processor front-end or the compilers. This paper proposes a complexity-effective indirect- branch prediction mechanism, called the Set-Way Index Pointing (SWIP) prediction. It stores multiple indirect-branch targets in different branch target buffer (BTB) entries, whose set indices and way locations are treated as set-way index pointers. These pointers are stored in the existing branch-direction predictor. SWIP prediction reuses the branch direction predictor to provide such pointers, and then accesses the pointed BTB entries for the predicted indirect-branch target. Our evaluation shows that SWIP prediction could achieve attractive performance improvement without requiring large dedicated storage or additional compiler support. It improves the indirect-branch prediction accuracy by 36.5% compared to that of a commonly-used BTB, resulting in average performance improvement of 18.56%. Its energy consumption is also reduced by 14.34% over that of the baseline.
基金supported in part by the National Science and Technology Major Project of the Ministry of Science and Technology of China under Grant No.2009ZX01029-001-002-2
文摘The performance loss resulting from different cache misses is variable in modern systems for two reasons: 1) memory access latency is not uniform, and 2) the latency toleration ability of processor cores varies across different misses. Compared with parallel misses and store misses, isolated fetch and load misses are more costly. The variation of cache miss penalty suggests that the cache replacement policy should take it into account. To that end, first, we propose the notion of retention benefit. Retention benefits can evaluate not only the increment of processor stall cycles on cache misses, but also the reduction of processor stall cycles due to cache hits. Then, we propose Retention Benefit Based Replacement (RBR) which aims to maximize the aggregate retention benefits of blocks reserved in the cache. RBR keeps track of the total retention benefit for each block in the cache, and it preferentially evicts the block with the minimum total retention benefit on replacement. The evaluation shows that RBR can improve cache performance significantly in both single-core and multi-core environment while requiring a low storage overhead. It also outperforms other state-of-the-art techniques.