Tree search is a widely used fundamental algorithm. Modern processors provide tremendous computing power by integrating multiple cores, each with a vector processing unit. This paper reviews some studies on exploiting...Tree search is a widely used fundamental algorithm. Modern processors provide tremendous computing power by integrating multiple cores, each with a vector processing unit. This paper reviews some studies on exploiting single instruction multiple date (SIMD) capacity of processors to improve the performance of tree search, and proposes several improvement methods on reported SIMD tree search algorithms. Based on blocking tree structure, blocking for memory alignment and dynamic blocking prefetch are proposed to optimize the overhead of memory access. Furthermore, as a way of non-linear loop unrolling, the search branch unwinding shows that the number of branches can exceed the data width of SIMD instructions in the SIMD search algorithm. The experiments suggest that blocking optimized SIMD tree search algorithm can achieve 1.6 times response speed faster than the un-optimized algorithm.展开更多
j-lanes tree hashing is a tree mode that splits an input message into?j?slices, computes?j?independent digests of each slice, and outputs the hash value of their concatenation.?j-pointers tree hashing is a similar tre...j-lanes tree hashing is a tree mode that splits an input message into?j?slices, computes?j?independent digests of each slice, and outputs the hash value of their concatenation.?j-pointers tree hashing is a similar tree mode that receives, as input,?j?pointers to?j?messages (or slices of a single message), computes their digests and outputs the hash value of their concatenation. Such modes expose parallelization opportunities in a hashing process that is otherwise serial by nature. As a result, they have a performance advantage on modern processor architectures. This paper provides precise specifications for these hashing modes, proposes appropriate IVs, and demonstrates their performance on the latest processors. Our hope is that it would be useful for standardization of these modes.展开更多
j-lanes hashing is a tree mode that splits an input message to j slices, computes j independent digests of each slice, and outputs the hash value of their concatenation. We demonstrate the performance advantage of j-l...j-lanes hashing is a tree mode that splits an input message to j slices, computes j independent digests of each slice, and outputs the hash value of their concatenation. We demonstrate the performance advantage of j-lanes hashing on SIMD architectures, by coding a 4-lanes-SHA-256 implementation and measuring its performance on the latest 3rd Generation IntelR CoreTM. For messages whose lengths range from 2 KB to 132 KB, we show that the 4-lanes SHA-256 is between 1.5 to 1.97 times faster than the fastest publicly available implementation that we are aware of, and between ~2 to ~2.5 times faster than the OpenSSL 1.0.1c implementation. For long messages, there is no significant performance difference between different choices of j. We show that the 4-lanes SHA-256 is faster than the two SHA3 finalists (BLAKE and Keccak) that have a published tree mode implementation. Finally, we explain why j-lanes hashing will be faster on the coming AVX2 architecture that facilitates using 256 bits registers. These results suggest that standardizing a tree mode for hash functions (SHA-256 in particular) could be useful for performance hungry applications.展开更多
基金Project supported by the Shanghai Leading Academic Discipline Project(Grant No.J50103)the Graduate Student Innovation Foundation of Shanghai University(Grant No.SHUCX112167)
文摘Tree search is a widely used fundamental algorithm. Modern processors provide tremendous computing power by integrating multiple cores, each with a vector processing unit. This paper reviews some studies on exploiting single instruction multiple date (SIMD) capacity of processors to improve the performance of tree search, and proposes several improvement methods on reported SIMD tree search algorithms. Based on blocking tree structure, blocking for memory alignment and dynamic blocking prefetch are proposed to optimize the overhead of memory access. Furthermore, as a way of non-linear loop unrolling, the search branch unwinding shows that the number of branches can exceed the data width of SIMD instructions in the SIMD search algorithm. The experiments suggest that blocking optimized SIMD tree search algorithm can achieve 1.6 times response speed faster than the un-optimized algorithm.
文摘j-lanes tree hashing is a tree mode that splits an input message into?j?slices, computes?j?independent digests of each slice, and outputs the hash value of their concatenation.?j-pointers tree hashing is a similar tree mode that receives, as input,?j?pointers to?j?messages (or slices of a single message), computes their digests and outputs the hash value of their concatenation. Such modes expose parallelization opportunities in a hashing process that is otherwise serial by nature. As a result, they have a performance advantage on modern processor architectures. This paper provides precise specifications for these hashing modes, proposes appropriate IVs, and demonstrates their performance on the latest processors. Our hope is that it would be useful for standardization of these modes.
文摘j-lanes hashing is a tree mode that splits an input message to j slices, computes j independent digests of each slice, and outputs the hash value of their concatenation. We demonstrate the performance advantage of j-lanes hashing on SIMD architectures, by coding a 4-lanes-SHA-256 implementation and measuring its performance on the latest 3rd Generation IntelR CoreTM. For messages whose lengths range from 2 KB to 132 KB, we show that the 4-lanes SHA-256 is between 1.5 to 1.97 times faster than the fastest publicly available implementation that we are aware of, and between ~2 to ~2.5 times faster than the OpenSSL 1.0.1c implementation. For long messages, there is no significant performance difference between different choices of j. We show that the 4-lanes SHA-256 is faster than the two SHA3 finalists (BLAKE and Keccak) that have a published tree mode implementation. Finally, we explain why j-lanes hashing will be faster on the coming AVX2 architecture that facilitates using 256 bits registers. These results suggest that standardizing a tree mode for hash functions (SHA-256 in particular) could be useful for performance hungry applications.