用OpenCL语言标准设计并实现了推荐系统领域的两种经典算法:交替最小二乘法(Alternating Least Squares,ALS)与循环坐标下降法(Cyclic Coordinate Descent,CCD)。将其应用到CPU,GPU,MIC多核与众核平台上,探索了在该平台上影响算法性能...用OpenCL语言标准设计并实现了推荐系统领域的两种经典算法:交替最小二乘法(Alternating Least Squares,ALS)与循环坐标下降法(Cyclic Coordinate Descent,CCD)。将其应用到CPU,GPU,MIC多核与众核平台上,探索了在该平台上影响算法性能的因子:潜在特征维数与线程个数。同时,将OpenCL实现的两种算法与CUDA和OpenMP的实现进行比较,得出了一系列结论。在同等条件下,与ALS算法相比,CCD算法的精度更高,收敛速度更快且更稳定,但所耗时间更长。ALS和CCD算法基于OpenCL的实现性能不亚于CUDA(CCD上加速比为1.03x,ALS上加速比为1.2x)和OpenMP的实现(CCD与ALS上加速比大约为1.6~1.7x),并且两种算法在CPU平台上的性能均比GPU与MIC好。展开更多
Cache performance is a critical design constraint for modern many-core systems.Since the cache often works in a"black-box"manner,it is difficult for the software to reason about the cache behavior to match t...Cache performance is a critical design constraint for modern many-core systems.Since the cache often works in a"black-box"manner,it is difficult for the software to reason about the cache behavior to match the running software to the underlying hardware.To better support code optimization,we need to understand and characterize the cache be-havior.While cache performance characterization is heavily studied on traditional x86 architectures,there is little work for understanding the cache implementations on emerging ARMv8-based many-cores.This paper presents a comprehensive study to evaluate the cache architecture design on three representative ARMv8 multi-cores,Phytium 2000+,ThunderX2,and Kunpeng 920(KP920).To this end,we develop wrBench,a micro-benchmark suite to measure the realized latency and bandwidth of caches at different memory hierarchies when performing core-to-core communication.Our evaluation pro-vides inter-core latency and bandwidth in different cache levels and coherency states for the three ARMv8 many-cores.The quantitative performance data is shown in tables.We mine the characteristics of caches and coherency protocols by analyzing the data for the three processors,Phytium 2000+,ThunderX2,and KP920.Our paper also provides discussions and guidelines for optimizing memory access on ARMv8 many-cores.展开更多
文摘用OpenCL语言标准设计并实现了推荐系统领域的两种经典算法:交替最小二乘法(Alternating Least Squares,ALS)与循环坐标下降法(Cyclic Coordinate Descent,CCD)。将其应用到CPU,GPU,MIC多核与众核平台上,探索了在该平台上影响算法性能的因子:潜在特征维数与线程个数。同时,将OpenCL实现的两种算法与CUDA和OpenMP的实现进行比较,得出了一系列结论。在同等条件下,与ALS算法相比,CCD算法的精度更高,收敛速度更快且更稳定,但所耗时间更长。ALS和CCD算法基于OpenCL的实现性能不亚于CUDA(CCD上加速比为1.03x,ALS上加速比为1.2x)和OpenMP的实现(CCD与ALS上加速比大约为1.6~1.7x),并且两种算法在CPU平台上的性能均比GPU与MIC好。
基金funded by the National Key Research and Development Program of China under Grant No.2018YFB0204301the National Natural Science Foundation of China under Grant Nos.61972408 and 61872294.
文摘Cache performance is a critical design constraint for modern many-core systems.Since the cache often works in a"black-box"manner,it is difficult for the software to reason about the cache behavior to match the running software to the underlying hardware.To better support code optimization,we need to understand and characterize the cache be-havior.While cache performance characterization is heavily studied on traditional x86 architectures,there is little work for understanding the cache implementations on emerging ARMv8-based many-cores.This paper presents a comprehensive study to evaluate the cache architecture design on three representative ARMv8 multi-cores,Phytium 2000+,ThunderX2,and Kunpeng 920(KP920).To this end,we develop wrBench,a micro-benchmark suite to measure the realized latency and bandwidth of caches at different memory hierarchies when performing core-to-core communication.Our evaluation pro-vides inter-core latency and bandwidth in different cache levels and coherency states for the three ARMv8 many-cores.The quantitative performance data is shown in tables.We mine the characteristics of caches and coherency protocols by analyzing the data for the three processors,Phytium 2000+,ThunderX2,and KP920.Our paper also provides discussions and guidelines for optimizing memory access on ARMv8 many-cores.