期刊文献+

布尔矩阵乘的分布式异构并行优化 被引量:1

Distributed heterogeneous parallel Boolean matrix multiplication and its performance optimization
下载PDF
导出
摘要 布尔多项式求解是当今密码代数分析中的关键步骤,F4算法是布尔多项式求解的高效算法。分析了Lachartre为F4矩阵专门设计的高斯消去算法,针对其中布尔矩阵乘这一耗时的计算步骤,设计并实现了分布式异构(CPU+MIC)并行算法。布尔矩阵相对于普通矩阵主要体现在矩阵元素取值区间不一样上,由于布尔矩阵元素(0,1)导致矩阵乘操作的特殊性,普通矩阵乘的优化方法不能很好地满足布尔矩阵乘的需求。分别从布尔矩阵的存储、OpenMP多线程组织、访存、任务划分和调度等方面进行了性能优化,实现了布尔矩阵乘的分布式异构并行算法。通过随机生成布尔矩阵测试,优化后的分布式异构并行程序相较于分布式同构并行程序达到了2.45的加速比,体现了良好的性能提升。 The Boolean polynomial solution is a key step in the analysis of cryptographic algebra, and the F4 algorithm is an efficient algorithm for Boolean polynomial solution. We analyze the Gaussian elimination algorithm designed by Lachartre for F4 matrix, then design and implement the distributed heterogeneous (CPU + MIC) parallel algorithm for the time consumption calculation of Boolean matrix multiplication. The Boolean matrix differs from ordinary matrixes mainly in the valuetaking intervals of matrix elements. The optimization method of the general matrix multiplication cannot satisfy the Boolean matrix multiplication because the Boolean matrix element (0,1) leads to the particularity of the matrix multiplication operation. We realize the distributed heterogeneous parallel algorithms of Boolean matrix multiplication by optimizing its performance respectively on binary domain matrix storage, OpenMP multithreading organization, fetch, task partition and scheduling, etc. By randomly generating the Boolean matrix tests, the optimized distributed heterogeneous parallel program achieves an acceleration ratio of 2.45 compared with the distributed isomorphism parallel program, which shows a good performance improvement.
出处 《计算机工程与科学》 CSCD 北大核心 2017年第4期634-640,共7页 Computer Engineering & Science
基金 国家自然科学基金(61502516 61572515) 国家重点研发计划(2016YFC1401803)
关键词 F4算法 二元域 布尔矩阵乘 分布式异构并行 F4 algorithm binary domain Boolean matrix multiplication distributed heterogeneous parallel
  • 相关文献

参考文献3

二级参考文献38

  • 1李邦河.分解多项式升列为不可约升列的算法(英文)[J].应用泛函分析学报,2005,7(2):97-105. 被引量:4
  • 2Li T,Li H,Liu X,et al.GPU acceleration of interior point methods in large scale SVM training[C]∥Proc of the 12th IEEE International Conference on Trust,Security and Privacy in Computing and Communications(TrustCom),2013:863-870.
  • 3Li T,Wang D,Zhang S,et al.Parallel rank coherence in networks for inferring disease phenotype and gene set associations[C]∥Proc of ACA'14,2014:163-176.
  • 4Kagstrm B,Ling P,Van Loan C.GEMM-based level 3BLAS:High-performance model implementations and performance evaluation benchmark[J].ACM Transactions on Mathematical Software(TOMS),1998,24(3):268-302.
  • 5Anderson E,Bai Z,Bischof C,et al.LAPACK users'guide[K].Philadelphia:Siam,1999.
  • 6Blackford L S,Choi J,Cleary A,et al.ScaLAPACK users'guide[K].Philadelphia:Siam,1997.
  • 7Dongarra J J,Luszczek P,Petitet A.The LINPACK benchmark:Past,present and future[J].Concurrency and Computation:Practice and Experience,2003,15(9):803-820.
  • 8Nath R,Tomov S,Dongarra J.Accelerating GPU kernels for dense linear algebra[C]∥Proc of High Performance Computing for Computational Science-VECPAR 2010,2011:83-92.
  • 9Nakasato N.A fast GEMM implementation on the Cypress GPU[J].ACM SIGMETRICS Performance Evaluation Review,2011,38(4):50-55.
  • 10Du P,Weber R,Luszczek P,et al.From CUDA to OpenCL:Towards a performance-portable solution for multi-platform GPU programming[J].Parallel Computing,2012,38(8):391-407.

共引文献22

同被引文献7

引证文献1

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部