A Unified Co-Processor Architecture for Matrix Decomposition 被引量：1

A Unified Co-Processor Architecture for Matrix Decomposition

导出

摘要 QR and LU decompositions are the most important matrix decomposition algorithms. Many studies work on accelerating these algorithms by FPGA or ASIC in a case by case style. In this paper, we propose a unified framework for the matrix decomposition algorithms, combining three QR decomposition algorithms and LU algorithm with pivoting into a unified linear array structure. The QR and LU decomposition algorithms exhibit the same two-level loop structure and the same data dependency. Utilizing the similarities in loop structure and data dependency of matrix decomposition, we unify a fine-grained algorithm for all four matrix decomposition algorithms. Furthermore, we present a unified co-processor structure with a scalable linear array of processing elements （PEs）, in which four types of PEs are same in the structure of memory channels and PE connections, but the only difference exists in the internal structure of data path. Our unified co-processor, which is IEEE 32-bit floating-point precision, is implemented and mapped onto a Xilinx Virtex5 FPGA chip. Experimental results show that our co-processors can achieve speedup of 2.3 to 14.9 factors compared to a Pentium Dual CPU with double SSE threads. QR and LU decompositions are the most important matrix decomposition algorithms. Many studies work on accelerating these algorithms by FPGA or ASIC in a case by case style. In this paper, we propose a unified framework for the matrix decomposition algorithms, combining three QR decomposition algorithms and LU algorithm with pivoting into a unified linear array structure. The QR and LU decomposition algorithms exhibit the same two-level loop structure and the same data dependency. Utilizing the similarities in loop structure and data dependency of matrix decomposition, we unify a fine-grained algorithm for all four matrix decomposition algorithms. Furthermore, we present a unified co-processor structure with a scalable linear array of processing elements （PEs）, in which four types of PEs are same in the structure of memory channels and PE connections, but the only difference exists in the internal structure of data path. Our unified co-processor, which is IEEE 32-bit floating-point precision, is implemented and mapped onto a Xilinx Virtex5 FPGA chip. Experimental results show that our co-processors can achieve speedup of 2.3 to 14.9 factors compared to a Pentium Dual CPU with double SSE threads.

作者窦勇周杰邬贵明姜晶菲雷元武倪时策

机构地区 National Laboratory for Parallel & Distributed Processing National Laboratory for Parallel & Distributed Processing.National University of Defense Technology

出处《Journal of Computer Science & Technology》 SCIE EI CSCD 2010年第4期874-885,共12页 计算机科学技术学报（英文版）

基金 Supported by the National Natural Science Foundation of China under Grant Nos.60633050 and 60833004,60903057 the National High-Technology Research and Development 863 Program of China under Grant No.2009AA01Z101

关键词 co-processor matrix decomposition fine-grained parallel FPGA co-processor, matrix decomposition, fine-grained parallel, FPGA

分类号 TP332 [自动化与计算机技术—计算机系统结构] TN911.7 [电子电信—通信与信息系统]

引文网络
相关文献

参考文献28

1Farina A, Timmoneri L. Parallel algorithms and processing architectures for space-time adaptive processing. In Proc. Radar CIE International Conference, Beijing, China, October 8-10, 1996, pp.770-774.
2Rabideau D J, Kogon S M. A signal processing architecture for space-based GMTI radar. In Proc. the Record of the IEEE Radar Conference, Waltham, Massachusetts, April 20- 22, 1999, pp.96-101.
3Fischer B, Modersitzki J. Fast inversion of matrices arising in image processing. Computer Science, 1999, 22(1): 1-11.
4Batchelor G H. Introduction to Fluid Dynamics. 2nd Edition, Cambridge University Press, 2000.
5Ojalvo I U. Proper use of Lanczos vectors for large eigenvalue problems. Computers & Structures, 1985, 20(1-3): 115-120.
6Buttari A, Langou J, Kurzak J, Dongarra J. Parallel tiled QR factorization for multicore architecture. Concurrency and Computation: Practice and Experience, 2008, 20(13): 1573- 1590.
7The LINPACK Benchmark. http://www.netlib. org/linpack/, December, 2008.
8Xu H, Alexander W E. Parallel QR factorization on a block data flow architecture. In Proc. the 24th Southeastern Symposium and the 3rd Annual Symposium on Communications, Signal Processing Expert Systems, and ASIC VLSI Design, March 1-3, 1992, pp.332-336.
9Fernandez L, Garcia J M. The performance of fast Givens Rotation problem implemented with MPI extensions in multicomputer. In Proc. International Conference on Applications of High-Performance Computers in Engineering, Santiago de Compostela, Espagne, July 1997, pp.83-92.
10Ian N Dunn, Gerard G L Meyer. Parallel QR factorization for hybrid message passing/shared memory operation. Journal of the Franklin Institute, 338(5): 601-613.

同被引文献15

1薄华,马缚龙,焦李成.图像纹理的灰度共生矩阵计算问题的分析[J].电子学报,2006,34(1):155-158. 被引量：203
2J Jang, S Choi, V K Prasanna. Area and time efficient imple- mentation of matrix multiplication on FleAs[ A]. Proceedings of the First . International Conference on Field Pro- grammable Technology [ C ]. Piscataway, NJ, United States: IEEE Inc, 2002.93 - 100.
3J Jang,S Choi, V K Prasanna. Energy-efficient matrix multipli- cation on FtK]As [ A ]. Proceedings of the 12th International Conference on Field Programmable Logic and Application [ C ]. Heidelberg, Germany: Springer Vedag, 2002.534 - 544.
4S Choi, V K Pmsanna. Time and energy efficient matrix factor- ization using FtAs[ A]. Proceedings of the 13th International Conference on Field Programmable Logic and Applications [ C ]. Heidelberg, Germany: Springer Vertag, 2003.507 - 519.
5L Zhuo, V K Prasanna. High-performance and parameterized matrix factorization on FPGAs[ A] .Proceedings of the 16th In- ternational Conference on Field Programmable Logic and Ap- plications [ C ]. Heidelberg, Germany: Springer Verlag, 2006.1 --6.
6L Zhuo, V K Prasanna. Hardware/software co-design on recon- figurable computing systems[ A] .Proceedings of the 21st II.Et International Parallel&Distributed Processing Symposium [ C ]. Piscataway, NJ, United States: IEEE Inc,2007.1 - 10.
7D Boland, G A Constantinides. An FleA-based implementa- tion of the MINRF__S algorithm[ A]. Proceedings of the 18th International Conference on Field Programmable Logic and Applications [ C ]. Heidelberg, Germany: Springer Verlag, 2008.379 - 384.
8A R Lopes, G A Constanlinides. A high throughput FA- based floating point conjugate gradient implementation [ A ]. Proceedings of the International Symposium on Applied Re- configurable Computing E C . Heidelberg, Germany: Springer Verlag,2008.75 - 86.
9A R Lopes, A Shahzad, et al. More flops or more precision accuracy parameterizable linear equation solvers for model predictive conlrol[ A] .Proceedings of the 17th IEEE Sympo- sium on Field-Programmable Custom Computing Machines [C]. Piscataway, NJ, United States: IEEE lnc, 2009. 209 - 216.
10Y Dou,S Vassiliadis,et al.64-bit floating-point FtA matrix multiplication[ A] .Proceedings of the 13th ACM/SIGDA In- ternational Symposium on Field Programmable Gate Arrays [ C]. NY, USA: ACM, 2005.86- 95.

引证文献1

1刘书勇,吴艳霞,张博为,张国印,戴葵.基于可重构计算系统的矩阵三角化分解硬件并行结构研究[J].电子学报,2015,43(8):1642-1650. 被引量：6

二级引证文献6

1刘书勇,林俊宇,吴艳霞,张博为.基于矩阵三角化分解的Cholesky分解及FPGA并行结构设计[J].清华大学学报（自然科学版）,2016,56(9):963-968. 被引量：7
2苏翔,余云鹏,余桢伟,王志英,吴沣沛.基于复杂产品设计网络Hub节点的工程变更风险传播研究[J].现代制造工程,2018(6):23-31.
3张多利,叶紫燕,邱俊豪,宋宇鲲.任意阶矩阵求逆的算法优化和硬件实现[J].合肥工业大学学报（自然科学版）,2019,42(9):1227-1233. 被引量：6
4张多利,蒋雯,叶紫燕,宋宇鲲,汪健.一种用于矩阵求逆的原位替换算法及硬件实现[J].合肥工业大学学报（自然科学版）,2020,43(1):75-80. 被引量：4
5LI Gezi,CHEN Xiaogang,LI Shunfen,MA Bin,SONG Zhitang.FPGA-Enhanced Data Processing System Using PCM Technology[J].Chinese Journal of Electronics,2020,29(4):766-771.
6凌元,韩文俊,孙健.基于HLS的矩阵求逆算法设计优化[J].电子技术与软件工程,2021(22):93-96. 被引量：2

1林晓勇,代苓苓,史晟辉,李芳.基于矩阵分解的社交网络正则化推荐模型[J].计算机系统应用,2016,25(1):9-16. 被引量：3
2赵海燕,刘倩玉,陈庆奎,曹健.基于相似性随时间衰减的矩阵分解算法[J].小型微型计算机系统,2016,37(7):1474-1478.
3龚宇,唐向宏.全自适应阵列中的逆QR分解算法[J].电子科技大学学报,1997,26(1):24-28.
4陈辉,王永良.基于空间平滑的矩阵分解算法[J].信号处理,2002,18(4):324-327. 被引量：10
5张晓明,王勇军,张民选.网络处理器中协处理器设计方法研究及实现[J].计算机工程与科学,2007,29(3):80-83. 被引量：1
6刘景超,刘先锋.XML数据流基于组着色的XPath查询模型[J].信息化纵横,2009(11):63-66.
7王庆龙,赵文元,王和国.基于Virtex5高性能FPGA的脉冲激光测距系统设计[J].国外电子元器件,2008(8):12-13.
8杨启洲,刘一清.基于HEVC的多长度DCT变换的VLSI设计[J].微电子学,2015,45(1):100-103. 被引量：3
9卢桂馥,万鸣华.Hessian正则化的低秩矩阵分解算法[J].小型微型计算机系统,2016,37(10):2296-2299. 被引量：3
10方冰,牛晓婷.基于标签的矩阵分解推荐算法[J].计算机应用研究,2017,34(4):1022-1025. 被引量：11

Journal of Computer Science & Technology

2010年第4期

浏览历史

内容加载中请稍等...

A Unified Co-Processor Architecture for Matrix Decomposition 被引量：1

参考文献28

同被引文献15

引证文献1

二级引证文献6

相关作者

相关机构

相关主题

浏览历史