摘要
简洁非交互式零知识证明能够生成长度固定的证明并快速进行验证,极大地推动了零知识证明在数字签名、区块链及分布式存储等领域的应用。但其证明的生成过程极其耗时且需要被频繁调用,其中数论变换是证明生成过程的主要运算之一。然而现有的通用数论变换硬件加速方法难以满足其在简洁非交互式零知识证明中大规模、高位宽的要求。针对该问题,提出一种数论变换多级流水硬件计算架构。针对高位宽计算需求对高位模运算进行优化,设计了低时延蒙哥马利模乘单元;为了加速大规模计算,通过二维子任务划分将大规模数论变换任务划分为小规模独立子任务,并通过消除数据依赖实现了子任务间计算流水;在子任务多轮蝶形运算之间采用数据重排机制,有效缓解了访存需求并实现了不同步长蝶形运算间的计算流水。所提出的数论变换计算架构可以根据现场可编程门阵列(FPGA)片上资源灵活扩展,方便部署在不同规模的FPGA上以获得最大加速效果。所提出的硬件架构使用高层次综合(HLS)开发并基于OpenCL框架在AMD Xilinx Alveo U50实现了整套异构加速系统。实验结果表明,相比于PipeZK中的数论变换加速模块,该方法获得了1.95倍的加速比;在运行当前主流的简洁非交互式零知识证明开源项目bellman时,相比于AMD Ryzen 95900X单核及12核分别获得了27.98倍和1.74倍的加速比,并分别获得了6.9倍、6倍的能效提升。
The proof in zk-SNARK has a fixed length and can be verified quickly,promoting the application of zero-knowledge proof in areas such as digital signature,blockchain,distributed storage,and outsourced computing.How-ever,the generation of proofs is time-consuming and frequently used.As a result,NTT(number theoretic trans-form),one of the most time-consuming parts in proof-generation,needs to be accelerated significantly.However,the existing general NTT hardware acceleration methods cannot meet the requirements of large-bitwidth and large-scale in zk-SNARK.To address this issue,this paper proposes a highly pipelined architecture for NTT.First of all,large-bitwidth modular arithmetic is optimized and low-latency Montgomery modular multiplication hardware unit is de-signed.And then,the large-scale NTT tasks are divided into smaller sub-tasks through two-dimensional partitioning,which improves the parallelism of NTT computation and eliminates the data dependence among sub-tasks,thus reali-zing the pipeline among sub-tasks.Finally,the“data reordering”technique is introduced among multiple rounds of butterfly operations in a sub-task,which effectively alleviates the memory access requirements,thus realizing the bottom-level pipeline in each sub-task,among butterfly operations with different step sizes.This architecture can be flexibly scaled to different scales of FPGAs.The accelerator is prototyped on the AMD-Xilinx Alveo U50 card(UltraScale+XCU50 FPGA).To balance computing efficiency and flexibility,the OpenCL equipped with high-level synthesis(HLS)is used to implement the system.The evaluation results show that the NTT module performs 1.95 times faster than the one in PipeZK and the accelerator achieves 27.98 and 1.74 times speedup,6.9 and 6 times ener-gy efficiency improvement than AMD Ryzen 95900X respectively,when it is integrated into the well-known ZKP open-source project,bellman.
作者
赵海旭
柴志雷
花鹏程
王锋
丁冬
ZHAO Haixu;CHAI Zhilei;HUA Pengcheng;WANG Feng;DING Dong(School of Artificial Intelligence and Computer Science,Jiangnan University,Wuxi,Jiangsu 214122,China;School of Internet of Things Engineering,Jiangnan University,Wuxi,Jiangsu 214122,China;Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence,Wuxi,Jiangsu 214122,China)
出处
《计算机科学与探索》
CSCD
北大核心
2024年第2期538-552,共15页
Journal of Frontiers of Computer Science and Technology
基金
国家自然科学基金(61972180)
江苏省模式识别与计算智能工程实验室项目。
关键词
现场可编程门阵列(FPGA)
简洁非交互式零知识证明(zk-SNARK)
模乘
数论变换
硬件加速
field programmable gate array(FPGA)
zero-knowledge succinct non-interactive arguments of knowl-edge(zk-SNARK)
modular multiplication
number theoretic transform
hardware acceleration