摘要
量子计算机的发展对现有公钥体系的影响是实质性的.在众多后量子密码流派中,格基密码方案因其安全性、高效性等优良特点而成为了主流技术路线.密码算法在各实现平台的运行效率是后量子算法评估进程中的重要指标.FPGA(Field Programmable Gate Array)具有并行性架构,是密码体系实际部署时的重要硬件平台.近年来,后量子算法的硬件优化实现研究吸引了越来越多的关注.针对格基密码算法中计算复杂度最高、耗时最长的操作——环上多项式乘法计算,本文对主流的加速技术——数论变换技术(Number-Theoretic Transform,NTT)进行了系统研究,根据参数不同将其分为标准NTT(Standard-NTT,SNTT)、删减NTT(Truncated-NTT,TNTT)、混合NTT(Hybrid-NTT,HNTT).为了提高适用性,设计了一种统一型、可常数时间执行、支持多参数的硬件电路,支持三种NTT架构,结合Karatsuba技巧,将对应系数相乘中乘法次数减少20%.针对NTT的核心操作——蝴蝶变换,设计了一种紧凑型、低时延的电路结构,可实现正向NTT、对应系数相乘、逆向NTT功能,同时支持CooleyTukey和Gentleman-Sande两种结构;提出一种新的系数访存模式,采用“交叉存储型”结构,使用双口BRAM将计算后的系数“对角交叉”变换位置后存储在原地址,利用双Bank存储模式满足蝴蝶变换单元的数据吞吐量;对Barrett约减算法进行修改,提出一种适用于多模值的模约减硬件单元,用加法和移位代替其中的乘法操作,减少乘法器资源的消耗.本方案使用Verilog HDL语言在Xilinx公司Artix-7系列XC7A200TFBG484-2型号FPGA芯片上进行了纯硬件优化实现.与当前研究相比,本设计在面积上减少12%-62.6%,时间上提升5.5%-49.8%.此外,我们还对高吞吐量的并行设计进行了优化实现,优化后的二并行、四并行架构时间上分别加速两倍、四倍,面积消耗仅为单结构的1.45倍、2.58倍.
The development of quantum computers has a substantial impact on the existing public key systems.The security of traditional public key encryption schemes,such as RSA and ECC algorithm,is based on the most difficult mathematical problems such as large integer factoring and discrete logarithm problems.In 1994,Shor proposed that these mathematical problems can be solved in polynomial time on a quantum computer.IBM and Microsoft engineers expect largescale quantum computers to emerge in the next 15-20 years.It is urgent to research and design quantum-resistant public key cryptographic algorithms and protocols.Post-quantum cryptography(PQC)refers to cryptography that can resist quantum computer attacks.Among many postquantum cryptographic schemes,lattice-based cryptography has attracted extensive attention due to its security,efficiency and other advantages.In July 2022,the National Institute of Standards and Technology(NIST)announced four algorithms that will serve as the future post-quantum cryptography standard.Three of them are lattice-based cryptographic schemes.Among the winning algorithms in the Chinese national cryptographic algorithm competition,lattice-based algorithms account for the majority,such as Aigis,AKCN-MLWE,etc.FPGA(Field Programmable Gate Array)is an important hardware platform for practical deployment,with a parallel architecture.There has been widespread interest in researching and designing dedicated hardware unit for post-quantum cryptographic algorithms on FPGAs.The most complex and timeconsuming operation in lattice-based cryptography is polynomial multiplication over the rings.Normally,most schemes use number theoretic transform(NTT)technology to reduce computational complexity and accelerate polynomial multiplication.In this paper,we studied NTT technologies used in the mainstream lattice-based cryptographic schemes and typed them as Standard-NTT(SNTT),Truncated-NTT(TNTT)and Hybrid-NTT(HNTT)according to different parameters.In order to improve its applicability,a unified,constant-time executable,multi-parameter supported hardware architecture was designed,which supports all three NTT architectures.Combining the Karatsuba technique,the number of multiplication times during point-wise multiplication was reduced by 20%.A compact and low latency unit has been designed for the core operation of NTT,butterfly transform,which can perform forward NTT,point-wise multiplication,and reverse NTT functions.It supports both Cooley Tukey and Gentleman Sande structures.To avoid data conflicts,we designed a unique cross-storage memory access mode,using two banks to store polynomial coefficients.After each layer of NTT,polynomial coefficients are cross-swapped before being stored in BRAM.The coefficients in the NTT calculation process are carried out in the specific domain,where the intermediate results require modular reduction.There are two popular algorithms in the cryptographic engineering field,Montgomery reduction and Barrett reduction.Both of them require two multiplications which are expensive for hardware design.For that,we have designed a dedicated reduction unit that replaces multiplications with addition/shift operations and is suitable for multiple moduli.The proposed hardware design code is written in Verilog HDL,synthesized,placed,and routed using Xilinx Vivado tools on the Xilinx Artix-7 FPGA XC7A200TFBG484-2 chip.Compared to the state-of-art,this design reduces area consumption by 12%-62.6%,and speeds up by 5.5%-49.8%.We also designed a high-performance parallel framework to increase data throughput.Compared to the single structure,the optimized two-parallel and four-parallel architectures speed up twice and four times respectively,with only 1.45 and 2.58 times area consumption.
作者
赵旭阳
梁志闯
胡跃
耿合详
赵运磊
ZHAO Xu-Yang;LIANG Zhi-Chuang;HU Yue;GENG He-Xiang;ZHAO Yun-Lei(School of Computer Science,Fudan University,Shanghai 200433;State Key Laboratory of Cryptology,Beijing 100000)
出处
《计算机学报》
EI
CAS
CSCD
北大核心
2023年第12期2670-2686,共17页
Chinese Journal of Computers
基金
国家重点研发计划(2022YFB2701600,2017YFB0802000)
国家自然科学基金(61972094)资助。
关键词
后量子密码
格基密码算法
多项式乘法
数论变换
FPGA
硬件实现
post-quantum cryptography
lattice-based algorithm
polynomial multiplication
number theoretic transform
FPGA
hardware design