摘要
正则表达式匹配技术在数据清洗、解析提取等数据处理任务方面发挥重大作用。然而,由于匹配过程中存在数据强依赖关系和内存访问不可预测等问题,造成匹配性能较低。针对此问题,提出一种基于向量指令集的高性能正则表达式数据处理算法,称之为ßFA:通过向量指令一次性从内存读出若干连续字符,并与最常被访问状态对应的非信任字符集进行向量匹配,利用内置函数定位首个非信任字符的位置,获得可直接跳过的字符数,从而实现匹配性能的加速。实验结果表明,ßFA算法的吞吐率优于原始DFA算法和αFA算法,是原始DFA算法的4.67~60倍以及ɑFA算法的4.37~7.82倍。
Regular expression matching technology plays a significant role in data processing tasks such as data cleaning,pars‐ing,and extraction.However,due to issues such as strong data dependency and unpredictable memory access in the matching pro‐cess,the matching performance is relatively low.In response to this problem,this paper proposes a high-performance regular ex‐pression data processing algorithm based on vector instruction set,which is calledßFA.By using vector instructions to read a se‐quence of consecutive characters at once,and performing vector matching with the non-trusted character set corresponding to the most frequently accessed state,built-in functions can be utilized to find the position of the first non-trusted character,thus obtain‐ing the number of characters that can be skipped directly,thereby accelerating the matching performance.Experimental results show that the throughput of theßFA algorithm is superior to the original DFA algorithm and theαFA algorithm,being 4.67~60 times faster than the original DFA algorithm and 4.37~7.82 times faster than theαFA algorithm.
作者
杨嘉佳
关健
李正
于增明
姚旺君
Yang Jiajia;Guan Jian;Li Zheng;Yu Zengming;Yao Wangjun(The Sixth Research Institute of China Electronics Corporation,Beijing 100083,China)
出处
《电子技术应用》
2024年第11期85-88,共4页
Application of Electronic Technique
关键词
正则表达式匹配
向量指令集
高性能数据处理
regular expression matching
vector instruction set
high-performance data processing