摘要
二进制代码相似性检测技术近年来被广泛用于漏洞函数搜索、恶意代码检测与高级程序分析等领域,而由于程序代码与自然语言有一定程度的相似性,研究人员开始借助预训练等自然语言处理的相关技术来提高检测准确度。针对现有方法中未考虑程序指令概率特征导致的准确率提升瓶颈,提出了一种基于预训练汇编指令表征技术的二进制代码相似性检测方法。设计了面向多架构汇编指令的分词方法,并在控制流与数据流关系基础上,考虑指令间顺序出现的概率与各个指令单元使用的频率等特征设计预训练任务,以实现对指令更好的向量化表征;结合预训练汇编指令表征方法,对二进制代码相似性检测下游任务进行改进,使用表征向量替换统计特征作为指令与基本块的表征,以提高检测准确率。实验结果表明,与现有方法相比,所提方法在指令表征能力方面最高提升23.7%,在基本块搜索准确度上最高提升33.97%,在二进制代码相似性检测的检出数量上最高增加4倍。
Binary code similarity detection has been widely used in vulnerability searching,malware detection,advanced program analysis and other fields in recent years,while program code is similar to natural language in a degree,researchers start to use pre-training and other natural language processing related technologies to improve accuracy.A binary code similarity detection method based on pre-training assembly instruction representation is proposed to deal with the accuracy bottleneck due to insufficient consideration of instruction probability features.It includes tokenization method for multi-arch assembly instructions,and pre-trai-ning tasks that considering control flow,data flow,instruction logic and probability of occurrence,to achieve better vectorized representation of instructions.Downstream binary code similarity detection task is improved by combining pre-training method to gain accuracy boost.Experiments show that,compared with the existing methods,the proposed method improves instruction representing performance by 23.7%at the maximum,and improves block searching ability and similarity detection performance by up to 33.97%and 400%respectively.
作者
王泰彦
潘祖烈
于璐
宋景彬
WANG Taiyan;PAN Zulie;YU Lu;SONG Jingbin(College of Electronic Engineering,National University of Defense Technology,Hefei 230037,China;Anhui Province Key Laboratory of Cyberspace Security Situation Awareness and Evaluation,Hefei 230037,China;PLA 31401,Changchun 130022,China)
出处
《计算机科学》
CSCD
北大核心
2023年第4期288-297,共10页
Computer Science
基金
国家重点研发计划(2021YFB3100500)。
关键词
二进制代码
相似性检测
指令表征
分词方法
预训练任务
Binary code
Similarity detection
Instruction representation
Tokenization
Pre-training task