摘要
针对代码混淆领域忽略代码混淆技术隐蔽性的安全问题,提出一种基于操作码n-gram特征的代码混淆技术识别模型。通过提取反编译二进制文件程序操作码,生成程序n-gram特征并筛选后输入机器学习分类算法训练,构建识别混淆程序的二分类模型与识别混淆技术的多分类模型。基于多来源第三方数据集与两种先进混淆工具验证该识别模型,使用10维特征识别两类混淆工具的混淆程序,平均识别准确率分别为100%、99.6%;使用30维特征识别5层以上混淆组合技术,平均识别准确率为98.8%。实验结果表明提出的代码混淆技术识别模型相较其他识别模型准确率更高,且对不同混淆工具有一定泛化能力,揭示了当前主流代码混淆技术的隐蔽性风险。
An opcode n-gram based recognition model is proposed for code obfuscation approaches which exposes the risk of ignoring the code obfuscation stealth.Opcode is extracted from decompiled binaries while the opcode n-gram is used as the input of machine learning algorithm after feature engineering,in which way the two classification model of identifying obfuscation program and the multi classification model of identifying obfuscation approaches are constructed separately.The recognition model is verified on multi-source third-party datasets obfuscated by two state-of-the-art confusion tools where the average accuracy is 100% and 99.6% for recognizing obfuscated programs with 10 features respectively and the average accuracy exceeds 98.8% for the recognition of more than 5 layers of obfuscation approaches,implying serious risks of ignoring the stealth of code obfuscation.
作者
肖玉强
郭云飞
王亚文
XIAO Yuqiang;GUO Yunfei;WANG Yawen(Information Engineering University,Zhengzhou 450001,China)
出处
《信息工程大学学报》
2023年第1期72-80,共9页
Journal of Information Engineering University
基金
国家自然科学基金资助项目(62072467,62002383)
国家重点研发资助项目(2021YFB1006200,2021YFB1006201)。