摘要
代码图像化技术被提出后在Android恶意软件研究领域迅速普及。针对使用单个DEX文件转换而成的代码图像表征能力不足的问题,提出了一种基于代码图像合成的Android恶意软件家族分类方法。首先,将安装包中的DEX、XML与反编译生成的JAR文件进行灰度图像化处理,并使用Bilinear插值算法来放缩处理不同尺寸的灰度图像,然后将三张灰度图合成为一张三维RGB图像用于训练与分类。在分类模型上,将软阈值去噪模块与基于SplitAttention的ResNeSt相结合提出了STResNeSt。该模型具备较强的抗噪能力,更能关注代码图像的重要特征。针对训练过程中的数据长尾分布问题,在数据增强的基础上引入了类别平衡损失函数(CB Loss),从而为样本不平衡造成的过拟合现象提供了解决方案。在Drebin数据集上,合成代码图像的准确率领先DEX灰度图像2.93个百分点,STResNeSt与残差神经网络(ResNet)相比准确率提升了1.1个百分点,且数据增强结合CB Loss的方案将F1值最高提升了2.4个百分点。实验结果表明,所提方法的平均分类准确率达到了98.97%,能有效分类Android恶意软件家族。
Code visualization technology is rapidly popularized in the field of Android malware research once it was proposed. Aiming at the problem of insufficient representation ability of code image converted from single DEX(classes.dex)file,a new Android malware family classification method based on code image integration was proposed. Firstly,the DEX,XML(androidManifest. xml) and decompiled JAR(classes. jar) files in the Android application package were converted to three gray-scale images,and the Bilinear interpolation algorithm was used for the scaling of gray images in different sizes. Then,the three gray-scale images were integrated into a three-dimensional Red-Green-Blue(RGB)image for training and classification. In terms of classification model,the Soft Threshold(ST)Block+ResNeSt(STResNeSt)was proposed by combining the soft threshold denoising block with Split-Attention based ResNeSt. The proposed model has the strong anti-noise ability and is able to pay more attention to the important features of code image. To handle the long-tail distribution of data in the training process,Class Balance Loss(CB Loss)was introduced after data augmentation,which provided a feasible solution to the over-fitting caused by the imbalance of samples. On the Drebin dataset,the accuracy of integrated code image is 2. 93 percentage points higher than that of DEX gray-scale image,the accuracy of STResNeSt is improved by 1. 1 percentage points compared with the Residual Neural Network(ResNet),the scheme of data augmentation combined with CB Loss improves the F1 score by up to 2. 4 percentage points. Experimental results show that,the average classification accuracy of the proposed method reaches 98. 97%,which can effectively classify the Android malware family.
作者
李默
芦天亮
谢子恒
LI Mo;LU Tianliang;XIE Ziheng(School of Information and Cyber Security,People’s Public Security University of China,Beijing 100038,China)
出处
《计算机应用》
CSCD
北大核心
2022年第5期1490-1499,共10页
journal of Computer Applications
基金
2021年公共安全行为科学实验室开放课题(2020SYS06)。