摘要
源代码分类是源代码共享、数据泄露防护和数据安全治理等领域实现代码类资产安全保护的关键基础之一。统计分析、机器学习和深度学习等方法被广泛应用于源代码分类技术,提高了源代码识别分类的准确性。但是,这些技术通常要求输入完整的源代码文件或纯代码片段,当输入数据中混入json、xml、字符乱码或中英文语句等非代码文件时,源代码识别分类的准确率明显下降。为此,提出了一种基于自然语言处理的代码特征提取方法,同时改进了模型输出,在输出结果中加入各类代码和非代码占比信息。该方法通过对逻辑回归模型调优,在代码、非代码或二者混合输入的情况下,对20类编程语言和非代码语言分类的准确率达到98.8%,解决了混合输入情况下代码分类准确率低的问题。
In the fields of source code sharing,data leakage protection and data security governance,source code classification is one of the key foundations for achieving the security protection of code-based assets.Methods such as statistical analysis,machine learning and deep learning are widely used in source code classification techniques and they enhance the accuracy of source code recognition classification.However,these techniques usually need to input complete source code files or pure code snippets,and when the input data is mixed with non-code files such as json,xml,character gibberish,and Chinese/English statements,the accuracy of source code recognition classification decreases significantly.To this end,this paper proposes a code feature extraction method based on natural language processing,and also improves the model output by adding various types of code and non-code percentage information in the output results.By tuning the logistic regression model,the method achieves 98.8%accuracy in classifying 20 types of programming languages and non-code languages under code,non-code or mixed input of both,which addresses the problem of low accuracy of code classification under mixed input.
作者
刘赟
张位
郑周荣
王梦
LIU Yun;ZHANG Wei;ZHENG Zhourong;WANG Meng(Cybersecurity Innovation Center of Science and Technology Industry for National Defense,Chengdu Sichuan 610041,China;Cyberspace Security Technology Laboratory of CETC,Chengdu Sichuan 610041,China;No.30 Institute of CETC,Chengdu Sichuan 610041,China)
出处
《通信技术》
2024年第7期725-730,共6页
Communications Technology
基金
国家重点研发计划(2021YFB3302105)。
关键词
源代码分类
自然语言处理
特征提取
逻辑回归
source code classification
natural language processing
feature extraction
logistic regression