针对混合输入的源代码分类技术

Source Code Classification Techniques for Mixed Input

下载PDF

导出

摘要源代码分类是源代码共享、数据泄露防护和数据安全治理等领域实现代码类资产安全保护的关键基础之一。统计分析、机器学习和深度学习等方法被广泛应用于源代码分类技术,提高了源代码识别分类的准确性。但是,这些技术通常要求输入完整的源代码文件或纯代码片段,当输入数据中混入json、xml、字符乱码或中英文语句等非代码文件时,源代码识别分类的准确率明显下降。为此,提出了一种基于自然语言处理的代码特征提取方法,同时改进了模型输出,在输出结果中加入各类代码和非代码占比信息。该方法通过对逻辑回归模型调优,在代码、非代码或二者混合输入的情况下,对20类编程语言和非代码语言分类的准确率达到98.8%,解决了混合输入情况下代码分类准确率低的问题。 In the fields of source code sharing,data leakage protection and data security governance,source code classification is one of the key foundations for achieving the security protection of code-based assets.Methods such as statistical analysis,machine learning and deep learning are widely used in source code classification techniques and they enhance the accuracy of source code recognition classification.However,these techniques usually need to input complete source code files or pure code snippets,and when the input data is mixed with non-code files such as json,xml,character gibberish,and Chinese/English statements,the accuracy of source code recognition classification decreases significantly.To this end,this paper proposes a code feature extraction method based on natural language processing,and also improves the model output by adding various types of code and non-code percentage information in the output results.By tuning the logistic regression model,the method achieves 98.8%accuracy in classifying 20 types of programming languages and non-code languages under code,non-code or mixed input of both,which addresses the problem of low accuracy of code classification under mixed input.

作者刘赟张位郑周荣王梦 LIU Yun;ZHANG Wei;ZHENG Zhourong;WANG Meng(Cybersecurity Innovation Center of Science and Technology Industry for National Defense,Chengdu Sichuan 610041,China;Cyberspace Security Technology Laboratory of CETC,Chengdu Sichuan 610041,China;No.30 Institute of CETC,Chengdu Sichuan 610041,China)

机构地区国防科技工业网络安全创新中心中电科网络空间安全重点实验室中国电子科技集团公司第三十研究所

出处《通信技术》 2024年第7期725-730,共6页 Communications Technology

基金国家重点研发计划(2021YFB3302105)。

关键词源代码分类自然语言处理特征提取逻辑回归 source code classification natural language processing feature extraction logistic regression

分类号 TP393.4 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

1朱昌海.构建海外资产风险全程预警防控体系[J].中国石油企业,2024(2):62-62.
2许倩倩.行政事业单位内部控制制度及完善策略[J].纳税,2023(17):34-36.
3鹿志平.网络语言分类及其影响探析[J].汉字文化,2024(11):16-18.
4陈敏.国有企业全面预算绩效管理的措施研究[J].乡镇企业导报,2022(8):70-72. 被引量：2
5孔建国,韩琪聪,梁海军,李煜琨.基于Conformer的端到端中英文管制语音识别[J].航空计算技术,2024,54(3):1-5.
6萨沙.使用4种统计指标对语言进行分类[J].南开语言学刊,2023(2):61-61.
7茆荣珍,舒明星,金巍.勘测设计行业数据信息泄露风险及保护措施探讨[J].水利规划与设计,2023(8):55-59. 被引量：2
8廖文佳.人工智能在网络空间安全领域的应用探究[J].信息记录材料,2024,25(7):71-73.
9郭嘉,崔思涵.汉英三音节词语音突显对比实验分析[J].中国语音学报,2022(2):100-117.
10朱玉梅.人工智能技术在校园网网络安全中的应用与实践[J].信息记录材料,2024,25(6):146-148.

通信技术

2024年第7期

浏览历史

内容加载中请稍等...

针对混合输入的源代码分类技术

相关作者

相关机构

相关主题

浏览历史