基于多模态对比学习的代码表征增强预训练方法

Pre-training Method for Enhanced Code Representation Based on Multimodal Contrastive Learning

下载PDF

导出

摘要代码表征旨在融合源代码的特征,以获取其语义向量,在基于深度学习的代码智能中扮演着重要角色.传统基于手工的代码表征依赖领域专家的标注,繁重耗时,且无法灵活地复用于特定下游任务,这与绿色低碳的发展理念极不相符.因此,近年来,许多自监督学习的编程语言大规模预训练模型(如CodeBERT)应运而生,为获取通用代码表征提供了有效途径.这些模型通过预训练获得通用的代码表征,然后在具体任务上进行微调,取得了显著成果.但是,要准确表示代码的语义信息,需要融合所有抽象层次的特征(文本级、语义级、功能级和结构级).然而,现有模型将编程语言仅视为类似于自然语言的普通文本序列,忽略了它的功能级和结构级特征.因此,旨在进一步提高代码表征的准确性,提出了基于多模态对比学习的代码表征增强的预训练模型(representation enhanced contrastive multimodal pretraining,REcomp).REcomp设计了新的语义级-结构级特征融合算法,将它用于序列化抽象语法树,并通过多模态对比学习的方法将该复合特征与编程语言的文本级和功能级特征相融合,以实现更精准的语义建模.最后,在3个真实的公开数据集上进行了实验,验证了REcomp在提高代码表征准确性方面的有效性. Code representation aims to extract the characteristics of source code to obtain its semantic embedding,playing a crucial role in deep learning-based code intelligence.Traditional handcrafted code representation methods mainly rely on domain expert annotations,which are time-consuming and labor-intensive.Moreover,the obtained code representations are task-specific and not easily reusable for specific downstream tasks,which contradicts the green and sustainable development concept.To this end,many large-scale pretraining models for source code representation have shown remarkable success in recent years.These methods utilize massive source code for self-supervised learning to obtain universal code representations,which are then easily fine-tuned for various downstream tasks.Based on the abstraction levels of programming languages,code representations have four level features:text level,semantic level,functional level,and structural level.Nevertheless,current models for code representation treat programming languages merely as ordinary text sequences resembling natural language.They overlook the functional-level and structural-level features,which bring performance inferior.To overcome this drawback,this study proposes a representation enhanced contrastive multimodal pretraining(REcomp)framework for code representation pretraining.REcomp has developed a novel semantic-level to structure-level feature fusion algorithm,which is employed for serializing abstract syntax trees.Through a multi-modal contrastive learning approach,this composite feature is integrated with both the textual and functional features of programming languages,enabling a more precise semantic modeling.Extensive experiments are conducted on three real-world public datasets.Experimental results clearly validate the superiority of REcomp.

作者杨宏宇马建辉侯旻沈双宏陈恩红 YANG Hong-Yu;MA Jian-Hui;HOU Min;SHEN Shuang-Hong;CHEN En-Hong(Anhui Province Key Laboratory of Big Data Analysis and Application(University of Science and Technology of China),Hefei 230027,China;School of Computer Science and Technology,University of Science and Technology of China,Hefei 230027,China;School of Data Science,University of Science and Technology of China,Hefei 230027,China;State Key Laboratory of Cognitive Intelligence,Hefei 230088,China)

机构地区大数据分析与应用安徽省重点实验室(中国科学技术大学) 中国科学技术大学计算机科学与技术学院中国科学技术大学大数据学院认知智能全国重点实验室

出处《软件学报》 EI CSCD 北大核心 2024年第4期1601-1617,共17页 Journal of Software

关键词代码表征预训练模型多模态对比学习 code representation pre-trained model multimodal contrastive learning

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献7

1成思强,刘建勋,彭珍连,曹奔.以CodeBERT为基础的代码分类研究[J].计算机工程与应用,2023,59(24):277-288. 被引量：1
2姜远,黎铭,周志华.Software Defect Detection with ROCUS[J].Journal of Computer Science & Technology,2011,26(2):328-342. 被引量：11
3周志华,陈世福.神经网络集成[J].计算机学报,2002,25(1):1-8. 被引量：245
4谢春丽,梁瑶,王霞.深度学习在代码表征中的应用综述[J].计算机工程与应用,2021,57(20):53-63. 被引量：2
5吕天根,洪日昌,何军,胡社教.多模态引导的局部特征选择小样本学习方法[J].软件学报,2023,34(5):2068-2082. 被引量：4
6刘冰,李瑞麟,封举富.深度度量学习综述[J].智能系统学报,2019,14(6):1064-1072. 被引量：12
7魏敏,张丽萍.代码搜索方法研究进展[J].计算机应用研究,2021,38(11):3215-3221. 被引量：2

二级参考文献82

1Blum A, Mitchell T. Combining labeled and unlabeled data with co-training. In Proc. the 11th Annual Conference on Computational Learning Theory, Madison, USA, Jul.24-26,1998,pp.92-100.
2Goldman S, Zhou Y. Enhancing supervised learning with un-labeled data. In Proc. the 17th International Conference onMachine Learning, San Francisco, USA, Jun. 29-Jul.2,2000,pp.327-334.
3Li M, Zhou Z H. Improve computer-aided diagnosis with ma-chine learning techniques using undiagnosed samples. IEEE Transactions on Systems, Man and Cybernetics - Part A: Systems and Humans,2007,37(6):1088-1098.
4Zhou Z H, Li M. Tri-training: Exploiting unlabeled data us-ing three classifiers. IEEE Transactions on Knowledge and Data Engineering,2005,17(11):1529-1541.
5Zhou Z H, Li M. Semi-supervised regression with co-training style algorithms. IEEE Transactions on Knowledge and Data Engineering,2007,19(11):1479-1493.
6Steedman M, Osborne M, Sarkar A et al. Bootstrapping sta-tistical parsers from small data sets. In Proc. the 11th Con-ference on the European Chapter of the Association for Com-putational Linguistics, Budapest, Hungary, Apr.12-17,2003, pp.331-338.
7Li M, Zhou Z H. Semi-supervised document retrieval. Infor-mation Processing & Management,2009,45(3):341-355.
8Zhou Z H, Chen K J, Dai H B. Enhancing relevance feedback in image retrieval using unlabeled data. ACM Transactions on Information Systems,2006,24(2):219-244.
9Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P SMOTE: Synthetic minority over-sampling technique. Jour-nal ol Artificial Intelliqence Research,2002,16:321-357.
10Kubat M, Matwin S. Addressing the curse of imbalanced training sets: One-sided selection. In Proe. the 14th Int. Conf. Machine Learning, Nashville, USA,1997,pp.179-186.

共引文献270

1杜扬帆,伍孝飞,乔百友.基于XGBoost-PredRNN++的海表面温度预测[J].计算机系统应用,2022,31(10):236-244. 被引量：1
2钟侠骄,张绍兵,郭静,王胜朝,成苗,何莲,赵铱民.基于RandLA-Net的3D点云牙颌分割与身份识别[J].计算机应用,2023,43(S01):269-275.
3刘俊.Photoshop在印制电路制造中的应用[J].印制电路信息,2002(11):26-27.
4李朝奎,王利东,李吟,周新邵.土壤重金属污染评价方法研究进展[J].矿产与地质,2011,25(2):172-176. 被引量：43
5王飞,周鹏程,王雷,徐本连.一种面向新型入侵的获取和分类方法[J].计算机科学,2012,39(S3):45-50.
6安金霞,朱纪洪,袁夏明.基于神经网络知识库的多神经网络集成方法[J].中南大学学报（自然科学版）,2009,40(S1):1-9. 被引量：1
7陈万忠,孙保峰,高韧杰,雷俊.基于NNE技术的手臂运动模式识别算法研究[J].吉林大学学报（工学版）,2013,43(S1):69-73. 被引量：1
8闫友彪,陈元琰.机器学习的主要策略综述[J].计算机应用研究,2004,21(7):4-10. 被引量：56
9施彦,黄聪明,侯朝桢.基于改进的PSO算法的神经网络集成[J].复旦学报（自然科学版）,2004,43(5):692-695. 被引量：8
10凌锦江,周志华.基于因果发现的神经网络集成方法[J].软件学报,2004,15(10):1479-1484. 被引量：9

1桑塔,杨晓霭.“语言抽象层次”理论对于大学课堂讲授的启示[J].中国大学教学,2024(3):92-96.
2Beibei Han,Yingmei Wei,Qingyong Wang,Shanshan Wan.CoLM^(2)S:Contrastive self‐supervised learning on attributed multiplex graph network with multi‐scale information[J].CAAI Transactions on Intelligence Technology,2023,8(4):1464-1479.
3吴海滨,戴诗语,王爱丽,岩堀祐之,于效宇.CNN-Transformer结合对比学习的高光谱与LiDAR数据协同分类[J].光学精密工程,2024,32(7):1087-1100.
4Qingyang Zhang,Kaishen Wang,Jingqing Ruan,Yiming Yang,Dengpeng Xing,Bo Xu.Enhancing Multi-agent Coordination via Dual-channel Consensus[J].Machine Intelligence Research,2024,21(2):349-368.
5贺异欣,张新长,吴福成,缪丹,叶建新,刘浩枫,李有甫,刘权.实景三维下的AIGC变形监测算法分析[J].测绘科学,2023,48(11):211-217.
6穆艳娟.既有公共建筑混凝土屋面光伏建筑一体化改造设计研究与应用——以厦门某老旧办公楼为例[J].建设科技,2024(5):55-59.
7章淯淞,夏鸿斌,刘渊.结合自对比图神经网络与双预测器的会话推荐模型[J].模式识别与人工智能,2024,37(3):242-252.
8马艳珍,勾智楠,池云仙,高凯.基于结构-语义融合的评论文本情感分类研究[J].河北工业科技,2024,41(2):92-98.
9WU Degang,WU Shenghe,LIU Lei,SUN Yide.An intelligent automatic correlation method of oilbearing strata based on pattern constraints:An example of accretionary stratigraphy of Shishen 100 block in Shinan Oilfield of Bohai Bay Basin,East China[J].Petroleum Exploration and Development,2024,51(1):180-192.

软件学报

2024年第4期

浏览历史

内容加载中请稍等...

基于多模态对比学习的代码表征增强预训练方法

参考文献7

二级参考文献82

共引文献270

相关作者

相关机构

相关主题

浏览历史