期刊文献+

基于多模态对比学习的代码表征增强预训练方法

Pre-training Method for Enhanced Code Representation Based on Multimodal Contrastive Learning
下载PDF
导出
摘要 代码表征旨在融合源代码的特征,以获取其语义向量,在基于深度学习的代码智能中扮演着重要角色.传统基于手工的代码表征依赖领域专家的标注,繁重耗时,且无法灵活地复用于特定下游任务,这与绿色低碳的发展理念极不相符.因此,近年来,许多自监督学习的编程语言大规模预训练模型(如CodeBERT)应运而生,为获取通用代码表征提供了有效途径.这些模型通过预训练获得通用的代码表征,然后在具体任务上进行微调,取得了显著成果.但是,要准确表示代码的语义信息,需要融合所有抽象层次的特征(文本级、语义级、功能级和结构级).然而,现有模型将编程语言仅视为类似于自然语言的普通文本序列,忽略了它的功能级和结构级特征.因此,旨在进一步提高代码表征的准确性,提出了基于多模态对比学习的代码表征增强的预训练模型(representation enhanced contrastive multimodal pretraining,REcomp).REcomp设计了新的语义级-结构级特征融合算法,将它用于序列化抽象语法树,并通过多模态对比学习的方法将该复合特征与编程语言的文本级和功能级特征相融合,以实现更精准的语义建模.最后,在3个真实的公开数据集上进行了实验,验证了REcomp在提高代码表征准确性方面的有效性. Code representation aims to extract the characteristics of source code to obtain its semantic embedding,playing a crucial role in deep learning-based code intelligence.Traditional handcrafted code representation methods mainly rely on domain expert annotations,which are time-consuming and labor-intensive.Moreover,the obtained code representations are task-specific and not easily reusable for specific downstream tasks,which contradicts the green and sustainable development concept.To this end,many large-scale pretraining models for source code representation have shown remarkable success in recent years.These methods utilize massive source code for self-supervised learning to obtain universal code representations,which are then easily fine-tuned for various downstream tasks.Based on the abstraction levels of programming languages,code representations have four level features:text level,semantic level,functional level,and structural level.Nevertheless,current models for code representation treat programming languages merely as ordinary text sequences resembling natural language.They overlook the functional-level and structural-level features,which bring performance inferior.To overcome this drawback,this study proposes a representation enhanced contrastive multimodal pretraining(REcomp)framework for code representation pretraining.REcomp has developed a novel semantic-level to structure-level feature fusion algorithm,which is employed for serializing abstract syntax trees.Through a multi-modal contrastive learning approach,this composite feature is integrated with both the textual and functional features of programming languages,enabling a more precise semantic modeling.Extensive experiments are conducted on three real-world public datasets.Experimental results clearly validate the superiority of REcomp.
作者 杨宏宇 马建辉 侯旻 沈双宏 陈恩红 YANG Hong-Yu;MA Jian-Hui;HOU Min;SHEN Shuang-Hong;CHEN En-Hong(Anhui Province Key Laboratory of Big Data Analysis and Application(University of Science and Technology of China),Hefei 230027,China;School of Computer Science and Technology,University of Science and Technology of China,Hefei 230027,China;School of Data Science,University of Science and Technology of China,Hefei 230027,China;State Key Laboratory of Cognitive Intelligence,Hefei 230088,China)
出处 《软件学报》 EI CSCD 北大核心 2024年第4期1601-1617,共17页 Journal of Software
关键词 代码表征 预训练模型 多模态 对比学习 code representation pre-trained model multimodal contrastive learning
  • 相关文献

参考文献7

二级参考文献82

  • 1Blum A, Mitchell T. Combining labeled and unlabeled data with co-training. In Proc. the 11th Annual Conference on Computational Learning Theory, Madison, USA, Jul.24-26,1998,pp.92-100.
  • 2Goldman S, Zhou Y. Enhancing supervised learning with un-labeled data. In Proc. the 17th International Conference onMachine Learning, San Francisco, USA, Jun. 29-Jul.2,2000,pp.327-334.
  • 3Li M, Zhou Z H. Improve computer-aided diagnosis with ma-chine learning techniques using undiagnosed samples. IEEE Transactions on Systems, Man and Cybernetics - Part A: Systems and Humans,2007,37(6):1088-1098.
  • 4Zhou Z H, Li M. Tri-training: Exploiting unlabeled data us-ing three classifiers. IEEE Transactions on Knowledge and Data Engineering,2005,17(11):1529-1541.
  • 5Zhou Z H, Li M. Semi-supervised regression with co-training style algorithms. IEEE Transactions on Knowledge and Data Engineering,2007,19(11):1479-1493.
  • 6Steedman M, Osborne M, Sarkar A et al. Bootstrapping sta-tistical parsers from small data sets. In Proc. the 11th Con-ference on the European Chapter of the Association for Com-putational Linguistics, Budapest, Hungary, Apr.12-17,2003, pp.331-338.
  • 7Li M, Zhou Z H. Semi-supervised document retrieval. Infor-mation Processing & Management,2009,45(3):341-355.
  • 8Zhou Z H, Chen K J, Dai H B. Enhancing relevance feedback in image retrieval using unlabeled data. ACM Transactions on Information Systems,2006,24(2):219-244.
  • 9Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P SMOTE: Synthetic minority over-sampling technique. Jour-nal ol Artificial Intelliqence Research,2002,16:321-357.
  • 10Kubat M, Matwin S. Addressing the curse of imbalanced training sets: One-sided selection. In Proe. the 14th Int. Conf. Machine Learning, Nashville, USA,1997,pp.179-186.

共引文献270

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部