CodeScore-R:用于评估代码合成功能准确性的自动化鲁棒指标

CodeScore-R:An Automated Robustness Metric for Assessing the Functional Correctness of Code Synthesis

下载PDF

导出

摘要评估指标在代码合成领域中至关重要.常用的代码评估指标可以分为3种类型:基于匹配、基于语义和基于执行.其中,基于执行的Pass@k指标通过执行测试用例,能够准确判断预测代码的功能准确性.然而,该指标的计算需要大量开销,因此亟需设计一种自动化评估指标,在无需测试用例时仍可评估预测代码的功能准确性.此外,好的评估指标应当具有鲁棒性,即预测代码发生微小改变时,评估指标仍能保持其准确性.为此,提出了一种基于UniXcoder和对比学习的自动化鲁棒指标CodeScore-R,用于评估代码合成的功能准确性. CodeScore-R采用草图化处理、语法等价转换和变异测试等技术手段,有效减轻了标识符、语法结构和运算符对评估结果的干扰.实验结果表明,在Java和Python语言上的代码生成和迁移任务中,CodeScore-R的表现优于其他无需测试用例的评估指标,且更接近Pass@k指标,并具有更强的鲁棒性. Evaluation metrics are crucial in the field of code synthesis.Commonly used code evaluation metrics can be classified into three types:match-based,semantic-based,and execution-based.Among them,the execution-based Pass@k metric accurately assesses the functionality of predicted code by executing test cases.However,calculating this metric requires a significant amount of overhead,necessitating the design of an automated evaluation metric that can assess the functionality of predicted code without the need for test cases.Additionally,a good evaluation metric should be robust,that is the metric can maintain its accuracy even when the predicted code undergoes minor changes.To address these challenges,we propose an automated robust metric,called CodeScore-R,based on UniXcoder and contrastive learning,for evaluating the functionality of code synthesis.CodeScore-R employs techniques such as sketch-based processing,syntactic-equivalent transformations,and mutation testing to effectively mitigate the interference caused by identifiers,syntax structures,and operators on evaluation results.Experimental results demonstrate that in the tasks of code generation and migration in Java and Python,CodeScore-R outperforms other evaluation metrics and is more closely aligned with the Pass@k metric,while exhibiting stronger robustness.

作者杨光周宇陈翔张翔宇 Yang Guang;Zhou Yu;Chen Xiang;Zhang Xiangyu(College of Computer Science and Technology/College of Artificial Intelligence/College of Software,Nanjing University of Aeronautics and Astronautics,Nanjing 211106;School of Information Science and Technology,Nantong University,Nantong 226019)

机构地区南京航空航天大学计算机科学与技术学院/人工智能学院/软件学院南通大学信息科学技术学院

出处《计算机研究与发展》 EI CSCD 北大核心 2024年第2期291-306,共16页 Journal of Computer Research and Development

基金国家自然科学基金项目(61972197,62372232) 江苏省研究生科研与实践创新计划项目(KYCX23_0396) 中央高校基本科研业务费专项资金资助(NG2023005)。

关键词代码合成评估指标功能准确性鲁棒性代码合成神经网络 code synthesis evaluation metric functional correctness robustness code synthesis neural network

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

1董云卫,刘关俊,毛晓光.形式化方法与应用专题前言[J].软件学报,2023,34(7):2979-2980.
2魏萱,张浩毅,赵晨,李开成.基于自适应权重混合策略主动学习的电能质量复合扰动识别[J].华南师范大学学报（自然科学版）,2023,55(6):55-62.
3王正东,王靖,杨晓君,林浩申.基于多层次视图对比学习的知识感知推荐算法[J].计算机应用研究,2024,41(1):45-50.
4金亦涵,邢栋,孙大林,刘媛媛,夏迪,金保方,梁元姣.养精胶囊通过调节自噬提高睾丸间质细胞睾酮合成功能的研究[J].中国男科学杂志,2023,37(6):39-43.
5苏东芮,任小康,于沄淏,赵鲁阳,王天宇,闫学海.酪氨酸衍生物调控酶催化路径可控合成功能黑色素[J].化学学报,2023,81(11):1486-1492. 被引量：1
6拜云虎,王全晖,岳玮,谢峻,吴菲菲,王亚云,杨雁灵,张春旭.Pink1通过调控Drp1参与高氨诱导线粒体异常分裂的机制研究[J].空军军医大学学报,2024,45(1):59-65.
7刘兆辉.复杂建筑表皮自然通风模拟简化建模方法[J].建筑施工,2024,46(1):33-35. 被引量：1
8戴雨阳,岳野,刘震宇,何润,刘雨婷,张祥,陈德华,陈媛.低温胁迫对Bt棉纤维中杀虫蛋白含量及氮代谢的影响[J].作物学报,2024,50(3):709-720.
9赖玉萍,吴雄彬,李春婷.维持性血液透析患者血镁水平与腹主动脉钙化程度的相关性研究[J].中国卫生标准管理,2023,14(22):92-95.
10王麒,许婧霞,张亚妮,孙婧,张升潮,王腊,冯作山,易菊平,杨亚珍,胡洪涛.贝莱斯芽孢杆菌(Bacillus velezensis)JJYY防控土传病害效果评价及其全基因组测序分析和抗菌成分鉴定[J].微生物学通报,2024,51(1):155-171. 被引量：3

计算机研究与发展

2024年第2期

浏览历史

内容加载中请稍等...

CodeScore-R:用于评估代码合成功能准确性的自动化鲁棒指标

相关作者

相关机构

相关主题

浏览历史