摘要
医疗机器翻译对于跨境医疗、医疗文献翻译等应用具有重要价值。汉英神经机器翻译依靠深度学习强大的建模能力和大规模双语平行数据取得了长足的进步。神经机器翻译通常依赖于大规模的平行句对训练翻译模型。目前,汉英翻译数据主要以新闻、政策等领域数据为主,缺少医疗领域的数据,导致医疗领域的汉英机器翻译效果不佳。针对医疗垂直领域机器翻译训练数据不足的问题,该文提出利用复述生成技术对汉英医疗机器翻译数据进行增广,扩大汉英机器翻译的规模。通过多种主流的神经机器翻译模型的实验结果表明,通过复述生成对数据进行增广可以有效地提升机器翻译的性能,在RNNSearch,Transformer等多个主流模型上均取得了6个点以上的BLEU值提升,验证了复述增广方法对领域机器翻译的有效性。同时,基于MT5等大规模预训练语言模型可以进一步地提升机器翻译的性能。
Medical machine translation is of great value for cross-border medical translation.Chinese to English neural machine translation has made great progress based on deep learning,powerful modeling ability and large-scale bilingual parallel data.Neural machine translation relies usually on large-scale parallel sentence pairs to train translation models.At present,Chinese-English translation data are mainly in the fields of news,policy and so on.Due to the lack of parallel data in the medical field,the performance of Chinese to English machine translation in the medical field is not compromising.To reduce the size of parallel data for training medical machine translation models,this paper proposes a paraphrase based data augmentation mechanism.The experimental results on a variety of neural machine translation models show that data augmentation through paraphrase augmentation can effectively improve the performance of medical machine translation,and has achieved consistency improvements on mainstream models such as RNNSearch and Transformers,which verifies the effectiveness of paraphrase augmentation method for domain machine translation.Meanwhile,the medical machine translation performances could be further improved based on large-scale pre-training language model,such as MT5.
作者
安波
龙从军
AN Bo;LONG Congjun(Institute of Ethnology and Anthropology,Chinese Academy of Social Sciences,Beijing,100081,China)
出处
《电子与信息学报》
EI
CSCD
北大核心
2022年第1期118-126,共9页
Journal of Electronics & Information Technology
基金
国家自然科学基金(62076233)
中国社会科学院重大创新工程项目(2020YZDZX01-2)。
关键词
神经机器翻译
汉英翻译
复述生成
数据增广
大规模预训练语言模型
Neural machine translation
Chinese to English translation
Paraphrase generation
Data augmentation
Large scale pre-traing language model