摘要
蒙汉翻译属于低资源语言的翻译,面临着平行语料资源稀缺的困难,为了缓解平行语料数据稀缺和词汇表受限引发的翻译正确率低的问题,利用动态的数据预训练方法ELMo(Embeddings from Language Models),并结合多任务域信息共享的Transformer翻译架构进行蒙汉翻译。利用ELMo(深层语境化词表示)进行单语语料的预训练。利用FastText词嵌入算法把蒙汉平行语料库中的上下文语境相关的大规模文本进行预训练。根据多任务共享参数以实现域信息共享的原理,构建了一对多的编码器-解码器模型进行蒙汉神经机器翻译。实验结果表明,该翻译方法比Transformer基线翻译方法在长句子输入序列中可以有效提高翻译质量。
Mongolian-Chinese translation is a translation of low-resource language, facing the difficulty of the scarcity of parallel corpus resources. In order to alleviate the problem of low translation accuracy caused by the scarcity of parallel corpus data and vocabulary limitation, this paper uses dynamic data pre-training method ELMo(Embeddings from Language Models), and combines the Transformer translation architecture for multi-tasking domain information sharing in the Mongolian-Chinese translation. Firstly, ELMo(deep contextualized word representation)is used for the pre-training of the Monolingual corpus. Secondly, this paper uses the Fast Text word embedding algorithm to pre-train the context-related large-scale text in the Mongolian-Chinese parallel corpus. Then, according to the principle of multi-task sharing parameters to realize domain information sharing, a one-to-many encoder-decoder model is constructed for Mongolian-Chinese neural machine translation. The experimental results show that the translation method can effectively improve the translation quality in the long sentence input sequence than the Transformer baseline translation method.
作者
张振
苏依拉
牛向华
高芬
赵亚平
仁庆道尔吉
ZHANG Zhen;SU Yila;NIU Xianghua;GAO Fen;ZHAO Yaping;Ren Qing Daoer Ji(School of Information Engineering,Inner Mongolia University of Technology,Hohhot 010000,China)
出处
《计算机工程与应用》
CSCD
北大核心
2020年第10期106-114,共9页
Computer Engineering and Applications
基金
国家自然科学基金(No.61363052)
内蒙古自治区自然科学基金(No.2016MS0605)
内蒙古自治区民族事务委员会基金(No.MW-2017-MGYWXXH-03)。