摘要
传统的机器翻译模型的性能受限于双语平行语料库的规模,仅使用单语数据的无监督机器翻译方法难以有效保证模型性能的稳定。针对该问题,提出一种联合EM算法的自动语料扩充方法。利用生成的单语料结合原数据集构建平行语料,进行模型的迭代训练。根据部分双语语料初始化预训练两个单向Transformer模型;通过联合EM算法进行模型优化;通过逐渐减少训练数据的翻译损失来迭代更新两个相对翻译任务上机器翻译模型。实验结果表明,基于单双语料混合的EM迭代训练方法相比使用全双语数据的监督机器翻译方法和仅使用单语数据的无监督机器翻译方法,在中英机器翻译任务上具有更好的表现。
The performance of traditional machine translation models is limited by the size of bilingual parallel corpus,and the unsupervised machine translation method using only monolingual data is difficult to effectively ensure the stability of model performance.In view of this problem,this paper proposes a method of automatic corpus expansion by combining EM algorithm.The parallel corpus was constructed by using the generated monolingual corpus combined with the original datasets,so as to train the model iteratively.The algorithm initialized and pre-trained two unidirectional Transformer model according to part of bilingual corpus.Then,the model was optimized by combining EM algorithm,and the machine translation model on two relative translation tasks was updated iteratively by gradually reducing the translation loss of training data.The experimental results show that the EM iterative training method based on corpus mixing of monolingual and bilingual has a better performance in Chinese and English machine translation tasks than the supervised machine translation method using fully bilingual data and the unsupervised machine translation method using only monolingual data.
作者
杨云
王全
Yang Yun;Wang Quan(Institute of Electronic Information and Artificial Intelligence,Shaanxi University of Science and Technology,Xi’an 710021,Shaanxi,China)
出处
《计算机应用与软件》
北大核心
2020年第8期250-255,共6页
Computer Applications and Software
基金
国家自然科学基金项目(61601271)。