期刊文献+

基于回译和比例抽取孪生网络筛选的汉越平行语料扩充方法 被引量:4

A Chinese-Vietnamese parallel corpus expansion method based on back translation and proportional extraction siamese network screening
下载PDF
导出
摘要 回译作为翻译中重要的数据增强方法,受到了越来越多研究者的关注。其基本思想为首先基于平行语料训练基础翻译模型,然后利用模型将单语语料翻译为目标语言,组合为新语料用于模型训练。然而在汉-越低资源场景下,训练得到的基础翻译模型性能较差,导致在其上应用回译方法得到的平行语料中含有较多噪声,较难用于下游任务。针对此问题,构建基于比例抽取的孪生网络筛选模型,通过训练使得模型可以识别平行句对和伪平行句对,在同一语义空间上对回译得到的伪平行语料进行筛选去噪,进而得到更优的平行语料。在汉越数据集上的实验结果表明,所提方法训练的模型的性能相较基线模型有显著提升。 As an important data enhancement method in translation,back translation has attracted more and more researchers’attentions.The basic idea is to first train a basic translation model based on parallel corpus,then use the model to translate monolingual corpus into the target language,and combine it into a new corpus for model training.However,in the Chinese-Vietnamese low-resource scenario,the performance of the basic translation model obtained by training is poor,which results in the parallel corpus obtained by applying the back translation method on it contains more noise and is difficult to use for downstream tasks.In response to this problem,a siamese network screening model based on proportional extraction is constructed.Through training,the model can identify parallel sentence pairs and pseudo-parallel sentence pairs,and filter and denoise the pseudo-parallel corpus obtained by back translation in the same semantic space,thereby obtaining a better parallel corpus.The test results on the Chinese-Vietnamese data set show that the proposed method significantly outperforms the baseline system.
作者 王可超 郭军军 张亚飞 高盛祥 余正涛 WANG Ke-chao;GUO Jun-jun;ZHANG Ya-fei;GAO Sheng-xiang;YU Zheng-tao(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China)
出处 《计算机工程与科学》 CSCD 北大核心 2022年第10期1861-1868,共8页 Computer Engineering & Science
基金 国家自然科学基金(61732005,61761026,61866020,61672271,61762056,61972186) 国家重点研发计划(2019QY1801,2019QY1802,2019QY1800)。
关键词 汉越平行语料扩充 回译 数据增强 比例抽取 孪生网络 Chinese-Vietnamese parallel corpus expansion back translation data enhancement proportional extraction siamese network
  • 相关文献

参考文献3

二级参考文献14

  • 1俞士汶等.机器翻译译文质量自动评估系统[A]..中国中文信息学会1991年会论文集[C].,.314—319.
  • 2Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, Paul S. Roossin, A Statistical Approach to Machine Translation [J],Computational Linguistics, 1990.
  • 3Peter. F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, Robert L. Mercer, The Mathematics of Statistical Machine Translation: Parameter Estimation [J], Computational Linguiatics, 19,(2), 1993.
  • 4F. J. Och, C. Tillmann, and H. Ney. Improved alignment models for statistical machine translation[A]. In Proc. of the Joint SIGDAT Conf. On Empirical Methods in Natural Language Processing and Very Large Corpora, pages 20-28, University of Maryland, College Park, MD, June 1999.
  • 5Franz Josef Och, Hermann Ney. What Can Machine Translation Learn from Speech Recognition? [A]In: proceedings of MT 2001 Workshop: Towards a Road Map for MT, 26-31, Santiago de Compostels,Spain, September 2001.
  • 6Franz Josef Och, Hermann Ney, Discriminative Training and Maximum Entropy Models for Statistical Machine Translation [A], ACL2002.
  • 7K. A. Papineni, S. Roukos, and R. T. Ward. Feature-based language understanding[A]. In European Conf. on Speech Communication and Technology, 1435-1438, Rhodes, Greece, September,1997.
  • 8K. A. Papineni, S. Roukos, and R. T. Ward. Maximum likelihood and discriminative training of direct translation models [A] In Proc. Int. Conf. on Accoustics, Speech, and Signal Processing,pages,189-192, Seattle, WA, May, 1998.
  • 9Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu, Bleu: a Method for Automatic Evaluation of Machine Translation [R], IBM Research, RC22176 (W0109-022) September 17, 2001.
  • 10Ye-Yi Wang, Grammar Inference and Statistical Machine Translation [D], Ph.D Thesis, Carnegie Mellon University, 1998.

共引文献93

同被引文献36

引证文献4

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部