基于回译和比例抽取孪生网络筛选的汉越平行语料扩充方法被引量：4

A Chinese-Vietnamese parallel corpus expansion method based on back translation and proportional extraction siamese network screening

下载PDF

导出

摘要回译作为翻译中重要的数据增强方法,受到了越来越多研究者的关注。其基本思想为首先基于平行语料训练基础翻译模型,然后利用模型将单语语料翻译为目标语言,组合为新语料用于模型训练。然而在汉-越低资源场景下,训练得到的基础翻译模型性能较差,导致在其上应用回译方法得到的平行语料中含有较多噪声,较难用于下游任务。针对此问题,构建基于比例抽取的孪生网络筛选模型,通过训练使得模型可以识别平行句对和伪平行句对,在同一语义空间上对回译得到的伪平行语料进行筛选去噪,进而得到更优的平行语料。在汉越数据集上的实验结果表明,所提方法训练的模型的性能相较基线模型有显著提升。 As an important data enhancement method in translation,back translation has attracted more and more researchers’attentions.The basic idea is to first train a basic translation model based on parallel corpus,then use the model to translate monolingual corpus into the target language,and combine it into a new corpus for model training.However,in the Chinese-Vietnamese low-resource scenario,the performance of the basic translation model obtained by training is poor,which results in the parallel corpus obtained by applying the back translation method on it contains more noise and is difficult to use for downstream tasks.In response to this problem,a siamese network screening model based on proportional extraction is constructed.Through training,the model can identify parallel sentence pairs and pseudo-parallel sentence pairs,and filter and denoise the pseudo-parallel corpus obtained by back translation in the same semantic space,thereby obtaining a better parallel corpus.The test results on the Chinese-Vietnamese data set show that the proposed method significantly outperforms the baseline system.

作者王可超郭军军张亚飞高盛祥余正涛 WANG Ke-chao;GUO Jun-jun;ZHANG Ya-fei;GAO Sheng-xiang;YU Zheng-tao(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China)

机构地区昆明理工大学信息工程与自动化学院昆明理工大学云南省人工智能重点实验室

出处《计算机工程与科学》 CSCD 北大核心 2022年第10期1861-1868,共8页 Computer Engineering & Science

基金国家自然科学基金(61732005,61761026,61866020,61672271,61762056,61972186) 国家重点研发计划(2019QY1801,2019QY1802,2019QY1800)。

关键词汉越平行语料扩充回译数据增强比例抽取孪生网络 Chinese-Vietnamese parallel corpus expansion back translation data enhancement proportional extraction siamese network

分类号 H085 [语言文字—语言学]

引文网络
相关文献

参考文献3

1刘群.统计机器翻译综述[J].中文信息学报,2003,17(4):1-12. 被引量：71
2蔡子龙,杨明明,熊德意.基于数据增强技术的神经机器翻译[J].中文信息学报,2018,32(7):30-36. 被引量：21
3Ziming Chi,Bingyan Zhang.A Sentence Similarity Estimation Method Based on Improved Siamese Network[J].Journal of Intelligent Learning Systems and Applications,2018,10(4):121-134. 被引量：5

二级参考文献14

1俞士汶等.机器翻译译文质量自动评估系统[A]..中国中文信息学会1991年会论文集[C].,.314—319.
2Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, Paul S. Roossin, A Statistical Approach to Machine Translation [J],Computational Linguistics, 1990.
3Peter. F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, Robert L. Mercer, The Mathematics of Statistical Machine Translation: Parameter Estimation [J], Computational Linguiatics, 19,(2), 1993.
4F. J. Och, C. Tillmann, and H. Ney. Improved alignment models for statistical machine translation[A]. In Proc. of the Joint SIGDAT Conf. On Empirical Methods in Natural Language Processing and Very Large Corpora, pages 20-28, University of Maryland, College Park, MD, June 1999.
5Franz Josef Och, Hermann Ney. What Can Machine Translation Learn from Speech Recognition? [A]In: proceedings of MT 2001 Workshop: Towards a Road Map for MT, 26-31, Santiago de Compostels,Spain, September 2001.
6Franz Josef Och, Hermann Ney, Discriminative Training and Maximum Entropy Models for Statistical Machine Translation [A], ACL2002.
7K. A. Papineni, S. Roukos, and R. T. Ward. Feature-based language understanding[A]. In European Conf. on Speech Communication and Technology, 1435-1438, Rhodes, Greece, September,1997.
8K. A. Papineni, S. Roukos, and R. T. Ward. Maximum likelihood and discriminative training of direct translation models [A] In Proc. Int. Conf. on Accoustics, Speech, and Signal Processing,pages,189-192, Seattle, WA, May, 1998.
9Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu, Bleu: a Method for Automatic Evaluation of Machine Translation [R], IBM Research, RC22176 (W0109-022) September 17, 2001.
10Ye-Yi Wang, Grammar Inference and Statistical Machine Translation [D], Ph.D Thesis, Carnegie Mellon University, 1998.

共引文献93

1唐元楠.论机器翻译的现状[J].南国博览,2019,0(4):380-380.
2贾承勋,赖华,余正涛,文永华,于志强.基于短语替换的汉越伪平行句对生成[J].中文信息学报,2021,35(8):47-55. 被引量：2
3李霞,马骏腾,覃世豪.融合图像注意力的多模态机器翻译模型[J].中文信息学报,2020(7):68-78. 被引量：4
4周新栋,王挺.基于N元语言模型的文本分类方法[J].计算机应用,2005,25(1):11-13. 被引量：11
5肖明.机器翻译系统中间件模型[J].福建电脑,2006,22(3):122-123.
6李玉鑑.英汉翻译模板的标准化方案及其应用[J].中文信息学报,2006,20(B03):41-46.
7徐波,史晓东,刘群,宗成庆,庞薇,陈振标,杨振东,魏玮,杜金华,陈毅东,刘洋,熊德意,侯宏旭,何中军.2005统计机器翻译研讨班研究报告[J].中文信息学报,2006,20(5):1-9. 被引量：10
8王洪俊,施水才,俞士汶,肖诗斌.跨语言相似文档检索[J].中文信息学报,2007,21(1):30-37. 被引量：4
9张大鲲,张玮,冯元勇,孙乐.基于非连续短语的统计翻译模型研究[J].中文信息学报,2007,21(1):101-108. 被引量：5
10付雷,刘群.单纯形算法在统计机器翻译Re-ranking中的应用[J].中文信息学报,2007,21(3):28-33. 被引量：2

同被引文献36

1沙九,冯冲,周鹭琴,李洪政,张天夫,慧慧.面向司法领域的高质量开源藏汉平行语料库构建[J].中文信息学报,2021,35(11):51-59. 被引量：4
2贾承勋,赖华,余正涛,文永华,于志强.基于短语替换的汉越伪平行句对生成[J].中文信息学报,2021,35(8):47-55. 被引量：2
3冯洋,邵晨泽.神经机器翻译前沿综述[J].中文信息学报,2020(7):1-18. 被引量：35
4夏玲,李宜蔓,李弘武.人工智能背景下科技论文摘要的机器翻译与译后编辑[J].编辑学报,2022,34(4):396-401. 被引量：11
5陈悦,陈超美,刘则渊,胡志刚,王贤文.CiteSpace知识图谱的方法论功能[J].科学学研究,2015,33(2):242-253. 被引量：7143
6刘洋,刘群,林守勋.机器翻译评测中的模糊匹配[J].中文信息学报,2005,19(3):45-53. 被引量：8
7姚树杰,肖桐,朱靖波.基于句对质量和覆盖度的统计机器翻译训练语料选取[J].中文信息学报,2011,25(2):72-77. 被引量：11
8朱琳,侯晓舟.“抗日战争”一词的翻译变化研究[J].社会科学论坛,2016(12):242-247. 被引量：3
9蔡子龙,杨明明,熊德意.基于数据增强技术的神经机器翻译[J].中文信息学报,2018,32(7):30-36. 被引量：21
10翟家欣,高盛祥,余正涛,文永华,郭军军.基于句子特征向量的汉-越伪平行句对抽取[J].山西大学学报（自然科学版）,2019,42(4):770-776. 被引量：1

引证文献4

1王琳,刘伍颖.基于集成机器翻译的双语平行语料无监督质量评价[J].山西大学学报（自然科学版）,2023,46(3):528-536. 被引量：1
2傅琳凌,刘磊.基于CiteSpace的机器翻译研究可视化分析[J].黑龙江科学,2023,14(15):1-5.
3申影利,赵小兵.语言模型蒸馏的低资源神经机器翻译方法[J].计算机工程与科学,2024,46(4):743-751.
4张津一,郭聪,高忠辉.基于语言知识的神经机器翻译研究进展[J].人工智能与机器人研究,2023,12(2):97-106.

二级引证文献1

1雷宏友.融合聚类算法与改进粒子群算法的机器翻译句式一致性研究[J].自动化与仪器仪表,2024(6):179-183.

1张磊,高盛祥,余正涛,刘畅,陈瑞清.类型感知的汉越跨语言事件检测方法[J].重庆邮电大学学报（自然科学版）,2022,34(5):803-811. 被引量：1
2晁忠涛,叶传奇,韩雪磊,朱奎源,吴明利,张留杰.基于Transformer的中英机器翻译系统的研究与开发[J].电脑知识与技术,2022,18(27):16-17. 被引量：1
3张瑞.浅析母语对二语习得的影响[J].文化创新比较研究,2022,6(9):54-57.
4吴庚键.我国三人制篮球运动的SWOT分析及发展对策研究[J].冰雪体育创新研究,2022(9):139-142. 被引量：2
5王静宜.基于内容分析的知识服务体系研究[J].图书馆学研究,2022(8):47-58. 被引量：3
6刘远明,王小义,郭琳,夏丽,权申文,钱令军,李宏军.基于深度学习分割掩膜的胸片图像配准技术及其应用[J].中国医学物理学杂志,2022,39(10):1231-1235. 被引量：1

计算机工程与科学

2022年第10期

浏览历史

内容加载中请稍等...

基于回译和比例抽取孪生网络筛选的汉越平行语料扩充方法被引量：4

参考文献3

二级参考文献14

共引文献93

同被引文献36

引证文献4

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

基于回译和比例抽取孪生网络筛选的汉越平行语料扩充方法 被引量：4

参考文献3

二级参考文献14

共引文献93

同被引文献36

引证文献4

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

基于回译和比例抽取孪生网络筛选的汉越平行语料扩充方法被引量：4