摘要
机器翻译数据集的精确度对翻译模型的性能起决定性作用。传统蒙古语由于字符编码的特殊性,拼写错误十分普遍,网络开放资源字符编码准确性不足20%,这给其文本智能处理造成重大障碍。本文以第十七届全国机器翻译大会(CCMT 2021)蒙汉双语公开评测数据集作为原始语料,进行蒙文文本自动校正,构建面向机器翻译的高质量蒙汉句对校正数据集。在CWMT2017测试集上的实验结果表明,经过蒙文文本校正后的蒙汉双语平行句对在蒙汉、汉蒙两个方向上均优于原始评测数据的翻译效果,验证了蒙文校正文本的使用对提升下游自然语言处理任务性能的有效性及实用性。
The accuracy of machine translation datasets plays a decisive role in the performance of translation models.Due to the particularity of character encoding in traditional Mongolian,spelling errors are very common,and the accuracy of character encoding of open resources on the Internet is less than 20%,which poses a major obstacle to intelligent text processing.In this paper,we used the Mongolian-Chinese bilingual public evaluation dataset of the 17th China Conference on Machine Translation(CCMT 2021)as the original corpus to complete automatic Mongolian correction,and constructed a high-quality Mongolian-Chinese sentence pair correction dataset for machine translation.The experimental results on the CWMT2017 test set show that the Mongolian-Chinese bilingual parallel sentence pair after the Mongolian text correction is better than the translation effect of the original evaluation data in both Mongolian->Chinese and Chinese->Mongolian directions,which verifies the effectiveness and practicability of the Mongolian corrected text for improving the performance of downstream natural language processing tasks.
作者
申影利
包乌格德勒
赵小兵
SHEN Yingli;BAO Wugedele;ZHAO Xiaobing(School of Chinese Ethnic Minority Languages and Literatures,Minzu University of China,Beijing 100081,P.R.China;Hohhot Minzu College,Hohhot 010051,P.R.China;School of Information Engineering,Minzu University of China,Beijing 100081,P.R.China;National Language Resource Monitoring&Research Center of Minority Languages,Beijing 100081,P.R.China)
基金
国家语委重点项目(ZDI135-118)
中央民族大学研究生科研实践项目(BZKY2021062)
关键词
机器翻译
传统蒙古文
文本校正
数据集
machine translation
traditional Mongolian
text correction
dataset