摘要
目前,由于缺乏公开数据集,面向少数民族语言的语音翻译的研究较少。为此,本文构建并公开了蒙语语音到汉语文本语音翻译数据集(NMLR-Mon2Chs ST)。本数据集包含36位年龄在20-25岁之间的蒙古人通过手机录制的蒙语语音,以及由专业人员标注的蒙语和汉语的文本。为保证数据质量,对数据进行了预处理,如去除空语音文件、重采样、归一化后,最终得到25小时的高质量数据,数据集中音频的平均时长为4.2秒。本数据集的建立为探索面向少数民族语言的语音翻译技术提供了一定的数据基础。
Due to the lack of public datasets,few researches focus on speech translation in minority languages.Therefore,in this paper we constructed a dataset of Mongolian-Chinese speech translation,named“NMLR-Mon2Chs ST”.The dataset consists of Mongolian speech,Mongolian and Chinese texts.First,the Mongolian speech were recorded from 36 Mongols aged between 20 and 25 by recording the audio on their mobile phones.Then,the corresponding Chinese texts were annotated by professionals.In order to ensure the quality of the dataset,we preprocessed the data in it,such as removing the quiet speech,resampling,and normalization.As a result,a total of 25 hours of high-quality data are obtained,and the average duration of audio in the dataset is 4.2 seconds.This dataset is expected to provide certain data support for the research on the speech translation from minority languages to other languages.
作者
戚肖克
特尼格尔
孙媛
赵小兵
QI Xiaoke;BORJIGIN BTeniger;SUN Yuan;ZHAO Xiaobing(China University of Political Science and Law,Beijing 102249,P.R.China;National Language Resource Monitoring&Research Center of Minority Languages,Beijing 100081,P.R.China;School of Chinese Ethnic Minority Languages and Literatures,Minzu University of China,Beijing 100081,P.R.China)
基金
国家语委重点项目(ZDI135-118)
关键词
语音翻译
蒙汉
少数民族语言
低资源
数据集
speech translation
Mongolian-Chinese
minority languages
low resource
dataset