摘要
随着神经网络的迅速发展,语音翻译研究开始了端到端方向的尝试。而训练一个性能良好的语音翻译模型往往需要一定规模和质量的语音语料库,在俄汉语音翻译领域也是如此。由于语音翻译研究起步较晚,经常面临着缺乏可公开获取的高质量的语音语料库问题,因此自主构建语音语料库以满足神经网络的训练需求显得十分重要。本文在综合衡量了构建语音语料库成本和质量的基础上,通过在公开可获取的字幕网站中人工挑选了70小时的俄汉影视作品,经过制定规范、加工处理和人工评价三个环节,最终成功构建了小规模的俄汉语音语料库,证明了此种方法的可行性,为端到端语音翻译研究提供了数据基础。
With the rapid development of neural network,the research of speech translation has begun an end-to-end attempt.Training a good speech translation model often requires a certain size and quality of speech translation corpus,and is also true in the field of Russian Chinese speech translation.Due to the late start of speech translation research,it is often faced with the problem of lack of publicly available high-quality speech translation corpus.Therefore,it is very important to independently construct speech translation corpus to meet the training needs of neural network.Based on the comprehensive measurement of the cost and quality of constructing the speech translation corpus,this paper manually selects 70 hours of Russian and Chinese film and television works from the publicly available subtitle website,and finally successfully constructs a small-scale Russian and Chinese speech translation corpus through three links:Formulation of norms,processing and manual evaluation,which proves the feasibility of this method,it provides a data base for the research of end-to-end speech translation.
作者
幸梦阳
马延周
杨政
XING Mengyang;MA Yanzhou;YANG Zheng(Strategic Support Force Information Engineering University Luoyang Campus,Luoyang Henan 471003)
出处
《软件》
2022年第5期85-87,共3页
Software
关键词
语料库
语音翻译
影视作品
corpus
speech translation
film and television works