摘要
语音合成是人机交互领域的热门研究方向。深度学习时代以来,其研究重心由低效的传统方法转向基于神经网络的端到端语音合成技术,但在小语种语料数据、目标说话人语音训练数据或大型情感语音数据集收集困难的低数据资源情况下,构建成熟的语音合成系统仍是研究难点。故对语音合成的经典模型做分类介绍,围绕低资源问题的国内外研究现状做系统综述。从语音合成系统的组成结构与模型训练角度,分别阐述近年提升语音合成模型总体性能的主流技术,并总结了适用于语音合成不同任务的包含多种语言、多种情感、多位说话人的各类开源语音数据集。对应用深度学习和机器学习如迁移学习、元学习、数据增广等手段的解决低资源语音合成方法进行概述分析与优缺点比较,简要介绍少样本场景下的说话人自适应、语音克隆与转换等技术。对缓解低资源语音合成问题的可行研究方向进行探讨与展望。
Speech synthesis is a hot research direction in the field of human-computer interaction.Since the era of deep learning,its research focus has shifted from inefficient traditional methods to end-to-end speech synthesis technology based on neural networks.However,in the case of low data resources where it is difficult to collect minority language corpus data,target speaker speech training data or large emotional speech datasets,building a mature speech synthesis system is still a research difficulty.Therefore,the classic models of speech synthesis are introduced in categories,and the research status at home and abroad on low resource issues are systematically reviewed.From the perspective of the composition structure and model training of speech synthesis systems,the mainstream technologies to improve the overall performance of speech synthesis models in recent years are described respectively.It also summarizes various kinds of open source speech datasets that are applicable to different tasks of speech synthesis including multi-language,multi-emotion and multi-speaker.This paper summarizes,analyzes and compares the advantages and disadvantages of low resource speech synthesis methods using deep learning and machine learning,such as transfer learning,meta learning,data augmentation,etc.This paper also briefly introduces speaker adaptation,voice cloning and conversion technologies in few-shot scenario.Finally,the feasible research directions to alleviate the problem of low resource speech synthesis are discussed and prospected.
作者
张佳琳
买日旦·吾守尔
古兰拜尔·吐尔洪
ZHANG Jialin;Mairidan Wushouer;Gulanbaier Tuerhong(School of Information Science and Engineering,Xinjiang University,Urumqi 830046,China)
出处
《计算机工程与应用》
CSCD
北大核心
2023年第15期1-16,共16页
Computer Engineering and Applications
基金
国家自然科学基金(2020680012)
新疆维吾尔自治区自然科学基金(202104120016)。
关键词
语音合成
低资源
数据增广
迁移学习
元学习
微调
speech synthesis
low resource
data augmentation
transfer learning
meta learning
fine-tuning