期刊文献+

低资源条件下的语音合成方法综述

Review of Speech Synthesis Methods Under Low-Resource Condition
下载PDF
导出
摘要 语音合成是人机交互领域的热门研究方向。深度学习时代以来,其研究重心由低效的传统方法转向基于神经网络的端到端语音合成技术,但在小语种语料数据、目标说话人语音训练数据或大型情感语音数据集收集困难的低数据资源情况下,构建成熟的语音合成系统仍是研究难点。故对语音合成的经典模型做分类介绍,围绕低资源问题的国内外研究现状做系统综述。从语音合成系统的组成结构与模型训练角度,分别阐述近年提升语音合成模型总体性能的主流技术,并总结了适用于语音合成不同任务的包含多种语言、多种情感、多位说话人的各类开源语音数据集。对应用深度学习和机器学习如迁移学习、元学习、数据增广等手段的解决低资源语音合成方法进行概述分析与优缺点比较,简要介绍少样本场景下的说话人自适应、语音克隆与转换等技术。对缓解低资源语音合成问题的可行研究方向进行探讨与展望。 Speech synthesis is a hot research direction in the field of human-computer interaction.Since the era of deep learning,its research focus has shifted from inefficient traditional methods to end-to-end speech synthesis technology based on neural networks.However,in the case of low data resources where it is difficult to collect minority language corpus data,target speaker speech training data or large emotional speech datasets,building a mature speech synthesis system is still a research difficulty.Therefore,the classic models of speech synthesis are introduced in categories,and the research status at home and abroad on low resource issues are systematically reviewed.From the perspective of the composition structure and model training of speech synthesis systems,the mainstream technologies to improve the overall performance of speech synthesis models in recent years are described respectively.It also summarizes various kinds of open source speech datasets that are applicable to different tasks of speech synthesis including multi-language,multi-emotion and multi-speaker.This paper summarizes,analyzes and compares the advantages and disadvantages of low resource speech synthesis methods using deep learning and machine learning,such as transfer learning,meta learning,data augmentation,etc.This paper also briefly introduces speaker adaptation,voice cloning and conversion technologies in few-shot scenario.Finally,the feasible research directions to alleviate the problem of low resource speech synthesis are discussed and prospected.
作者 张佳琳 买日旦·吾守尔 古兰拜尔·吐尔洪 ZHANG Jialin;Mairidan Wushouer;Gulanbaier Tuerhong(School of Information Science and Engineering,Xinjiang University,Urumqi 830046,China)
出处 《计算机工程与应用》 CSCD 北大核心 2023年第15期1-16,共16页 Computer Engineering and Applications
基金 国家自然科学基金(2020680012) 新疆维吾尔自治区自然科学基金(202104120016)。
关键词 语音合成 低资源 数据增广 迁移学习 元学习 微调 speech synthesis low resource data augmentation transfer learning meta learning fine-tuning
  • 相关文献

参考文献10

二级参考文献67

  • 1韩文静,李海峰,韩纪庆.基于长短时特征融合的语音情感识别方法[J].清华大学学报(自然科学版),2008,48(S1):708-714. 被引量:20
  • 2王丽娟,曹志刚.TTS语音单元边界的自动切分[J].微电子学与计算机,2005,22(12):8-11. 被引量:3
  • 3王志伟,邵艳秋,赵永贞,刘挺.一个普通话文语转换系统中的韵律模型[J].计算机应用研究,2006,23(6):79-81. 被引量:1
  • 4蒋丹宁,蔡莲红.基于韵律特征的汉语情感语音分类[C]//第一届中国情感计算及智能交互学术会议论文集.北京:中国科学院自动化研究所,2003:122-124.
  • 5Tao J, Kang Y, Li A. Prosody Conversion from Neutral Speech to Emotional Speech[J]. Audio, Speech, and Language Processing, IEEE Transactions on, 2006, 14(4): 1145-1154.
  • 6Cowie R, Cornelius R R. Describing the Emotional States That are Expressed in Speech[J]. Speech Communication, 2003, 40(1): 5-32.
  • 7李爱军.汉语情感语音研究[C].第十一届全国现代语音学高级研讨班.天津:南开大学,2006.
  • 8Mehrabian. A Framework for a Comprehensive Description and Measurement of Emotional States[J]. Genetic, Social, and General Psy- chology Monographs. 1995,121:339-361.
  • 9Cowie R, Douglas-Cowie E, Tsapatsoulis N, et al. Emotion Recognition in Human-Computer Interaction[J]. Signal Processing Maga- zine, IEEE, 2001, 18(1): 32-80.
  • 10Nicholson J, Takahashi K, Nakatsu R. Emotion Recognition in Speech Using Neural Networks[J]. Neural Computing & Applications, 2000, 9(4): 290-296.

共引文献72

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部