摘要
个性化语音合成任务旨在合成特定说话人音色的语音。传统方法在合成域外说话人语音时,与真实语音存在明显音色差异,解耦说话人特征仍较为困难。本文提出面向训练时未出现的域外说话人适应场景下的多层级解耦个性化语音合成方法,通过不同粒度特征融合,有效提升零资源条件下域外说话人语音合成性能。本文方法采用快速傅里叶卷积提取说话人全局特征,以提高模型对域外说话人的泛化能力,实现句子粒度的说话人解耦;借助语音识别模型解耦音素粒度说话人特征,并通过注意力机制捕捉音素级音色特征,实现音素粒度的说话人解耦。实验结果表明:在公开数据集AISHELL3上,本文方法对域外说话人在客观评价指标说话人特征向量余弦相似度上达到0.697,相比基线模型提高6.25%,有效提升对域外说话人音色特征建模能力。
Personalized speech synthesis aims to generate speech with specific speaker’s characteristics.Traditional approaches often exhibit noticeable timbre disparities when synthesizing speech from unseen speakers,making it challenging to disentangle speaker-specific timbre features.This paper proposes a multi-level disentangled personalized speech synthesis approach designed for out-of-domain speakers.By fusing features at different granularities,the proposed method effectively enhances the performance of synthesizing speech from unseen speakers under zero-resource conditions.This is achieved by utilizing fast Fourier convolution to extract global speaker features,thereby enhancing the model's generalization to unseen speakers and enabling sentence-level speaker decoupling.Additionally,leveraging a speech recognition model,the method decouples speaker features at the phoneme level and captures phoneme-level timbre features through an attention mechanism,achieving phoneme-level speaker disentanglement.Experimental results on the publicly available dataset AISHELL3 demonstrate that the proposed approach achieves a cosine similarity of 0.697 for speaker feature vectors of cross-speaker adaptation,indicating a 6.25%improvement compared with the baseline model.This enhancement shows the method’s capability in modeling timbre features for speech from unseen speakers in cross-speaker adaptation scenarios.
作者
高盛祥
杨元樟
王琳钦
莫尚斌
余正涛
董凌
GAO Shengxiang;YANG Yuanzhang;WANG Linqin;MO Shangbin;YU Zhengtao;DONG Ling(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming Yunnan 650500,China;Yunnan Key Laboratory of Artificial Intelligence(Kunming University of Science and Technology),Kunming Yunnan 650500,China;Yunnan Key Laboratory of Media Convergence(Yunnan Daily Press Group),Kunming Yunnan 650228,China)
出处
《广西师范大学学报(自然科学版)》
CAS
北大核心
2024年第4期11-21,共11页
Journal of Guangxi Normal University:Natural Science Edition
基金
国家自然科学基金(62376111,U23A20388,61972186,U21B2027)
云南高新技术产业发展项目(201606)
云南省基础研究计划项目(202001AS070014)
云南省科技人才与平台计划项目(202105AC160018)
云南省媒体融合重点实验室开放课题(220225702)
云南省重点研发计划项目(202303AP140008,202103AA080015)。
关键词
语音合成
零资源
说话人表征
域外说话人
特征解耦
speech synthesis
zero-shot
speaker representation
out-of-domain speaker
feature disentanglement