Latent information is difficult to get from the text in speech synthesis.Studies show that features from speech can get more information to help text encoding.In the field of speech encoding,a lot of work has been con...Latent information is difficult to get from the text in speech synthesis.Studies show that features from speech can get more information to help text encoding.In the field of speech encoding,a lot of work has been conducted on two aspects.The first aspect is to encode speech frame by frame.The second aspect is to encode the whole speech to a vector.But the scale in these aspects is fixed.So,encoding speech with an adjustable scale for more latent information is worthy of investigation.But current alignment approaches only support frame-by-frame encoding and speech-to-vector encoding.It remains a challenge to propose a new alignment approach to support adjustable scale speech encoding.This paper presents the dynamic speech encoder with a new alignment approach in conjunction with frame-by-frame encoding and speech-to-vector encoding.The speech feature fromourmodel achieves three functions.First,the speech feature can reconstruct the origin speech while the length of the speech feature is equal to the text length.Second,our model can get text embedding fromspeech,and the encoded speech feature is similar to the text embedding result.Finally,it can transfer the style of synthesis speech and make it more similar to the given reference speech.展开更多
基金supported by National Key R&D Program of China (2020AAA0107901).
文摘Latent information is difficult to get from the text in speech synthesis.Studies show that features from speech can get more information to help text encoding.In the field of speech encoding,a lot of work has been conducted on two aspects.The first aspect is to encode speech frame by frame.The second aspect is to encode the whole speech to a vector.But the scale in these aspects is fixed.So,encoding speech with an adjustable scale for more latent information is worthy of investigation.But current alignment approaches only support frame-by-frame encoding and speech-to-vector encoding.It remains a challenge to propose a new alignment approach to support adjustable scale speech encoding.This paper presents the dynamic speech encoder with a new alignment approach in conjunction with frame-by-frame encoding and speech-to-vector encoding.The speech feature fromourmodel achieves three functions.First,the speech feature can reconstruct the origin speech while the length of the speech feature is equal to the text length.Second,our model can get text embedding fromspeech,and the encoded speech feature is similar to the text embedding result.Finally,it can transfer the style of synthesis speech and make it more similar to the given reference speech.