期刊文献+

基于预训练语言表示模型的汉语韵律结构预测 被引量:2

Chinese Prosodic Structure Prediction Based on a Pretrained Language Representation Model
下载PDF
导出
摘要 韵律结构预测作为语音合成系统中的一个关键步骤,其结果直接影响合成语音的自然度和可懂度.本文提出了一种基于预训练语言表示模型的韵律结构预测方法,以字为建模单位,在预训练语言模型的基础上对每个韵律层级设置了独立的输出层,利用韵律标注数据对预训练模型进行微调.另外在此基础上额外增加了分词任务,通过多任务学习的方法对各韵律层级间的关系及韵律与词间的关系建模,实现对输入文本各级韵律边界的同时预测.实验首先证明了多输出结构设置的合理性及使用预训练模型的有效性,并验证了分词任务的加入可以进一步提升模型性能;将最优的结果与设置的两个基线模型相比,在韵律词和韵律短语预测的F1值上与条件随机场模型相比分别有2.48%和4.50%的绝对提升,而与双向长短时记忆网络相比分别有6.2%和5.4%的绝对提升;最后实验表明该方法可以在保证预测性能的同时减少对训练数据量的需求. Prosodic structure prediction is an indispensable step in the text-to-speech system,and its results directly influence the naturalness and intelligibility of synthesized speech.In this study,a prosodic structure prediction method based on a pretrained language representation model was proposed.On the basis of the pretrained language representation model,a separate output layer was set for each prosody level,with character as the modeling unit.Then,the model was fine-tuned with prosody labeled data.To achieve the simultaneous prediction of different prosodic levels in input text,a word segmentation task was additionally introduced and the multitask learning method was used to model the relationship between the multilevel prosody and lexicon words.The experimental results prove the rationality of a multi-output structure and the effectiveness of using a pretrained language representation model and verify that adding the word segmentation task can further improve model performance.When comparing the best result to the baseline conditional random field model,significant improvements of 2.48% and 4.50% were observed for the F1 scores of prosodic word prediction and prosodic phrase prediction,respectively.By contrast,when comparing the best result to the baseline bidirectional long short-term memory model,more significant improvements of 6.2% and 5.4% were observed for the F1 scores of prosodic word prediction and prosodic phrase prediction,respectively.Finally,the experiments show that the proposed method considerably reduces the demand for training data while maintaining an excellent prediction performance.
作者 张鹏远 卢春晖 王睿敏 Zhang Pengyuan;Lu Chunhui;Wang Ruimin(Key Laboratory of Speech Acoustics and Content Understanding,Institute of Acoustics,Chinese Academy of Sciences,Beijing 100190,China;School of Electronic,Electrical and Communication Engineering,University of Chinese Academy of Sciences,Beijing 100049,China)
出处 《天津大学学报(自然科学与工程技术版)》 EI CSCD 北大核心 2020年第3期265-271,共7页 Journal of Tianjin University:Science and Technology
基金 国家自然科学基金资助项目(11590773,11590770) 全军共用信息系统装备预研项目(JZX2017-0994/Y306)~~
关键词 韵律结构预测 预训练语言表示模型 多任务学习 语音合成 prosodic structure prediction pretrained language representation model multitask learning speech synthesis
  • 相关文献

参考文献2

二级参考文献17

  • 1周强,俞士汶.汉语短语标注标记集的确定[J].中文信息学报,1996,10(4):1-11. 被引量:35
  • 2M. Chu, Y. Qian, Locating Boundaries for Prosodic Constituents in Unrestricted Mandarin Texts[J]. Computational Linguistics and Chinese Language Processing, February 2001,6(1) :61 - 82.
  • 3Bachenko J, Fitzpatrick E. A computational grammar of discourse-neutral prosodic phrasing in English[J]. Computational Linguistics, 1990, 16(3): 155-170.
  • 4J. Hirschberg, P. Prieto. Training intonational phrasing rules automnatically for English and Spanish text-to-speech[J]. Speech Communication, 1996.
  • 5G. J. Busser, W. Daelemans, Van den Bosch, A. Predicting phrase breaks with memory-based learning[A]. Proceedings 4th ISCA Tutorial and Research Workshop on Speech Synthesis[ C], Perthshire Scotland, August 29th - September 1st, 2001.
  • 6Adam L. Berger, Stephen A. Della Pietra, Vincent J. Della Pietra. A maximum entropy approach to natural language processing[J]. Computational Linguistics 1996, 23(4): 597-618.
  • 7Adwait Ratnaparkhi. A Maximum Entropy Part-Of-Speech Tagger[ A]. Proceedings of the Empirical Methods in Natural Language Processing Conference[C], May 17- 18, 1996.
  • 8Hanna Wallach. Efficient training of conditional random fields[D]. Master's thesis, University of Edinburgh, 2002.
  • 9Adwait Ratnaparkhi. (1998). Maximum Entropy Models for Natural Language Ambiguity Resolution[ D ]. Ph. D.Dissertation. University of Pennsylvania, 1998.
  • 10应宏,蔡莲红.基于结构助词驱动的韵律短语界定的研究[J].中文信息学报,1999,13(6):41-46. 被引量:18

共引文献24

同被引文献11

引证文献2

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部