摘要
数据的长尾分布问题是NLP实践领域中的常见问题。以语音合成前端的多音字消歧任务为例,多音字数据的极度不均衡、尾部数据的缺乏,影响着语音合成系统的工业实用效果。该文观察到,汉语多音字的分布在“字符”与“字音”两个维度上都呈长尾特性,因此该文针对性地提出一种二重加权算法(Double Weighted,DW)。DW算法可分别与两种长尾算法:MARC,Decouple-cRT结合,进一步提升模型性能。在开源数据和工业数据上,DW算法较基线模型和两种原始算法取得了不同程度的准确率提升,为多维长尾问题提供解决方案与借鉴思路。
The problem of long-tail distributed data is common in NLP practice.Taking the polyphone disambiguation task in text-to-speech(TTS)as an example,the extreme data imbalance and the lack of tail data affect industrial online TTS applications.Observging that the Chinese polyphone is long-tail distributed on both“character”and“pronunciation”dimensions,this paper proposes a double-weighted(DW)algorithm,which can be combined with the other two long-tail algorithms:MARC and Decouple-cRT.Given the perspectives of both open-source data and industrial data,DW demonstrates improvement in accuracy compared to the baseline model and the two original algorithms.
作者
高羽
熊一瑾
叶建成
GAO Yu;XIONG Yijin;YE Jiancheng(AI Innovation Center,Midea Group(Shanghai)Co.,Ltd.,Shanghai 201702,China)
出处
《中文信息学报》
CSCD
北大核心
2022年第11期169-176,共8页
Journal of Chinese Information Processing
关键词
多音字消歧
长尾分布
重加权
解耦特征与分类器
polyphone disambiguation
long-tail distribution
re-weighting
decouple representation and classifier