期刊文献+

多声学场景下端到端语音识别声学编码器的自适应 被引量:1

The self-adaptation of acoustic encoder in end-to-end automatic speech recognition under diverse acoustic scenes
原文传递
导出
摘要 提出了一种面向多样化声学场景自适应设计声学编码器的方法(SAE)。该方法通过学习不同声学场景下语音中包含的声学特征的差异,适应性地为端到端语音识别任务设计出合适的声学编码器。通过引入神经网络结构搜索技术,提高了编码器设计的有效性,从而改善了下游识别任务的性能。在Aishell-1、HKUST和SWBD三个常用的中英文数据集上的实验表明,通过所提场景自适应设计方法得到的声学编码器相比已有的声学编码器可以获得平均5%以上的错误率改善。所提方法是一种深入分析特定场景下语音特征、针对性设计高性能声学编码器的有效方法。 In this paper,a scene-adaptive acoustic encoder(SAE)is proposed for different speech scenes.This method adaptively designs an appropriate acoustic encoder for end-to-end speech recognition tasks by learning the differences of acoustic features in different acoustic scenes.By the application of the neural architecture search method,the effectiveness of encoder design and the performance of downstream recognition tasks are improved.Experiments on three commonly used Chinese and English dataset,Aishell-1,HKUST and SWBD,show that the proposed SAE can achieve average 5%relative character error rate reductions than the best human-designed encoders.The results show that the proposed method is an effective method for analysis of acoustic features in specific scenes and targeted design of high-performance acoustic encoders.
作者 刘育坤 郑霖 黎塔 张鹏远 LIU Yukun;ZHENG Lin;LI Ta;ZHANG Pengyuan(Key Laboratory of Speech Acoustics and Content Understanding,Institute of Acoustics,Chinese Academy of Sciences,Beijing 100190;University of Chinese Academy of Sciences,Beijing 100049)
出处 《声学学报》 EI CAS CSCD 北大核心 2023年第6期1260-1268,共9页 Acta Acustica
基金 国家重点研发计划项目(2020AAA0108002) 中国科学院声学研究所自主部署“目标导向”类项目(MBDX202106)资助。
关键词 自动语音识别 声学编码器 自适应 神经网络结构搜索 Automatic speech recognition Acoustic encoder Self-adaptation Neural architecture search
  • 相关文献

参考文献14

二级参考文献158

  • 1俞士汶,朱学锋,王惠,张芸芸.现代汉语语法信息词典规格说明书[J].中文信息学报,1996,10(2):1-22. 被引量:34
  • 2钱跃良,林守勋,刘群,刘宏.2005年度863计划中文信息处理与智能人机接口技术评测回顾[J].中文信息学报,2006,20(B03):1-6. 被引量:4
  • 3张建平.大词汇量自然连续语音识别中的语言模型和理解算法研究.博士论文[M].北京:清华大学,1999..
  • 4徐波.汉语非特定人听写机系统研究和集成.博士论文[M].北京:中国科学院自动化研究所,1997..
  • 5Zhang, B., S. Matsoukas and R. Schwartz. Discrimina tively trained region dependent teature transforms for speech recognition [C]// Proc. ICASSP, Vol. 1-13, 2006: 313-316.
  • 6Beyerlein, P., et al., Large vocabulary continuous speech recognition of Broadcast News - The Philips/ RWTH approach[J]. Speech Communication, 2002, 37(1-2): 109- 131.
  • 7Hain, T., et al., Automatic transcription of conversational telephone speech [C]// IEEE Transactions on Speech and Audio Processing, 2005, 13(6): 1173-1185.
  • 8Zhang, B. and S. Matsoukas, Minimum phoneme error based heteroscedastic linear discriminant analy sis for speech recognition[C]// Proc. ICASSP, Vol. 1-5, 2005: 1925-1928.
  • 9Hirsimaki, T., et al., Unlimited vocabulary speech recognition with morph language models applied to Finnish[J]. Computer Speech and Language, 2006, 20(4) : 515-541.
  • 10Odell, J.J., The Use of Context in Large Vocabulary Speech Recognition[D]. 1995, University of Cambridge :Cambridge

共引文献165

同被引文献14

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部