摘要
端到端(End-to-End)框架是一种基于深度神经网络可直接预测语音信号和目标语言字符的概率模型,从原始的数据输入到结果输出,中间的处理过程和神经网络成一体化,可脱离人类主观偏见,直接提取特征,从而充分挖掘数据信息,简化任务处理步骤。近几年,注意力机制的引入,辅助端到端架构实现了多模态间的相互映射,进一步提高了技术的整体性能。通过对近几年端到端技术在智能语音领域技术和应用的调研,端到端架构为语音模型算法提供了新的思想和方法,但也存在混合框架无法有效地平衡和兼顾单一技术特点,模型内部逻辑复杂使得人工介入调试困难、定制可扩展性减弱等问题。未来端到端一体化模型在语音领域应用方面还将有进一步的发展,一方面是前端到后端的模块端到端,忽略前端语音增强和后端语音识别中涉及多项输入的假设,将语音增强和声学建模一体化,另一方面是交互信息载体的端到端,聚焦于语音信号数据本身的信息提取和处理,使得人机交互更贴近真实人类语言的沟通方式。
The end-to-end framework is a probability model based on the depth neural network which can directly predict the speech signal and the target language character.From the original data input to the result output,the intermediate processing process and neural network are integrated,which can be separated from human subjective bias,directly extract the features,fully mine the data information,and simplify the task processing steps.In recent years,with the introduction of attention mechanism,the auxiliary end-to-end architecture realizes the mutual mapping between multimode,further improving the overall performance of the technology.Through the research on the technology and application of end-to-end technology in the field of intelligent speech in recent years,the end-to-end architecture provides a new idea and method for speech model algorithm,but there are also problems such as the mixed framework can not effectively balance and take into account the single technical characteristics,the complexity of the internal logic of the model makes it difficult for human intervention debugging,and the customization scalability is weakened.In the future,there will be further development in the application of the end-to-end integrated model in the field of speech.On the one hand,the front-end to back-end modules ignore the multiple input assumptions in front-end speech enhancement and back-end speech recognition to integrate speech enhancement and acoustic modeling.On the other hand,the end-to-end interactive information carrier focuses on the information extraction and processing of speech signal data itself the human-compu-ter interaction is closer to the real human language communication.
作者
李荪
曹峰
LI Sun;CAO Feng(China Academy of Information and Communications Technology,Beijing 100191,China)
出处
《计算机科学》
CSCD
北大核心
2022年第S01期331-336,共6页
Computer Science
关键词
端到端模型
智能语音
混合框架
人机交互
End-to-end model
Intelligent voice
Hybrid framework
Human-computer interaction