In air traffic control communications (ATCC), misunderstandings between pilots and controllers could result in fatal aviation accidents. Fortunately, advanced automatic speech recognition technology has emerged as a p...In air traffic control communications (ATCC), misunderstandings between pilots and controllers could result in fatal aviation accidents. Fortunately, advanced automatic speech recognition technology has emerged as a promising means of preventing miscommunications and enhancing aviation safety. However, most existing speech recognition methods merely incorporate external language models on the decoder side, leading to insufficient semantic alignment between speech and text modalities during the encoding phase. Furthermore, it is challenging to model acoustic context dependencies over long distances due to the longer speech sequences than text, especially for the extended ATCC data. To address these issues, we propose a speech-text multimodal dual-tower architecture for speech recognition. It employs cross-modal interactions to achieve close semantic alignment during the encoding stage and strengthen its capabilities in modeling auditory long-distance context dependencies. In addition, a two-stage training strategy is elaborately devised to derive semantics-aware acoustic representations effectively. The first stage focuses on pre-training the speech-text multimodal encoding module to enhance inter-modal semantic alignment and aural long-distance context dependencies. The second stage fine-tunes the entire network to bridge the input modality variation gap between the training and inference phases and boost generalization performance. Extensive experiments demonstrate the effectiveness of the proposed speech-text multimodal speech recognition method on the ATCC and AISHELL-1 datasets. It reduces the character error rate to 6.54% and 8.73%, respectively, and exhibits substantial performance gains of 28.76% and 23.82% compared with the best baseline model. The case studies indicate that the obtained semantics-aware acoustic representations aid in accurately recognizing terms with similar pronunciations but distinctive semantics. The research provides a novel modeling paradigm for semantics-aware speech recognition in air traffic control communications, which could contribute to the advancement of intelligent and efficient aviation safety management.展开更多
采用压缩和激励网络(squeeze and excitation network,SENET)构建双塔推荐模型,针对卷烟消费推荐问题进行了研究.基于长期采集的包含用户、商品和历史交互信息的数据集,使用SENET双塔推荐模型对该数据集进行建模,以预测用户的卷烟消费行...采用压缩和激励网络(squeeze and excitation network,SENET)构建双塔推荐模型,针对卷烟消费推荐问题进行了研究.基于长期采集的包含用户、商品和历史交互信息的数据集,使用SENET双塔推荐模型对该数据集进行建模,以预测用户的卷烟消费行为.实验结果表明,基于SENET双塔构架的卷烟消费推荐模型在消费者与产品间的交互信息获取上具有优势;与传统的推荐算法相比,SENET双塔推荐模型具有更好的推荐效果.展开更多
基金This research was funded by Shenzhen Science and Technology Program(Grant No.RCBS20221008093121051)the General Higher Education Project of Guangdong Provincial Education Department(Grant No.2020ZDZX3085)+1 种基金China Postdoctoral Science Foundation(Grant No.2021M703371)the Post-Doctoral Foundation Project of Shenzhen Polytechnic(Grant No.6021330002K).
文摘In air traffic control communications (ATCC), misunderstandings between pilots and controllers could result in fatal aviation accidents. Fortunately, advanced automatic speech recognition technology has emerged as a promising means of preventing miscommunications and enhancing aviation safety. However, most existing speech recognition methods merely incorporate external language models on the decoder side, leading to insufficient semantic alignment between speech and text modalities during the encoding phase. Furthermore, it is challenging to model acoustic context dependencies over long distances due to the longer speech sequences than text, especially for the extended ATCC data. To address these issues, we propose a speech-text multimodal dual-tower architecture for speech recognition. It employs cross-modal interactions to achieve close semantic alignment during the encoding stage and strengthen its capabilities in modeling auditory long-distance context dependencies. In addition, a two-stage training strategy is elaborately devised to derive semantics-aware acoustic representations effectively. The first stage focuses on pre-training the speech-text multimodal encoding module to enhance inter-modal semantic alignment and aural long-distance context dependencies. The second stage fine-tunes the entire network to bridge the input modality variation gap between the training and inference phases and boost generalization performance. Extensive experiments demonstrate the effectiveness of the proposed speech-text multimodal speech recognition method on the ATCC and AISHELL-1 datasets. It reduces the character error rate to 6.54% and 8.73%, respectively, and exhibits substantial performance gains of 28.76% and 23.82% compared with the best baseline model. The case studies indicate that the obtained semantics-aware acoustic representations aid in accurately recognizing terms with similar pronunciations but distinctive semantics. The research provides a novel modeling paradigm for semantics-aware speech recognition in air traffic control communications, which could contribute to the advancement of intelligent and efficient aviation safety management.
文摘采用压缩和激励网络(squeeze and excitation network,SENET)构建双塔推荐模型,针对卷烟消费推荐问题进行了研究.基于长期采集的包含用户、商品和历史交互信息的数据集,使用SENET双塔推荐模型对该数据集进行建模,以预测用户的卷烟消费行为.实验结果表明,基于SENET双塔构架的卷烟消费推荐模型在消费者与产品间的交互信息获取上具有优势;与传统的推荐算法相比,SENET双塔推荐模型具有更好的推荐效果.