Audio-Text Multimodal Speech Recognition via Dual-Tower Architecture for Mandarin Air Traffic Control Communications

下载PDF

导出

摘要 In air traffic control communications (ATCC), misunderstandings between pilots and controllers could result in fatal aviation accidents. Fortunately, advanced automatic speech recognition technology has emerged as a promising means of preventing miscommunications and enhancing aviation safety. However, most existing speech recognition methods merely incorporate external language models on the decoder side, leading to insufficient semantic alignment between speech and text modalities during the encoding phase. Furthermore, it is challenging to model acoustic context dependencies over long distances due to the longer speech sequences than text, especially for the extended ATCC data. To address these issues, we propose a speech-text multimodal dual-tower architecture for speech recognition. It employs cross-modal interactions to achieve close semantic alignment during the encoding stage and strengthen its capabilities in modeling auditory long-distance context dependencies. In addition, a two-stage training strategy is elaborately devised to derive semantics-aware acoustic representations effectively. The first stage focuses on pre-training the speech-text multimodal encoding module to enhance inter-modal semantic alignment and aural long-distance context dependencies. The second stage fine-tunes the entire network to bridge the input modality variation gap between the training and inference phases and boost generalization performance. Extensive experiments demonstrate the effectiveness of the proposed speech-text multimodal speech recognition method on the ATCC and AISHELL-1 datasets. It reduces the character error rate to 6.54% and 8.73%, respectively, and exhibits substantial performance gains of 28.76% and 23.82% compared with the best baseline model. The case studies indicate that the obtained semantics-aware acoustic representations aid in accurately recognizing terms with similar pronunciations but distinctive semantics. The research provides a novel modeling paradigm for semantics-aware speech recognition in air traffic control communications, which could contribute to the advancement of intelligent and efficient aviation safety management.

作者 Shuting Ge Jin Ren Yihua Shi Yujun Zhang Shunzhi Yang Jinfeng Yang

机构地区 School of Computer Science and Software Engineering Institute of Applied Artificial Intelligence of the Guangdong-Hong Kong-Macao Greater Bay Area Shenzhen Institutes of Advanced Technology Industrial Training Centre

出处《Computers, Materials & Continua》 SCIE EI 2024年第3期3215-3245,共31页 计算机、材料和连续体（英文）

基金 This research was funded by Shenzhen Science and Technology Program(Grant No.RCBS20221008093121051) the General Higher Education Project of Guangdong Provincial Education Department(Grant No.2020ZDZX3085) China Postdoctoral Science Foundation(Grant No.2021M703371) the Post-Doctoral Foundation Project of Shenzhen Polytechnic(Grant No.6021330002K).

关键词 Speech-text multimodal automatic speech recognition semantic alignment air traffic control communications dual-tower architecture

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献1

1Hao Zhu,Man-Di Luo,Rui Wang,Ai-Hua Zheng,Ran He.Deep Audio-visual Learning:A Survey[J].International Journal of Automation and computing,2021,18(3):351-376. 被引量：3

二级参考文献2

1Ao-Xue Li,Ke-Xin Zhang,Li-Wei Wang.Zero-shot Fine-grained Classification by Deep Feature Learning with Semantics[J].International Journal of Automation and computing,2019,16(5):563-574. 被引量：7
2Shui-Guang Tong,Yuan-Yuan Huang,Zhe-Ming Tong.A Robust Face Recognition Method Combining LBP with Multi-mirror Symmetry for Images with Various Face Interferences[J].International Journal of Automation and computing,2019,16(5):671-682. 被引量：8

共引文献2

1姜莱,于震,王鹏飞,周东生,侯亚庆.音频驱动跨模态视觉生成算法综述[J].图学学报,2022,43(2):181-188. 被引量：2
2沈然,王庆娟,金良峰,丁麒.基于自动机器学习的电网客户语音情感分类方法[J].浙江电力,2022,41(5):82-88. 被引量：1

1丰天韵,阮俊豪,王卓琛.语音数据特征聚类分析[J].电子设计工程,2024,32(6):52-56.
2无.Full Text of Xi Jinping's Keynote Speech at 3rd Belt and Road Forum for Int'l Cooperation[J].China News Release,2023(11):4-7.
3Xinhua.Full Text of Xi's Speech at Welcome Dinner by Friendly Organizations in United States[J].China News Release,2023(12):4-9.
4Li fang Fu,Huanxin Peng,Changjin Ma,Yuhan Liu.Fake News Detection Based on Text-Modal Dominance and Fusing Multiple Multi-Model Clues[J].Computers, Materials & Continua,2024,78(3):4399-4416.
5Wanyu Luo,Yanqing Wang,Yujia Liu,Yiqin Xu.Design and Implementation of Speech Generation and Demonstration Research Based on Deep Learning[J].国际计算机前沿大会会议论文集,2023(1):475-486.
6Xin-Qiao Liu,Zi-Ru Zhang.Potential use of large language models for mitigating students’problematic social media use:ChatGPT as an example[J].World Journal of Psychiatry,2024,14(3):334-341.
7CAFIU Holds Webinar"Enhancing People-to-People Bonds for High-Quality Belt and Road Cooperation"[J].International Understanding,2023(3):62-62.
8Yaojun Wang,Shiwei Sun.Revolutionizing Antibody Discovery:An Innovative AI Model for Generating Robust Libraries[J].Genomics, Proteomics & Bioinformatics,2023,21(5):910-912.
9Amin Bonyad,Hamdi Ben Abdessalem,Claude Frasson.The Relation between Mental Workload and Face Temperature in Flight Simulation[J].Journal of Behavioral and Brain Science,2024,14(2):64-92.
10王高祥,陈郑玮,吴明胜,李田,孙效辉,徐美青,解明然.保留迷走神经肺支对Ⅰ期周围型肺腺癌患者术后咳嗽影响的初步研究[J].中国肺癌杂志,2024,27(2):102-108.

Computers, Materials & Continua

2024年第3期

浏览历史

内容加载中请稍等...

Audio-Text Multimodal Speech Recognition via Dual-Tower Architecture for Mandarin Air Traffic Control Communications

参考文献1

二级参考文献2

共引文献2

相关作者

相关机构

相关主题

浏览历史