In air traffic control communications (ATCC), misunderstandings between pilots and controllers could result in fatal aviation accidents. Fortunately, advanced automatic speech recognition technology has emerged as a p...In air traffic control communications (ATCC), misunderstandings between pilots and controllers could result in fatal aviation accidents. Fortunately, advanced automatic speech recognition technology has emerged as a promising means of preventing miscommunications and enhancing aviation safety. However, most existing speech recognition methods merely incorporate external language models on the decoder side, leading to insufficient semantic alignment between speech and text modalities during the encoding phase. Furthermore, it is challenging to model acoustic context dependencies over long distances due to the longer speech sequences than text, especially for the extended ATCC data. To address these issues, we propose a speech-text multimodal dual-tower architecture for speech recognition. It employs cross-modal interactions to achieve close semantic alignment during the encoding stage and strengthen its capabilities in modeling auditory long-distance context dependencies. In addition, a two-stage training strategy is elaborately devised to derive semantics-aware acoustic representations effectively. The first stage focuses on pre-training the speech-text multimodal encoding module to enhance inter-modal semantic alignment and aural long-distance context dependencies. The second stage fine-tunes the entire network to bridge the input modality variation gap between the training and inference phases and boost generalization performance. Extensive experiments demonstrate the effectiveness of the proposed speech-text multimodal speech recognition method on the ATCC and AISHELL-1 datasets. It reduces the character error rate to 6.54% and 8.73%, respectively, and exhibits substantial performance gains of 28.76% and 23.82% compared with the best baseline model. The case studies indicate that the obtained semantics-aware acoustic representations aid in accurately recognizing terms with similar pronunciations but distinctive semantics. The research provides a novel modeling paradigm for semantics-aware speech recognition in air traffic control communications, which could contribute to the advancement of intelligent and efficient aviation safety management.展开更多
Detecting prohibited item based on convolutional neural networks(CNNs) is of great significance to ensure public safety. However, the natural occurrence of such prohibited items is a small-probability event, collectin...Detecting prohibited item based on convolutional neural networks(CNNs) is of great significance to ensure public safety. However, the natural occurrence of such prohibited items is a small-probability event, collecting enough datasets to support CNN training is a big challenge. In this paper, we propose a new method for synthesizing X-ray security image with multiple prohibited items from semantic label images basing on Generative Adversarial Networks(GANs). Theoretically, we can use it to synthesize as many X-ray images as needed. A new generator architecture with Res 2 Net is presented, which is more effective in learning multi-scale features of different prohibited items images. This method is extended by establishing the semantic label library which contains 14 000 images. So we totally synthesize 14 000 Xray security images. The experimental results show the super performance(Fréchet Inception Distance(FID) score of 30.55). And we achieve 0.825 of mean average precision(m AP) with Single Shot Multi Box Detector(SSD) for object detection, demonstrating the effectiveness of our approach.展开更多
Deep learning(DL) based semantic segmentation methods can extract object information including category, location and shape. In this paper, the identification of prohibited items is regarded as a task of semantic segm...Deep learning(DL) based semantic segmentation methods can extract object information including category, location and shape. In this paper, the identification of prohibited items is regarded as a task of semantic segmentation, and proposes a universal model with automatic identification of prohibited items. This model has two improvements based on the general semantic segmentation network. Firstly, the N-type encoding structure is applied to enlarge the receptive field of the network aiming at reducing the misclassification. Secondly, consider the lack of surface texture in X-ray security images. Inspired by feature reuse in Densenet, shallow semantic information is reused to improve the segmentation accuracy. With the use of this model, when using input images of size 512× 512, we could achieve 0.783 mean intersection over union(m Io U) for a seven-class object recognition problem.展开更多
基金This research was funded by Shenzhen Science and Technology Program(Grant No.RCBS20221008093121051)the General Higher Education Project of Guangdong Provincial Education Department(Grant No.2020ZDZX3085)+1 种基金China Postdoctoral Science Foundation(Grant No.2021M703371)the Post-Doctoral Foundation Project of Shenzhen Polytechnic(Grant No.6021330002K).
文摘In air traffic control communications (ATCC), misunderstandings between pilots and controllers could result in fatal aviation accidents. Fortunately, advanced automatic speech recognition technology has emerged as a promising means of preventing miscommunications and enhancing aviation safety. However, most existing speech recognition methods merely incorporate external language models on the decoder side, leading to insufficient semantic alignment between speech and text modalities during the encoding phase. Furthermore, it is challenging to model acoustic context dependencies over long distances due to the longer speech sequences than text, especially for the extended ATCC data. To address these issues, we propose a speech-text multimodal dual-tower architecture for speech recognition. It employs cross-modal interactions to achieve close semantic alignment during the encoding stage and strengthen its capabilities in modeling auditory long-distance context dependencies. In addition, a two-stage training strategy is elaborately devised to derive semantics-aware acoustic representations effectively. The first stage focuses on pre-training the speech-text multimodal encoding module to enhance inter-modal semantic alignment and aural long-distance context dependencies. The second stage fine-tunes the entire network to bridge the input modality variation gap between the training and inference phases and boost generalization performance. Extensive experiments demonstrate the effectiveness of the proposed speech-text multimodal speech recognition method on the ATCC and AISHELL-1 datasets. It reduces the character error rate to 6.54% and 8.73%, respectively, and exhibits substantial performance gains of 28.76% and 23.82% compared with the best baseline model. The case studies indicate that the obtained semantics-aware acoustic representations aid in accurately recognizing terms with similar pronunciations but distinctive semantics. The research provides a novel modeling paradigm for semantics-aware speech recognition in air traffic control communications, which could contribute to the advancement of intelligent and efficient aviation safety management.
基金This work has been supported by the National Natural Science Foundation of China(No.61806208)。
文摘Detecting prohibited item based on convolutional neural networks(CNNs) is of great significance to ensure public safety. However, the natural occurrence of such prohibited items is a small-probability event, collecting enough datasets to support CNN training is a big challenge. In this paper, we propose a new method for synthesizing X-ray security image with multiple prohibited items from semantic label images basing on Generative Adversarial Networks(GANs). Theoretically, we can use it to synthesize as many X-ray images as needed. A new generator architecture with Res 2 Net is presented, which is more effective in learning multi-scale features of different prohibited items images. This method is extended by establishing the semantic label library which contains 14 000 images. So we totally synthesize 14 000 Xray security images. The experimental results show the super performance(Fréchet Inception Distance(FID) score of 30.55). And we achieve 0.825 of mean average precision(m AP) with Single Shot Multi Box Detector(SSD) for object detection, demonstrating the effectiveness of our approach.
文摘Deep learning(DL) based semantic segmentation methods can extract object information including category, location and shape. In this paper, the identification of prohibited items is regarded as a task of semantic segmentation, and proposes a universal model with automatic identification of prohibited items. This model has two improvements based on the general semantic segmentation network. Firstly, the N-type encoding structure is applied to enlarge the receptive field of the network aiming at reducing the misclassification. Secondly, consider the lack of surface texture in X-ray security images. Inspired by feature reuse in Densenet, shallow semantic information is reused to improve the segmentation accuracy. With the use of this model, when using input images of size 512× 512, we could achieve 0.783 mean intersection over union(m Io U) for a seven-class object recognition problem.