A Long Short-Term Memory(LSTM) Recurrent Neural Network(RNN) has driven tremendous improvements on an acoustic model based on Gaussian Mixture Model(GMM). However, these models based on a hybrid method require a force...A Long Short-Term Memory(LSTM) Recurrent Neural Network(RNN) has driven tremendous improvements on an acoustic model based on Gaussian Mixture Model(GMM). However, these models based on a hybrid method require a forced aligned Hidden Markov Model(HMM) state sequence obtained from the GMM-based acoustic model. Therefore, it requires a long computation time for training both the GMM-based acoustic model and a deep learning-based acoustic model. In order to solve this problem, an acoustic model using CTC algorithm is proposed. CTC algorithm does not require the GMM-based acoustic model because it does not use the forced aligned HMM state sequence. However, previous works on a LSTM RNN-based acoustic model using CTC used a small-scale training corpus. In this paper, the LSTM RNN-based acoustic model using CTC is trained on a large-scale training corpus and its performance is evaluated. The implemented acoustic model has a performance of 6.18% and 15.01% in terms of Word Error Rate(WER) for clean speech and noisy speech, respectively. This is similar to a performance of the acoustic model based on the hybrid method.展开更多
Acoustic scene classification(ASC)is a method of recognizing and classifying environments that employ acoustic signals.Various ASC approaches based on deep learning have been developed,with convolutional neural networ...Acoustic scene classification(ASC)is a method of recognizing and classifying environments that employ acoustic signals.Various ASC approaches based on deep learning have been developed,with convolutional neural networks(CNNs)proving to be the most reliable and commonly utilized in ASC systems due to their suitability for constructing lightweight models.When using ASC systems in the real world,model complexity and device robustness are essential considerations.In this paper,we propose a two-pass mobile network for low-complexity classification of the acoustic scene,named TP-MobNet.With inverse residuals and linear bottlenecks,TPMobNet is based on MobileNetV2,and following mobile blocks,coordinate attention and two-pass fusion approaches are utilized.The log-range dependencies and precise position information in feature maps can be trained via coordinate attention.By capturing more diverse feature resolutions at the network’s end sides,two-pass fusions can also train generalization.Also,the model size is reduced by applying weight quantization to the trained model.By adding weight quantization to the trained model,the model size is also lowered.The TAU Urban Acoustic Scenes 2020 Mobile development set was used for all of the experiments.It has been confirmed that the proposed model,with a model size of 219.6 kB,achieves an accuracy of 73.94%.展开更多
A differentiable neural computer(DNC)is analogous to the Von Neumann machine with a neural network controller that interacts with an external memory through an attention mechanism.Such DNC’s offer a generalized metho...A differentiable neural computer(DNC)is analogous to the Von Neumann machine with a neural network controller that interacts with an external memory through an attention mechanism.Such DNC’s offer a generalized method for task-specific deep learning models and have demonstrated reliability with reasoning problems.In this study,we apply a DNC to a language model(LM)task.The LM task is one of the reasoning problems,because it can predict the next word using the previous word sequence.However,memory deallocation is a problem in DNCs as some information unrelated to the input sequence is not allocated and remains in the external memory,which degrades performance.Therefore,we propose a forget gatebased memory deallocation(FMD)method,which searches for the minimum value of elements in a forget gate-based retention vector.The forget gatebased retention vector indicates the retention degree of information stored in each external memory address.In experiments,we applied our proposed NTM architecture to LM tasks as a task-specific example and to rescoring for speech recognition as a general-purpose example.For LM tasks,we evaluated DNC using the Penn Treebank and enwik8 LM tasks.Although it does not yield SOTA results in LM tasks,the FMD method exhibits relatively improved performance compared with DNC in terms of bits-per-character.For the speech recognition rescoring tasks,FMD again showed a relative improvement using the LibriSpeech data in terms of word error rate.展开更多
基金supported by the Ministry of Trade,Industry & Energy(MOTIE,Korea) under Industrial Technology Innovation Program (No.10063424,'development of distant speech recognition and multi-task dialog processing technologies for in-door conversational robots')
文摘A Long Short-Term Memory(LSTM) Recurrent Neural Network(RNN) has driven tremendous improvements on an acoustic model based on Gaussian Mixture Model(GMM). However, these models based on a hybrid method require a forced aligned Hidden Markov Model(HMM) state sequence obtained from the GMM-based acoustic model. Therefore, it requires a long computation time for training both the GMM-based acoustic model and a deep learning-based acoustic model. In order to solve this problem, an acoustic model using CTC algorithm is proposed. CTC algorithm does not require the GMM-based acoustic model because it does not use the forced aligned HMM state sequence. However, previous works on a LSTM RNN-based acoustic model using CTC used a small-scale training corpus. In this paper, the LSTM RNN-based acoustic model using CTC is trained on a large-scale training corpus and its performance is evaluated. The implemented acoustic model has a performance of 6.18% and 15.01% in terms of Word Error Rate(WER) for clean speech and noisy speech, respectively. This is similar to a performance of the acoustic model based on the hybrid method.
基金This work was supported by Institute of Information&communications Technology Planning&Evaluation(IITP)grant funded by the Korea government(MSIT)[No.2021-0-0268,Artificial Intelligence Innovation Hub(Artificial Intelligence Institute,Seoul National University)]。
文摘Acoustic scene classification(ASC)is a method of recognizing and classifying environments that employ acoustic signals.Various ASC approaches based on deep learning have been developed,with convolutional neural networks(CNNs)proving to be the most reliable and commonly utilized in ASC systems due to their suitability for constructing lightweight models.When using ASC systems in the real world,model complexity and device robustness are essential considerations.In this paper,we propose a two-pass mobile network for low-complexity classification of the acoustic scene,named TP-MobNet.With inverse residuals and linear bottlenecks,TPMobNet is based on MobileNetV2,and following mobile blocks,coordinate attention and two-pass fusion approaches are utilized.The log-range dependencies and precise position information in feature maps can be trained via coordinate attention.By capturing more diverse feature resolutions at the network’s end sides,two-pass fusions can also train generalization.Also,the model size is reduced by applying weight quantization to the trained model.By adding weight quantization to the trained model,the model size is also lowered.The TAU Urban Acoustic Scenes 2020 Mobile development set was used for all of the experiments.It has been confirmed that the proposed model,with a model size of 219.6 kB,achieves an accuracy of 73.94%.
基金supported by the ICT R&D By the Institute for Information&communications Technology Promotion(IITP)grant funded by the Korea government(MSIT)[Project Number:2020-0-00113,Project Name:Development of data augmentation technology by using heterogeneous information and data fusions].
文摘A differentiable neural computer(DNC)is analogous to the Von Neumann machine with a neural network controller that interacts with an external memory through an attention mechanism.Such DNC’s offer a generalized method for task-specific deep learning models and have demonstrated reliability with reasoning problems.In this study,we apply a DNC to a language model(LM)task.The LM task is one of the reasoning problems,because it can predict the next word using the previous word sequence.However,memory deallocation is a problem in DNCs as some information unrelated to the input sequence is not allocated and remains in the external memory,which degrades performance.Therefore,we propose a forget gatebased memory deallocation(FMD)method,which searches for the minimum value of elements in a forget gate-based retention vector.The forget gatebased retention vector indicates the retention degree of information stored in each external memory address.In experiments,we applied our proposed NTM architecture to LM tasks as a task-specific example and to rescoring for speech recognition as a general-purpose example.For LM tasks,we evaluated DNC using the Penn Treebank and enwik8 LM tasks.Although it does not yield SOTA results in LM tasks,the FMD method exhibits relatively improved performance compared with DNC in terms of bits-per-character.For the speech recognition rescoring tasks,FMD again showed a relative improvement using the LibriSpeech data in terms of word error rate.