摘要
环境声音识别(environment sound recognition,ESR)在基于情景感知和辅助技术等领域发挥着重要作用。卷积神经网络(CNN)和循环神经网络(RNN)作为两种最具代表性的特征提取方法,在语音和音乐信号处理方面都取得显著效果;然而二者都存在一定缺点,CNN无法有效提取时间特征,RNN在提取空间特征上也存在明显劣势。为了有效提取并利用时间特征和空间特征,提出一种新模型,利用时间分布CNN从梅尔频谱图中提取城市环境声音特征,然后应用双向长短时记忆网络(BiLSTM)从CNN输出中获取时间信息,最后在输出序列上实施注意力机制,从而关注到与城市环境声音最相关的特征进而做出分类判断,注意力机制既提高了分类准确性,又增强了模型的可解释性。实验结果表明:在Urbansound8K数据集中,该模型可获得80.2%的分类准确率,这优于以往在同一数据集的报告结果。
Environment sound recognition(ESR)is widely applied in the fields of context-based awareness and assistive technologies.Convolutional neural network(CNN)and recurrent neural network(RNN)are the most effective feature extraction methods,which have achieved remarkable results in speech and music signal processing.However,CNN is not effective enough to process time-related features,and RNN has a disadvantage in extracting spatial features.To effectively extract and use temporal and spatial features,a novel model(CNN+BiLSTM+attention-mechanism)was proposed to overcome the above shortcomings.In this model,CNN was adopted to learn significant features from Mel spectral information,and then bi-directional long and short-term memory(BiLSTM)was used to obtain the time information from the CNN output,and finally,an attention-mechanism was implemented on the output sequence of the BiLSTM to focus on the target characteristics of the ambient sound.The experimental result is proved to obtain an average accuracy of 80.2%,which is superior to the other state-of-the-art classification methods in the Urbandsound8K dataset.
作者
杨磊
赵红东
YANG Lei;ZHAO Hong-dong(School of Electronic and Information Engineering,Hebei University ofTechnology,Tianjin 300300,China)
出处
《科学技术与工程》
北大核心
2020年第33期13757-13761,共5页
Science Technology and Engineering
基金
光电信息控制和安全技术重点实验室基金(614210701041705)。
关键词
卷积神经网络
双向长短时记忆网络
注意力机制
convolutional neural network
bi-directional long and short-term memory
attention-mechanism