摘要
针对频域乐声分离方法缺失相位信息,时域端到端方法无法充分利用时频表示中的声学信息的问题,提出了一种基于集成学习的乐声分离方法。通过在频域U型卷积神经网络(U-Net)的编码块和解码块之间引入卷积块注意力模块(convolutional block attention module,CBAM),从通道和空间两方面调整权重,增强模型特征提取能力;通过提出一种时域端对端分离模型ST-Demucs(soft threshold-Demucs),在编码层中添加全连接子网络和软阈值化层,有选择性地提取特征,抑制冗余噪声;最后,通过软投票的策略对两种模型的分离结果进行融合,弥补频域模型相位缺失弊端,得到更加接近纯净音频的目标音源波形图。在MUSDB18数据集上的实验结果表明:改进后的频域网络模型的信号失真比提升了0.33 dB,时域网络模型的信号失真比提升了0.31 dB,经过集成后,信号失真比得到了进一步提高,提出的基于集成学习的乐声分离方法在分离性能上优于相关单个模型。
A music and voice separation method based on ensemble learning was proposed to solve the problem that frequency domain music separation method lacks phase information,and time domain end-to-end method cannot make full use of acoustic information in time frequency representation.By introducing convolutional block attention module(CBAM)into the encoding blocks and decoding blocks of the frequency-domain U-Net,the weights were adjusted from both channel and space aspects,and the feature extraction ability of the model was enhanced.A time-domain end-to-end separation model ST-Demucs was proposed,which added fully connected subnetwork and soft threshold layer to the coding layer to selectively extract features and suppress redundant noise.Finally,the separation results of the two models were fused through the soft voting strategy to make up for the phase loss of the frequency domain model and get the waveform of the target sound source closer to the pure audio.The experimental results on MUSDB18 data set show that the signal-to-distortion ratio of the improved frequency-domain network model is improved by 0.33 dB,and that of the time-domain network model is improved by 0.31 dB.After integration,the signal-to-distortion ratio is further improved.The proposed music separation method based on ensemble learning outperforms the related single model in terms of separation performance.
作者
孟晶晶
徐雅斌
MENG Jingjing;XU Yabin(Big Data Security Technology Research Institute,Beijing Information Science&Technology University,Beijing 100101,China;Computer School,Beijing Information Science&Technology University,Beijing 100101,China)
出处
《北京信息科技大学学报(自然科学版)》
2023年第3期27-34,共8页
Journal of Beijing Information Science and Technology University
基金
国家自然科学基金资助项目(61672101)
网络文化与数字传播北京市重点实验室开放课题(ICCD XN004)
信息网络安全公安部重点实验室开放课题(C18601)。
关键词
乐声分离
卷积块注意力模块
软阈值化
集成学习
separation of music and voice
convolutional block attention module
soft thresholding
ensemble learning