摘要
针对现有歌声分离算法分离精度不高的问题,提出了一种基于高分辨率网络和自注意力机制的歌声分离算法。该算法构建了基于频域模型的深度神经网络,将高分辨率网络作为主干网络,以此保证分离精度,并在网络中融入自注意力机制来捕获歌曲中的重复旋律。在歌声分离算法中,首先通过短时傅里叶变换对音乐信号进行时频转换,得到幅值谱;其次通过构建的神经网络将歌曲幅值谱进行分离,得到人声和伴奏的幅值谱;最后结合原歌曲的相位谱,通过短时傅里叶逆变换得到人声和伴奏的时域信号。结果表明:在MUSDB18数据集上,分离得到的人声和伴奏信号偏差比指标分别为7.68 dB和12.85 dB,相比于基准模型分别提高了21.52%和1.26%。该算法可以增强神经网络特征表达能力,有效提升歌声分离效果。
To address the problem of low separation accuracy of the existing singing voice separation algorithms, a singing voice separation algorithm based on high-resolution network and self-attention mechanism was proposed, which constructed a deep neural network based on the frequency-domain model, used high-resolution network as the backbone network to ensure the separation accuracy, and integrated the self-attention mechanism into the network to capture the repeated melody in the song. The process of singing voice separation algorithm is as follows: Firstly, the short-time Fourier transform was used for the time-frequency transformation of music signal to get the amplitude spectrogram;second, the amplitude spectrum of song was separated by the established neural network to obtain the amplitude spectrogram of the singing voice and accompaniment;finally, the time domain signals of singing voice and accompaniment were obtained by short-time inverse Fourier transform according to the phase spectrogram of the original song. The experimental results show that: on the MUSDB18 dataset, the signal-to-deviation ratio index of singing voice and accompaniment is 7.68 db and 12.85 db respectively, an increase of 21.52% and 1.26% than the benchmark model, indicating that the algorithm proposed in this study can strengthen the feature expression ability of neural network, and effectively improve the effect of singing voice separation.
作者
倪欣
任佳
NI Xin;REN Jia(Faculty of Mechanical Engineering&Automation,Zhejiang Sci-Tech University,Hangzhou 310018)
出处
《浙江理工大学学报(自然科学版)》
2022年第3期405-412,共8页
Journal of Zhejiang Sci-Tech University(Natural Sciences)
基金
浙江省公益技术研究项目(LGG20F030007)。
关键词
歌声分离
高分辨率网络
自注意力机制
深度神经网络
频域模型
singing voice separation
high-resolution network
self-attention mechanism
deep neural network
frequency-domain model