摘要
声学特征可以大致分为三类:人工设计特征,参数化特征和可学习特征。其中,可学习特征是指将其与分离网络以端到端的方式进行联合训练,如时域卷积语音分离网络(convolutional time domain audio separation network,Conv‑Tasnet),这成为了如今语音分离研究中的一种新的趋势。然而在最近的研究中证明了人工设计特征以及参数化特征也能产生具有竞争力的结果。但是,截止目前还没有工作对这三种声学特征进行系统的比较。本文通过设置不同声学特征作为编码器和解码器,在Conv‑Tasnet框架下对它们进行比较。我们还将人工设计的多相位gammatone滤波器组(multi‑phase gammatone filterbank,MPGTF)扩展为一种新的参数化多相位gammatone滤波器组(Parameterized MPGTF,ParaMPGTF)。在WSJ0‑2mix数据集上的实验结果表明:(i)如果解码器是可学习特征时,将编码器设置为STFT,MPGTF,ParaMPGTF以及可学习特征的性能相近;(ii)如果将STFT,MPGTF,ParaMPGTF的逆变换作为解码器时,所提出的ParaMPGTF相比于其他两种人工设计特征有更好的性能。
It can be roughly categorized into three classes:handcrafted,parameterized,and learnable features.Among them,learnable features,which are trained with separation networks jointly in an end-to-end fashion,become a new trend of modern speech separation research,e.g.convolutional time domain audio separation network(Conv-Tasnet),while handcrafted and parameterized features are also shown competitive in very recent studies.However,a systematic comparison across the three kinds of acoustic features has not been conducted yet.In this paper,we compare them in the framework of Conv-Tasnet by setting its encoder and decoder with different acoustic features.We also generalize the handcrafted multi-phase gammatone filterbank(MPGTF)to a new parameterized multi-phase gammatone filterbank(ParaMPGTF).Experimental results on the WSJ0-2mix corpus show that(i)if the decoder is learnable,then setting the encoder to STFT,MPGTF,ParaMPGTF,and learnable features lead to similar performance;and(ii)when the pseudo-inverse transforms of STFT,MPGTF,and ParaMPGTF are used as the decoders,the proposed ParaMPGTF performs better than the other two handcrafted features.
作者
朱文博
王谋
张晓雷
Susanto Rahardja
ZHU Wenbo;WANG Mou;ZHANG Xiaolei;Susanto Rahardja(CIAIC,School of Marine Science and Technology,Northwestern Polytechnical University,Xi′an 710072,China)
出处
《中国传媒大学学报(自然科学版)》
2021年第3期52-57,共6页
Journal of Communication University of China:Science and Technology