摘要
现有合成语音检测系统在真实场景下性能损失严重。本文提出了一种基于频域掩蔽的倒谱特征数据增强方法。该方法对输入信号的线性滤波器组特征(LFBs)进行频域掩蔽,以引入符合真实场景的语音失真;计算掩蔽特征的线性频率倒谱系数(LFCC),以降低特征维度,提升检测性能。本文利用轻量级卷积神经网络(LCNN)、深度残差网络(ResNet)和一维卷积Transformer模型(OCT)建立了3种检测系统,用于验证所提方法的有效性。真实场景数据集上的实验结果表明,所提方法可使不同合成语音检测系统的等错误率(EER)相较无增强的基线降低6.39%~25.95%。将所提方法与基于音频编解码的增强技术相结合时,不同系统的EER比基线降低31.71%~42.47%,进一步提升了系统对真实场景的泛化能力,且性能优于现有数据增强方法。
The performance of existing synthetic speech detection systems is significantly degraded in real scenarios.This paper proposes a data augmentation method for cepstral features via frequency masking.First,linear filter banks(LFBs)of the input signal are masked on frequency channels for realistic speech distortion.Then,the linear frequency cepstral coefficients(LFCC)of the masked features are calculated to reduce the feature dimensionality and improve the detection performance.Using light convolutional neural network(LCNN),deep residual network(ResNet)and one-dimensional convolutional Transformer(OCT),three detection systems are established to verify the effectiveness of the proposed method.Experiments on the real scene datasets show that the proposed method can reduce the equal error rate(EER)of different synthetic speech detection systems by 6.39%-25.95%compared with the baseline without augmentation.The proposed method with the codec-based augmentation can reduce the EER of different systems by 31.71%-42.47%compared with the baseline,which further improves the generalization ability of the systems in real scenarios,and outperforms the existing data augmentation methods.
作者
万伊
李春国
杨飞然
杨军
WAN Yi;LI Chunguo;YANG Feiran;YANG Jun(Institute of Acoustics,Chinese Academy of Sciences,Beijing 100190;University of Chinese Academy of Sciences,Beijing 100049;School of Information Science and Engineering,Southeast University,Nanjing 210096)
出处
《高技术通讯》
CAS
北大核心
2024年第10期1013-1023,共11页
Chinese High Technology Letters
基金
国家自然科学基金(62171438)
北京市自然科学基金(4242013)
中国科学院声学研究所自主部署“前沿探索”类项目(QYTS202111)资助。
关键词
合成语音检测
数据增强
真实场景
频域掩蔽
泛化能力
synthetic speech detection
data augmentation
real scenes
frequency masking
generalization ability