摘要
语音合成和转换技术的不断更迭对声纹识别系统产生重大威胁。针对现有语音欺骗检测方法中难以适应多种欺骗类型,对未知欺骗攻击检测能力不足的问题,提出了一种结合卷积神经网络(Convolutional Neural Network,CNN)与Transformer的欺骗语音检测模型。设计基于坐标注意力(Coordinate Attention,CA)嵌入的SE-ResNet18的位置感知特征序列提取网络,将语音信号局部时频表示映射为高维特征序列并引入二维位置编码(two-Dimensional Position Encoding,2D-PE)保留特征之间的相对位置关系;提出多尺度自注意力机制从多个尺度建模特征序列之间的长期依赖关系,解决Transformer难以捕捉局部依赖的问题;引入特征序列池化(Sequence Pooling,SeqPool)提取话语级特征,保留Transformer层输出帧级特征序列之间的相关性信息。在ASVspoof2019大赛官方逻辑访问(Logic Access,LA)数据集的实验结果表明,提出的方法相对于当前先进的欺骗语音检测系统,等错误率(Equal Error Rate,EER)平均降低12.83%,串联检测成本函数(tandem Detection Cost Function,t-DCF)平均降低7.81%。
The continuous change of speech synthesis and conversion technology poses a major threat to the voiceprint recognition system.To deal with the problem that the existing voice spoofing detection methods are difficult to adapt to multiple spoofing types and have insufficient ability to detect unknown spoofing attacks,a spoofed speech detection model combining Convolutional Neural Network(CNN)and Transformer is proposed.A location-aware feature sequence extraction network based on SE-ResNet18 embedded with Coordinate Attention(CA)is designed,which maps the local time-frequency representation of speech signals into high-dimensional feature sequences and introduces two-Dimensional Position Encoding(2D-PE)to preserve the relative position relationship between features.The multi-scale self-attention mechanism is proposed to model the long-term dependence between feature sequences from multiple scales,which solves the problem that it is difficult for Transformer to capture local dependencies.Feature Sequence Pooling(SeqPool)is introduced to extract discourse-level features,and the correlation information between frame-level feature sequences output by the Transformer layer is retained.The experimental results on the official Logic Access(LA)data set of the ASVspoof2019 competition show that,compared with the current advanced spoofed speech detection system,the proposed method reduces the Equal Error Rate(EER)by an average of 12.83%,and the tandem Detection Cost Function(t-DCF)by an average of 7.81%.
作者
徐童心
黄俊
XU Tongxin;HUANG Jun(School of Communication and Information Engineering,Chongqing University of Posts and Telecommunications,Chongqing 400065,China)
出处
《无线电工程》
2024年第5期1091-1098,共8页
Radio Engineering
基金
国家自然科学基金(61771085)。