基于CNN-Transformer的欺骗语音检测

Spoofed Speech Detection Based on CNN-Transformer

下载PDF

导出

摘要语音合成和转换技术的不断更迭对声纹识别系统产生重大威胁。针对现有语音欺骗检测方法中难以适应多种欺骗类型,对未知欺骗攻击检测能力不足的问题,提出了一种结合卷积神经网络(Convolutional Neural Network,CNN)与Transformer的欺骗语音检测模型。设计基于坐标注意力(Coordinate Attention,CA)嵌入的SE-ResNet18的位置感知特征序列提取网络,将语音信号局部时频表示映射为高维特征序列并引入二维位置编码(two-Dimensional Position Encoding,2D-PE)保留特征之间的相对位置关系;提出多尺度自注意力机制从多个尺度建模特征序列之间的长期依赖关系,解决Transformer难以捕捉局部依赖的问题;引入特征序列池化(Sequence Pooling,SeqPool)提取话语级特征,保留Transformer层输出帧级特征序列之间的相关性信息。在ASVspoof2019大赛官方逻辑访问(Logic Access,LA)数据集的实验结果表明,提出的方法相对于当前先进的欺骗语音检测系统,等错误率(Equal Error Rate,EER)平均降低12.83%,串联检测成本函数(tandem Detection Cost Function,t-DCF)平均降低7.81%。 The continuous change of speech synthesis and conversion technology poses a major threat to the voiceprint recognition system.To deal with the problem that the existing voice spoofing detection methods are difficult to adapt to multiple spoofing types and have insufficient ability to detect unknown spoofing attacks,a spoofed speech detection model combining Convolutional Neural Network(CNN)and Transformer is proposed.A location-aware feature sequence extraction network based on SE-ResNet18 embedded with Coordinate Attention(CA)is designed,which maps the local time-frequency representation of speech signals into high-dimensional feature sequences and introduces two-Dimensional Position Encoding(2D-PE)to preserve the relative position relationship between features.The multi-scale self-attention mechanism is proposed to model the long-term dependence between feature sequences from multiple scales,which solves the problem that it is difficult for Transformer to capture local dependencies.Feature Sequence Pooling(SeqPool)is introduced to extract discourse-level features,and the correlation information between frame-level feature sequences output by the Transformer layer is retained.The experimental results on the official Logic Access(LA)data set of the ASVspoof2019 competition show that,compared with the current advanced spoofed speech detection system,the proposed method reduces the Equal Error Rate(EER)by an average of 12.83%,and the tandem Detection Cost Function(t-DCF)by an average of 7.81%.

作者徐童心黄俊 XU Tongxin;HUANG Jun(School of Communication and Information Engineering,Chongqing University of Posts and Telecommunications,Chongqing 400065,China)

机构地区重庆邮电大学通信与信息工程学院

出处《无线电工程》 2024年第5期1091-1098,共8页 Radio Engineering

基金国家自然科学基金(61771085)。

关键词欺骗语音检测位置感知序列 TRANSFORMER 特征序列池化 spoofed speech detection position aware sequence Transformer feature SeqPool

分类号 TP391.4 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

1陈昌盛,陈自炜,李锡劲.基于文字边缘失真特征的翻拍图像篡改定位[J].中国科技论文,2024,19(2):160-168.
2吴文政,卢先领.融合物品转换关系和时序信息的会话推荐算法[J].计算机科学与探索,2024,18(3):768-779. 被引量：1
3李明珠.化工自动化仪表及控制系统智能化分析[J].信息产业报道,2024(2):63-65.
4杨玲,高勇.基于虚拟对抗训练的合成话音检测方法[J].通信技术,2023,56(4):425-433.
5刘暾东,黄智斌,高凤强,郑鹏,谢玉练.LCD面板C/FOG工艺制造虚拟计量方法研究[J].仪器仪表学报,2024,45(1):16-25. 被引量：1
6董昱灿,赵奎.基于注意力机制多特征融合与文本情感分析的日志异常检测方法[J].四川大学学报（自然科学版）,2024,61(2):70-80.
7熊守丽.无人船红外图像单目视觉检测与跟踪研究[J].舰船科学技术,2024,46(7):159-162.
8Roujia Wang,Riley J.Deutsch,Enakshi D.Sunassee,Brian T.Crouch,Nirmala Ramanujam.Adaptive Design of Fluorescence Imaging Systems for Custom Resolution, Fields of View, and Geometries[J].Biomedical Engineering Frontiers,2023,4(1):260-273.
9YUE Dequan,ZHANG Yuying,XU Xiuli,YUE Wuyi.Product Form Solution of a Queuing-Inventory System with Lost Sales and Server Vacation[J].Journal of Systems Science & Complexity,2024,37(2):729-758.
10Xiaoxi YAN,Muyuan MA,Kaihong LU.Zeroth-Order Methods for Online Distributed Optimization with Strongly Pseudoconvex Cost Functions[J].Journal of Systems Science and Information,2024,12(1):145-160.

无线电工程

2024年第5期

浏览历史

内容加载中请稍等...

基于CNN-Transformer的欺骗语音检测

相关作者

相关机构

相关主题

浏览历史