语音深度伪造及其检测技术研究进展

Research progress on speech deepfake and its detection techniques

导出

摘要语音深度伪造技术是利用深度学习方法进行合成或生成语音的技术。人工智能生成内容技术的快速迭代与优化,推动了语音深度伪造技术在伪造语音的自然度、逼真度和多样性等方面取得显著提升,同时也使得语音深度伪造检测技术面临着巨大挑战。本文对语音深度伪造及其检测技术的研究进展进行全面梳理回顾。首先,介绍以语音合成(speech synthesis,SS)和语音转换(voice conversion,VC)为代表的伪造技术。然后,介绍语音深度伪造检测领域的常用数据集和相关评价指标。在此基础上,从数据增强、特征提取和优化以及学习机制等处理流程的角度对现有的语音深度伪造检测技术进行分类与深入分析。具体而言,从语音加噪、掩码增强、信道增强和压缩增强等数据增强的角度来分析不同增强方式对伪造检测技术性能的影响,从基于手工特征的伪造检测、基于混合特征的伪造检测、基于端到端的伪造检测和基于特征融合的伪造检测等特征提取和优化的角度对比分析各类方法的优缺点,从自监督学习、对抗训练和多任务学习等学习机制的角度对伪造检测技术的训练方式进行探讨。最后,总结分析语音深度伪造检测技术存在的挑战性问题,并对未来研究进行展望。本文汇总的相关数据集和代码可在https://github.com/media-sec-lab/Audio-Deepfake-Detection访问。 Speech deepfake technology,which employs deep learning methods to synthesize or generate speech,has emerged as a critical research hotspot in multimedia information security.The rapid iteration and optimization of artificial intelligence-generated content technologies have significantly advanced speech deepfake techniques.These advancements have significantly enhanced the naturalness,fidelity,and diversity of synthesized speech.However,they have also pre⁃sented great challenges for speech deepfake detection technology.To address these challenges,this study comprehensively reviews recent research progress on speech deepfake generation and its detection techniques.Based on an extensive litera⁃ture survey,this study first introduces the research background of speech forgery and its detection and compares and ana⁃lyzes previously published reviews in this field.Second,this study provides a concise overview of speech deepfake genera⁃tion,especially speech synthesis(SS)and voice conversion(VC).SS,which is commonly known as text-to-speech(TTS),analyzes text and generates speech that aligns with the provided input by applying linguistic rules for text descrip⁃tion.Various deep models are employed in TTS,including sequence-to-sequence models,flow models,generative adver⁃sarial network models,variational auto-encoder models,and diffusion models.VC involves modifying acoustic features,such as emotion,accent,pronunciation,and speaker identity,to produce speech resembling human-like speech.VC algo⁃rithms can be categorized as single,multiple,and arbitrary target speech conversion depending on the number of target speakers.Third,this study briefly introduces commonly used datasets in speech deepfake detection and provides relevant access links to open-source datasets.This study briefly introduces two commonly used evaluation metrics in speech deep⁃fake detection:the equal error rate and the tandem detection cost function.This study analyzes and categorizes the existing deep speech forgery detection techniques in detail.The pros and cons of different detection techniques are studied and com⁃pared in depth,focusing primarily on data processing,feature extraction and optimization,and learning mechanisms.Notably,this study summarizes the experimental results of existing detection techniques on the ASVspoof 2019 and 2021 datasets in tabular form.Within this context,the primary focus of this study is to investigate the generality of current detec⁃tion techniques in the field of speech deepfake detection without focusing on specific forgery attack methods.Data augmen⁃tation involves a series of transformations on the original speech data.These include speech noise addition,mask enhance⁃ment,channel enhancement,and compression enhancement,each aiming to simulate complex real-world acoustic environ⁃ments more effectively.Among them,one of the most common data processing methods is speech noise addition,which aims to interfere with the speech signal by adding noise to simulate the complex acoustic environment of a real scenario as much as possible.Mask enhancement is the masking operation on the time or frequency domain of speech to achieve noise suppression and enhancement of the speech signal for improving the accuracy and robustness of speech detection tech⁃niques.Transmission channel enhancement focuses on solving the problems of signal attenuation,data loss,and noise interference caused by changes in the codec and transmission channel of speech data.Compression enhancement tech⁃niques address the problem of degradation of speech quality during data compression.In particular,the main data compres⁃sion methods are MP3,M4A,and OGG.From the perspective of feature extraction and optimization,speech deepfake detection can be divided into handcrafted feature-,hybrid feature-,deep feature-,and feature fusion-based methods.Handcrafted features refer to speech features extracted with the help of certain prior knowledge,which mainly include constant-Q transform,linear frequency cepstral coefficients,and Mel-spectrogram.By contrast,feature-based hybrid forg⁃ery detection methods utilize the domain knowledge provided by handcrafted features to mine richer information about speech representations through deep learning networks.End-to-end forgery detection methods directly learn feature repre⁃sentation and classification models from raw speech signals,which eliminates the need for handcrafted feature extraction.This way allows the model to discover discriminative features from the input data automatically.Moreover,these detection techniques can be trained using a single feature.Alternatively,feature-level fusion forgery detection can be employed to combine multiple features,whether they are identical or different.Techniques such as weighted aggregation and feature concatenation are used for feature-level fusion.The detection techniques can capture richer speech information by fusing these features,which improves performance.For the learning mechanism,this study explores the impact of different train⁃ing methods on forgery detection techniques,especially self-supervised learning,adversarial training,and multi-task learn⁃ing.Self-supervised learning plays an important role in forgery detection techniques by automatically generating auxiliary targets or labels from speech data to train models.Fine-tuning the self-supervised-based pretrained model can effectively dis⁃tinguish between real and forged speech.Then,adversarial training-based forgery detection enhances the robustness and general⁃ization of the model by adding adversarial samples to the training data.In contrast to binary classification tasks,the forgery detec⁃tion based on multi-task learning captures more comprehensive and useful speech feature information from different speechrelated tasks by sharing the underlying feature representations.This approach improves the detection performance of the model while effectively utilizing speech training data.Although speech deepfake detection techniques have achieved excellent perfor⁃mance in some datasets,their performance is less satisfactory when testing speech data from natural scenarios.Analysis of the existing research shows that the main future research directions are to establish diversified speech deepfake datasets,study adver⁃sarial samples or data enhancement methods for enhancing the robustness of speech deepfake detection techniques,establish gen⁃eralized speech deepfake detection techniques,and explore interpretable speech deepfake detection techniques.The relevant datasets and code mentioned can be accessed from https://github.com/media-sec-lab/Audio-Deepfake-Detection.

作者许裕雄李斌谭舜泉黄继武 Xu Yuxiong;Li Bin;Tan Shunquan;Huang Jiwu(Guangdong Key Laboratory of Intelligent Information Processing,Shenzhen 518060,China;Shenzhen Key Laboratory of Media Security,Shenzhen 518060,China;College of Electronics and Information Engineering,Shenzhen University,Shenzhen 518060,China;College of Computer Science and Software Engineering,Shenzhen University,Shenzhen 518060,China)

机构地区广东省智能信息处理重点实验室深圳市媒体信息内容安全重点实验室深圳大学电子与信息工程学院深圳大学计算机与软件学院

出处《中国图象图形学报》 CSCD 北大核心 2024年第8期2236-2268,共33页 Journal of Image and Graphics

基金国家自然科学基金项目(U23B2022,U22B2047,62272314) 广东省基础与应用基础研究基金项目(2019B151502001) 深圳市基础研究重点项目(JCYJ20200109105008228) 亚马逊云科技——2022教育部就业育人项目(20221128)。

关键词语音深度伪造语音深度伪造检测语音合成(SS) 语音转换(VC) 人工智能生成内容(AIGC) 自监督学习对抗训练 speech deepfake speech deepfake detection speech synthesis(SS) voice conversion(VC) artificial intelligence-generated content(AIGC) self-supervised learning adversarial training

分类号 TN912 [电子电信—通信与信息系统]

引文网络
相关文献

参考文献5

1任延珍,刘晨雨,刘武洋,王丽娜.语音伪造及检测技术研究综述[J].信号处理,2021,37(12):2412-2439. 被引量：16
2李晓龙,俞能海,张新鹏,张卫明,李斌,卢伟,王伟,刘晓龙.数字媒体取证技术综述[J].中国图象图形学报,2021,26(6):1216-1226. 被引量：20
3张雄伟,李嘉康,孙蒙,郑琳琳.语音欺骗检测方法的研究现状及展望[J].数据采集与处理,2020,35(5):807-823. 被引量：10
4杨帅,乔凯,陈健,王林元,闫镔.语音合成及伪造、鉴伪技术综述[J].计算机系统应用,2022,31(7):12-22. 被引量：8
5陶建华,傅睿博,易江燕,王成龙,汪涛.语音伪造与鉴伪的发展与挑战[J].信息安全学报,2020,5(2):28-38. 被引量：16

二级参考文献7

1林晶,黄添强,赖玥聪,卢贺楠.采用量化离散余弦变换系数检测视频单帧连续多次复制-粘贴篡改[J].计算机应用,2016,36(5):1356-1361. 被引量：5
2高铁杠,杨亮,宣妍,佟静.基于超像素和游程直方图的对比度修改检测算法[J].电子与信息学报,2016,38(11):2787-2794. 被引量：1
3杨晓花.基于相关性检测的数字图像盲取证算法仿真[J].微电子学与计算机,2018,35(4):114-118. 被引量：6
4苏文煊,方针.基于CFA插值特性不一致的图像真伪鉴别[J].应用科学学报,2019,37(1):33-40. 被引量：8
5吴韵清,吴鹏,陈北京,鞠兴旺,高野.基于残差全卷积网络的图像拼接定位算法[J].应用科学学报,2019,37(5):651-662. 被引量：5
6张雄伟,苗晓孔,曾歆,孙蒙,曹铁勇.语音转换技术研究现状及展望[J].数据采集与处理,2019,34(5):753-770. 被引量：9
7陶建华,傅睿博,易江燕,王成龙,汪涛.语音伪造与鉴伪的发展与挑战[J].信息安全学报,2020,5(2):28-38. 被引量：16

共引文献59

1夏翔,方磊,方四安,柳林.基于自监督预训练和有监督微调的伪造语音检测方法[J].计算机应用,2023,43(S01):263-268.
2盛春明.基于分数阶傅里叶变换和K-均值聚类的重放语音检测算法[J].电声技术,2022,46(8):118-123.
3芦天亮,涂君奥,杜彦辉,刘颖卿.基于大数据技术的电信网络诈骗案件分析实验设计[J].实验技术与管理,2020,37(10):50-55. 被引量：9
4韩语晨,华光,张海剑.基于Inception3D网络的眼部与口部区域协同视频换脸伪造检测[J].信号处理,2021,37(4):567-577. 被引量：7
5潘孝勤,芦天亮,杜彦辉,仝鑫.基于深度学习的语音合成与转换技术综述[J].计算机科学,2021,48(8):200-208. 被引量：9
6张雄伟,张星昱,孙蒙,邹霞.说话人验证系统攻击方法的研究现状及展望[J].数据采集与处理,2021,36(5):831-849. 被引量：3
7陈代杰.常态化防控背景下构建高校预防电信网络诈骗机制的研究[J].法制博览（名家讲坛、经典杂文）,2021(29):125-126. 被引量：8
8倪雪莉,王群,梁广俊.微信证据的鉴真方法研究[J].信息网络安全,2021(12):60-69. 被引量：2
9孙锬锋,蒋兴浩,许可,许强,彭朝阳,寿利奔.数字视频篡改痕迹的被动检测技术综述[J].信号处理,2021,37(12):2356-2370.
10任延珍,刘晨雨,刘武洋,王丽娜.语音伪造及检测技术研究综述[J].信号处理,2021,37(12):2412-2439. 被引量：16

1冯继凡,杨清,石昌鑫.基于图像放缩增强网络的图像分类方法[J].信息技术与信息化,2024(5):181-184.
2谢伟功.小学生数学问题解决能力的培养方法研究[J].中国科技经济新闻数据库教育,2024(9):0001-0004.
3韩潇,王明秋,赵胜利.基于稳健距离的大数据Logistic回归最优子抽样[J].统计与决策,2024,40(15):59-64.
4刘婷婷,杨晓华,韩亚帅,王军民.基于相干反馈的相敏放大器强度差压缩增强研究[J].物理学报,2024,73(13):135-141.
5罗平,杨清平,曹逸轩,曹荣禹,何清.非关系型表格理解前沿进展[J].中文信息学报,2024,38(5):1-21.
6陆盼辉.BIM技术在钢桥梁设计图纸深化中的应用与优化[J].中文科技期刊数据库（全文版）工程技术,2024(9):0142-0145.
7施昊翔,张旭龙,王健宗,程宁,肖京.情感语音合成综述[J].大数据,2024,10(5):56-73.
8积累,王丽,黄淑程,黄建清.观察循证护理干预在妇科盆腔磁共振成像(MRI)增强扫描检查中的应用[J].中国科技期刊数据库医药,2024(8):0064-0067.
9郭黎,王国栋,龚建业,李润泽,姜斌.基于FRANC3D和LSTM的桥梁钢桁架裂纹寿命预测[J].振动．测试与诊断,2024,44(4):646-651.
10张梦萍,牟熠,曾敏,肖尧馨.深度学习视域下生成式伪造手写数字鉴定方法[J].信息技术与信息化,2024(8):199-201.

中国图象图形学报

2024年第8期

浏览历史

内容加载中请稍等...

语音深度伪造及其检测技术研究进展

参考文献5

二级参考文献7

共引文献59

相关作者

相关机构

相关主题

浏览历史