摘要
语音深度伪造技术是利用深度学习方法进行合成或生成语音的技术。人工智能生成内容技术的快速迭代与优化,推动了语音深度伪造技术在伪造语音的自然度、逼真度和多样性等方面取得显著提升,同时也使得语音深度伪造检测技术面临着巨大挑战。本文对语音深度伪造及其检测技术的研究进展进行全面梳理回顾。首先,介绍以语音合成(speech synthesis,SS)和语音转换(voice conversion,VC)为代表的伪造技术。然后,介绍语音深度伪造检测领域的常用数据集和相关评价指标。在此基础上,从数据增强、特征提取和优化以及学习机制等处理流程的角度对现有的语音深度伪造检测技术进行分类与深入分析。具体而言,从语音加噪、掩码增强、信道增强和压缩增强等数据增强的角度来分析不同增强方式对伪造检测技术性能的影响,从基于手工特征的伪造检测、基于混合特征的伪造检测、基于端到端的伪造检测和基于特征融合的伪造检测等特征提取和优化的角度对比分析各类方法的优缺点,从自监督学习、对抗训练和多任务学习等学习机制的角度对伪造检测技术的训练方式进行探讨。最后,总结分析语音深度伪造检测技术存在的挑战性问题,并对未来研究进行展望。本文汇总的相关数据集和代码可在https://github.com/media-sec-lab/Audio-Deepfake-Detection访问。
Speech deepfake technology,which employs deep learning methods to synthesize or generate speech,has emerged as a critical research hotspot in multimedia information security.The rapid iteration and optimization of artificial intelligence-generated content technologies have significantly advanced speech deepfake techniques.These advancements have significantly enhanced the naturalness,fidelity,and diversity of synthesized speech.However,they have also pre⁃sented great challenges for speech deepfake detection technology.To address these challenges,this study comprehensively reviews recent research progress on speech deepfake generation and its detection techniques.Based on an extensive litera⁃ture survey,this study first introduces the research background of speech forgery and its detection and compares and ana⁃lyzes previously published reviews in this field.Second,this study provides a concise overview of speech deepfake genera⁃tion,especially speech synthesis(SS)and voice conversion(VC).SS,which is commonly known as text-to-speech(TTS),analyzes text and generates speech that aligns with the provided input by applying linguistic rules for text descrip⁃tion.Various deep models are employed in TTS,including sequence-to-sequence models,flow models,generative adver⁃sarial network models,variational auto-encoder models,and diffusion models.VC involves modifying acoustic features,such as emotion,accent,pronunciation,and speaker identity,to produce speech resembling human-like speech.VC algo⁃rithms can be categorized as single,multiple,and arbitrary target speech conversion depending on the number of target speakers.Third,this study briefly introduces commonly used datasets in speech deepfake detection and provides relevant access links to open-source datasets.This study briefly introduces two commonly used evaluation metrics in speech deep⁃fake detection:the equal error rate and the tandem detection cost function.This study analyzes and categorizes the existing deep speech forgery detection techniques in detail.The pros and cons of different detection techniques are studied and com⁃pared in depth,focusing primarily on data processing,feature extraction and optimization,and learning mechanisms.Notably,this study summarizes the experimental results of existing detection techniques on the ASVspoof 2019 and 2021 datasets in tabular form.Within this context,the primary focus of this study is to investigate the generality of current detec⁃tion techniques in the field of speech deepfake detection without focusing on specific forgery attack methods.Data augmen⁃tation involves a series of transformations on the original speech data.These include speech noise addition,mask enhance⁃ment,channel enhancement,and compression enhancement,each aiming to simulate complex real-world acoustic environ⁃ments more effectively.Among them,one of the most common data processing methods is speech noise addition,which aims to interfere with the speech signal by adding noise to simulate the complex acoustic environment of a real scenario as much as possible.Mask enhancement is the masking operation on the time or frequency domain of speech to achieve noise suppression and enhancement of the speech signal for improving the accuracy and robustness of speech detection tech⁃niques.Transmission channel enhancement focuses on solving the problems of signal attenuation,data loss,and noise interference caused by changes in the codec and transmission channel of speech data.Compression enhancement tech⁃niques address the problem of degradation of speech quality during data compression.In particular,the main data compres⁃sion methods are MP3,M4A,and OGG.From the perspective of feature extraction and optimization,speech deepfake detection can be divided into handcrafted feature-,hybrid feature-,deep feature-,and feature fusion-based methods.Handcrafted features refer to speech features extracted with the help of certain prior knowledge,which mainly include constant-Q transform,linear frequency cepstral coefficients,and Mel-spectrogram.By contrast,feature-based hybrid forg⁃ery detection methods utilize the domain knowledge provided by handcrafted features to mine richer information about speech representations through deep learning networks.End-to-end forgery detection methods directly learn feature repre⁃sentation and classification models from raw speech signals,which eliminates the need for handcrafted feature extraction.This way allows the model to discover discriminative features from the input data automatically.Moreover,these detection techniques can be trained using a single feature.Alternatively,feature-level fusion forgery detection can be employed to combine multiple features,whether they are identical or different.Techniques such as weighted aggregation and feature concatenation are used for feature-level fusion.The detection techniques can capture richer speech information by fusing these features,which improves performance.For the learning mechanism,this study explores the impact of different train⁃ing methods on forgery detection techniques,especially self-supervised learning,adversarial training,and multi-task learn⁃ing.Self-supervised learning plays an important role in forgery detection techniques by automatically generating auxiliary targets or labels from speech data to train models.Fine-tuning the self-supervised-based pretrained model can effectively dis⁃tinguish between real and forged speech.Then,adversarial training-based forgery detection enhances the robustness and general⁃ization of the model by adding adversarial samples to the training data.In contrast to binary classification tasks,the forgery detec⁃tion based on multi-task learning captures more comprehensive and useful speech feature information from different speechrelated tasks by sharing the underlying feature representations.This approach improves the detection performance of the model while effectively utilizing speech training data.Although speech deepfake detection techniques have achieved excellent perfor⁃mance in some datasets,their performance is less satisfactory when testing speech data from natural scenarios.Analysis of the existing research shows that the main future research directions are to establish diversified speech deepfake datasets,study adver⁃sarial samples or data enhancement methods for enhancing the robustness of speech deepfake detection techniques,establish gen⁃eralized speech deepfake detection techniques,and explore interpretable speech deepfake detection techniques.The relevant datasets and code mentioned can be accessed from https://github.com/media-sec-lab/Audio-Deepfake-Detection.
作者
许裕雄
李斌
谭舜泉
黄继武
Xu Yuxiong;Li Bin;Tan Shunquan;Huang Jiwu(Guangdong Key Laboratory of Intelligent Information Processing,Shenzhen 518060,China;Shenzhen Key Laboratory of Media Security,Shenzhen 518060,China;College of Electronics and Information Engineering,Shenzhen University,Shenzhen 518060,China;College of Computer Science and Software Engineering,Shenzhen University,Shenzhen 518060,China)
出处
《中国图象图形学报》
CSCD
北大核心
2024年第8期2236-2268,共33页
Journal of Image and Graphics
基金
国家自然科学基金项目(U23B2022,U22B2047,62272314)
广东省基础与应用基础研究基金项目(2019B151502001)
深圳市基础研究重点项目(JCYJ20200109105008228)
亚马逊云科技——2022教育部就业育人项目(20221128)。