Adversarial attacks have been posing significant security concerns to intelligent systems,such as speaker recognition systems(SRSs).Most attacks assume the neural networks in the systems are known beforehand,while bla...Adversarial attacks have been posing significant security concerns to intelligent systems,such as speaker recognition systems(SRSs).Most attacks assume the neural networks in the systems are known beforehand,while black-box attacks are proposed without such information to meet practical situations.Existing black-box attacks improve trans-ferability by integrating multiple models or training on multiple datasets,but these methods are costly.Motivated by the optimisation strategy with spatial information on the perturbed paths and samples,we propose a Dual Spatial Momentum Iterative Fast Gradient Sign Method(DS-MI-FGSM)to improve the transferability of black-box at-tacks against SRSs.Specifically,DS-MI-FGSM only needs a single data and one model as the input;by extending to the data and model neighbouring spaces,it generates adver-sarial examples against the integrating models.To reduce the risk of overfitting,DS-MI-FGSM also introduces gradient masking to improve transferability.The authors conduct extensive experiments regarding the speaker recognition task,and the results demonstrate the effectiveness of their method,which can achieve up to 92%attack success rate on the victim model in black-box scenarios with only one known model.展开更多
Automatic Speaker Identification(ASI)involves the process of distinguishing an audio stream associated with numerous speakers’utterances.Some common aspects,such as the framework difference,overlapping of different s...Automatic Speaker Identification(ASI)involves the process of distinguishing an audio stream associated with numerous speakers’utterances.Some common aspects,such as the framework difference,overlapping of different sound events,and the presence of various sound sources during recording,make the ASI task much more complicated and complex.This research proposes a deep learning model to improve the accuracy of the ASI system and reduce the model training time under limited computation resources.In this research,the performance of the transformer model is investigated.Seven audio features,chromagram,Mel-spectrogram,tonnetz,Mel-Frequency Cepstral Coefficients(MFCCs),delta MFCCs,delta-delta MFCCs and spectral contrast,are extracted from the ELSDSR,CSTRVCTK,and Ar-DAD,datasets.The evaluation of various experiments demonstrates that the best performance was achieved by the proposed transformer model using seven audio features on all datasets.For ELSDSR,CSTRVCTK,and Ar-DAD,the highest attained accuracies are 0.99,0.97,and 0.99,respectively.The experimental results reveal that the proposed technique can achieve the best performance for ASI problems.展开更多
Most current security and authentication systems are based on personal biometrics.The security problem is a major issue in the field of biometric systems.This is due to the use in databases of the original biometrics....Most current security and authentication systems are based on personal biometrics.The security problem is a major issue in the field of biometric systems.This is due to the use in databases of the original biometrics.Then biometrics will forever be lost if these databases are attacked.Protecting privacy is the most important goal of cancelable biometrics.In order to protect privacy,therefore,cancelable biometrics should be non-invertible in such a way that no information can be inverted from the cancelable biometric templates stored in personal identification/verification databases.One methodology to achieve non-invertibility is the employment of non-invertible transforms.This work suggests an encryption process for cancellable speaker identification using a hybrid encryption system.This system includes the 3D Jigsaw transforms and Fractional Fourier Transform(FrFT).The proposed scheme is compared with the optical Double Random Phase Encoding(DRPE)encryption process.The evaluation of simulation results of cancellable biometrics shows that the algorithm proposed is secure,authoritative,and feasible.The encryption and cancelability effects are good and reveal good performance.Also,it introduces recommended security and robustness levels for its utilization for achieving efficient cancellable biometrics systems.展开更多
The use of voice to perform biometric authentication is an importanttechnological development,because it is a non-invasive identification methodand does not require special hardware,so it is less likely to arouse user...The use of voice to perform biometric authentication is an importanttechnological development,because it is a non-invasive identification methodand does not require special hardware,so it is less likely to arouse user disgust.This study tries to apply the voice recognition technology to the speech-driveninteractive voice response questionnaire system aiming to upgrade the traditionalspeech system to an intelligent voice response questionnaire network so that thenew device may offer enterprises more precise data for customer relationshipmanagement(CRM).The intelligence-type voice response gadget is becominga new mobile channel at the current time,with functions of the questionnaireto be built in for the convenience of collecting information on local preferencesthat can be used for localized promotion and publicity.Authors of this study propose a framework using voice recognition and intelligent analysis models to identify target customers through voice messages gathered in the voice response questionnaire system;that is,transforming the traditional speech system to anintelligent voice complex.The speaker recognition system discussed hereemploys volume as the acoustic feature in endpoint detection as the computationload is usually low in this method.To correct two types of errors found in the endpoint detection practice because of ambient noise,this study suggests ways toimprove the situation.First,to reach high accuracy,this study follows a dynamictime warping(DTW)based method to gain speaker identification.Second,it isdevoted to avoiding any errors in endpoint detection by filtering noise from voicesignals before getting recognition and deleting any test utterances that might negatively affect the results of recognition.It is hoped that by so doing the recognitionrate is improved.According to the experimental results,the method proposed inthis research has a high recognition rate,whether it is on personal-level or industrial-level computers,and can reach the practical application standard.Therefore,the voice management system in this research can be regarded as Virtual customerservice staff to use.展开更多
Previous studies have investigated the efficiency in teaching listener and speaker repertoires in children diagnosed with autism spectrum disorder(ASD).Some investigations focused on listener responding by function,fe...Previous studies have investigated the efficiency in teaching listener and speaker repertoires in children diagnosed with autism spectrum disorder(ASD).Some investigations focused on listener responding by function,feature,and class(LRFFC)and intraverbal by function,feature,and class(FFC).For some children,teaching intraverbal FFC was more efficient because it resulted in a better emergence effect of a related untaught repertoire(LRFFC).For other children,teaching LRFFC along with tacting pictures was more efficient,resulting in a better emergence effect of a related untaught repertoire(intraverbal FFC).In these cases,it is not clear whether the tact increased the efficiency of LRFFC training because a comparison with a condition in which tacts were not required was not conducted.This investigation consisted of a replication with two children diagnosed with ASD.Three instructional sequences were compared:teaching LRFFC-probing intraverbal;teaching LRFFC+tacts-probing intraverbal;teaching intraverbal-probing LRFFC.For a child,all sequences were equally efficient because all related untaught repertoires emerged without errors.However,the acquisition of intraverbals during training occurred with variability.In the case of the second child,the most efficient sequence consisted of teaching intraverbals,resulting in the emergence of LRFFC without errors.In both cases of teaching LRFFC,the emergence of related intraverbals was partial and acquisition of the trained repertoires occurred with variability.The case that did not demand tact responses was slightly more efficient.Data were discussed in the sense that the best instructional sequence may vary from learner to learner.展开更多
A novel emotional speaker recognition system (ESRS) is proposed to compensate for emotion variability. First, the emotion recognition is adopted as a pre-processing part to classify the neutral and emotional speech....A novel emotional speaker recognition system (ESRS) is proposed to compensate for emotion variability. First, the emotion recognition is adopted as a pre-processing part to classify the neutral and emotional speech. Then, the recognized emotion speech is adjusted by prosody modification. Different methods including Gaussian normalization, the Gaussian mixture model (GMM) and support vector regression (SVR) are adopted to define the mapping rules of F0s between emotional and neutral speech, and the average linear ratio is used for the duration modification. Finally, the modified emotional speech is employed for the speaker recognition. The experimental results show that the proposed ESRS can significantly improve the performance of emotional speaker recognition, and the identification rate (IR) is higher than that of the traditional recognition system. The emotional speech with F0 and duration modifications is closer to the neutral one.展开更多
基金The Major Key Project of PCL,Grant/Award Number:PCL2022A03National Natural Science Foundation of China,Grant/Award Numbers:61976064,62372137Zhejiang Provincial Natural Science Foundation of China,Grant/Award Number:LZ22F020007。
文摘Adversarial attacks have been posing significant security concerns to intelligent systems,such as speaker recognition systems(SRSs).Most attacks assume the neural networks in the systems are known beforehand,while black-box attacks are proposed without such information to meet practical situations.Existing black-box attacks improve trans-ferability by integrating multiple models or training on multiple datasets,but these methods are costly.Motivated by the optimisation strategy with spatial information on the perturbed paths and samples,we propose a Dual Spatial Momentum Iterative Fast Gradient Sign Method(DS-MI-FGSM)to improve the transferability of black-box at-tacks against SRSs.Specifically,DS-MI-FGSM only needs a single data and one model as the input;by extending to the data and model neighbouring spaces,it generates adver-sarial examples against the integrating models.To reduce the risk of overfitting,DS-MI-FGSM also introduces gradient masking to improve transferability.The authors conduct extensive experiments regarding the speaker recognition task,and the results demonstrate the effectiveness of their method,which can achieve up to 92%attack success rate on the victim model in black-box scenarios with only one known model.
基金The authors are grateful to the Taif University Researchers Supporting Project Number(TURSP-2020/36)Taif University,Taif,Saudi Arabia.
文摘Automatic Speaker Identification(ASI)involves the process of distinguishing an audio stream associated with numerous speakers’utterances.Some common aspects,such as the framework difference,overlapping of different sound events,and the presence of various sound sources during recording,make the ASI task much more complicated and complex.This research proposes a deep learning model to improve the accuracy of the ASI system and reduce the model training time under limited computation resources.In this research,the performance of the transformer model is investigated.Seven audio features,chromagram,Mel-spectrogram,tonnetz,Mel-Frequency Cepstral Coefficients(MFCCs),delta MFCCs,delta-delta MFCCs and spectral contrast,are extracted from the ELSDSR,CSTRVCTK,and Ar-DAD,datasets.The evaluation of various experiments demonstrates that the best performance was achieved by the proposed transformer model using seven audio features on all datasets.For ELSDSR,CSTRVCTK,and Ar-DAD,the highest attained accuracies are 0.99,0.97,and 0.99,respectively.The experimental results reveal that the proposed technique can achieve the best performance for ASI problems.
文摘Most current security and authentication systems are based on personal biometrics.The security problem is a major issue in the field of biometric systems.This is due to the use in databases of the original biometrics.Then biometrics will forever be lost if these databases are attacked.Protecting privacy is the most important goal of cancelable biometrics.In order to protect privacy,therefore,cancelable biometrics should be non-invertible in such a way that no information can be inverted from the cancelable biometric templates stored in personal identification/verification databases.One methodology to achieve non-invertibility is the employment of non-invertible transforms.This work suggests an encryption process for cancellable speaker identification using a hybrid encryption system.This system includes the 3D Jigsaw transforms and Fractional Fourier Transform(FrFT).The proposed scheme is compared with the optical Double Random Phase Encoding(DRPE)encryption process.The evaluation of simulation results of cancellable biometrics shows that the algorithm proposed is secure,authoritative,and feasible.The encryption and cancelability effects are good and reveal good performance.Also,it introduces recommended security and robustness levels for its utilization for achieving efficient cancellable biometrics systems.
文摘The use of voice to perform biometric authentication is an importanttechnological development,because it is a non-invasive identification methodand does not require special hardware,so it is less likely to arouse user disgust.This study tries to apply the voice recognition technology to the speech-driveninteractive voice response questionnaire system aiming to upgrade the traditionalspeech system to an intelligent voice response questionnaire network so that thenew device may offer enterprises more precise data for customer relationshipmanagement(CRM).The intelligence-type voice response gadget is becominga new mobile channel at the current time,with functions of the questionnaireto be built in for the convenience of collecting information on local preferencesthat can be used for localized promotion and publicity.Authors of this study propose a framework using voice recognition and intelligent analysis models to identify target customers through voice messages gathered in the voice response questionnaire system;that is,transforming the traditional speech system to anintelligent voice complex.The speaker recognition system discussed hereemploys volume as the acoustic feature in endpoint detection as the computationload is usually low in this method.To correct two types of errors found in the endpoint detection practice because of ambient noise,this study suggests ways toimprove the situation.First,to reach high accuracy,this study follows a dynamictime warping(DTW)based method to gain speaker identification.Second,it isdevoted to avoiding any errors in endpoint detection by filtering noise from voicesignals before getting recognition and deleting any test utterances that might negatively affect the results of recognition.It is hoped that by so doing the recognitionrate is improved.According to the experimental results,the method proposed inthis research has a high recognition rate,whether it is on personal-level or industrial-level computers,and can reach the practical application standard.Therefore,the voice management system in this research can be regarded as Virtual customerservice staff to use.
文摘Previous studies have investigated the efficiency in teaching listener and speaker repertoires in children diagnosed with autism spectrum disorder(ASD).Some investigations focused on listener responding by function,feature,and class(LRFFC)and intraverbal by function,feature,and class(FFC).For some children,teaching intraverbal FFC was more efficient because it resulted in a better emergence effect of a related untaught repertoire(LRFFC).For other children,teaching LRFFC along with tacting pictures was more efficient,resulting in a better emergence effect of a related untaught repertoire(intraverbal FFC).In these cases,it is not clear whether the tact increased the efficiency of LRFFC training because a comparison with a condition in which tacts were not required was not conducted.This investigation consisted of a replication with two children diagnosed with ASD.Three instructional sequences were compared:teaching LRFFC-probing intraverbal;teaching LRFFC+tacts-probing intraverbal;teaching intraverbal-probing LRFFC.For a child,all sequences were equally efficient because all related untaught repertoires emerged without errors.However,the acquisition of intraverbals during training occurred with variability.In the case of the second child,the most efficient sequence consisted of teaching intraverbals,resulting in the emergence of LRFFC without errors.In both cases of teaching LRFFC,the emergence of related intraverbals was partial and acquisition of the trained repertoires occurred with variability.The case that did not demand tact responses was slightly more efficient.Data were discussed in the sense that the best instructional sequence may vary from learner to learner.
基金The National Natural Science Foundation of China (No.60872073, 60975017, 51075068)the Natural Science Foundation of Guangdong Province (No. 10252800001000001)the Natural Science Foundation of Jiangsu Province (No. BK2010546)
文摘A novel emotional speaker recognition system (ESRS) is proposed to compensate for emotion variability. First, the emotion recognition is adopted as a pre-processing part to classify the neutral and emotional speech. Then, the recognized emotion speech is adjusted by prosody modification. Different methods including Gaussian normalization, the Gaussian mixture model (GMM) and support vector regression (SVR) are adopted to define the mapping rules of F0s between emotional and neutral speech, and the average linear ratio is used for the duration modification. Finally, the modified emotional speech is employed for the speaker recognition. The experimental results show that the proposed ESRS can significantly improve the performance of emotional speaker recognition, and the identification rate (IR) is higher than that of the traditional recognition system. The emotional speech with F0 and duration modifications is closer to the neutral one.