The cocktail party problem,i.e.,tracing and recognizing the speech of a specific speaker when multiple speakers talk simultaneously,is one of the critical problems yet to be solved to enable the wide application of au...The cocktail party problem,i.e.,tracing and recognizing the speech of a specific speaker when multiple speakers talk simultaneously,is one of the critical problems yet to be solved to enable the wide application of automatic speech recognition(ASR) systems.In this overview paper,we review the techniques proposed in the last two decades in attacking this problem.We focus our discussions on the speech separation problem given its central role in the cocktail party environment,and describe the conventional single-channel techniques such as computational auditory scene analysis(CASA),non-negative matrix factorization(NMF) and generative models,the conventional multi-channel techniques such as beamforming and multi-channel blind source separation,and the newly developed deep learning-based techniques,such as deep clustering(DPCL),the deep attractor network(DANet),and permutation invariant training(PIT).We also present techniques developed to improve ASR accuracy and speaker identification in the cocktail party environment.We argue effectively exploiting information in the microphone array,the acoustic training set,and the language itself using a more powerful model.Better optimization ob jective and techniques will be the approach to solving the cocktail party problem.展开更多
Audio signal separation is an open and challenging issue in the classical“Cocktail Party Problem”.Especially in a reverberation environment,the separation of mixed signals is more difficult separated due to the infl...Audio signal separation is an open and challenging issue in the classical“Cocktail Party Problem”.Especially in a reverberation environment,the separation of mixed signals is more difficult separated due to the influence of reverberation and echo.To solve the problem,we propose a determined reverberant blind source separation algorithm.The main innovation of the algorithm focuses on the estimation of the mixing matrix.A new cost function is built to obtain the accurate demixing matrix,which shows the gap between the prediction and the actual data.Then,the update rule of the demixing matrix is derived using Newton gradient descent method.The identity matrix is employed as the initial demixing matrix for avoiding local optima problem.Through the real-time iterative update of the demixing matrix,frequency-domain sources are obtained.Then,time-domain sources can be obtained using an inverse short-time Fourier transform.Experi-mental results based on a series of source separation of speech and music mixing signals demonstrate that the proposed algorithm achieves better separation performance than the state-of-the-art methods.In particular,it has much better superiority in the highly reverberant environment.展开更多
Speech perception is essential for daily communication.Background noise or concurrent talkers,on the other hand,can make it challenging for listeners to track the target speech(i.e.,cocktail party problem).The present...Speech perception is essential for daily communication.Background noise or concurrent talkers,on the other hand,can make it challenging for listeners to track the target speech(i.e.,cocktail party problem).The present study reviews and compares existing findings on speech perception and unmasking in cocktail party listening environments in English and Mandarin Chinese.The review starts with an introduction section followed by related concepts of auditory masking.The next two sections review factors that release speech perception from masking in English and Mandarin Chinese,respectively.The last section presents an overall summary of the findings with comparisons between the two languages.Future research directions with respect to the difference in literature on the reviewed topic between the two languages are also discussed.展开更多
基金supported by the Tencent and Shanghai Jiao Tong University Joint Project
文摘The cocktail party problem,i.e.,tracing and recognizing the speech of a specific speaker when multiple speakers talk simultaneously,is one of the critical problems yet to be solved to enable the wide application of automatic speech recognition(ASR) systems.In this overview paper,we review the techniques proposed in the last two decades in attacking this problem.We focus our discussions on the speech separation problem given its central role in the cocktail party environment,and describe the conventional single-channel techniques such as computational auditory scene analysis(CASA),non-negative matrix factorization(NMF) and generative models,the conventional multi-channel techniques such as beamforming and multi-channel blind source separation,and the newly developed deep learning-based techniques,such as deep clustering(DPCL),the deep attractor network(DANet),and permutation invariant training(PIT).We also present techniques developed to improve ASR accuracy and speaker identification in the cocktail party environment.We argue effectively exploiting information in the microphone array,the acoustic training set,and the language itself using a more powerful model.Better optimization ob jective and techniques will be the approach to solving the cocktail party problem.
基金This research was partially supported by the National Natural Science Foundation of China under Grant 52105268Natural Science Foundation of Guangdong Province under Grant 2022A1515011409+2 种基金Key Platforms and Major Scientific Research Projects of Universities in Guangdong under Grants 2019KTSCX161 and 2019KTSCX165Key Projects of Natural Science Research Projects of Shaoguan University under Grants SZ2020KJ02 and SZ2021KJ04the Science and Technology Program of Shaoguan City of China under Grants 2019sn056,200811094530423,200811094530805,and 200811094530811.
文摘Audio signal separation is an open and challenging issue in the classical“Cocktail Party Problem”.Especially in a reverberation environment,the separation of mixed signals is more difficult separated due to the influence of reverberation and echo.To solve the problem,we propose a determined reverberant blind source separation algorithm.The main innovation of the algorithm focuses on the estimation of the mixing matrix.A new cost function is built to obtain the accurate demixing matrix,which shows the gap between the prediction and the actual data.Then,the update rule of the demixing matrix is derived using Newton gradient descent method.The identity matrix is employed as the initial demixing matrix for avoiding local optima problem.Through the real-time iterative update of the demixing matrix,frequency-domain sources are obtained.Then,time-domain sources can be obtained using an inverse short-time Fourier transform.Experi-mental results based on a series of source separation of speech and music mixing signals demonstrate that the proposed algorithm achieves better separation performance than the state-of-the-art methods.In particular,it has much better superiority in the highly reverberant environment.
文摘Speech perception is essential for daily communication.Background noise or concurrent talkers,on the other hand,can make it challenging for listeners to track the target speech(i.e.,cocktail party problem).The present study reviews and compares existing findings on speech perception and unmasking in cocktail party listening environments in English and Mandarin Chinese.The review starts with an introduction section followed by related concepts of auditory masking.The next two sections review factors that release speech perception from masking in English and Mandarin Chinese,respectively.The last section presents an overall summary of the findings with comparisons between the two languages.Future research directions with respect to the difference in literature on the reviewed topic between the two languages are also discussed.