Video data are composed of multimodal information streams including visual, auditory and textual streams, so an approach of story segmentation for news video using multimodal analysis is described in this paper. The p...Video data are composed of multimodal information streams including visual, auditory and textual streams, so an approach of story segmentation for news video using multimodal analysis is described in this paper. The proposed approach detects the topic-caption frames, and integrates them with silence clips detection results, as well as shot segmentation results to locate the news story boundaries. The integration of audio-visual features and text information overcomes the weakness of the approach using only image analysis techniques. On test data with 135 400 frames, when the boundaries between news stories are detected, the accuracy rate 85.8% and the recall rate 97.5% are obtained. The experimental results show the approach is valid and robust.展开更多
This paper is dedicated to a thorough review on the audio-visual related translations from both home and abroad.In reviewing the foreign achievements on this specific field of translation studies it can shed some ligh...This paper is dedicated to a thorough review on the audio-visual related translations from both home and abroad.In reviewing the foreign achievements on this specific field of translation studies it can shed some lights on our national audio-visual practice and research.The review on the Chinese scholars’ audio-visual translation studies is to offer the potential developing direction and guidelines to the studies and aspects neglected as well.Based on the summary of relevant studies,possible topics for further studies are proposed.展开更多
Emotion recognition has become an important task of modern human-computer interac- tion. A multilayer boosted HMM ( MBHMM ) classifier for automatic audio-visual emotion recognition is presented in this paper. A mod...Emotion recognition has become an important task of modern human-computer interac- tion. A multilayer boosted HMM ( MBHMM ) classifier for automatic audio-visual emotion recognition is presented in this paper. A modified Baum-Welch algorithm is proposed for component HMM learn- ing and adaptive boosting (AdaBoost) is used to train ensemble classifiers for different layers (cues). Except for the first layer, the initial weights of training samples in current layer are decided by recognition results of the ensemble classifier in the upper layer. Thus the training procedure using current cue can focus more on the difficult samples according to the previous cue. Our MBHMM clas- sifier is combined by these ensemble classifiers and takes advantage of the complementary informa- tion from multiple cues and modalities. Experimental results on audio-visual emotion data collected in Wizard of Oz scenarios and labeled under two types of emotion category sets demonstrate that our approach is effective and promising.展开更多
February 10 (US Central Time), 2019, China National Peking Opera Company (CNPOC) and the Hubei Chime Bells National Chinese Orchestra presented a fantastic audio-visual performance of Chinese Peking Opera and Chinese ...February 10 (US Central Time), 2019, China National Peking Opera Company (CNPOC) and the Hubei Chime Bells National Chinese Orchestra presented a fantastic audio-visual performance of Chinese Peking Opera and Chinese chime bells for the American audience at the world s top-level Buntrock Hall at Symphony Center.展开更多
Mongolian audio-visual works are an important carrier of exploring the true significance to this national culture.This paper believes that the Mongolian people in Inner Mongolia constantly enhance the individual sense...Mongolian audio-visual works are an important carrier of exploring the true significance to this national culture.This paper believes that the Mongolian people in Inner Mongolia constantly enhance the individual sense of identity to the overall ethnic group through the influence of film and television and music,and on this basis constantly evolve a new culture in line with modern and contemporary life to further enhance their sense of belonging to the ethnic nation.展开更多
Based on the current situation of college audio-visual English teaching in China, this article points out that the avoidance in class is a serious phenomenon in the process of college audio-visual English teaching. Af...Based on the current situation of college audio-visual English teaching in China, this article points out that the avoidance in class is a serious phenomenon in the process of college audio-visual English teaching. After further analysis and combination with the characteristics of college English audio-visual teaching in China, it puts forward the application of task-based teaching method to college audio-visual English teaching and its steps, attempting to alleviate the avoidance phenomenon in students through task-based teaching method.展开更多
The object-based scalable coding in MPEG-4 is investigated, and a prioritized transmission scheme of MPEG-4 audio-visual objects (AVOs) over the DiffServ network with the QoS guarantee is proposed. MPEG-4 AVOs are e...The object-based scalable coding in MPEG-4 is investigated, and a prioritized transmission scheme of MPEG-4 audio-visual objects (AVOs) over the DiffServ network with the QoS guarantee is proposed. MPEG-4 AVOs are extracted and classified into different groups according to their priority values and scalable layers (visual importance). These priority values are mapped to the 1P DiffServ per hop behaviors (PHB). This scheme can selectively discard packets with low importance, in order to avoid the network congestion. Simulation results show that the quality of received video can gracefully adapt to network state, as compared with the ‘best-effort' manner. Also, by allowing the content provider to define prioritization of each audio-visual object, the adaptive transmission of object-based scalable video can be customized based on the content.展开更多
Existing pre-trained models like Distil HuBERT excel at uncovering hidden patterns and facilitating accurate recognition across diverse data types, such as audio and visual information. We harnessed this capability to...Existing pre-trained models like Distil HuBERT excel at uncovering hidden patterns and facilitating accurate recognition across diverse data types, such as audio and visual information. We harnessed this capability to develop a deep learning model that utilizes Distil HuBERT for jointly learning these combined features in speech emotion recognition (SER). Our experiments highlight its distinct advantages: it significantly outperforms Wav2vec 2.0 in both offline and real-time accuracy on RAVDESS and BAVED datasets. Although slightly trailing HuBERT’s offline accuracy, Distil HuBERT shines with comparable performance at a fraction of the model size, making it an ideal choice for resource-constrained environments like mobile devices. This smaller size does come with a slight trade-off: Distil HuBERT achieved notable accuracy in offline evaluation, with 96.33% on the BAVED database and 87.01% on the RAVDESS database. In real-time evaluation, the accuracy decreased to 79.3% on the BAVED database and 77.87% on the RAVDESS database. This decrease is likely a result of the challenges associated with real-time processing, including latency and noise, but still demonstrates strong performance in practical scenarios. Therefore, Distil HuBERT emerges as a compelling choice for SER, especially when prioritizing accuracy over real-time processing. Its compact size further enhances its potential for resource-limited settings, making it a versatile tool for a wide range of applications.展开更多
In recent years,computing art has developed rapidly with the in-depth cross study of artificial intelligence generated con-tent(AIGC)and the main features of artworks.Audio-visual content generation has gradually been...In recent years,computing art has developed rapidly with the in-depth cross study of artificial intelligence generated con-tent(AIGC)and the main features of artworks.Audio-visual content generation has gradually been applied to various practical tasks,including video or game score,assisting artists in creation,art education and other aspects,which demonstrates a broad application pro-spect.In this paper,we introduce innovative achievements in audio-visual content generation from the perspective of visual art genera-tion and auditory art generation based on artificial intelligence(Al).We outline the development tendency of image and music datasets,visual and auditory content modelling,and related automatic generation systems.The objective and subjective evaluation of generated samples plays an important role in the measurement of algorithm performance.We provide a cogeneration mechanism of audio-visual content in multimodal tasks from image to music and display the construction of specific stylized datasets.There are still many new op-portunities and challenges in the field of audio-visual synesthesia generation,and we provide a comprehensive discussion on them.展开更多
In response to the evolving challenges posed by small unmanned aerial vehicles(UAVs),which have the potential to transport harmful payloads or cause significant damage,we present AV-FDTI,an innovative Audio-Visual Fus...In response to the evolving challenges posed by small unmanned aerial vehicles(UAVs),which have the potential to transport harmful payloads or cause significant damage,we present AV-FDTI,an innovative Audio-Visual Fusion system designed for Drone Threat Identification.AV-FDTI leverages the fusion of audio and omnidirectional camera feature inputs,providing a comprehensive solution to enhance the precision and resilience of drone classification and 3D localization.Specifically,AV-FDTI employs a CRNN network to capture vital temporal dynamics within the audio domain and utilizes a pretrained ResNet50 model for image feature extraction.Furthermore,we adopt a visual information entropy and cross-attention-based mechanism to enhance the fusion of visual and audio data.Notably,our system is trained based on automated Leica tracking annotations,offering accurate ground truth data with millimeter-level accuracy.Comprehensive comparative evaluations demonstrate the superiority of our solution over the existing systems.In our commitment to advancing this field,we will release this work as open-source code and wearable AV-FDTI design,contributing valuable resources to the research community.展开更多
Language is considered as a tool of communication in the world. Spoken English is very important in English learning and teaching. As an English teacher, we should speak English more and foster students' ability o...Language is considered as a tool of communication in the world. Spoken English is very important in English learning and teaching. As an English teacher, we should speak English more and foster students' ability of speaking. By more practice, the students can speak fluent English and express themselves freely.展开更多
In multimedia environment, many teachers try to use new means and methods to teach listening and of which English movies with great advantages become more and more popular listening material easily accepted by student...In multimedia environment, many teachers try to use new means and methods to teach listening and of which English movies with great advantages become more and more popular listening material easily accepted by students in college English class.展开更多
<strong>Aim:</strong> The aim of this study was to explore patients’ preferences for forms of patient education material, including leaflets, podcasts, and videos;that is, to determine what forms of infor...<strong>Aim:</strong> The aim of this study was to explore patients’ preferences for forms of patient education material, including leaflets, podcasts, and videos;that is, to determine what forms of information, besides that provided verbally by healthcare personnel, do patients prefer following visits to hospital? <strong>Methods: </strong>The study was a mixed-methods study, using a survey design with primarily quantitative items but with a qualitative component. A survey was distributed to patients over 18 years between May and July 2020 and 480 patients chose to respond.<strong> Results:</strong> Text-based patient education materials (leaflets), is the form that patients have the most experience with and was preferred by 86.46% of respondents;however, 50.21% and 31.67% of respondents would also like to receive patient education material in video and podcast formats, respectively. Furthermore, several respondents wrote about the need for different forms of patient education material, depending on the subject of the supplementary information. <strong>Conclusion: </strong>This study provides an overview of patient preferences regarding forms of patient education material. The results show that the majority of respondents prefer to use combinations of written, audio, and video material, thus applying and co-constructing a multimodal communication system, from which they select and apply different modes of communication from different sources simultaneously.展开更多
Background:Difficulty in hearing can occur for numerous reasons across a variety of ages in humans.To overcome this,humans can employ a number of techniques to help improve their understanding of sound in other ways.O...Background:Difficulty in hearing can occur for numerous reasons across a variety of ages in humans.To overcome this,humans can employ a number of techniques to help improve their understanding of sound in other ways.One is to use vision,and attempt to lip-read in order to understand someone else in a face-to-face conversation.Audio-visual integration has a long history in perception(e.g.,the McGurk Effect),and researchers have shown that older adults will look at the mouth region for additional information in noisy situations.However,this concept has not been explored in the context of social media.A common way to communicate virtually that simulates a live conversation is the concept of video chatting or conferencing.It is used for a variety of reasons including work,maintaining social interactions,and has started to be used in clinical settings.However,video chat session quality is often sub-optimal,and may contain degraded audio and/or decoupled audio and video.The goal of this study is to determine whether humans use the same visual compensation mechanism,lip reading,in a digital setting as they would in a face-to-face conversation.Methods:The participants(n=116,age 18 to 41)answered a demographics questionnaire including questions about their use of the video chatting software.Then,the participants viewed two videos of a video call:one with synchronized audio and video,and the other dyssynchronous(1 second delay).The order of video was randomized across participants.Binocular eye movements were monitored at 60 Hz using a Mirametrix S2 eye tracker connected to Ogama 5.0(http://www.ogama.net/).After each video,the participants answered questions about the call quality,and the content of the video.Results:There was no significant difference in the total dwell time at the eyes and the mouth of the speaker remained,t(116)=−1.574,P=0.059,d=−0.147,BF10=0.643.However,using the heat maps generated by Ogama,we observed when viewing the poor-quality video,the participants looked more towards the mouth than the eyes of the speaker.It was found that as call quality decreased,the number of fixations increased from n=79.87 in the synchronous condition to n=113.4 in the asynchronous condition,and the median duration of each fixation decreased from 218.3 ms in the synchronous condition to 205ms in the asynchronous condition.Conclusions:The above results may indicate that humans employ similar compensation mechanisms in response to a decrease in auditory comprehension,given the tendency of participants looking towards the mouth of the speaker more.However,more study is needed because of the inconsistency in the results.展开更多
With the development of society and economy,more and more talents capable persons are badly needed in the world.Under the influence of traditional English teaching mode,most English learners can only read and write.Th...With the development of society and economy,more and more talents capable persons are badly needed in the world.Under the influence of traditional English teaching mode,most English learners can only read and write.They are usually called "deaf-mutes".Therefore,traditional English teaching mode isn't satisfied by the teachers and received greatly challenged.Due to applying multimedia to English teaching could create more authentic language environment for the learners,which enables them to communicate in English in real-life situations.At present,the multi-media approach is the most popular language teaching method in the world.The most effective way to develop the teaching is combine multimedia with the traditional methods.This is of special significance to English teaching and make the English teaching receiving the best effect.展开更多
Many adults especially business people have the need to learn English for their work. Yet, a lot of them have problems in different language skills. For example, across U.S.A, business English teachers encounter Chine...Many adults especially business people have the need to learn English for their work. Yet, a lot of them have problems in different language skills. For example, across U.S.A, business English teachers encounter Chinese speaking students who had problems in writing proper English business messages(Beamer, 1994).Although a lot of educators have been trying creative approaches on teaching children, the adult classrooms are relatively more traditional. This paper aims at reviewing some prospective problems and sharing with the practitioners some approaches for language instruction.展开更多
With the development of science and technology,especially the development of digital technology,mankind has entered the age of multimedia,and the mode of human life and communication have undergone profound changes.As...With the development of science and technology,especially the development of digital technology,mankind has entered the age of multimedia,and the mode of human life and communication have undergone profound changes.As a single communicative mode,language has been gradually replaced by complex communicative mode composed of language,image and sound.Multimodal discourse analysis provides a new perspective for discourse analysis composed of a variety of symbols,which can help readers understand how symbols such as images and music work together and form meanings.Firm analysis is often analyzed from the perspective of psychology,aesthetics and other macro aspects,but seldom from the perspective of linguistics.The paper analyzes how the theory of multimodal discourse analysis affects the translation of film by discussing the interaction between film translation and multimodal modes in the film Pride and Prejudice.展开更多
Audio-visual learning,aimed at exploiting the relationship between audio and visual modalities,has drawn considerable attention since deep learning started to be used successfully.Researchers tend to leverage these tw...Audio-visual learning,aimed at exploiting the relationship between audio and visual modalities,has drawn considerable attention since deep learning started to be used successfully.Researchers tend to leverage these two modalities to improve the performance of previously considered single-modality tasks or address new challenging problems.In this paper,we provide a comprehensive survey of recent audio-visual learning development.We divide the current audio-visual learning tasks into four different subfields:audiovisual separation and localization,audio-visual correspondence learning,audio-visual generation,and audio-visual representation learning.State-of-the-art methods,as well as the remaining challenges of each subfield,are further discussed.Finally,we summarize the commonly used datasets and challenges.展开更多
In order to detect cross-sectional age characteristics of cognitive neural mechanisms in audio-visual modal interference inhibition,event-related potentials(ERP) of 14 10-year-old children were recorded while performi...In order to detect cross-sectional age characteristics of cognitive neural mechanisms in audio-visual modal interference inhibition,event-related potentials(ERP) of 14 10-year-old children were recorded while performing the words interference task.In incongruent conditions,the participants were required to inhibit the audio interference words of the same category.The present findings provided the preliminary evidence of brain mechanism for the children's inhibition development in the specific childhood stage.展开更多
In this paper we address the problem of audio-visual speech recognition in the framework of the multi-stream hidden Markov model. Stream weight training based on minimum classification error criterion is dis...In this paper we address the problem of audio-visual speech recognition in the framework of the multi-stream hidden Markov model. Stream weight training based on minimum classification error criterion is discussed for use in large vocabulary continuous speech recognition (LVCSR). We present the lattice re- scoring and Viterbi approaches for calculating the loss function of continuous speech. The experimental re- sults show that in the case of clean audio, the system performance can be improved by 36.1% in relative word error rate reduction when using state-based stream weights trained by a Viterbi approach, compared to an audio only speech recognition system. Further experimental results demonstrate that our audio-visual LVCSR system provides significant enhancement of robustness in noisy environments.展开更多
文摘Video data are composed of multimodal information streams including visual, auditory and textual streams, so an approach of story segmentation for news video using multimodal analysis is described in this paper. The proposed approach detects the topic-caption frames, and integrates them with silence clips detection results, as well as shot segmentation results to locate the news story boundaries. The integration of audio-visual features and text information overcomes the weakness of the approach using only image analysis techniques. On test data with 135 400 frames, when the boundaries between news stories are detected, the accuracy rate 85.8% and the recall rate 97.5% are obtained. The experimental results show the approach is valid and robust.
文摘This paper is dedicated to a thorough review on the audio-visual related translations from both home and abroad.In reviewing the foreign achievements on this specific field of translation studies it can shed some lights on our national audio-visual practice and research.The review on the Chinese scholars’ audio-visual translation studies is to offer the potential developing direction and guidelines to the studies and aspects neglected as well.Based on the summary of relevant studies,possible topics for further studies are proposed.
基金Supported by the National Natural Science Foundation of China(60905006)the NSFC-Guangdong Joint Fund(U1035004)
文摘Emotion recognition has become an important task of modern human-computer interac- tion. A multilayer boosted HMM ( MBHMM ) classifier for automatic audio-visual emotion recognition is presented in this paper. A modified Baum-Welch algorithm is proposed for component HMM learn- ing and adaptive boosting (AdaBoost) is used to train ensemble classifiers for different layers (cues). Except for the first layer, the initial weights of training samples in current layer are decided by recognition results of the ensemble classifier in the upper layer. Thus the training procedure using current cue can focus more on the difficult samples according to the previous cue. Our MBHMM clas- sifier is combined by these ensemble classifiers and takes advantage of the complementary informa- tion from multiple cues and modalities. Experimental results on audio-visual emotion data collected in Wizard of Oz scenarios and labeled under two types of emotion category sets demonstrate that our approach is effective and promising.
文摘February 10 (US Central Time), 2019, China National Peking Opera Company (CNPOC) and the Hubei Chime Bells National Chinese Orchestra presented a fantastic audio-visual performance of Chinese Peking Opera and Chinese chime bells for the American audience at the world s top-level Buntrock Hall at Symphony Center.
基金This paper is the periodic research result of the research project:Basic Research Project of Beijing Institute of Graphic Communication:Research on the Artistic,Modern Communication and Publishing of Dian-shi Zhai Pictorial(1884-1898)(Serial Number Eb202008).
文摘Mongolian audio-visual works are an important carrier of exploring the true significance to this national culture.This paper believes that the Mongolian people in Inner Mongolia constantly enhance the individual sense of identity to the overall ethnic group through the influence of film and television and music,and on this basis constantly evolve a new culture in line with modern and contemporary life to further enhance their sense of belonging to the ethnic nation.
文摘Based on the current situation of college audio-visual English teaching in China, this article points out that the avoidance in class is a serious phenomenon in the process of college audio-visual English teaching. After further analysis and combination with the characteristics of college English audio-visual teaching in China, it puts forward the application of task-based teaching method to college audio-visual English teaching and its steps, attempting to alleviate the avoidance phenomenon in students through task-based teaching method.
文摘The object-based scalable coding in MPEG-4 is investigated, and a prioritized transmission scheme of MPEG-4 audio-visual objects (AVOs) over the DiffServ network with the QoS guarantee is proposed. MPEG-4 AVOs are extracted and classified into different groups according to their priority values and scalable layers (visual importance). These priority values are mapped to the 1P DiffServ per hop behaviors (PHB). This scheme can selectively discard packets with low importance, in order to avoid the network congestion. Simulation results show that the quality of received video can gracefully adapt to network state, as compared with the ‘best-effort' manner. Also, by allowing the content provider to define prioritization of each audio-visual object, the adaptive transmission of object-based scalable video can be customized based on the content.
文摘Existing pre-trained models like Distil HuBERT excel at uncovering hidden patterns and facilitating accurate recognition across diverse data types, such as audio and visual information. We harnessed this capability to develop a deep learning model that utilizes Distil HuBERT for jointly learning these combined features in speech emotion recognition (SER). Our experiments highlight its distinct advantages: it significantly outperforms Wav2vec 2.0 in both offline and real-time accuracy on RAVDESS and BAVED datasets. Although slightly trailing HuBERT’s offline accuracy, Distil HuBERT shines with comparable performance at a fraction of the model size, making it an ideal choice for resource-constrained environments like mobile devices. This smaller size does come with a slight trade-off: Distil HuBERT achieved notable accuracy in offline evaluation, with 96.33% on the BAVED database and 87.01% on the RAVDESS database. In real-time evaluation, the accuracy decreased to 79.3% on the BAVED database and 77.87% on the RAVDESS database. This decrease is likely a result of the challenges associated with real-time processing, including latency and noise, but still demonstrates strong performance in practical scenarios. Therefore, Distil HuBERT emerges as a compelling choice for SER, especially when prioritizing accuracy over real-time processing. Its compact size further enhances its potential for resource-limited settings, making it a versatile tool for a wide range of applications.
基金This work was supported by National Natural Science Foundation of China(No.62176006)the National Key Research and Development Program of China(No.2022YFF0902302).
文摘In recent years,computing art has developed rapidly with the in-depth cross study of artificial intelligence generated con-tent(AIGC)and the main features of artworks.Audio-visual content generation has gradually been applied to various practical tasks,including video or game score,assisting artists in creation,art education and other aspects,which demonstrates a broad application pro-spect.In this paper,we introduce innovative achievements in audio-visual content generation from the perspective of visual art genera-tion and auditory art generation based on artificial intelligence(Al).We outline the development tendency of image and music datasets,visual and auditory content modelling,and related automatic generation systems.The objective and subjective evaluation of generated samples plays an important role in the measurement of algorithm performance.We provide a cogeneration mechanism of audio-visual content in multimodal tasks from image to music and display the construction of specific stylized datasets.There are still many new op-portunities and challenges in the field of audio-visual synesthesia generation,and we provide a comprehensive discussion on them.
基金National Research Foundation,Singapore,under its Medium-Sized Center for Advanced Robotics Technology Innovation(CARTIN)under project WP5 within the Delta-NTU Corporate Lab with funding support from A*STAR under its IAF-ICP program(Grant no:I2201E0013)and Delta Electronics Inc.
文摘In response to the evolving challenges posed by small unmanned aerial vehicles(UAVs),which have the potential to transport harmful payloads or cause significant damage,we present AV-FDTI,an innovative Audio-Visual Fusion system designed for Drone Threat Identification.AV-FDTI leverages the fusion of audio and omnidirectional camera feature inputs,providing a comprehensive solution to enhance the precision and resilience of drone classification and 3D localization.Specifically,AV-FDTI employs a CRNN network to capture vital temporal dynamics within the audio domain and utilizes a pretrained ResNet50 model for image feature extraction.Furthermore,we adopt a visual information entropy and cross-attention-based mechanism to enhance the fusion of visual and audio data.Notably,our system is trained based on automated Leica tracking annotations,offering accurate ground truth data with millimeter-level accuracy.Comprehensive comparative evaluations demonstrate the superiority of our solution over the existing systems.In our commitment to advancing this field,we will release this work as open-source code and wearable AV-FDTI design,contributing valuable resources to the research community.
文摘Language is considered as a tool of communication in the world. Spoken English is very important in English learning and teaching. As an English teacher, we should speak English more and foster students' ability of speaking. By more practice, the students can speak fluent English and express themselves freely.
文摘In multimedia environment, many teachers try to use new means and methods to teach listening and of which English movies with great advantages become more and more popular listening material easily accepted by students in college English class.
文摘<strong>Aim:</strong> The aim of this study was to explore patients’ preferences for forms of patient education material, including leaflets, podcasts, and videos;that is, to determine what forms of information, besides that provided verbally by healthcare personnel, do patients prefer following visits to hospital? <strong>Methods: </strong>The study was a mixed-methods study, using a survey design with primarily quantitative items but with a qualitative component. A survey was distributed to patients over 18 years between May and July 2020 and 480 patients chose to respond.<strong> Results:</strong> Text-based patient education materials (leaflets), is the form that patients have the most experience with and was preferred by 86.46% of respondents;however, 50.21% and 31.67% of respondents would also like to receive patient education material in video and podcast formats, respectively. Furthermore, several respondents wrote about the need for different forms of patient education material, depending on the subject of the supplementary information. <strong>Conclusion: </strong>This study provides an overview of patient preferences regarding forms of patient education material. The results show that the majority of respondents prefer to use combinations of written, audio, and video material, thus applying and co-constructing a multimodal communication system, from which they select and apply different modes of communication from different sources simultaneously.
文摘Background:Difficulty in hearing can occur for numerous reasons across a variety of ages in humans.To overcome this,humans can employ a number of techniques to help improve their understanding of sound in other ways.One is to use vision,and attempt to lip-read in order to understand someone else in a face-to-face conversation.Audio-visual integration has a long history in perception(e.g.,the McGurk Effect),and researchers have shown that older adults will look at the mouth region for additional information in noisy situations.However,this concept has not been explored in the context of social media.A common way to communicate virtually that simulates a live conversation is the concept of video chatting or conferencing.It is used for a variety of reasons including work,maintaining social interactions,and has started to be used in clinical settings.However,video chat session quality is often sub-optimal,and may contain degraded audio and/or decoupled audio and video.The goal of this study is to determine whether humans use the same visual compensation mechanism,lip reading,in a digital setting as they would in a face-to-face conversation.Methods:The participants(n=116,age 18 to 41)answered a demographics questionnaire including questions about their use of the video chatting software.Then,the participants viewed two videos of a video call:one with synchronized audio and video,and the other dyssynchronous(1 second delay).The order of video was randomized across participants.Binocular eye movements were monitored at 60 Hz using a Mirametrix S2 eye tracker connected to Ogama 5.0(http://www.ogama.net/).After each video,the participants answered questions about the call quality,and the content of the video.Results:There was no significant difference in the total dwell time at the eyes and the mouth of the speaker remained,t(116)=−1.574,P=0.059,d=−0.147,BF10=0.643.However,using the heat maps generated by Ogama,we observed when viewing the poor-quality video,the participants looked more towards the mouth than the eyes of the speaker.It was found that as call quality decreased,the number of fixations increased from n=79.87 in the synchronous condition to n=113.4 in the asynchronous condition,and the median duration of each fixation decreased from 218.3 ms in the synchronous condition to 205ms in the asynchronous condition.Conclusions:The above results may indicate that humans employ similar compensation mechanisms in response to a decrease in auditory comprehension,given the tendency of participants looking towards the mouth of the speaker more.However,more study is needed because of the inconsistency in the results.
文摘With the development of society and economy,more and more talents capable persons are badly needed in the world.Under the influence of traditional English teaching mode,most English learners can only read and write.They are usually called "deaf-mutes".Therefore,traditional English teaching mode isn't satisfied by the teachers and received greatly challenged.Due to applying multimedia to English teaching could create more authentic language environment for the learners,which enables them to communicate in English in real-life situations.At present,the multi-media approach is the most popular language teaching method in the world.The most effective way to develop the teaching is combine multimedia with the traditional methods.This is of special significance to English teaching and make the English teaching receiving the best effect.
文摘Many adults especially business people have the need to learn English for their work. Yet, a lot of them have problems in different language skills. For example, across U.S.A, business English teachers encounter Chinese speaking students who had problems in writing proper English business messages(Beamer, 1994).Although a lot of educators have been trying creative approaches on teaching children, the adult classrooms are relatively more traditional. This paper aims at reviewing some prospective problems and sharing with the practitioners some approaches for language instruction.
文摘With the development of science and technology,especially the development of digital technology,mankind has entered the age of multimedia,and the mode of human life and communication have undergone profound changes.As a single communicative mode,language has been gradually replaced by complex communicative mode composed of language,image and sound.Multimodal discourse analysis provides a new perspective for discourse analysis composed of a variety of symbols,which can help readers understand how symbols such as images and music work together and form meanings.Firm analysis is often analyzed from the perspective of psychology,aesthetics and other macro aspects,but seldom from the perspective of linguistics.The paper analyzes how the theory of multimodal discourse analysis affects the translation of film by discussing the interaction between film translation and multimodal modes in the film Pride and Prejudice.
基金supported by National Key Research and Development Program of China(No.2016YFB1001001)Beijing Natural Science Foundation(No.JQ18017)National Natural Science Foundation of China(No.61976002)。
文摘Audio-visual learning,aimed at exploiting the relationship between audio and visual modalities,has drawn considerable attention since deep learning started to be used successfully.Researchers tend to leverage these two modalities to improve the performance of previously considered single-modality tasks or address new challenging problems.In this paper,we provide a comprehensive survey of recent audio-visual learning development.We divide the current audio-visual learning tasks into four different subfields:audiovisual separation and localization,audio-visual correspondence learning,audio-visual generation,and audio-visual representation learning.State-of-the-art methods,as well as the remaining challenges of each subfield,are further discussed.Finally,we summarize the commonly used datasets and challenges.
基金supported by the National Natural Science Foundation of China (Grant Nos. 30807780 and 30700238)
文摘In order to detect cross-sectional age characteristics of cognitive neural mechanisms in audio-visual modal interference inhibition,event-related potentials(ERP) of 14 10-year-old children were recorded while performing the words interference task.In incongruent conditions,the participants were required to inhibit the audio interference words of the same category.The present findings provided the preliminary evidence of brain mechanism for the children's inhibition development in the specific childhood stage.
基金Supported by the National High-Tech Research and Development (863) Program of China (No. 863-306-ZD03-01-2)
文摘In this paper we address the problem of audio-visual speech recognition in the framework of the multi-stream hidden Markov model. Stream weight training based on minimum classification error criterion is discussed for use in large vocabulary continuous speech recognition (LVCSR). We present the lattice re- scoring and Viterbi approaches for calculating the loss function of continuous speech. The experimental re- sults show that in the case of clean audio, the system performance can be improved by 36.1% in relative word error rate reduction when using state-based stream weights trained by a Viterbi approach, compared to an audio only speech recognition system. Further experimental results demonstrate that our audio-visual LVCSR system provides significant enhancement of robustness in noisy environments.