In this paper,we propose a novel approach to visualizing global geographical information:a panoramic sphere in an immersive environment.The whole geographical surface can be observed through the rotating of heads as t...In this paper,we propose a novel approach to visualizing global geographical information:a panoramic sphere in an immersive environment.The whole geographical surface can be observed through the rotating of heads as the viewpoint of the panoramic sphere is inside the sphere.We compared three approaches to visualizing the earth for rendering the geographical information in a virtual reality environment.On the tasks of terrestrial and marine geographical information,we compare the visualization effects on a)a globe,b)a flat map and c)a panoramic sphere.Terrestrial geographical information tasks include the area comparison and direction determination.Marine geographical information tasks contain the visualization of sea surface temperature and sea surface currents.For terrestrial geographical information tasks,the experimental results show that the panoramic sphere outperforms the globe and the flat map,with a higher average accuracy and a shorter time.On marine geographical information task,the panoramic sphere visualization is also superior to the flat map and the globe in an immersive environment for the sea surface temperature data and the sea surface current fields.In all three visualization experiments,the panoramic sphere is most preferred by the participants,particularly for global,transcontinental and transoceanic needs.展开更多
Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐vi...Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐visual wake word spotting models are only suitable for simple single‐speaker scenarios and require high computational complexity.Further development is hindered by complex multi‐person scenarios and computational limitations in mobile environments.In this paper,a novel audio‐visual model is proposed for on‐device multi‐person wake word spotting.Firstly,an attention‐based audio‐visual voice activity detection module is presented,which generates an attention score matrix of audio and visual representations to derive active speaker representation.Secondly,the knowledge distillation method is introduced to transfer knowledge from the large model to the on‐device model to control the size of our model.Moreover,a new audio‐visual dataset,PKU‐KWS,is collected for sentence‐level multi‐person wake word spotting.Experimental results on the PKU‐KWS dataset show that this approach outperforms the previous state‐of‐the‐art methods.展开更多
With the increasing need of sensitive or secret data transmission through public network,security demands using cryptography and steganography are becoming a thirsty research area of last few years.These two technique...With the increasing need of sensitive or secret data transmission through public network,security demands using cryptography and steganography are becoming a thirsty research area of last few years.These two techniques can be merged and provide better security which is nowadays extremely required.The proposed system provides a novel method of information security using the techniques of audio steganography combined with visual cryptography.In this system,we take a secret image and divide it into several subparts to make more than one incomprehensible sub-images using the method of visual cryptography.Each of the sub-images is then hidden within individual cover audio files using audio steganographic techniques.The cover audios are then sent to the required destinations where reverse steganography schemes are applied to them to get the incomprehensible component images back.At last,all the sub-images are superimposed to get the actual secret image.This method is very secure as it uses a two-step security mechanism to maintain secrecy.The possibility of interception is less in this technique because one must have each piece of correct sub-image to regenerate the actual secret image.Without superimposing every one of the sub-images meaningful secret images cannot be formed.Audio files are composed of densely packed bits.The high density of data in audio makes it hard for a listener to detect the manipulation due to the proposed time-domain audio steganographic method.展开更多
Video data are composed of multimodal information streams including visual, auditory and textual streams, so an approach of story segmentation for news video using multimodal analysis is described in this paper. The p...Video data are composed of multimodal information streams including visual, auditory and textual streams, so an approach of story segmentation for news video using multimodal analysis is described in this paper. The proposed approach detects the topic-caption frames, and integrates them with silence clips detection results, as well as shot segmentation results to locate the news story boundaries. The integration of audio-visual features and text information overcomes the weakness of the approach using only image analysis techniques. On test data with 135 400 frames, when the boundaries between news stories are detected, the accuracy rate 85.8% and the recall rate 97.5% are obtained. The experimental results show the approach is valid and robust.展开更多
Emotion recognition has become an important task of modern human-computer interac- tion. A multilayer boosted HMM ( MBHMM ) classifier for automatic audio-visual emotion recognition is presented in this paper. A mod...Emotion recognition has become an important task of modern human-computer interac- tion. A multilayer boosted HMM ( MBHMM ) classifier for automatic audio-visual emotion recognition is presented in this paper. A modified Baum-Welch algorithm is proposed for component HMM learn- ing and adaptive boosting (AdaBoost) is used to train ensemble classifiers for different layers (cues). Except for the first layer, the initial weights of training samples in current layer are decided by recognition results of the ensemble classifier in the upper layer. Thus the training procedure using current cue can focus more on the difficult samples according to the previous cue. Our MBHMM clas- sifier is combined by these ensemble classifiers and takes advantage of the complementary informa- tion from multiple cues and modalities. Experimental results on audio-visual emotion data collected in Wizard of Oz scenarios and labeled under two types of emotion category sets demonstrate that our approach is effective and promising.展开更多
Experimental single case studies on automatic processing of emotion were carried on a sample of people with an anxiety disorder. Participants were required to take three Audio Visual Entrainment (AVE) sessions to test...Experimental single case studies on automatic processing of emotion were carried on a sample of people with an anxiety disorder. Participants were required to take three Audio Visual Entrainment (AVE) sessions to test for anxiety reduction as proclaimed by some academic research. Explicit reports were measured as well as pre-attentive bias to stressing information by using affective priming studies before and after AVE intervention. Group analysis shows that indeed AVEs program applications do reduce anxiety producing significant changes over explicit reports on anxiety levels and automatic processing bias of emotion. However, case by case analysis of six anxious participants shows that even when all of the participants report emotional improvement after intervention, not all of them reduce or eliminate dysfunctional bias to stressing information. Rather, they show a variety of processing styles due to intervention and some of them show no change at all. Implications of this differential effect to clinical sets are discussed.展开更多
The object-based scalable coding in MPEG-4 is investigated, and a prioritized transmission scheme of MPEG-4 audio-visual objects (AVOs) over the DiffServ network with the QoS guarantee is proposed. MPEG-4 AVOs are e...The object-based scalable coding in MPEG-4 is investigated, and a prioritized transmission scheme of MPEG-4 audio-visual objects (AVOs) over the DiffServ network with the QoS guarantee is proposed. MPEG-4 AVOs are extracted and classified into different groups according to their priority values and scalable layers (visual importance). These priority values are mapped to the 1P DiffServ per hop behaviors (PHB). This scheme can selectively discard packets with low importance, in order to avoid the network congestion. Simulation results show that the quality of received video can gracefully adapt to network state, as compared with the ‘best-effort' manner. Also, by allowing the content provider to define prioritization of each audio-visual object, the adaptive transmission of object-based scalable video can be customized based on the content.展开更多
Analytics and visualization of multi-dimensional and complex geo-data,such as three-dimensional(3D)subsurface ground models,is critical for development of underground space and design and construction of underground s...Analytics and visualization of multi-dimensional and complex geo-data,such as three-dimensional(3D)subsurface ground models,is critical for development of underground space and design and construction of underground structures(e.g.,tunnels,dams,and slopes)in engineering practices.Although complicated 3D subsurface ground models now can be developed from site investigation data(e.g.,boreholes)which is often sparse in practice,it remains a great challenge to visualize a 3D subsurface ground model with sophisticated stratigraphic variations by conventional two-dimensional(2D)geological cross-sections.Virtual reality(VR)technology,which has an attractive capability of constructing a virtual environment that links to the physical world,has been rapidly developed and applied to visualization in various disciplines recently.Leveraging on the rapid development of VR,this study proposes a framework for immersive visualization of 3D subsurface ground models in geo-applications using VR technology.The 3D subsurface model is first developed from limited borehole data in a data-driven manner.Then,a VR system is developed using related software and hardware devices currently available in the markets for immersive visualization and interaction with the developed 3D subsurface ground model.The results demonstrate that VR visualization of the 3D subsurface ground model in an immersive environment has great potential in revolutionizing the geo-practices from 2D cross-sections to a 3D immersive virtual environment in digital era,particularly for the emerging digital twins.展开更多
为了提高语音分离的效果,除了利用混合的语音信号,还可以借助视觉信号作为辅助信息。这种融合了视觉与音频信号的多模态建模方式,已被证实可以有效地提高语音分离的性能,为语音分离任务提供了新的可能性。为了更好地捕捉视觉与音频特征...为了提高语音分离的效果,除了利用混合的语音信号,还可以借助视觉信号作为辅助信息。这种融合了视觉与音频信号的多模态建模方式,已被证实可以有效地提高语音分离的性能,为语音分离任务提供了新的可能性。为了更好地捕捉视觉与音频特征中的长期依赖关系,并强化网络对输入上下文信息的理解,本文提出了一种基于一维扩张卷积与Transformer的时域视听融合语音分离模型。将基于频域的传统视听融合语音分离方法应用到时域中,避免了时频变换带来的信息损失和相位重构问题。所提网络架构包含四个模块:一个视觉特征提取网络,用于从视频帧中提取唇部嵌入特征;一个音频编码器,用于将混合语音转换为特征表示;一个多模态分离网络,主要由音频子网络、视频子网络,以及Transformer网络组成,用于利用视觉和音频特征进行语音分离;以及一个音频解码器,用于将分离后的特征还原为干净的语音。本文使用LRS2数据集生成的包含两个说话者混合语音的数据集。实验结果表明,所提出的网络在尺度不变信噪比改进(Scale-Invariant Signal-to-Noise Ratio Improvement,SISNRi)与信号失真比改进(Signal-to-Distortion Ratio Improvement,SDRi)这两种指标上分别达到14.0 dB与14.3 dB,较纯音频分离模型和普适的视听融合分离模型有明显的性能提升。展开更多
基金This research was funded by the Science and Technology Innovation Project for Laoshan Laboratory(No.LSKJ202204303)the National Natural Science Foundation of China(No.42030406)+1 种基金the Fundamental Research Funds for the Central Universities(No.202261006)the ESANRSCC Scientific Cooperation Project on Earth Observation Science and Applications:Dragon 5(No.58393).
文摘In this paper,we propose a novel approach to visualizing global geographical information:a panoramic sphere in an immersive environment.The whole geographical surface can be observed through the rotating of heads as the viewpoint of the panoramic sphere is inside the sphere.We compared three approaches to visualizing the earth for rendering the geographical information in a virtual reality environment.On the tasks of terrestrial and marine geographical information,we compare the visualization effects on a)a globe,b)a flat map and c)a panoramic sphere.Terrestrial geographical information tasks include the area comparison and direction determination.Marine geographical information tasks contain the visualization of sea surface temperature and sea surface currents.For terrestrial geographical information tasks,the experimental results show that the panoramic sphere outperforms the globe and the flat map,with a higher average accuracy and a shorter time.On marine geographical information task,the panoramic sphere visualization is also superior to the flat map and the globe in an immersive environment for the sea surface temperature data and the sea surface current fields.In all three visualization experiments,the panoramic sphere is most preferred by the participants,particularly for global,transcontinental and transoceanic needs.
基金supported by the National Key R&D Program of China(No.2020AAA0108904)the Science and Technology Plan of Shenzhen(No.JCYJ20200109140410340).
文摘Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐visual wake word spotting models are only suitable for simple single‐speaker scenarios and require high computational complexity.Further development is hindered by complex multi‐person scenarios and computational limitations in mobile environments.In this paper,a novel audio‐visual model is proposed for on‐device multi‐person wake word spotting.Firstly,an attention‐based audio‐visual voice activity detection module is presented,which generates an attention score matrix of audio and visual representations to derive active speaker representation.Secondly,the knowledge distillation method is introduced to transfer knowledge from the large model to the on‐device model to control the size of our model.Moreover,a new audio‐visual dataset,PKU‐KWS,is collected for sentence‐level multi‐person wake word spotting.Experimental results on the PKU‐KWS dataset show that this approach outperforms the previous state‐of‐the‐art methods.
基金Taif University Researchers Supporting Project No.(TURSP-2020/77),Taif university,Taif,Saudi Arabia.
文摘With the increasing need of sensitive or secret data transmission through public network,security demands using cryptography and steganography are becoming a thirsty research area of last few years.These two techniques can be merged and provide better security which is nowadays extremely required.The proposed system provides a novel method of information security using the techniques of audio steganography combined with visual cryptography.In this system,we take a secret image and divide it into several subparts to make more than one incomprehensible sub-images using the method of visual cryptography.Each of the sub-images is then hidden within individual cover audio files using audio steganographic techniques.The cover audios are then sent to the required destinations where reverse steganography schemes are applied to them to get the incomprehensible component images back.At last,all the sub-images are superimposed to get the actual secret image.This method is very secure as it uses a two-step security mechanism to maintain secrecy.The possibility of interception is less in this technique because one must have each piece of correct sub-image to regenerate the actual secret image.Without superimposing every one of the sub-images meaningful secret images cannot be formed.Audio files are composed of densely packed bits.The high density of data in audio makes it hard for a listener to detect the manipulation due to the proposed time-domain audio steganographic method.
文摘Video data are composed of multimodal information streams including visual, auditory and textual streams, so an approach of story segmentation for news video using multimodal analysis is described in this paper. The proposed approach detects the topic-caption frames, and integrates them with silence clips detection results, as well as shot segmentation results to locate the news story boundaries. The integration of audio-visual features and text information overcomes the weakness of the approach using only image analysis techniques. On test data with 135 400 frames, when the boundaries between news stories are detected, the accuracy rate 85.8% and the recall rate 97.5% are obtained. The experimental results show the approach is valid and robust.
基金Supported by the National Natural Science Foundation of China(60905006)the NSFC-Guangdong Joint Fund(U1035004)
文摘Emotion recognition has become an important task of modern human-computer interac- tion. A multilayer boosted HMM ( MBHMM ) classifier for automatic audio-visual emotion recognition is presented in this paper. A modified Baum-Welch algorithm is proposed for component HMM learn- ing and adaptive boosting (AdaBoost) is used to train ensemble classifiers for different layers (cues). Except for the first layer, the initial weights of training samples in current layer are decided by recognition results of the ensemble classifier in the upper layer. Thus the training procedure using current cue can focus more on the difficult samples according to the previous cue. Our MBHMM clas- sifier is combined by these ensemble classifiers and takes advantage of the complementary informa- tion from multiple cues and modalities. Experimental results on audio-visual emotion data collected in Wizard of Oz scenarios and labeled under two types of emotion category sets demonstrate that our approach is effective and promising.
文摘Experimental single case studies on automatic processing of emotion were carried on a sample of people with an anxiety disorder. Participants were required to take three Audio Visual Entrainment (AVE) sessions to test for anxiety reduction as proclaimed by some academic research. Explicit reports were measured as well as pre-attentive bias to stressing information by using affective priming studies before and after AVE intervention. Group analysis shows that indeed AVEs program applications do reduce anxiety producing significant changes over explicit reports on anxiety levels and automatic processing bias of emotion. However, case by case analysis of six anxious participants shows that even when all of the participants report emotional improvement after intervention, not all of them reduce or eliminate dysfunctional bias to stressing information. Rather, they show a variety of processing styles due to intervention and some of them show no change at all. Implications of this differential effect to clinical sets are discussed.
文摘The object-based scalable coding in MPEG-4 is investigated, and a prioritized transmission scheme of MPEG-4 audio-visual objects (AVOs) over the DiffServ network with the QoS guarantee is proposed. MPEG-4 AVOs are extracted and classified into different groups according to their priority values and scalable layers (visual importance). These priority values are mapped to the 1P DiffServ per hop behaviors (PHB). This scheme can selectively discard packets with low importance, in order to avoid the network congestion. Simulation results show that the quality of received video can gracefully adapt to network state, as compared with the ‘best-effort' manner. Also, by allowing the content provider to define prioritization of each audio-visual object, the adaptive transmission of object-based scalable video can be customized based on the content.
基金supported by the Research Grant Council of Hong Kong Special Administrative Region(Project No.CityU 11203322)Shenzhen Science and Technology Innovation Commission(Shenzhen-Hong Kong-Macao Science and Technology Project(Category C)No.SGDX20210823104002020),China.
文摘Analytics and visualization of multi-dimensional and complex geo-data,such as three-dimensional(3D)subsurface ground models,is critical for development of underground space and design and construction of underground structures(e.g.,tunnels,dams,and slopes)in engineering practices.Although complicated 3D subsurface ground models now can be developed from site investigation data(e.g.,boreholes)which is often sparse in practice,it remains a great challenge to visualize a 3D subsurface ground model with sophisticated stratigraphic variations by conventional two-dimensional(2D)geological cross-sections.Virtual reality(VR)technology,which has an attractive capability of constructing a virtual environment that links to the physical world,has been rapidly developed and applied to visualization in various disciplines recently.Leveraging on the rapid development of VR,this study proposes a framework for immersive visualization of 3D subsurface ground models in geo-applications using VR technology.The 3D subsurface model is first developed from limited borehole data in a data-driven manner.Then,a VR system is developed using related software and hardware devices currently available in the markets for immersive visualization and interaction with the developed 3D subsurface ground model.The results demonstrate that VR visualization of the 3D subsurface ground model in an immersive environment has great potential in revolutionizing the geo-practices from 2D cross-sections to a 3D immersive virtual environment in digital era,particularly for the emerging digital twins.
文摘为了提高语音分离的效果,除了利用混合的语音信号,还可以借助视觉信号作为辅助信息。这种融合了视觉与音频信号的多模态建模方式,已被证实可以有效地提高语音分离的性能,为语音分离任务提供了新的可能性。为了更好地捕捉视觉与音频特征中的长期依赖关系,并强化网络对输入上下文信息的理解,本文提出了一种基于一维扩张卷积与Transformer的时域视听融合语音分离模型。将基于频域的传统视听融合语音分离方法应用到时域中,避免了时频变换带来的信息损失和相位重构问题。所提网络架构包含四个模块:一个视觉特征提取网络,用于从视频帧中提取唇部嵌入特征;一个音频编码器,用于将混合语音转换为特征表示;一个多模态分离网络,主要由音频子网络、视频子网络,以及Transformer网络组成,用于利用视觉和音频特征进行语音分离;以及一个音频解码器,用于将分离后的特征还原为干净的语音。本文使用LRS2数据集生成的包含两个说话者混合语音的数据集。实验结果表明,所提出的网络在尺度不变信噪比改进(Scale-Invariant Signal-to-Noise Ratio Improvement,SISNRi)与信号失真比改进(Signal-to-Distortion Ratio Improvement,SDRi)这两种指标上分别达到14.0 dB与14.3 dB,较纯音频分离模型和普适的视听融合分离模型有明显的性能提升。