AIM:To assess the performance of macular ganglion cell-inner plexiform layer thickness(mGCIPLT)and 10-2 visual field(VF)parameters in detecting early glaucoma and evaluating the severity of advanced glaucoma.METHODS:T...AIM:To assess the performance of macular ganglion cell-inner plexiform layer thickness(mGCIPLT)and 10-2 visual field(VF)parameters in detecting early glaucoma and evaluating the severity of advanced glaucoma.METHODS:Totally 127 eyes from 89 participants(36 eyes of 19 healthy participants,45 eyes of 31 early glaucoma patients and 46 eyes of 39 advanced glaucoma patients)were included.The relationships between the optical coherence tomography(OCT)-derived parameters and VF sensitivity were determined.Patients with early glaucoma were divided into eyes with or without central 10°of the VF damages(CVFDs),and the diagnostic performances of OCT-derived parameters were assessed.RESULTS:In early glaucoma,the mGCIPLT was significantly correlated with 10-2 VF pattern standard deviation(PSD;with average mGCIPLT:β=-0.046,95%CI,-0.067 to-0.024,P<0.001).In advanced glaucoma,the mGCIPLT was related to the 24-2 VF mean deviation(MD;with average mGCIPLT:β=0.397,95%CI,0.199 to 0.595,P<0.001),10-2 VF MD(with average mGCIPLT:β=0.762,95%CI,0.485 to 1.038,P<0.001)and 24-2 VF PSD(with average mGCIPLT:β=0.244,95%CI,0.124 to 0.364,P<0.001).Except for the minimum and superotemporal mGCIPLT,the decrease of mGCIPLT in early glaucomatous eyes with CVFDs was more severe than that of early glaucomatous eyes without CVFDs.The area under the curve(AUC)of the average mGCIPLT(AUC=0.949,95%CI,0.868 to 0.982)was greater than that of the average circumpapillary retinal nerve fiber layer thickness(cpRNFLT;AUC=0.827,95%CI,0.674 to 0.918)and rim area(AUC=0.799,95%CI,0.610 to 0.907)in early glaucomatous eyes with CVFDs versus normal eyes.CONCLUSION:The 10-2 VF and mGCIPLT parameters are complementary to 24-2 VF,cpRNFLT and ONH parameters,especially in detecting early glaucoma with CVFDs and evaluating the severity of advanced glaucoma in group level.展开更多
Video data are composed of multimodal information streams including visual, auditory and textual streams, so an approach of story segmentation for news video using multimodal analysis is described in this paper. The p...Video data are composed of multimodal information streams including visual, auditory and textual streams, so an approach of story segmentation for news video using multimodal analysis is described in this paper. The proposed approach detects the topic-caption frames, and integrates them with silence clips detection results, as well as shot segmentation results to locate the news story boundaries. The integration of audio-visual features and text information overcomes the weakness of the approach using only image analysis techniques. On test data with 135 400 frames, when the boundaries between news stories are detected, the accuracy rate 85.8% and the recall rate 97.5% are obtained. The experimental results show the approach is valid and robust.展开更多
Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐vi...Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐visual wake word spotting models are only suitable for simple single‐speaker scenarios and require high computational complexity.Further development is hindered by complex multi‐person scenarios and computational limitations in mobile environments.In this paper,a novel audio‐visual model is proposed for on‐device multi‐person wake word spotting.Firstly,an attention‐based audio‐visual voice activity detection module is presented,which generates an attention score matrix of audio and visual representations to derive active speaker representation.Secondly,the knowledge distillation method is introduced to transfer knowledge from the large model to the on‐device model to control the size of our model.Moreover,a new audio‐visual dataset,PKU‐KWS,is collected for sentence‐level multi‐person wake word spotting.Experimental results on the PKU‐KWS dataset show that this approach outperforms the previous state‐of‐the‐art methods.展开更多
With the increasing need of sensitive or secret data transmission through public network,security demands using cryptography and steganography are becoming a thirsty research area of last few years.These two technique...With the increasing need of sensitive or secret data transmission through public network,security demands using cryptography and steganography are becoming a thirsty research area of last few years.These two techniques can be merged and provide better security which is nowadays extremely required.The proposed system provides a novel method of information security using the techniques of audio steganography combined with visual cryptography.In this system,we take a secret image and divide it into several subparts to make more than one incomprehensible sub-images using the method of visual cryptography.Each of the sub-images is then hidden within individual cover audio files using audio steganographic techniques.The cover audios are then sent to the required destinations where reverse steganography schemes are applied to them to get the incomprehensible component images back.At last,all the sub-images are superimposed to get the actual secret image.This method is very secure as it uses a two-step security mechanism to maintain secrecy.The possibility of interception is less in this technique because one must have each piece of correct sub-image to regenerate the actual secret image.Without superimposing every one of the sub-images meaningful secret images cannot be formed.Audio files are composed of densely packed bits.The high density of data in audio makes it hard for a listener to detect the manipulation due to the proposed time-domain audio steganographic method.展开更多
Emotion recognition has become an important task of modern human-computer interac- tion. A multilayer boosted HMM ( MBHMM ) classifier for automatic audio-visual emotion recognition is presented in this paper. A mod...Emotion recognition has become an important task of modern human-computer interac- tion. A multilayer boosted HMM ( MBHMM ) classifier for automatic audio-visual emotion recognition is presented in this paper. A modified Baum-Welch algorithm is proposed for component HMM learn- ing and adaptive boosting (AdaBoost) is used to train ensemble classifiers for different layers (cues). Except for the first layer, the initial weights of training samples in current layer are decided by recognition results of the ensemble classifier in the upper layer. Thus the training procedure using current cue can focus more on the difficult samples according to the previous cue. Our MBHMM clas- sifier is combined by these ensemble classifiers and takes advantage of the complementary informa- tion from multiple cues and modalities. Experimental results on audio-visual emotion data collected in Wizard of Oz scenarios and labeled under two types of emotion category sets demonstrate that our approach is effective and promising.展开更多
The object-based scalable coding in MPEG-4 is investigated, and a prioritized transmission scheme of MPEG-4 audio-visual objects (AVOs) over the DiffServ network with the QoS guarantee is proposed. MPEG-4 AVOs are e...The object-based scalable coding in MPEG-4 is investigated, and a prioritized transmission scheme of MPEG-4 audio-visual objects (AVOs) over the DiffServ network with the QoS guarantee is proposed. MPEG-4 AVOs are extracted and classified into different groups according to their priority values and scalable layers (visual importance). These priority values are mapped to the 1P DiffServ per hop behaviors (PHB). This scheme can selectively discard packets with low importance, in order to avoid the network congestion. Simulation results show that the quality of received video can gracefully adapt to network state, as compared with the ‘best-effort' manner. Also, by allowing the content provider to define prioritization of each audio-visual object, the adaptive transmission of object-based scalable video can be customized based on the content.展开更多
Experimental single case studies on automatic processing of emotion were carried on a sample of people with an anxiety disorder. Participants were required to take three Audio Visual Entrainment (AVE) sessions to test...Experimental single case studies on automatic processing of emotion were carried on a sample of people with an anxiety disorder. Participants were required to take three Audio Visual Entrainment (AVE) sessions to test for anxiety reduction as proclaimed by some academic research. Explicit reports were measured as well as pre-attentive bias to stressing information by using affective priming studies before and after AVE intervention. Group analysis shows that indeed AVEs program applications do reduce anxiety producing significant changes over explicit reports on anxiety levels and automatic processing bias of emotion. However, case by case analysis of six anxious participants shows that even when all of the participants report emotional improvement after intervention, not all of them reduce or eliminate dysfunctional bias to stressing information. Rather, they show a variety of processing styles due to intervention and some of them show no change at all. Implications of this differential effect to clinical sets are discussed.展开更多
高速的城市化发展迫使人们需构建人与自然和谐的宜居城市来降低未来的风险,而掌握人类对自然感知裨益的信息是有效降低风险的基础。尽管前人面对自然环境的感知裨益做了一定的研究,但涉及多感官交互裨益的研究仍处于探索阶段,缺少相关...高速的城市化发展迫使人们需构建人与自然和谐的宜居城市来降低未来的风险,而掌握人类对自然感知裨益的信息是有效降低风险的基础。尽管前人面对自然环境的感知裨益做了一定的研究,但涉及多感官交互裨益的研究仍处于探索阶段,缺少相关方法的总结与理论的概括,实践中难以真正实现人类与自然和谐的环境建设。因此,借助文献计量分析的CiteSpace软件平台,基于Web of Science核心数据库与中国知网数据库,系统分析国内外基于自然环境的视听嗅感知交互对居民福祉研究的进展及动向。结果表明:1)发文阶段性突出,近5年爆发性增长,主要集中于亚欧地区,发文学科与期刊成多学科交叉特点,且研究由单维感知向多维感知裨益转变;2)研究样地涉及自然要素、自然景观构成及土地覆盖类型3个层面,研究对象包含不同年龄阶段群体,且主要以大学生为主;3)应用的研究方法主要包括实际现场、室内模拟及社交媒体众包数据分析3类,且感知维度数与感知裨益效应呈非线性正相关,其中自然要素、环境一致性及自然认同感是影响感知裨益的关键因素;4)该领域涉及环境心理及景观复愈等多理论的作用途径,进而提出了“刺激-有机体-反应”(Stimulus-Organism-Response,SOR)的机理概念框架,以期为今后以人类福祉为导向的城市绿地规划与管理提供理论依据和实践方法。展开更多
With the largest population in the world,the Asia-Pacific area is in a great need for the fundamentalresearch of visual sciences,the protection of vision and the prevention and treatment of visual diseases.The Symposi...With the largest population in the world,the Asia-Pacific area is in a great need for the fundamentalresearch of visual sciences,the protection of vision and the prevention and treatment of visual diseases.The Symposium will open a new era of the academic exchanges in the field of visual sciences in thisarea.It also will enhance the academic exchanges of visual sciences worldwide.展开更多
为了提高语音分离的效果,除了利用混合的语音信号,还可以借助视觉信号作为辅助信息。这种融合了视觉与音频信号的多模态建模方式,已被证实可以有效地提高语音分离的性能,为语音分离任务提供了新的可能性。为了更好地捕捉视觉与音频特征...为了提高语音分离的效果,除了利用混合的语音信号,还可以借助视觉信号作为辅助信息。这种融合了视觉与音频信号的多模态建模方式,已被证实可以有效地提高语音分离的性能,为语音分离任务提供了新的可能性。为了更好地捕捉视觉与音频特征中的长期依赖关系,并强化网络对输入上下文信息的理解,本文提出了一种基于一维扩张卷积与Transformer的时域视听融合语音分离模型。将基于频域的传统视听融合语音分离方法应用到时域中,避免了时频变换带来的信息损失和相位重构问题。所提网络架构包含四个模块:一个视觉特征提取网络,用于从视频帧中提取唇部嵌入特征;一个音频编码器,用于将混合语音转换为特征表示;一个多模态分离网络,主要由音频子网络、视频子网络,以及Transformer网络组成,用于利用视觉和音频特征进行语音分离;以及一个音频解码器,用于将分离后的特征还原为干净的语音。本文使用LRS2数据集生成的包含两个说话者混合语音的数据集。实验结果表明,所提出的网络在尺度不变信噪比改进(Scale-Invariant Signal-to-Noise Ratio Improvement,SISNRi)与信号失真比改进(Signal-to-Distortion Ratio Improvement,SDRi)这两种指标上分别达到14.0 dB与14.3 dB,较纯音频分离模型和普适的视听融合分离模型有明显的性能提升。展开更多
基金National Natural Science Foundation of China(No.81860170).
文摘AIM:To assess the performance of macular ganglion cell-inner plexiform layer thickness(mGCIPLT)and 10-2 visual field(VF)parameters in detecting early glaucoma and evaluating the severity of advanced glaucoma.METHODS:Totally 127 eyes from 89 participants(36 eyes of 19 healthy participants,45 eyes of 31 early glaucoma patients and 46 eyes of 39 advanced glaucoma patients)were included.The relationships between the optical coherence tomography(OCT)-derived parameters and VF sensitivity were determined.Patients with early glaucoma were divided into eyes with or without central 10°of the VF damages(CVFDs),and the diagnostic performances of OCT-derived parameters were assessed.RESULTS:In early glaucoma,the mGCIPLT was significantly correlated with 10-2 VF pattern standard deviation(PSD;with average mGCIPLT:β=-0.046,95%CI,-0.067 to-0.024,P<0.001).In advanced glaucoma,the mGCIPLT was related to the 24-2 VF mean deviation(MD;with average mGCIPLT:β=0.397,95%CI,0.199 to 0.595,P<0.001),10-2 VF MD(with average mGCIPLT:β=0.762,95%CI,0.485 to 1.038,P<0.001)and 24-2 VF PSD(with average mGCIPLT:β=0.244,95%CI,0.124 to 0.364,P<0.001).Except for the minimum and superotemporal mGCIPLT,the decrease of mGCIPLT in early glaucomatous eyes with CVFDs was more severe than that of early glaucomatous eyes without CVFDs.The area under the curve(AUC)of the average mGCIPLT(AUC=0.949,95%CI,0.868 to 0.982)was greater than that of the average circumpapillary retinal nerve fiber layer thickness(cpRNFLT;AUC=0.827,95%CI,0.674 to 0.918)and rim area(AUC=0.799,95%CI,0.610 to 0.907)in early glaucomatous eyes with CVFDs versus normal eyes.CONCLUSION:The 10-2 VF and mGCIPLT parameters are complementary to 24-2 VF,cpRNFLT and ONH parameters,especially in detecting early glaucoma with CVFDs and evaluating the severity of advanced glaucoma in group level.
文摘Video data are composed of multimodal information streams including visual, auditory and textual streams, so an approach of story segmentation for news video using multimodal analysis is described in this paper. The proposed approach detects the topic-caption frames, and integrates them with silence clips detection results, as well as shot segmentation results to locate the news story boundaries. The integration of audio-visual features and text information overcomes the weakness of the approach using only image analysis techniques. On test data with 135 400 frames, when the boundaries between news stories are detected, the accuracy rate 85.8% and the recall rate 97.5% are obtained. The experimental results show the approach is valid and robust.
基金supported by the National Key R&D Program of China(No.2020AAA0108904)the Science and Technology Plan of Shenzhen(No.JCYJ20200109140410340).
文摘Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐visual wake word spotting models are only suitable for simple single‐speaker scenarios and require high computational complexity.Further development is hindered by complex multi‐person scenarios and computational limitations in mobile environments.In this paper,a novel audio‐visual model is proposed for on‐device multi‐person wake word spotting.Firstly,an attention‐based audio‐visual voice activity detection module is presented,which generates an attention score matrix of audio and visual representations to derive active speaker representation.Secondly,the knowledge distillation method is introduced to transfer knowledge from the large model to the on‐device model to control the size of our model.Moreover,a new audio‐visual dataset,PKU‐KWS,is collected for sentence‐level multi‐person wake word spotting.Experimental results on the PKU‐KWS dataset show that this approach outperforms the previous state‐of‐the‐art methods.
基金Taif University Researchers Supporting Project No.(TURSP-2020/77),Taif university,Taif,Saudi Arabia.
文摘With the increasing need of sensitive or secret data transmission through public network,security demands using cryptography and steganography are becoming a thirsty research area of last few years.These two techniques can be merged and provide better security which is nowadays extremely required.The proposed system provides a novel method of information security using the techniques of audio steganography combined with visual cryptography.In this system,we take a secret image and divide it into several subparts to make more than one incomprehensible sub-images using the method of visual cryptography.Each of the sub-images is then hidden within individual cover audio files using audio steganographic techniques.The cover audios are then sent to the required destinations where reverse steganography schemes are applied to them to get the incomprehensible component images back.At last,all the sub-images are superimposed to get the actual secret image.This method is very secure as it uses a two-step security mechanism to maintain secrecy.The possibility of interception is less in this technique because one must have each piece of correct sub-image to regenerate the actual secret image.Without superimposing every one of the sub-images meaningful secret images cannot be formed.Audio files are composed of densely packed bits.The high density of data in audio makes it hard for a listener to detect the manipulation due to the proposed time-domain audio steganographic method.
基金Supported by the National Natural Science Foundation of China(60905006)the NSFC-Guangdong Joint Fund(U1035004)
文摘Emotion recognition has become an important task of modern human-computer interac- tion. A multilayer boosted HMM ( MBHMM ) classifier for automatic audio-visual emotion recognition is presented in this paper. A modified Baum-Welch algorithm is proposed for component HMM learn- ing and adaptive boosting (AdaBoost) is used to train ensemble classifiers for different layers (cues). Except for the first layer, the initial weights of training samples in current layer are decided by recognition results of the ensemble classifier in the upper layer. Thus the training procedure using current cue can focus more on the difficult samples according to the previous cue. Our MBHMM clas- sifier is combined by these ensemble classifiers and takes advantage of the complementary informa- tion from multiple cues and modalities. Experimental results on audio-visual emotion data collected in Wizard of Oz scenarios and labeled under two types of emotion category sets demonstrate that our approach is effective and promising.
文摘The object-based scalable coding in MPEG-4 is investigated, and a prioritized transmission scheme of MPEG-4 audio-visual objects (AVOs) over the DiffServ network with the QoS guarantee is proposed. MPEG-4 AVOs are extracted and classified into different groups according to their priority values and scalable layers (visual importance). These priority values are mapped to the 1P DiffServ per hop behaviors (PHB). This scheme can selectively discard packets with low importance, in order to avoid the network congestion. Simulation results show that the quality of received video can gracefully adapt to network state, as compared with the ‘best-effort' manner. Also, by allowing the content provider to define prioritization of each audio-visual object, the adaptive transmission of object-based scalable video can be customized based on the content.
文摘Experimental single case studies on automatic processing of emotion were carried on a sample of people with an anxiety disorder. Participants were required to take three Audio Visual Entrainment (AVE) sessions to test for anxiety reduction as proclaimed by some academic research. Explicit reports were measured as well as pre-attentive bias to stressing information by using affective priming studies before and after AVE intervention. Group analysis shows that indeed AVEs program applications do reduce anxiety producing significant changes over explicit reports on anxiety levels and automatic processing bias of emotion. However, case by case analysis of six anxious participants shows that even when all of the participants report emotional improvement after intervention, not all of them reduce or eliminate dysfunctional bias to stressing information. Rather, they show a variety of processing styles due to intervention and some of them show no change at all. Implications of this differential effect to clinical sets are discussed.
文摘高速的城市化发展迫使人们需构建人与自然和谐的宜居城市来降低未来的风险,而掌握人类对自然感知裨益的信息是有效降低风险的基础。尽管前人面对自然环境的感知裨益做了一定的研究,但涉及多感官交互裨益的研究仍处于探索阶段,缺少相关方法的总结与理论的概括,实践中难以真正实现人类与自然和谐的环境建设。因此,借助文献计量分析的CiteSpace软件平台,基于Web of Science核心数据库与中国知网数据库,系统分析国内外基于自然环境的视听嗅感知交互对居民福祉研究的进展及动向。结果表明:1)发文阶段性突出,近5年爆发性增长,主要集中于亚欧地区,发文学科与期刊成多学科交叉特点,且研究由单维感知向多维感知裨益转变;2)研究样地涉及自然要素、自然景观构成及土地覆盖类型3个层面,研究对象包含不同年龄阶段群体,且主要以大学生为主;3)应用的研究方法主要包括实际现场、室内模拟及社交媒体众包数据分析3类,且感知维度数与感知裨益效应呈非线性正相关,其中自然要素、环境一致性及自然认同感是影响感知裨益的关键因素;4)该领域涉及环境心理及景观复愈等多理论的作用途径,进而提出了“刺激-有机体-反应”(Stimulus-Organism-Response,SOR)的机理概念框架,以期为今后以人类福祉为导向的城市绿地规划与管理提供理论依据和实践方法。
文摘With the largest population in the world,the Asia-Pacific area is in a great need for the fundamentalresearch of visual sciences,the protection of vision and the prevention and treatment of visual diseases.The Symposium will open a new era of the academic exchanges in the field of visual sciences in thisarea.It also will enhance the academic exchanges of visual sciences worldwide.
文摘为了提高语音分离的效果,除了利用混合的语音信号,还可以借助视觉信号作为辅助信息。这种融合了视觉与音频信号的多模态建模方式,已被证实可以有效地提高语音分离的性能,为语音分离任务提供了新的可能性。为了更好地捕捉视觉与音频特征中的长期依赖关系,并强化网络对输入上下文信息的理解,本文提出了一种基于一维扩张卷积与Transformer的时域视听融合语音分离模型。将基于频域的传统视听融合语音分离方法应用到时域中,避免了时频变换带来的信息损失和相位重构问题。所提网络架构包含四个模块:一个视觉特征提取网络,用于从视频帧中提取唇部嵌入特征;一个音频编码器,用于将混合语音转换为特征表示;一个多模态分离网络,主要由音频子网络、视频子网络,以及Transformer网络组成,用于利用视觉和音频特征进行语音分离;以及一个音频解码器,用于将分离后的特征还原为干净的语音。本文使用LRS2数据集生成的包含两个说话者混合语音的数据集。实验结果表明,所提出的网络在尺度不变信噪比改进(Scale-Invariant Signal-to-Noise Ratio Improvement,SISNRi)与信号失真比改进(Signal-to-Distortion Ratio Improvement,SDRi)这两种指标上分别达到14.0 dB与14.3 dB,较纯音频分离模型和普适的视听融合分离模型有明显的性能提升。