期刊文献+

基于注意力感知和语义感知的RGB-D室内图像语义分割算法 被引量:16

Attention-Aware and Semantic-Aware Network for RGB-D Indoor Semantic Segmentation
下载PDF
导出
摘要 近年来,全卷积神经网络有效提升了语义分割任务的准确率.然而,由于室内环境的复杂性,室内场景语义分割仍然是一个具有挑战性的问题.随着深度传感器的出现,人们开始考虑利用深度信息提升语义分割效果.以往的研究大多简单地使用等权值的拼接或求和操作来融合RGB特征和深度特征,未能充分利用RGB特征与深度特征之间的互补信息.本文提出一种基于注意力感知和语义感知的网络模型ASNet(Attention-aware and Semantic-aware Network).通过引入注意力感知多模态融合模块和语义感知多模态融合模块,有效地融合多层次的RGB特征和深度特征.其中,在注意力感知多模态融合模块中,本文设计了一种跨模态注意力机制,RGB特征和深度特征利用互补信息相互指导和优化,从而提取富含空间位置信息的特征表示.另外,语义感知多模态融合模块通过整合语义相关的RGB特征通道和深度特征通道,建模多模态特征之间的语义依赖关系,提取更精确的语义特征表示.本文将这两个多模态融合模块整合到一个带有跳跃连接的双分支编码-解码网络模型中.同时,网络在训练时采用深层监督策略,在多个解码层上进行监督学习.在公开数据集上的实验结果表明,本文算法优于现有的RGB-D图像语义分割算法,在平均精度和平均交并比上分别比近期算法提高了1.9%和1.2%. Semantic segmentation is a research hotspot in the field of computer vision.It refers to assigning all pixels into different semantic classes.As a fundamental problem in scene understanding,semantic segmentation is widely used in various intelligent tasks.In recent years,with the success of convolutional neural network(CNN)in many computer vision applications,fully convolutional networks(FCN)have shown great potential on RGB semantic segmentation task.However,semantic segmentation is still a challenging task due to the complexity of scene types,severe object occlusions and varying illuminations.In recent years,with the availability of consumer RGB-D sensors such as RealSense 3D Camera and Microsoft Kinect,we can capture both RGB image and depth information at the same time.Depth information can describe 3D geometric information which might be missed in RGB only images.It can significantly reduce classification errors and improve the accuracy of semantic segmentation.In order to make effective use of RGB information and depth information,it is crucial to find an efficient multi modal information fusion method.According to different fusion periods,the current RGB-D feature.fusion methods can be divided into three types:early fusion,late fusion and middle fusion.However,most of previous studies fail to make effective use of complementary information between RGB information and depth information.They simply fuse RGB features and depth features with equal-weight concatenating or summing,which failed to extract complementary information between two modals and will suppressed the modality specific information.In addition,semantic information in high level features between different modals is not taken into account,which is very important for the fine-grained semantic segmentation task.To solve the above problems,in this paper,we present a novel Attention-aware and Semantic-aware Multi-modal Fusion Network(ASNet)for RGB-D semantic segmentation.Our network is able to effectively fuse multi-level RGB-D features by including Attention-aware Multi-modal Fusion blocks(AMF)and Semantic-aware Multi-modal Fusion blocks(SMF).Specifically,in Attention-aware Multimodal Fusion blocks,a cross-modal attention mechanism is designed to make RGB features and depth features guide and optimize each other through their complementary characteristics in order to obtain the feature representation with rich spatial location information.In addition,Semantic-aware Multi modal Fusion blocks model the semantic interdependencies between multi-modal features by integrating semantic associated feature channels among the RGB and depth features and extract more precise semantic feature representation.The two blocks are integrated into a two-branch encoder-decoder architecture,which can restore image resolution gradually by using consecutive up-sampling operation and combine low level features and high level features through skip-connections to achieve high-resolution prediction.In order to optimize the training process,we using deeply supervised learning over multi-level decoding features.Our network is able to effectively learn the complementary characteristics of two modalities and models the semantic context interdependencies between RGB features and depth features.Experimental results with two challenging public RGB D indoor semantic segmentation datasets,i.e.,SUN RGB D and NYU Depth v2,show that our network outperforms existing RGB-D semantic segmentation methods and improves the segmentation performance by 1.9% and 1.2%for mean accuracy and mean IoU respectively.
作者 段立娟 孙启超 乔元华 陈军成 崔国勤 DUAN Li-Juan;SUN Qi-Chao;QIAO Yuan-Hua;CHEN Jun-Cheng;CUI Guo-Qin(Faculy of Informalion Technology,Beijing Universisy of Technology,Beijing 100124;Beijing Key Laboralory of Trusled Com puling,Beijing 100124;Nal ional Engineering Laboralory for Key Technologies of Informalion Securily Level Prolection,Beijing 100124;Adranced Insilule of In formalion Technology,Peking Universily,Ilangzhou 311200;College of Applied Sciences,Beijing Universisy of Technology,Beijing 100124;Slale Key Laboralory of Digilal Mulli-media Chip Technology,Vimicro Cor poration,Beijing 100191)
出处 《计算机学报》 EI CSCD 北大核心 2021年第2期275-291,共17页 Chinese Journal of Computers
基金 国家重点研发计划(2017YFC0803705) 北京市自然基金委-市教委联合资助项目(KZ201910005008) 杭州市重大科技创新项目(20182014B09)资助。
关键词 RGB-D语义分割 卷积神经网络 多模态融合 注意力模型 深度学习 RGB-D semantic segmentation convolutional neural network multi-modal fusion attention model deep learning
  • 相关文献

参考文献2

二级参考文献6

共引文献234

同被引文献89

引证文献16

二级引证文献23

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部