期刊文献+

融合显著性图像语义特征的人体相似动作识别 被引量:1

Human similar action recognition by fusing saliency image semantic features
原文传递
导出
摘要 目的基于骨骼的动作识别技术由于在光照变化、动态视角和复杂背景等情况下具有更强的鲁棒性而成为研究热点。利用骨骼/关节数据识别人体相似动作时,因动作间关节特征差异小,且缺少其他图像语义信息,易导致识别混乱。针对该问题,提出一种基于显著性图像特征强化的中心连接图卷积网络(saliency image feature enhancement based center-connected graph convolutional network,SIFE-CGCN)模型。方法首先,设计一种骨架中心连接拓扑结构,建立所有关节点到骨架中心的连接,以捕获相似动作中关节运动的细微差异;其次,利用高斯混合背景建模算法将每一帧图像与实时更新的背景模型对比,分割出动态图像区域并消除背景干扰作为显著性图像,通过预训练的VGG-Net(Visual Geometry Group network)提取特征图,并进行动作语义特征匹配分类;最后,设计一种融合算法利用分类结果对中心连接图卷积网络的识别结果强化修正,提高对相似动作的识别能力。此外,提出了一种基于骨架的动作相似度的计算方法,并建立一个相似动作数据集。结果实验在相似动作数据集与NTU RGB+D 60/120(Nanyang Technological University RGB+D 60/120)数据集上与其他方法进行比较。在相似动作数据集中,相比于次优模型识别准确率在跨参与者识别(X-Sub)和跨视角识别(X-View)基准分别提高4.6%和6.0%;在NTU RGB+D60数据集中,相比于次优模型识别准确率在X-Sub和X-View基准分别提高1.4%和0.6%;在NTU RGB+D 120数据集中,相比于次优模型识别准确率在X-Sub和跨设置识别(X-Set)基准分别提高1.7%和1.1%。此外,进行多种对比实验,验证了中心连接图卷积网络、显著性图像提取方法以及融合算法的有效性。结论提出的方法可以实现对相似动作的准确有效识别分类,且模型的整体识别性能及鲁棒性也得以提升。 Objective Human action recognition is a valuable research area in computer vision.It has a wide range of applications,such as security monitoring,intelligent monitoring,human-computer interaction,and virtual reality.The skeleton-based action recognition method first extracts the specific position coordinates of the major body joints from the video or image by using a hardware method or a software method.Then,the skeleton information is used for action recognition.In recent years,skeleton-based action recognition has received increasing attention because of its robustness in dynamic environments,complex backgrounds,and occlusion situations.Early action recognition methods usually use hand-crafted features for action recognition modeling.However,the hand-crafted feature methods have poor generalization because of the lack of diversity in the extracted features.Deep learning has become the mainstream action recognition method because of its powerful automatic feature extraction capabilities.Traditional deep learning methods use constructed skeleton data as joint coordinate vectors or pseudo-images,which are directly input into recurrent neural networks(RNNs)or convolutional neural networks(CNNs)for action classification.However,the RNN-based or CNN-based methods lose the spatial structure information of skeleton data because of the limitation set by the European data structure.Moreover,these methods cannot extract the natural correlation of human joints.Thus,distinguishing subtle differences between similar actions becomes difficult.Human joints are naturally structured as graph structures in non-Euclidean space.Several works have successfully adopted graph convolutional networks(GCNs)to achieve state-of-the-art performance for skeleton-based action recognition.In these methods,the subtle differences between the joints are not explicitly learned.These subtle differences are crucial to recognizing similar actions.Moreover,the skeleton data extracted from the video shield the object information that interacts with humans and only retain the primary joint coordinates.The lack of image semantics and the reliance only on joint sequences remarkably challenge the recognition of similar actions.Method Given the above factors,the saliency image feature enhancement based center-connected graph convolutional network(SIFE-CGCN)is proposed in this work for skeleton-based similar action recognition.The proposed model is based on GCN,which can fully utilize the spatial and temporal dependence information between human joints.First,the CGCN is proposed for skeleton-based similar action recognition.For the spatial dimension,a center-connection skeleton topology is designed to establish connections between all human joints and the skeleton center to capture the small difference in joint movements in similar actions.For the temporal dimension,each frame is associated with the previous and subsequent frames in the sequence.Therefore,the number of adjacent nodes in the frame is fixed at 2.The regular 1D convolution is used on the temporal dimension as the temporal graph convolution.A basic co-occurrence graph convolution unit includes a spatial graph convolution,a temporal graph convolution,and a dropout layer.For training stability,the residual connection is added for each unit.The proposed network is formed by stacking nine graph convolution basic units.The batch normalization(BN)layer is added before the beginning of the network to standardize the input data,and a global average pooling layer is added at the end to unify the feature dimensions.The dual-stream architecture is used for utilizing the joint and bone information of the skeleton data simultaneously to extract data features from multiple angles.Given the different roles of each joint in different actions,the attention map is added to focus on the main motion joints in action.Second,the saliency image in the video is selected using the Gaussian mixture background modeling method.Each image frame is compared with the real-time updated background model to segment the image area with considerable changes,and the background interference is eliminated.The effective extraction of semantic feature maps from saliency images is the key to distinguishing similar actions.The Visual Geometry Group network(VGG-Net)can effectively extract the spatial structure features of objects from images.In this work,the feature map is extracted through pre-trained VGG-Net,and the fully connected layer is used for feature matching.Finally,the feature map matching result is used to strengthen and revise the recognition result of CGCN and improve the recognition ability for similar actions.In addition,the similarity calculation method for skeleton sequences is proposed,and a similar action dataset is established in this work.Result The proposed model is compared with the state-of-the-art models on the proposed similar action dataset and Nanyang Technological University RGB+D(NTU RGB+D)60/120 dataset.The methods for comparison include CNN-based,RNN-based,and GCN-based models.On the cross-subject(X-Sub)and cross-view(X-View)benchmarks in the proposed similar action dataset,the recognition accuracy of the proposed model can reach 80.3%and 92.1%,which are 4.6%and 6.0%higher than the recognition accuracies of the suboptimal algorithm,respectively.The recognition accuracy of the proposed model on the X-Sub and X-View benchmarks in the NTU RGB+D 60 dataset can reach 91.7%and 96.9%.Compared with the suboptimal algorithm,the proposed model improves by 1.4%and 0.6%.Compared with the suboptimal model feedback graph convolutional network(FGCN),the proposed model improves the recognition accuracy by 1.7%and 1.1%on the X-Sub and cross-setup(X-Set)benchmarks in the NTU RGB+D 120 dataset,respectively.In addition,we conduct a series of comparative experiments to show clearly the effectiveness of the proposed CGCN,the saliency image extraction method,and the fusion algorithm.Con⁃clusion In this study,we propose a SIFE-CGCN to solve the recognition confusion when recognizing similar actions due to the ambiguity between the skeleton feature and the lack of image semantic information.The experimental results show that the proposed method can effectively recognize similar actions,and the overall recognition performance and robustness of the model are improved.
作者 白忠玉 丁其川 徐红丽 吴成东 Bai Zhongyu;Ding Qichuan;Xu Hongli;Wu Chengdong(Faculty of Robot Science and Engineering,Northeastern University,Shenyang 110819,China)
出处 《中国图象图形学报》 CSCD 北大核心 2023年第9期2872-2886,共15页 Journal of Image and Graphics
基金 国家自然科学基金项目(61973065,61973063) 辽宁省科技厅联合开放基金机器人学国家重点实验室开放基金资助项目(2020-KF-12-02) 中央高校基本科研业务业务费专项基金项目(N2226002)。
关键词 动作识别 骨架序列 相似动作 图卷积网络(GCN) 图像显著性特征 action recognition skeleton sequence similar action graph convolutional network(GCN) image salient features
  • 相关文献

参考文献3

二级参考文献11

  • 1Peng Suo, Wang Yanjiang. An improved adaptive background modeling algorithm based on Gaussian mixture model [ C ] // Proceedings of ICSP2008. Beijing: IEEE Press,2008 : 1426-1439.
  • 2Power P W, Schoonees J A. Understanding background mixture models for foregrounds segmentation [ C ]//Proceedings of Image and Vision Computing, New Zealand : Auckland ,2002:267-271.
  • 3Harville M, Gordon G, Woodfill J. Foreground segmentation using adaptive mixture models in color and depth [C ]//Proceedings of IEEE Workshop on Detection and Recognition of Events in Video. Vancouver, BC, Canada: USA : IEEE Press, 2001 : 3 - 11.
  • 4Zhong J ,Sclaroff S. Segmenting foreground objects from a dynamic textured background via a robust Kalman filter [ C ] // Proceedings of International Conference on Computer Vision. Nice, France : IEEE Press,2003:44-50.
  • 5Jabri S, Duric Z, Wechsler H. Detection and location people in video images using adaptive fusion of color and edge information [ C ] //Proceedings of International Conference on Pattern Recognition. Barcdona, Spain : IEEE Press ,2000,627-630.
  • 6Ercan Ozyildiz, Nils Krahnstover, Rajeev Shanna. Adaptive texture and color segmentation for tracking moving objects [ J ]. Pattern Recognition ,2002,35 (10) :2013-2029.
  • 7Li Liyuan, Leung K H Maylor. Integrating intensity and texture differences for robust change detection [ J ]. IEEE Trans. Image Processing,2002,11 (2) : 105 - 112.
  • 8原春锋,王传旭,张祥光,刘云.光照突变环境下基于高斯混合模型和梯度信息的视频分割[J].中国图象图形学报,2007,12(11):2068-2072. 被引量:24
  • 9刘鑫,刘辉,强振平,耿续涛.混合高斯模型和帧间差分相融合的自适应背景模型[J].中国图象图形学报,2008,13(4):729-734. 被引量:110
  • 10陈世文,蔡念,唐孝艳.一种基于高斯混合模型的运动目标检测改进算法[J].现代电子技术,2010,33(2):125-127. 被引量:7

共引文献28

同被引文献4

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部