基于多视图自适应3D骨架网络的工业装箱动作识别

Industrial box-packing action recognition based on multi-view adaptive 3D skeleton network

导出

摘要目的动作识别在工业生产制造中变得越来越重要。但在复杂的生产车间内,动作识别技术会受到环境遮挡、视角变化以及相似动作识别等干扰。基于此,提出一种结合双视图骨架多流网络的装箱行为识别方法。方法将堆叠的差分图像(residual frames, RF)作为模型的输入,结合多视图模块解决人体被遮挡的问题。在视角转换模块中,将差分人体骨架旋转到最佳的虚拟观察角度,并将转换后的骨架数据传入3层堆叠的长短时记忆网络(long short-term memory, LSTM)中,将不同视角下的分类分数进行融合,得到识别结果。为了解决细微动作的识别问题,采用结合注意力机制的局部定位图像卷积网络,传入到卷积神经网络中进行识别。融合骨架和局部图像识别的结果,预测工人的行为动作。结果在实际生产环境下的装箱场景中进行了实验,得到装箱行为识别准确率为92.31%,较大幅度领先于现有的主流行为识别方式。此外,该方法在公共数据集NTU(Nanyang Technological University) RGB+D上进行了评估,结果显示在CS(cross-subject)协议和CV(cross-view)协议中的性能分别达到了85.52%和93.64%,优于其他网络,进一步验证了本文方法的有效性和准确性。结论本文提出了一种人体行为识别方法,能够充分利用多个视图中的人体行为信息,采用骨架网络和卷积神经网络模型相结合的方式,有效提高了行为识别的准确率。 Objective Action recognition has become increasingly important in industrial manufacturing.Production effi⁃ciency and quality can be improved by recognizing worker actions and postures in complex production environments.In recent years,action recognition based on skeletal data has received widespread attention and research,with methods mainly based on graph convolutional networks(GCN)or long short-term memory(LSTM)networks exhibiting excellent recognition performance in experiments.However,these methods have not considered the recognition problems of occlu⁃sion,viewpoint changes,and similar subtle actions in the factory environment,which may have a significant impact on subsequent action recognition. Therefore, this study proposes a packing behavior recognition method that combines a dualview skeleton multi-stream network. Method The network model consists of a main network and a sub-network. The mainnetwork uses two RGB videos from different perspectives as input and records the input of workers at the same time andaction. Subsequently, the image difference method is used to convert the input video data into a difference image. More⁃over, the 3D skeleton information of the character is extracted from the depth map by using the 3D pose estimation algo⁃rithm and then transmitted to the subsequent viewing angle conversion module. In the perspective conversion module, therotation of the bone data is used to find the best viewing angle, and the converted skeleton data are passed into a three-layerstacked LSTM network. The different classification scores of the weighted fusion are obtained for the recognition results ofthe main network. In addition, for some similar behaviors and non-compliant “fake actions”, we use a local positioningimage convolution network combined with an attention mechanism and pass it into the ResNeXt network for recognition.Moreover, we introduce a spatio-temporal attention mechanism for analyzing video action recognition sequences to focus onthe key frames of the skeleton sequence. The recognition scores of the main network and the sub-network are fused in pro⁃portion to obtain the final recognition result and predict the behavior of the person. Result First, convolutional neural net⁃work (CNN)-based methods usually have better performance than recurrent neural network (RNN)-based ones, whereasGCN-based methods have middling performance. Moreover, CNN and RNN network structures are combined to improvethe accuracy and recall rate to greatly explore the spatiotemporal information of skeletons. However, the method proposedin this study has an identification accuracy of packing behavior of 92. 31% and a recall rate of 89. 72%, which is still3. 96% and 3. 81% higher than the accuracy, respectively. The proposed method is significantly ahead of other existingmainstream behavior recognition methods. Second, the method based on a difference image combined with a skeletonextraction algorithm can achieve an 87. 6% accuracy, which is better than RGB as the input method of the original image,although the frame rate is reduced to 55. 3 frames per second, which is still within the acceptable range. Third, consider⁃ing the influence of the adaptive transformation module and the multi-view module on the experiment, we find that the rec⁃ognition rate of the single-stream network with the adaptive transformation module is greatly improved, but the fps isslightly decreased. The experiment finds that the learning of the module is more inclined to observe the action from the frontbecause the front observation can scatter the skeleton as much as possible compared with the side observation. The highestdegree of mutual occlusion among bones was the worst observation effect. For dual view, simply fusing two different singlestream output results can improve the performance, and the weighted average method has the best effect, which is 3. 83%and 3. 03% higher than the accuracy of single-stream S1 and S2, respectively. Some actions have the problem of objectocclusion and human self-occlusion under a certain shooting angle. The occlusion problem can be solved by two comple⁃mentary views, that is, the occluded action can be well recognized in one of the views. In addition, evaluations were car⁃ried out on the public NTU RGB+D dataset, where the performance results outperformed other networks. This result furthervalidates the effectiveness and accuracy of the proposed method in the study. Conclusion This method uses a two-streamnetwork model. The main network is an adaptive multi-view RNN network. Two depth cameras under complementary per⁃spectives are used to collect the data from the same station, and the incoming RGB image is converted into a differentialimage for extracting skeleton information. Then, the skeleton data are passed into the adaptive view transformation moduleto obtain the best skeleton observation points, and the three-layer stacked LSTM network is used to obtain the recognitionresults. Finally, the weighted fusion of the two view features is used, and the main network solves the influence of occlu⁃sion and background clutter. The sub-network adds the hand image recognition of skeleton positioning, and the interceptedlocal positioning image is sent to the ResNeXt network for recognition to make up for the problem of insufficient accuracy of“fake action” and similar action recognition. Finally, the recognition results of the main network and the sub-network arefused. The human behavior recognition method proposed in this study effectively utilizes human behavior information frommultiple views and combines skeleton network and CNN models to significantly improve the accuracy of behavior recogni⁃tion.

作者张学琪胡海洋潘开来李忠金 Zhang Xueqi;Hu Haiyang;Pan Kailai;Li Zhongjin(School of Computer Science and Technology,Hangzhou Dianzi University,Hangzhou 310018,China)

机构地区杭州电子科技大学计算机院

出处《中国图象图形学报》 CSCD 北大核心 2024年第5期1392-1407,共16页 Journal of Image and Graphics

基金国家自然科学基金项目(61572162,61802095) 浙江省重点研发计划“领雁”项目(2023C01145) 浙江省自然科学基金项目(LQ17F020003)。

关键词动作识别长短时记忆网络(LSTM) 双视图自适应视图转换注意力机制 action recognition long short-term memory(LSTM) dual-view adaptive view transformation attention mechanism

分类号 TP399 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献4

1杨清山,穆太江.采用蒸馏训练的时空图卷积动作识别融合模型[J].中国图象图形学报,2022,27(4):1290-1301. 被引量：5
2陶树,王美丽.结合姿态估计和时序分段网络分析的羽毛球视频动作识别[J].中国图象图形学报,2022,27(11):3280-3291. 被引量：4
3李琪,墨瀚林,赵婧涵,郝宏翔,李华.时空双仿射微分不变量及骨架动作识别[J].中国图象图形学报,2021,26(12):2879-2891. 被引量：3
4姜权晏,吴小俊,徐天阳.用于骨架行为识别的多维特征嵌合注意力机制[J].中国图象图形学报,2022,27(8):2391-2403. 被引量：5

二级参考文献11

1冯林,刘胜蓝,王静,肖尧.人体运动分割算法:序列局部弯曲的流形学习[J].计算机辅助设计与图形学学报,2013,25(4):460-467. 被引量：7
2王扬扬,李一波,姬晓飞.人体动作的超兴趣点特征表述及识别[J].中国图象图形学报,2013,18(7):805-812. 被引量：8
3沈晴,班晓娟,常征,郭靖.基于视频的人机交互中动作在线发现与时域分割[J].计算机学报,2015,38(12):2477-2487. 被引量：5
4冉宪宇,刘凯,李光,丁文文,陈斌.自适应骨骼中心的人体行为识别算法[J].中国图象图形学报,2018,23(4):519-525. 被引量：14
5郑潇,彭晓东,王嘉璇.基于姿态时空特征的人体行为识别方法[J].计算机辅助设计与图形学学报,2018,30(9):1615-1624. 被引量：14
6杨静.体育视频中羽毛球运动员的动作识别[J].自动化技术与应用,2018,37(10):120-124. 被引量：11
7丁重阳,刘凯,李光,闫林,陈博洋,钟育民.基于时空权重姿态运动特征的人体骨架行为识别研究[J].计算机学报,2020,43(1):29-40. 被引量：30
8熊成鑫,郭丹,刘学亮.时域候选优化的时序动作检测[J].中国图象图形学报,2020,25(7):1447-1458. 被引量：2
9钟秋波,郑彩明,朴松昊.时空域融合的骨架动作识别与交互研究[J].智能系统学报,2020,15(3):601-608. 被引量：8
10马淼,李贻斌,武宪青,高金凤,潘海鹏.关键语义区域链提取的视频人体行为识别[J].中国图象图形学报,2020,25(12):2517-2529. 被引量：2

共引文献13

1墨瀚林,郝优,郭锐,郝宏翔,张贺,李琪,李华.图形图像积分与微分不变量的构造与应用[J].图学学报,2022,43(6):1182-1192. 被引量：1
2杨耿,梁俊威,蔡铁,李钦,郑家帆.新时代学校体育评价智慧大脑设计与构建研究[J].当代体育科技,2023,13(18):103-110. 被引量：2
3何赟泽,周辉,吴兴辉,任丹彤,丁美有,程亮.面向水域人员的不安全行为识别算法与应用[J].中国测试,2023,49(10):104-110. 被引量：3
4李华,赵领娣,陈雨杰,杨杨,杜新兆.多流融合的轻量级图卷积行为识别算法[J].计算机科学,2023,50(S02):365-370.
5李文静,白静,彭斌,杨瞻源.图卷积神经网络及其在图像识别领域的应用综述[J].计算机工程与应用,2023,59(22):15-35. 被引量：7
6张宇,徐天宇,米思娅.标记分布与时空注意力感知的视频动作质量评估[J].中国图象图形学报,2023,28(12):3810-3824. 被引量：1
7顾庆传,张靖,周丽,李鑫,朱豪,张鹏坤.基于图像识别的角度传感器设计[J].传感器与微系统,2024,43(2):113-115.
8黄倩,崔静雯,李畅.基于骨骼的人体行为识别方法研究综述[J].计算机辅助设计与图形学学报,2024,36(2):173-194.
9雷桂英,宋军锋.广义的WBKL方程和HS-KdV方程的微分不变量、微分不变方程[J].长春师范大学学报,2024,43(4):1-9.
10赵冬,杨改红,喻龙,周帅,薛俊杰.运动员体能训练动作量化修正系统设计[J].信息技术,2024,48(4):87-92.

1徐源徽.猿哀[J].青春,2023(3):56-65.
2王珍义,陈九萍,陈曦.政府补助与中小企业技术创新:基于融资约束的调节效应[J].统计与决策,2024,40(8):184-188.
3罗会兰,于亚威,王婵娟.多维特征激励网络用于视频行为识别[J].计算机科学,2023,50(S02):226-233.
4李光,刘丕亮,张雪松.基于骨架平衡的3D人体异常行为识别方法仿真[J].计算机仿真,2024,41(2):492-495.
5谢元坤,程皓楠,叶龙.深度伪造音频检测综述[J].中国传媒大学学报（自然科学版）,2024,31(3):26-33.
6孙帅,吕红光,黄骁.遮挡环境下基于航海雷达的舰船目标跟踪方法研究[J].中国舰船研究,2024,19(1):55-61.
7闫文杰,尹艺颖.基于3D骨架相似性的自适应移位图卷积神经网络人体行为识别算法[J].计算机科学,2024,51(4):236-242. 被引量：2
8陈禹,刘慧,梁东升,张雷.基于姿态估计和Transformer模型的遮挡行人重识别[J].科学技术与工程,2024,24(12):5051-5058.
9杨惠烽,徐莉.基于Android手机局部定位的心理咨询系统的设计与开发[J].晋中学院学报,2024,41(3):15-19.
10视界[J].China Report ASEAN,2024,9(6):10-15.

中国图象图形学报

2024年第5期

浏览历史

内容加载中请稍等...

基于多视图自适应3D骨架网络的工业装箱动作识别

参考文献4

二级参考文献11

共引文献13

相关作者

相关机构

相关主题

浏览历史