Gesture recognition technology enables machines to read human gestures and has significant application prospects in the fields of human-computer interaction and sign language translation.Existing researches usually us...Gesture recognition technology enables machines to read human gestures and has significant application prospects in the fields of human-computer interaction and sign language translation.Existing researches usually use convolutional neural networks to extract features directly from raw gesture data for gesture recognition,but the networks are affected by much interference information in the input data and thus fit to some unimportant features.In this paper,we proposed a novel method for encoding spatio-temporal information,which can enhance the key features required for gesture recognition,such as shape,structure,contour,position and hand motion of gestures,thereby improving the accuracy of gesture recognition.This encoding method can encode arbitrarily multiple frames of gesture data into a single frame of the spatio-temporal feature map and use the spatio-temporal feature map as the input to the neural network.This can guide the model to fit important features while avoiding the use of complex recurrent network structures to extract temporal features.In addition,we designed two sub-networks and trained the model using a sub-network pre-training strategy that trains the sub-networks first and then the entire network,so as to avoid the subnetworks focusing too much on the information of a single category feature and being overly influenced by each other’s features.Experimental results on two public gesture datasets show that the proposed spatio-temporal information encoding method achieves advanced accuracy.展开更多
The use of hand gestures can be the most intuitive human-machine interaction medium.The early approaches for hand gesture recognition used device-based methods.These methods use mechanical or optical sensors attached ...The use of hand gestures can be the most intuitive human-machine interaction medium.The early approaches for hand gesture recognition used device-based methods.These methods use mechanical or optical sensors attached to a glove or markers,which hinder the natural human-machine communication.On the other hand,vision-based methods are less restrictive and allow for a more spontaneous communication without the need of an intermediary between human and machine.Therefore,vision gesture recognition has been a popular area of research for the past thirty years.Hand gesture recognition finds its application in many areas,particularly the automotive industry where advanced automotive human-machine interface(HMI)designers are using gesture recognition to improve driver and vehicle safety.However,technology advances go beyond active/passive safety and into convenience and comfort.In this context,one of America’s big three automakers has partnered with the Centre of Pattern Analysis and Machine Intelligence(CPAMI)at the University of Waterloo to investigate expanding their product segment through machine learning to provide an increased driver convenience and comfort with the particular application of hand gesture recognition for autonomous car parking.The present paper leverages the state-of-the-art deep learning and optimization techniques to develop a vision-based multiview dynamic hand gesture recognizer for a self-parking system.We propose a 3D-CNN gesture model architecture that we train on a publicly available hand gesture database.We apply transfer learning methods to fine-tune the pre-trained gesture model on custom-made data,which significantly improves the proposed system performance in a real world environment.We adapt the architecture of end-to-end solution to expand the state-of-the-art video classifier from a single image as input(fed by monocular camera)to a Multiview 360 feed,offered by a six cameras module.Finally,we optimize the proposed solution to work on a limited resource embedded platform(Nvidia Jetson TX2)that is used by automakers for vehicle-based features,without sacrificing the accuracy robustness and real time functionality of the system.展开更多
基金This work was supported,in part,by the National Nature Science Foundation of China under grant numbers 62272236in part,by the Natural Science Foundation of Jiangsu Province under grant numbers BK20201136,BK20191401in part,by the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD)fund.
文摘Gesture recognition technology enables machines to read human gestures and has significant application prospects in the fields of human-computer interaction and sign language translation.Existing researches usually use convolutional neural networks to extract features directly from raw gesture data for gesture recognition,but the networks are affected by much interference information in the input data and thus fit to some unimportant features.In this paper,we proposed a novel method for encoding spatio-temporal information,which can enhance the key features required for gesture recognition,such as shape,structure,contour,position and hand motion of gestures,thereby improving the accuracy of gesture recognition.This encoding method can encode arbitrarily multiple frames of gesture data into a single frame of the spatio-temporal feature map and use the spatio-temporal feature map as the input to the neural network.This can guide the model to fit important features while avoiding the use of complex recurrent network structures to extract temporal features.In addition,we designed two sub-networks and trained the model using a sub-network pre-training strategy that trains the sub-networks first and then the entire network,so as to avoid the subnetworks focusing too much on the information of a single category feature and being overly influenced by each other’s features.Experimental results on two public gesture datasets show that the proposed spatio-temporal information encoding method achieves advanced accuracy.
文摘The use of hand gestures can be the most intuitive human-machine interaction medium.The early approaches for hand gesture recognition used device-based methods.These methods use mechanical or optical sensors attached to a glove or markers,which hinder the natural human-machine communication.On the other hand,vision-based methods are less restrictive and allow for a more spontaneous communication without the need of an intermediary between human and machine.Therefore,vision gesture recognition has been a popular area of research for the past thirty years.Hand gesture recognition finds its application in many areas,particularly the automotive industry where advanced automotive human-machine interface(HMI)designers are using gesture recognition to improve driver and vehicle safety.However,technology advances go beyond active/passive safety and into convenience and comfort.In this context,one of America’s big three automakers has partnered with the Centre of Pattern Analysis and Machine Intelligence(CPAMI)at the University of Waterloo to investigate expanding their product segment through machine learning to provide an increased driver convenience and comfort with the particular application of hand gesture recognition for autonomous car parking.The present paper leverages the state-of-the-art deep learning and optimization techniques to develop a vision-based multiview dynamic hand gesture recognizer for a self-parking system.We propose a 3D-CNN gesture model architecture that we train on a publicly available hand gesture database.We apply transfer learning methods to fine-tune the pre-trained gesture model on custom-made data,which significantly improves the proposed system performance in a real world environment.We adapt the architecture of end-to-end solution to expand the state-of-the-art video classifier from a single image as input(fed by monocular camera)to a Multiview 360 feed,offered by a six cameras module.Finally,we optimize the proposed solution to work on a limited resource embedded platform(Nvidia Jetson TX2)that is used by automakers for vehicle-based features,without sacrificing the accuracy robustness and real time functionality of the system.