Grasp detection plays a critical role for robot manipulation.Mainstream pixel-wise grasp detection networks with encoder-decoder structure receive much attention due to good accuracy and efficiency.However,they usuall...Grasp detection plays a critical role for robot manipulation.Mainstream pixel-wise grasp detection networks with encoder-decoder structure receive much attention due to good accuracy and efficiency.However,they usually transmit the high-level feature in the encoder to the decoder,and low-level features are neglected.It is noted that low-level features contain abundant detail information,and how to fully exploit low-level features remains unsolved.Meanwhile,the channel information in high-level feature is also not well mined.Inevitably,the performance of grasp detection is degraded.To solve these problems,we propose a grasp detection network with hierarchical multi-scale feature fusion and inverted shuffle residual.Both low-level and high-level features in the encoder are firstly fused by the designed skip connections with attention module,and the fused information is then propagated to corresponding layers of the decoder for in-depth feature fusion.Such a hierarchical fusion guarantees the quality of grasp prediction.Furthermore,an inverted shuffle residual module is created,where the high-level feature from encoder is split in channel and the resultant split features are processed in their respective branches.By such differentiation processing,more high-dimensional channel information is kept,which enhances the representation ability of the network.Besides,an information enhancement module is added before the encoder to reinforce input information.The proposed method attains 98.9%and 97.8%in image-wise and object-wise accuracy on the Cornell grasping dataset,respectively,and the experimental results verify the effectiveness of the method.展开更多
In order to make artificial intelligence smarter by detecting user emotions, this project analyzes and determines the current type of human emotions through computer vision, semantic recognition and audio feature clas...In order to make artificial intelligence smarter by detecting user emotions, this project analyzes and determines the current type of human emotions through computer vision, semantic recognition and audio feature classification. In facial expression recognition, for the problems of large number of parameters and poor real-time performance of expression recognition methods based on deep learning, Wang Weimin and Tang Yang Z. et al. proposed a face expression recognition method based on multilayer feature fusion with light-weight convolutional networks, which uses an improved inverted residual network as the basic unit to build a lightweight convolutional network model. Based on this method, this experiment optimizes the traditional CNN MobileNet model and finally constructs a new model framework ms_model_M, which has about 5% of the number of parameters of the traditional CNN MobileNet model. ms_model_M is tested on two commonly used real expression datasets, FER-2013 and AffectNet, the accuracy of ms_model_M is 74.35% and 56.67%, respectively, and the accuracy of the traditional MovbliNet model is 74.11% and 56.48% in the tests of these two datasets. This network structure well balances the recognition accuracy and recognition speed of the model. For semantic emotion detection and audio emotion detection, the existing models and APIs are used in this experiment.展开更多
基金This work was supported by the National Natural Science Foundation of China(Nos.62073322 and 61633020)the CIE-Tencent Robotics X Rhino-Bird Focused Research Program(No.2022-07)the Beijing Natural Science Foundation(No.2022MQ05).
文摘Grasp detection plays a critical role for robot manipulation.Mainstream pixel-wise grasp detection networks with encoder-decoder structure receive much attention due to good accuracy and efficiency.However,they usually transmit the high-level feature in the encoder to the decoder,and low-level features are neglected.It is noted that low-level features contain abundant detail information,and how to fully exploit low-level features remains unsolved.Meanwhile,the channel information in high-level feature is also not well mined.Inevitably,the performance of grasp detection is degraded.To solve these problems,we propose a grasp detection network with hierarchical multi-scale feature fusion and inverted shuffle residual.Both low-level and high-level features in the encoder are firstly fused by the designed skip connections with attention module,and the fused information is then propagated to corresponding layers of the decoder for in-depth feature fusion.Such a hierarchical fusion guarantees the quality of grasp prediction.Furthermore,an inverted shuffle residual module is created,where the high-level feature from encoder is split in channel and the resultant split features are processed in their respective branches.By such differentiation processing,more high-dimensional channel information is kept,which enhances the representation ability of the network.Besides,an information enhancement module is added before the encoder to reinforce input information.The proposed method attains 98.9%and 97.8%in image-wise and object-wise accuracy on the Cornell grasping dataset,respectively,and the experimental results verify the effectiveness of the method.
文摘In order to make artificial intelligence smarter by detecting user emotions, this project analyzes and determines the current type of human emotions through computer vision, semantic recognition and audio feature classification. In facial expression recognition, for the problems of large number of parameters and poor real-time performance of expression recognition methods based on deep learning, Wang Weimin and Tang Yang Z. et al. proposed a face expression recognition method based on multilayer feature fusion with light-weight convolutional networks, which uses an improved inverted residual network as the basic unit to build a lightweight convolutional network model. Based on this method, this experiment optimizes the traditional CNN MobileNet model and finally constructs a new model framework ms_model_M, which has about 5% of the number of parameters of the traditional CNN MobileNet model. ms_model_M is tested on two commonly used real expression datasets, FER-2013 and AffectNet, the accuracy of ms_model_M is 74.35% and 56.67%, respectively, and the accuracy of the traditional MovbliNet model is 74.11% and 56.48% in the tests of these two datasets. This network structure well balances the recognition accuracy and recognition speed of the model. For semantic emotion detection and audio emotion detection, the existing models and APIs are used in this experiment.