We propose a feature-fusion network for pose estimation directly from RGB images without any depth information in this study.First,we introduce a two-stream architecture consisting of segmentation and regression strea...We propose a feature-fusion network for pose estimation directly from RGB images without any depth information in this study.First,we introduce a two-stream architecture consisting of segmentation and regression streams.The segmentation stream processes the spatial embedding features and obtains the corresponding image crop.These features are further coupled with the image crop in the fusion network.Second,we use an efficient perspective-n-point(E-PnP)algorithm in the regression stream to extract robust spatial features between 3D and 2D keypoints.Finally,we perform iterative refinement with an end-to-end mechanism to improve the estimation performance.We conduct experiments on two public datasets of YCB-Video and the challenging Occluded-LineMOD.The results show that our method outperforms state-of-the-art approaches in both the speed and the accuracy.展开更多
基金the National Key Research and Development Program of China under Grant No.2021YFB1715900the National Natural Science Foundation of China under Grant Nos.12022117 and 61802406+2 种基金the Beijing Natural Science Foundation under Grant No.Z190004the Beijing Advanced Discipline Fund under Grant No.115200S001Alibaba Group through Alibaba Innovative Research Program.
文摘We propose a feature-fusion network for pose estimation directly from RGB images without any depth information in this study.First,we introduce a two-stream architecture consisting of segmentation and regression streams.The segmentation stream processes the spatial embedding features and obtains the corresponding image crop.These features are further coupled with the image crop in the fusion network.Second,we use an efficient perspective-n-point(E-PnP)algorithm in the regression stream to extract robust spatial features between 3D and 2D keypoints.Finally,we perform iterative refinement with an end-to-end mechanism to improve the estimation performance.We conduct experiments on two public datasets of YCB-Video and the challenging Occluded-LineMOD.The results show that our method outperforms state-of-the-art approaches in both the speed and the accuracy.