Background Lack of depth perception from medical imaging systems is one of the long-standing technological limitations of minimally invasive surgeries.The ability to visualize anatomical structures in 3D can improve c...Background Lack of depth perception from medical imaging systems is one of the long-standing technological limitations of minimally invasive surgeries.The ability to visualize anatomical structures in 3D can improve conventional arthroscopic surgeries,as a full 3D semantic representation of the surgical site can directly improve surgeons’ability.It also brings the possibility of intraoperative image registration with preoperative clinical records for the development of semi-autonomous,and fully autonomous platforms.This study aimed to present a novel monocular depth prediction model to infer depth maps from a single-color arthroscopic video frame.Methods We applied a novel technique that provides the ability to combine both supervised and self-supervised loss terms and thus eliminate the drawback of each technique.It enabled the estimation of edge-preserving depth maps from a single untextured arthroscopic frame.The proposed image acquisition technique projected artificial textures on the surface to improve the quality of disparity maps from stereo images.Moreover,following the integration of the attention-ware multi-scale feature extraction technique along with scene global contextual constraints and multiscale depth fusion,the model could predict reliable and accurate tissue depth of the surgical sites that complies with scene geometry.Results A total of 4,128 stereo frames from a knee phantom were used to train a network,and during the pre-trained stage,the network learned disparity maps from the stereo images.The fine-tuned training phase uses 12,695 knee arthroscopic stereo frames from cadaver experiments along with their corresponding coarse disparity maps obtained from the stereo matching technique.In a supervised fashion,the network learns the left image to the disparity map transformation process,whereas the self-supervised loss term refines the coarse depth map by minimizing reprojection,gradients,and structural dissimilarity loss.Together,our method produces high-quality 3D maps with minimum re-projection loss that are 0.0004132(structural similarity index),0.00036120156(L1 error distance)and 6.591908×10^(−5)(L1 gradient error distance).Conclusion Machine learning techniques for monocular depth prediction is studied to infer accurate depth maps from a single-color arthroscopic video frame.Moreover,the study integrates segmentation model hence,3D segmented maps are inferred that provides extended perception ability and tissue awareness.展开更多
Crowd density estimation,in general,is a challenging task due to the large variation of head sizes in the crowds.Existing methods always use a multi-column convolutional neural network(MCNN)to adapt to this variation,...Crowd density estimation,in general,is a challenging task due to the large variation of head sizes in the crowds.Existing methods always use a multi-column convolutional neural network(MCNN)to adapt to this variation,which results in an average effect in areas with different densities and brings a lot of noise to the density map.To address this problem,we propose a new method called the segmentation-aware prior network(SAPNet),which generates a high-quality density map without noise based on a coarse head-segmentation map.SAPNet is composed of two networks,i.e.,a foreground-segmentation convolutional neural network(FS-CNN)as the front end and a crowd-regression convolutional neural network(CR-CNN)as the back end.With only the single dot annotation,we generate the ground truth of segmentation masks in heads.Then,based on the ground truth,FS-CNN outputs a coarse head-segmentation map,which helps eliminate the noise in regions without people in the density map.By inputting the head-segmentation map generated by the front end,CR-CNN performs accurate crowd counting estimation and generates a high-quality density map.We demonstrate SAPNet on four datasets(i.e.,ShanghaiTech,UCF-CC-50,WorldExpo’10,and UCSD),and show the state-of-the-art performances on ShanghaiTech part B and UCF-CC-50 datasets.展开更多
基金supported by the Australian Indian Strategic Research Fund(Project AISRF53820).
文摘Background Lack of depth perception from medical imaging systems is one of the long-standing technological limitations of minimally invasive surgeries.The ability to visualize anatomical structures in 3D can improve conventional arthroscopic surgeries,as a full 3D semantic representation of the surgical site can directly improve surgeons’ability.It also brings the possibility of intraoperative image registration with preoperative clinical records for the development of semi-autonomous,and fully autonomous platforms.This study aimed to present a novel monocular depth prediction model to infer depth maps from a single-color arthroscopic video frame.Methods We applied a novel technique that provides the ability to combine both supervised and self-supervised loss terms and thus eliminate the drawback of each technique.It enabled the estimation of edge-preserving depth maps from a single untextured arthroscopic frame.The proposed image acquisition technique projected artificial textures on the surface to improve the quality of disparity maps from stereo images.Moreover,following the integration of the attention-ware multi-scale feature extraction technique along with scene global contextual constraints and multiscale depth fusion,the model could predict reliable and accurate tissue depth of the surgical sites that complies with scene geometry.Results A total of 4,128 stereo frames from a knee phantom were used to train a network,and during the pre-trained stage,the network learned disparity maps from the stereo images.The fine-tuned training phase uses 12,695 knee arthroscopic stereo frames from cadaver experiments along with their corresponding coarse disparity maps obtained from the stereo matching technique.In a supervised fashion,the network learns the left image to the disparity map transformation process,whereas the self-supervised loss term refines the coarse depth map by minimizing reprojection,gradients,and structural dissimilarity loss.Together,our method produces high-quality 3D maps with minimum re-projection loss that are 0.0004132(structural similarity index),0.00036120156(L1 error distance)and 6.591908×10^(−5)(L1 gradient error distance).Conclusion Machine learning techniques for monocular depth prediction is studied to infer accurate depth maps from a single-color arthroscopic video frame.Moreover,the study integrates segmentation model hence,3D segmented maps are inferred that provides extended perception ability and tissue awareness.
基金the National Natural Science Foundation of China(No.61775048)the Fundamental Research Funds for the Central UniversitiesChina(No.ZDXMPY20180103)。
文摘Crowd density estimation,in general,is a challenging task due to the large variation of head sizes in the crowds.Existing methods always use a multi-column convolutional neural network(MCNN)to adapt to this variation,which results in an average effect in areas with different densities and brings a lot of noise to the density map.To address this problem,we propose a new method called the segmentation-aware prior network(SAPNet),which generates a high-quality density map without noise based on a coarse head-segmentation map.SAPNet is composed of two networks,i.e.,a foreground-segmentation convolutional neural network(FS-CNN)as the front end and a crowd-regression convolutional neural network(CR-CNN)as the back end.With only the single dot annotation,we generate the ground truth of segmentation masks in heads.Then,based on the ground truth,FS-CNN outputs a coarse head-segmentation map,which helps eliminate the noise in regions without people in the density map.By inputting the head-segmentation map generated by the front end,CR-CNN performs accurate crowd counting estimation and generates a high-quality density map.We demonstrate SAPNet on four datasets(i.e.,ShanghaiTech,UCF-CC-50,WorldExpo’10,and UCSD),and show the state-of-the-art performances on ShanghaiTech part B and UCF-CC-50 datasets.