Video description generates natural language sentences that describe the subject,verb,and objects of the targeted Video.The video description has been used to help visually impaired people to understand the content.It...Video description generates natural language sentences that describe the subject,verb,and objects of the targeted Video.The video description has been used to help visually impaired people to understand the content.It is also playing an essential role in devolving human-robot interaction.The dense video description is more difficult when compared with simple Video captioning because of the object’s interactions and event overlapping.Deep learning is changing the shape of computer vision(CV)technologies and natural language processing(NLP).There are hundreds of deep learning models,datasets,and evaluations that can improve the gaps in current research.This article filled this gap by evaluating some state-of-the-art approaches,especially focusing on deep learning and machine learning for video caption in a dense environment.In this article,some classic techniques concerning the existing machine learning were reviewed.And provides deep learning models,a detail of benchmark datasets with their respective domains.This paper reviews various evaluation metrics,including Bilingual EvaluationUnderstudy(BLEU),Metric for Evaluation of Translation with Explicit Ordering(METEOR),WordMover’s Distance(WMD),and Recall-Oriented Understudy for Gisting Evaluation(ROUGE)with their pros and cons.Finally,this article listed some future directions and proposed work for context enhancement using key scene extraction with object detection in a particular frame.Especially,how to improve the context of video description by analyzing key frames detection through morphological image analysis.Additionally,the paper discusses a novel approach involving sentence reconstruction and context improvement through key frame object detection,which incorporates the fusion of large languagemodels for refining results.The ultimate results arise fromenhancing the generated text of the proposedmodel by improving the predicted text and isolating objects using various keyframes.These keyframes identify dense events occurring in the video sequence.展开更多
In the shape analysis community,decomposing a 3D shape intomeaningful parts has become a topic of interest.3D model segmentation is largely used in tasks such as shape deformation,shape partial matching,skeleton extra...In the shape analysis community,decomposing a 3D shape intomeaningful parts has become a topic of interest.3D model segmentation is largely used in tasks such as shape deformation,shape partial matching,skeleton extraction,shape correspondence,shape annotation and texture mapping.Numerous approaches have attempted to provide better segmentation solutions;however,the majority of the previous techniques used handcrafted features,which are usually focused on a particular attribute of 3Dobjects and so are difficult to generalize.In this paper,we propose a three-stage approach for using Multi-view recurrent neural network to automatically segment a 3D shape into visually meaningful sub-meshes.The first stage involves normalizing and scaling a 3D model to fit within the unit sphere and rendering the object into different views.Contrasting viewpoints,on the other hand,might not have been associated,and a 3D region could correlate into totally distinct outcomes depending on the viewpoint.To address this,we ran each view through(shared weights)CNN and Bolster block in order to create a probability boundary map.The Bolster block simulates the area relationships between different views,which helps to improve and refine the data.In stage two,the feature maps generated in the previous step are correlated using a Recurrent Neural network to obtain compatible fine detail responses for each view.Finally,a layer that is fully connected is used to return coherent edges,which are then back project to 3D objects to produce the final segmentation.Experiments on the Princeton Segmentation Benchmark dataset show that our proposed method is effective for mesh segmentation tasks.展开更多
To investigate the robustness of face recognition algorithms under the complicated variations of illumination, facial expression and posture, the advantages and disadvantages of seven typical algorithms on extracting ...To investigate the robustness of face recognition algorithms under the complicated variations of illumination, facial expression and posture, the advantages and disadvantages of seven typical algorithms on extracting global and local features are studied through the experiments respectively on the Olivetti Research Laboratory database and the other three databases (the three subsets of illumination, expression and posture that are constructed by selecting images from several existing face databases). By taking the above experimental results into consideration, two schemes of face recognition which are based on the decision fusion of the twodimensional linear discriminant analysis (2DLDA) and local binary pattern (LBP) are proposed in this paper to heighten the recognition rates. In addition, partitioning a face nonuniformly for its LBP histograms is conducted to improve the performance. Our experimental results have shown the complementarities of the two kinds of features, the 2DLDA and LBP, and have verified the effectiveness of the proposed fusion algorithms.展开更多
文摘Video description generates natural language sentences that describe the subject,verb,and objects of the targeted Video.The video description has been used to help visually impaired people to understand the content.It is also playing an essential role in devolving human-robot interaction.The dense video description is more difficult when compared with simple Video captioning because of the object’s interactions and event overlapping.Deep learning is changing the shape of computer vision(CV)technologies and natural language processing(NLP).There are hundreds of deep learning models,datasets,and evaluations that can improve the gaps in current research.This article filled this gap by evaluating some state-of-the-art approaches,especially focusing on deep learning and machine learning for video caption in a dense environment.In this article,some classic techniques concerning the existing machine learning were reviewed.And provides deep learning models,a detail of benchmark datasets with their respective domains.This paper reviews various evaluation metrics,including Bilingual EvaluationUnderstudy(BLEU),Metric for Evaluation of Translation with Explicit Ordering(METEOR),WordMover’s Distance(WMD),and Recall-Oriented Understudy for Gisting Evaluation(ROUGE)with their pros and cons.Finally,this article listed some future directions and proposed work for context enhancement using key scene extraction with object detection in a particular frame.Especially,how to improve the context of video description by analyzing key frames detection through morphological image analysis.Additionally,the paper discusses a novel approach involving sentence reconstruction and context improvement through key frame object detection,which incorporates the fusion of large languagemodels for refining results.The ultimate results arise fromenhancing the generated text of the proposedmodel by improving the predicted text and isolating objects using various keyframes.These keyframes identify dense events occurring in the video sequence.
基金supported by the National Natural Science Foundation of China (61671397).
文摘In the shape analysis community,decomposing a 3D shape intomeaningful parts has become a topic of interest.3D model segmentation is largely used in tasks such as shape deformation,shape partial matching,skeleton extraction,shape correspondence,shape annotation and texture mapping.Numerous approaches have attempted to provide better segmentation solutions;however,the majority of the previous techniques used handcrafted features,which are usually focused on a particular attribute of 3Dobjects and so are difficult to generalize.In this paper,we propose a three-stage approach for using Multi-view recurrent neural network to automatically segment a 3D shape into visually meaningful sub-meshes.The first stage involves normalizing and scaling a 3D model to fit within the unit sphere and rendering the object into different views.Contrasting viewpoints,on the other hand,might not have been associated,and a 3D region could correlate into totally distinct outcomes depending on the viewpoint.To address this,we ran each view through(shared weights)CNN and Bolster block in order to create a probability boundary map.The Bolster block simulates the area relationships between different views,which helps to improve and refine the data.In stage two,the feature maps generated in the previous step are correlated using a Recurrent Neural network to obtain compatible fine detail responses for each view.Finally,a layer that is fully connected is used to return coherent edges,which are then back project to 3D objects to produce the final segmentation.Experiments on the Princeton Segmentation Benchmark dataset show that our proposed method is effective for mesh segmentation tasks.
文摘To investigate the robustness of face recognition algorithms under the complicated variations of illumination, facial expression and posture, the advantages and disadvantages of seven typical algorithms on extracting global and local features are studied through the experiments respectively on the Olivetti Research Laboratory database and the other three databases (the three subsets of illumination, expression and posture that are constructed by selecting images from several existing face databases). By taking the above experimental results into consideration, two schemes of face recognition which are based on the decision fusion of the twodimensional linear discriminant analysis (2DLDA) and local binary pattern (LBP) are proposed in this paper to heighten the recognition rates. In addition, partitioning a face nonuniformly for its LBP histograms is conducted to improve the performance. Our experimental results have shown the complementarities of the two kinds of features, the 2DLDA and LBP, and have verified the effectiveness of the proposed fusion algorithms.