In recent years,a gain in popularity and significance of science understanding has been observed due to the high paced progress in computer vision techniques and technologies.The primary focus of computer vision based...In recent years,a gain in popularity and significance of science understanding has been observed due to the high paced progress in computer vision techniques and technologies.The primary focus of computer vision based scene understanding is to label each and every pixel in an image as the category of the object it belongs to.So it is required to combine segmentation and detection in a single framework.Recently many successful computer vision methods has been developed to aid scene understanding for a variety of real world application.Scene understanding systems typically involves detection and segmentation of different natural and manmade things.A lot of research has been performed in recent years,mostly with a focus on things(a well-defined objects that has shape,orientations and size)with a less focus on stuff classes(amorphous regions that are unclear and lack a shape,size or other characteristics Stuff region describes many aspects of scene,like type,situation,environment of scene etc.and hence can be very helpful in scene understanding.Existing methods for scene understanding still have to cover a challenging path to cope up with the challenges of computational time,accuracy and robustness for varying level of scene complexity.A robust scene understanding method has to effectively deal with imbalanced distribution of classes,overlapping objects,fuzzy object boundaries and poorly localized objects.The proposed method presents Panoptic Segmentation on Cityscapes Dataset.Mobilenet-V2 is used as a backbone for feature extraction that is pre-trained on ImageNet.MobileNet-V2 with state-of-art encoder-decoder architecture of DeepLabV3+with some customization and optimization is employed Atrous convolution along with Spatial Pyramid Pooling are also utilized in the proposed method to make it more accurate and robust.Very promising and encouraging results have been achieved that indicates the potential of the proposed method for robust scene understanding in a fast and reliable way.展开更多
Realizing autonomy is a hot research topic for automatic vehicles in recent years. For a long time, most of the efforts to this goal concentrate on understanding the scenes surrounding the ego-vehicle(autonomous vehi...Realizing autonomy is a hot research topic for automatic vehicles in recent years. For a long time, most of the efforts to this goal concentrate on understanding the scenes surrounding the ego-vehicle(autonomous vehicle itself). By completing lowlevel vision tasks, such as detection, tracking and segmentation of the surrounding traffic participants, e.g., pedestrian, cyclists and vehicles, the scenes can be interpreted. However, for an autonomous vehicle, low-level vision tasks are largely insufficient to give help to comprehensive scene understanding. What are and how about the past, the on-going and the future of the scene participants? This deep question actually steers the vehicles towards truly full automation, just like human beings. Based on this thoughtfulness, this paper attempts to investigate the interpretation of traffic scene in autonomous driving from an event reasoning view. To reach this goal, we study the most relevant literatures and the state-of-the-arts on scene representation, event detection and intention prediction in autonomous driving. In addition, we also discuss the open challenges and problems in this field and endeavor to provide possible solutions.展开更多
In this paper,we propose a Structure-Aware Fusion Network(SAFNet)for 3D scene understanding.As 2D images present more detailed information while 3D point clouds convey more geometric information,fusing the two complem...In this paper,we propose a Structure-Aware Fusion Network(SAFNet)for 3D scene understanding.As 2D images present more detailed information while 3D point clouds convey more geometric information,fusing the two complementary data can improve the discriminative ability of the model.Fusion is a very challenging task since 2D and 3D data are essentially different and show different formats.The existing methods first extract 2D multi-view image features and then aggregate them into sparse 3D point clouds and achieve superior performance.However,the existing methods ignore the structural relations between pixels and point clouds and directly fuse the two modals of data without adaptation.To address this,we propose a structural deep metric learning method on pixels and points to explore the relations and further utilize them to adaptively map the images and point clouds into a common canonical space for prediction.Extensive experiments on the widely used ScanNetV2 and S3DIS datasets verify the performance of the proposed SAFNet.展开更多
The analysis of overcrowded areas is essential for flow monitoring,assembly control,and security.Crowd counting’s primary goal is to calculate the population in a given region,which requires real-time analysis of con...The analysis of overcrowded areas is essential for flow monitoring,assembly control,and security.Crowd counting’s primary goal is to calculate the population in a given region,which requires real-time analysis of congested scenes for prompt reactionary actions.The crowd is always unexpected,and the benchmarked available datasets have a lot of variation,which limits the trained models’performance on unseen test data.In this paper,we proposed an end-to-end deep neural network that takes an input image and generates a density map of a crowd scene.The proposed model consists of encoder and decoder networks comprising batch-free normalization layers known as evolving normalization(EvoNorm).This allows our network to be generalized for unseen data because EvoNorm is not using statistics from the training samples.The decoder network uses dilated 2D convolutional layers to provide large receptive fields and fewer parameters,which enables real-time processing and solves the density drift problem due to its large receptive field.Five benchmark datasets are used in this study to assess the proposed model,resulting in the conclusion that it outperforms conventional models.展开更多
Since the fully convolutional network has achieved great success in semantic segmentation,lots of works have been proposed to extract discriminative pixel representations.However,the authors observe that existing meth...Since the fully convolutional network has achieved great success in semantic segmentation,lots of works have been proposed to extract discriminative pixel representations.However,the authors observe that existing methods still suffer from two typical challenges:(i)The intra-class feature variation between different scenes may be large,leading to the difficulty in maintaining the consistency between same-class pixels from different scenes;(ii)The inter-class feature distinction in the same scene could be small,resulting in the limited performance to distinguish different classes in each scene.The authors first rethink se-mantic segmentation from a perspective of similarity between pixels and class centers.Each weight vector of the segmentation head represents its corresponding semantic class in the whole dataset,which can be regarded as the embedding of the class center.Thus,the pixel-wise classification amounts to computing similarity in the final feature space between pixels and the class centers.Under this novel view,the authors propose a Class Center Similarity(CCS)layer to address the above-mentioned challenges by generating adaptive class centers conditioned on each scenes and supervising the similarities between class centers.The CCS layer utilises the Adaptive Class Center Module to generate class centers conditioned on each scene,which adapt the large intra-class variation between different scenes.Specially designed Class Distance Loss(CD Loss)is introduced to control both inter-class and intra-class distances based on the predicted center-to-center and pixel-to-center similarity.Finally,the CCS layer outputs the processed pixel-to-center similarity as the segmentation prediction.Extensive experiments demonstrate that our model performs favourably against the state-of-the-art methods.展开更多
Background In this study,we propose a novel 3D scene graph prediction approach for scene understanding from point clouds.Methods It can automatically organize the entities of a scene in a graph,where objects are nodes...Background In this study,we propose a novel 3D scene graph prediction approach for scene understanding from point clouds.Methods It can automatically organize the entities of a scene in a graph,where objects are nodes and their relationships are modeled as edges.More specifically,we employ the DGCNN to capture the features of objects and their relationships in the scene.A Graph Attention Network(GAT)is introduced to exploit latent features obtained from the initial estimation to further refine the object arrangement in the graph structure.A one loss function modified from cross entropy with a variable weight is proposed to solve the multi-category problem in the prediction of object and predicate.Results Experiments reveal that the proposed approach performs favorably against the state-of-the-art methods in terms of predicate classification and relationship prediction and achieves comparable performance on object classification prediction.Conclusions The 3D scene graph prediction approach can form an abstract description of the scene space from point clouds.展开更多
Video captioning aims at automatically generating a natural language caption to describe the content of a video.However,most of the existing methods in the video captioning task ignore the relationship between objects...Video captioning aims at automatically generating a natural language caption to describe the content of a video.However,most of the existing methods in the video captioning task ignore the relationship between objects in the video and the correlation between multimodal features,and they also ignore the effect of caption length on the task.This study proposes a novel video captioning framework(ORMF)based on the object relation graph and multimodal feature fusion.ORMF uses the similarity and Spatio-temporal relationship of objects in video to construct object relation features graph and introduce graph convolution network(GCN)to encode the object relation.At the same time,ORMF also constructs a multimodal features fusion network to learn the relationship between different modal features.The multimodal feature fusion network is used to fuse the features of different modals.Furthermore,the proposed model calculates the length loss of the caption,making the caption get richer information.The experimental results on two public datasets(Microsoft video captioning corpus[MSVD]and Microsoft research-video to text[MSR-VTT])demonstrate the effectiveness of our method.展开更多
The recent development in autonomous driving involves high-level computer vision and detailed road scene understanding.Today,most autonomous vehicles employ expensive high quality sensor-set such as light detection an...The recent development in autonomous driving involves high-level computer vision and detailed road scene understanding.Today,most autonomous vehicles employ expensive high quality sensor-set such as light detection and ranging(LIDAR)and HD maps with high level annotations.In this paper,we propose a scalable and affordable data collection and annotation framework image-to-map annotation proximity(I2MAP),for affordance learning in autonomous driving applications.We provide a new driving dataset using our proposed framework for driving scene affordance learning by calibrating the data samples with available tags from online database such as open street map(OSM).Our benchmark consists of 40000 images with more than40 affordance labels under various day time and weather even with very challenging heavy snow.We implemented sample advanced driver-assistance systems(ADAS)functions by training our data with neural networks(NN)and cross-validate the results on benchmarks like KITTI and BDD100K,which indicate the effectiveness of our framework and training models.展开更多
Rich semantic information in natural language increases team efficiency in human collaboration, reduces dependence on high precision data information, and improves adaptability to dynamic environment. We propose a sem...Rich semantic information in natural language increases team efficiency in human collaboration, reduces dependence on high precision data information, and improves adaptability to dynamic environment. We propose a semantic centered cloud control framework for cooperative multi-unmanned ground vehicle(UGV) system. Firstly, semantic modeling of task and environment is implemented by ontology to build a unified conceptual architecture, and secondly, a scene semantic information extraction method combining deep learning and semantic web rule language(SWRL) rules is used to realize the scene understanding and task-level cloud task cooperation. Finally, simulation results show that the framework is a feasible way to enable autonomous unmanned systems to conduct cooperative tasks.展开更多
Taking Digital Logic Design,a professional foundation course for undergraduates in the School of Computer Science of Harbin Institute of Technology,as an example,we propose a new teaching model of scenario comprehensi...Taking Digital Logic Design,a professional foundation course for undergraduates in the School of Computer Science of Harbin Institute of Technology,as an example,we propose a new teaching model of scenario comprehension and practical progressive teaching in response to the many difficult problems faced in undergraduate teaching,such as the change of the teaching target to first-year university students with zero foundation and low starting point,and the compression of class time,while the quality of the course and the quality of student training have to be improved simultaneously.With the help of MOOC to implement blended teaching,effective means such as lowering the threshold,raising interest,building foundation and progressive improvement are adopted to help freshmen challenge themselves and move to a higher starting point.This paper is a useful exploration of the current new model of high-quality teaching in hardware courses for junior undergraduates.展开更多
This paper tackles the high computational/space complexity associated with multi-head self-attention(MHSA)in vanilla vision transformers.To this end,we propose hierarchical MHSA(H-MHSA),a novel approach that computes ...This paper tackles the high computational/space complexity associated with multi-head self-attention(MHSA)in vanilla vision transformers.To this end,we propose hierarchical MHSA(H-MHSA),a novel approach that computes self-attention in a hierarchical fashion.Specifically,we first divide the input image into patches as commonly done,and each patch is viewed as a token.Then,the proposed H-MHSA learns token relationships within local patches,serving as local relationship modeling.Then,the small patches are merged into larger ones,and H-MHSA models the global dependencies for the small number of the merged tokens.At last,the local and global attentive features are aggregated to obtain features with powerful representation capacity.Since we only calculate attention for a limited number of tokens at each step,the computational load is reduced dramatically.Hence,H-MHSA can efficiently model global relationships among tokens without sacrificing fine-grained information.With the H-MHSA module incorporated,we build a family of hierarchical-attention-based transformer networks,namely HAT-Net.To demonstrate the superiority of HAT-Net in scene understanding,we conduct extensive experiments on fundamental vision tasks,including image classification,semantic segmentation,object detection and instance segmentation.Therefore,HAT-Net provides a new perspective for vision transformers.Code and pretrained models are available at https://github.com/yun-liu/HAT-Net.展开更多
The long-term goal of artificial intelligence (AI) is to make machines learn and think like human beings. Due to the high levels of uncertainty and vulnerability in human life and the open-ended nature of problems t...The long-term goal of artificial intelligence (AI) is to make machines learn and think like human beings. Due to the high levels of uncertainty and vulnerability in human life and the open-ended nature of problems that humans are facing, no matter how intelligent machines are, they are unable to completely replace humans. Therefore, it is necessary to introduce human cognitive capabilities or human-like cognitive models into AI systems to develop a new form of AI, that is, hybrid-augmented intelligence. This form of AI or machine intelligence is a feasible and important developing model. Hybrid-augmented intelligence can be divided into two basic models: one is human-in-the-loop augmented intelligence with human-computer collaboration, and the other is cognitive computing based augmented intelligence, in which a cognitive model is embedded in the machine learning system. This survey describes a basic framework for human-computer collaborative hybrid-augmented intelligence, and the basic elements of hybrid-augmented intelligence based on cognitive computing. These elements include intuitive reasoning, causal models, evolution of memory and knowledge, especially the role and basic principles of intuitive reasoning for complex problem solving, and the cognitive learning framework for visual scene understanding based on memory and reasoning. Several typical applications of hybrid-augmented intelligence in related fields are given.展开更多
The computer graphics and computer vision communities have been working closely together in recent years and a variety of algorithms and applications have been developed to analyze and manipulate the visual media arou...The computer graphics and computer vision communities have been working closely together in recent years and a variety of algorithms and applications have been developed to analyze and manipulate the visual media around us. There are three major driving forces behind this phenomenon: 1) the availability of big data from the Internet has created a demand for dealing with the ever-increasing, vast amount of resources; 2) powerful processing tools, such as deep neural networks, provide effective ways for learning how to deal with heterogeneous visual data; 3) new data capture devices, such as the Kilxect, the bridge betweea algorithms for 2D image understanding and 3D model analysis. These driving forces have emerged only recently, and we believe that the computer graphics and computer vision communities are still in the beginning of their honeymoon phase. In this work we survey recent research on how computer vision techniques benefit computer graphics techniques and vice versa, and cover research on analysis, manipulation, synthesis, and interaction. We also discuss existing problems and suggest possible further research directions.展开更多
Relation contexts have been proved to be useful for many challenging vision tasks.In the field of3D object detection,previous methods have been taking the advantage of context encoding,graph embedding,or explicit rela...Relation contexts have been proved to be useful for many challenging vision tasks.In the field of3D object detection,previous methods have been taking the advantage of context encoding,graph embedding,or explicit relation reasoning to extract relation contexts.However,there exist inevitably redundant relation contexts due to noisy or low-quality proposals.In fact,invalid relation contexts usually indicate underlying scene misunderstanding and ambiguity,which may,on the contrary,reduce the performance in complex scenes.Inspired by recent attention mechanism like Transformer,we propose a novel 3D attention-based relation module(ARM3D).It encompasses objectaware relation reasoning to extract pair-wise relation contexts among qualified proposals and an attention module to distribute attention weights towards different relation contexts.In this way,ARM3D can take full advantage of the useful relation contexts and filter those less relevant or even confusing contexts,which mitigates the ambiguity in detection.We have evaluated the effectiveness of ARM3D by plugging it into several state-of-the-art 3D object detectors and showing more accurate and robust detection results.Extensive experiments show the capability and generalization of ARM3D on 3D object detection.Our source code is available at https://github.com/lanlan96/ARM3D.展开更多
Human group activity recognition(GAR)has attracted significant attention from computer vision researchers due to its wide practical applications in security surveillance,social role understanding and sports video anal...Human group activity recognition(GAR)has attracted significant attention from computer vision researchers due to its wide practical applications in security surveillance,social role understanding and sports video analysis.In this paper,we give a comprehensive overview of the advances in group activity recognition in videos during the past 20 years.First,we provide a summary and comparison of 11 GAR video datasets in this field.Second,we survey the group activity recognition methods,including those based on handcrafted features and those based on deep learning networks.For better understanding of the pros and cons of these methods,we compare various models from the past to the present.Finally,we outline several challenging issues and possible directions for future research.From this comprehensive literature review,readers can obtain an overview of progress in group activity recognition for future studies.展开更多
The objective of this research is the rapid reconstruction of ancient buildings of historical importance using a single image. The key idea of our approach is to reduce the infinite solutions that might otherwise aris...The objective of this research is the rapid reconstruction of ancient buildings of historical importance using a single image. The key idea of our approach is to reduce the infinite solutions that might otherwise arise when recovering a 3D geometry from 2D photographs. The main outcome of our research shows that the proposed methodology can be used to reconstruct ancient monuments for use as proxies for digital effects in applications such as tourism, games, and entertainment, which do not require very accurate modeling. In this article, we consider the reconstruction of ancient Mughal architecture including the Taj Mahal. We propose a modeling pipeline that makes an easy reconstruction possible using a single photograph taken from a single view, without the need to create complex point clouds from multiple images or the use of laser scanners. First, an initial model is automatically reconstructed using locally fitted planar primitives along with their boundary polygons and the adjacency relation among parts of the polygons. This approach is faster and more accurate than creating a model from scratch because the initial reconstruction phase provides a set of structural information together with the adjacency relation, which makes it possible to estimate the approximate depth of the entire structural monument. Next, we use manual extrapolation and editing techniques with modeling software to assemble and adjust different 3D components of the model. Thus, this research opens up the opportunity for the present generation to experience remote sites of architectural and cultural importance through virtual worlds and real-time mobile applications. Variations of a recreated 3D monument to represent an amalgam of various cultures are targeted for future work.展开更多
文摘In recent years,a gain in popularity and significance of science understanding has been observed due to the high paced progress in computer vision techniques and technologies.The primary focus of computer vision based scene understanding is to label each and every pixel in an image as the category of the object it belongs to.So it is required to combine segmentation and detection in a single framework.Recently many successful computer vision methods has been developed to aid scene understanding for a variety of real world application.Scene understanding systems typically involves detection and segmentation of different natural and manmade things.A lot of research has been performed in recent years,mostly with a focus on things(a well-defined objects that has shape,orientations and size)with a less focus on stuff classes(amorphous regions that are unclear and lack a shape,size or other characteristics Stuff region describes many aspects of scene,like type,situation,environment of scene etc.and hence can be very helpful in scene understanding.Existing methods for scene understanding still have to cover a challenging path to cope up with the challenges of computational time,accuracy and robustness for varying level of scene complexity.A robust scene understanding method has to effectively deal with imbalanced distribution of classes,overlapping objects,fuzzy object boundaries and poorly localized objects.The proposed method presents Panoptic Segmentation on Cityscapes Dataset.Mobilenet-V2 is used as a backbone for feature extraction that is pre-trained on ImageNet.MobileNet-V2 with state-of-art encoder-decoder architecture of DeepLabV3+with some customization and optimization is employed Atrous convolution along with Spatial Pyramid Pooling are also utilized in the proposed method to make it more accurate and robust.Very promising and encouraging results have been achieved that indicates the potential of the proposed method for robust scene understanding in a fast and reliable way.
基金supported by National Key R&D Program Project of China(No.2016YFB1001004)National Natural Science Foundation of China(Nos.61751308,61603057,61773311)+1 种基金China Postdoctoral Science Foundation(No.2017M613152)Collaborative Research with MSRA
文摘Realizing autonomy is a hot research topic for automatic vehicles in recent years. For a long time, most of the efforts to this goal concentrate on understanding the scenes surrounding the ego-vehicle(autonomous vehicle itself). By completing lowlevel vision tasks, such as detection, tracking and segmentation of the surrounding traffic participants, e.g., pedestrian, cyclists and vehicles, the scenes can be interpreted. However, for an autonomous vehicle, low-level vision tasks are largely insufficient to give help to comprehensive scene understanding. What are and how about the past, the on-going and the future of the scene participants? This deep question actually steers the vehicles towards truly full automation, just like human beings. Based on this thoughtfulness, this paper attempts to investigate the interpretation of traffic scene in autonomous driving from an event reasoning view. To reach this goal, we study the most relevant literatures and the state-of-the-arts on scene representation, event detection and intention prediction in autonomous driving. In addition, we also discuss the open challenges and problems in this field and endeavor to provide possible solutions.
基金supported by the National Natural Science Foundation of China(No.61976023)。
文摘In this paper,we propose a Structure-Aware Fusion Network(SAFNet)for 3D scene understanding.As 2D images present more detailed information while 3D point clouds convey more geometric information,fusing the two complementary data can improve the discriminative ability of the model.Fusion is a very challenging task since 2D and 3D data are essentially different and show different formats.The existing methods first extract 2D multi-view image features and then aggregate them into sparse 3D point clouds and achieve superior performance.However,the existing methods ignore the structural relations between pixels and point clouds and directly fuse the two modals of data without adaptation.To address this,we propose a structural deep metric learning method on pixels and points to explore the relations and further utilize them to adaptively map the images and point clouds into a common canonical space for prediction.Extensive experiments on the widely used ScanNetV2 and S3DIS datasets verify the performance of the proposed SAFNet.
基金This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF)funded by the Ministry of Education(No.2021R1I1A1A01055652).
文摘The analysis of overcrowded areas is essential for flow monitoring,assembly control,and security.Crowd counting’s primary goal is to calculate the population in a given region,which requires real-time analysis of congested scenes for prompt reactionary actions.The crowd is always unexpected,and the benchmarked available datasets have a lot of variation,which limits the trained models’performance on unseen test data.In this paper,we proposed an end-to-end deep neural network that takes an input image and generates a density map of a crowd scene.The proposed model consists of encoder and decoder networks comprising batch-free normalization layers known as evolving normalization(EvoNorm).This allows our network to be generalized for unseen data because EvoNorm is not using statistics from the training samples.The decoder network uses dilated 2D convolutional layers to provide large receptive fields and fewer parameters,which enables real-time processing and solves the density drift problem due to its large receptive field.Five benchmark datasets are used in this study to assess the proposed model,resulting in the conclusion that it outperforms conventional models.
基金Hubei Provincial Natural Science Foundation of China,Grant/Award Number:2022CFA055National Natural Science Foundation of China,Grant/Award Number:62176097。
文摘Since the fully convolutional network has achieved great success in semantic segmentation,lots of works have been proposed to extract discriminative pixel representations.However,the authors observe that existing methods still suffer from two typical challenges:(i)The intra-class feature variation between different scenes may be large,leading to the difficulty in maintaining the consistency between same-class pixels from different scenes;(ii)The inter-class feature distinction in the same scene could be small,resulting in the limited performance to distinguish different classes in each scene.The authors first rethink se-mantic segmentation from a perspective of similarity between pixels and class centers.Each weight vector of the segmentation head represents its corresponding semantic class in the whole dataset,which can be regarded as the embedding of the class center.Thus,the pixel-wise classification amounts to computing similarity in the final feature space between pixels and the class centers.Under this novel view,the authors propose a Class Center Similarity(CCS)layer to address the above-mentioned challenges by generating adaptive class centers conditioned on each scenes and supervising the similarities between class centers.The CCS layer utilises the Adaptive Class Center Module to generate class centers conditioned on each scene,which adapt the large intra-class variation between different scenes.Specially designed Class Distance Loss(CD Loss)is introduced to control both inter-class and intra-class distances based on the predicted center-to-center and pixel-to-center similarity.Finally,the CCS layer outputs the processed pixel-to-center similarity as the segmentation prediction.Extensive experiments demonstrate that our model performs favourably against the state-of-the-art methods.
基金Supported by National Natural Science Foundation of China(61872024)National Key R&D Program of China under Grant(2018YFB2100603).
文摘Background In this study,we propose a novel 3D scene graph prediction approach for scene understanding from point clouds.Methods It can automatically organize the entities of a scene in a graph,where objects are nodes and their relationships are modeled as edges.More specifically,we employ the DGCNN to capture the features of objects and their relationships in the scene.A Graph Attention Network(GAT)is introduced to exploit latent features obtained from the initial estimation to further refine the object arrangement in the graph structure.A one loss function modified from cross entropy with a variable weight is proposed to solve the multi-category problem in the prediction of object and predicate.Results Experiments reveal that the proposed approach performs favorably against the state-of-the-art methods in terms of predicate classification and relationship prediction and achieves comparable performance on object classification prediction.Conclusions The 3D scene graph prediction approach can form an abstract description of the scene space from point clouds.
基金The National Natural Science Foundation of China under Grant,Grant/Award Number:62077015National Natural Science Foundation of ChinaZhejiang Normal University。
文摘Video captioning aims at automatically generating a natural language caption to describe the content of a video.However,most of the existing methods in the video captioning task ignore the relationship between objects in the video and the correlation between multimodal features,and they also ignore the effect of caption length on the task.This study proposes a novel video captioning framework(ORMF)based on the object relation graph and multimodal feature fusion.ORMF uses the similarity and Spatio-temporal relationship of objects in video to construct object relation features graph and introduce graph convolution network(GCN)to encode the object relation.At the same time,ORMF also constructs a multimodal features fusion network to learn the relationship between different modal features.The multimodal feature fusion network is used to fuse the features of different modals.Furthermore,the proposed model calculates the length loss of the caption,making the caption get richer information.The experimental results on two public datasets(Microsoft video captioning corpus[MSVD]and Microsoft research-video to text[MSR-VTT])demonstrate the effectiveness of our method.
文摘The recent development in autonomous driving involves high-level computer vision and detailed road scene understanding.Today,most autonomous vehicles employ expensive high quality sensor-set such as light detection and ranging(LIDAR)and HD maps with high level annotations.In this paper,we propose a scalable and affordable data collection and annotation framework image-to-map annotation proximity(I2MAP),for affordance learning in autonomous driving applications.We provide a new driving dataset using our proposed framework for driving scene affordance learning by calibrating the data samples with available tags from online database such as open street map(OSM).Our benchmark consists of 40000 images with more than40 affordance labels under various day time and weather even with very challenging heavy snow.We implemented sample advanced driver-assistance systems(ADAS)functions by training our data with neural networks(NN)and cross-validate the results on benchmarks like KITTI and BDD100K,which indicate the effectiveness of our framework and training models.
基金supported by the National Defense Science and Technology Innovation Zone of China (193-A13-203-01-01)the Military Science Postgraduate Project of PLA (JY2020B006)。
文摘Rich semantic information in natural language increases team efficiency in human collaboration, reduces dependence on high precision data information, and improves adaptability to dynamic environment. We propose a semantic centered cloud control framework for cooperative multi-unmanned ground vehicle(UGV) system. Firstly, semantic modeling of task and environment is implemented by ontology to build a unified conceptual architecture, and secondly, a scene semantic information extraction method combining deep learning and semantic web rule language(SWRL) rules is used to realize the scene understanding and task-level cloud task cooperation. Finally, simulation results show that the framework is a feasible way to enable autonomous unmanned systems to conduct cooperative tasks.
文摘Taking Digital Logic Design,a professional foundation course for undergraduates in the School of Computer Science of Harbin Institute of Technology,as an example,we propose a new teaching model of scenario comprehension and practical progressive teaching in response to the many difficult problems faced in undergraduate teaching,such as the change of the teaching target to first-year university students with zero foundation and low starting point,and the compression of class time,while the quality of the course and the quality of student training have to be improved simultaneously.With the help of MOOC to implement blended teaching,effective means such as lowering the threshold,raising interest,building foundation and progressive improvement are adopted to help freshmen challenge themselves and move to a higher starting point.This paper is a useful exploration of the current new model of high-quality teaching in hardware courses for junior undergraduates.
基金supported by A*STAR Career Development Fund,Singapore(No.C233312006)。
文摘This paper tackles the high computational/space complexity associated with multi-head self-attention(MHSA)in vanilla vision transformers.To this end,we propose hierarchical MHSA(H-MHSA),a novel approach that computes self-attention in a hierarchical fashion.Specifically,we first divide the input image into patches as commonly done,and each patch is viewed as a token.Then,the proposed H-MHSA learns token relationships within local patches,serving as local relationship modeling.Then,the small patches are merged into larger ones,and H-MHSA models the global dependencies for the small number of the merged tokens.At last,the local and global attentive features are aggregated to obtain features with powerful representation capacity.Since we only calculate attention for a limited number of tokens at each step,the computational load is reduced dramatically.Hence,H-MHSA can efficiently model global relationships among tokens without sacrificing fine-grained information.With the H-MHSA module incorporated,we build a family of hierarchical-attention-based transformer networks,namely HAT-Net.To demonstrate the superiority of HAT-Net in scene understanding,we conduct extensive experiments on fundamental vision tasks,including image classification,semantic segmentation,object detection and instance segmentation.Therefore,HAT-Net provides a new perspective for vision transformers.Code and pretrained models are available at https://github.com/yun-liu/HAT-Net.
基金Project supported by the Chinese Academy of Engi- neering, the National Natural Science Foundation of China (No. L1522023), the National Basic Research Program (973) of China (No. 2015CB351703), and the National Key Research and Development Plan (Nos. 2016YFB1001004 and 2016YFB1000903)
文摘The long-term goal of artificial intelligence (AI) is to make machines learn and think like human beings. Due to the high levels of uncertainty and vulnerability in human life and the open-ended nature of problems that humans are facing, no matter how intelligent machines are, they are unable to completely replace humans. Therefore, it is necessary to introduce human cognitive capabilities or human-like cognitive models into AI systems to develop a new form of AI, that is, hybrid-augmented intelligence. This form of AI or machine intelligence is a feasible and important developing model. Hybrid-augmented intelligence can be divided into two basic models: one is human-in-the-loop augmented intelligence with human-computer collaboration, and the other is cognitive computing based augmented intelligence, in which a cognitive model is embedded in the machine learning system. This survey describes a basic framework for human-computer collaborative hybrid-augmented intelligence, and the basic elements of hybrid-augmented intelligence based on cognitive computing. These elements include intuitive reasoning, causal models, evolution of memory and knowledge, especially the role and basic principles of intuitive reasoning for complex problem solving, and the cognitive learning framework for visual scene understanding based on memory and reasoning. Several typical applications of hybrid-augmented intelligence in related fields are given.
基金This research was sponsored by the National Natural Science Foundation of China under Grant Nos. 61572264 and 61373069, the National Key Research and Development Plan of China under Grant No. 2016YFB1001402, Huawei Innovation Research Program (HIRP), China Association for Science and Technology (CAST) Young Talents Plan, and Tianjin Short-Term Recruitment Program of Foreign Experts.
文摘The computer graphics and computer vision communities have been working closely together in recent years and a variety of algorithms and applications have been developed to analyze and manipulate the visual media around us. There are three major driving forces behind this phenomenon: 1) the availability of big data from the Internet has created a demand for dealing with the ever-increasing, vast amount of resources; 2) powerful processing tools, such as deep neural networks, provide effective ways for learning how to deal with heterogeneous visual data; 3) new data capture devices, such as the Kilxect, the bridge betweea algorithms for 2D image understanding and 3D model analysis. These driving forces have emerged only recently, and we believe that the computer graphics and computer vision communities are still in the beginning of their honeymoon phase. In this work we survey recent research on how computer vision techniques benefit computer graphics techniques and vice versa, and cover research on analysis, manipulation, synthesis, and interaction. We also discuss existing problems and suggest possible further research directions.
基金National Nature Science Foundation of China(62132021,62102435,62002375,62002376)National Key R&D Program of China(2018AAA0102200)NUDT Research Grants(ZK19-30)。
文摘Relation contexts have been proved to be useful for many challenging vision tasks.In the field of3D object detection,previous methods have been taking the advantage of context encoding,graph embedding,or explicit relation reasoning to extract relation contexts.However,there exist inevitably redundant relation contexts due to noisy or low-quality proposals.In fact,invalid relation contexts usually indicate underlying scene misunderstanding and ambiguity,which may,on the contrary,reduce the performance in complex scenes.Inspired by recent attention mechanism like Transformer,we propose a novel 3D attention-based relation module(ARM3D).It encompasses objectaware relation reasoning to extract pair-wise relation contexts among qualified proposals and an attention module to distribute attention weights towards different relation contexts.In this way,ARM3D can take full advantage of the useful relation contexts and filter those less relevant or even confusing contexts,which mitigates the ambiguity in detection.We have evaluated the effectiveness of ARM3D by plugging it into several state-of-the-art 3D object detectors and showing more accurate and robust detection results.Extensive experiments show the capability and generalization of ARM3D on 3D object detection.Our source code is available at https://github.com/lanlan96/ARM3D.
基金supported by National Natural Science Foundation of China(Nos.61976010,61802011)Beijing Postdoctoral Research Foundation(No.ZZ2019-63)+1 种基金Beijing excellent young talent cultivation project(No.2017000020124G075)“Ri xin”Training Programme Foundation for the Talents by Beijing University of Technology。
文摘Human group activity recognition(GAR)has attracted significant attention from computer vision researchers due to its wide practical applications in security surveillance,social role understanding and sports video analysis.In this paper,we give a comprehensive overview of the advances in group activity recognition in videos during the past 20 years.First,we provide a summary and comparison of 11 GAR video datasets in this field.Second,we survey the group activity recognition methods,including those based on handcrafted features and those based on deep learning networks.For better understanding of the pros and cons of these methods,we compare various models from the past to the present.Finally,we outline several challenging issues and possible directions for future research.From this comprehensive literature review,readers can obtain an overview of progress in group activity recognition for future studies.
基金Project partially supported by the Ministry of Culture,Sports and Tourism and Korea Creative Content Agency in the Culture Technology Research&Development Program 2014(50%)the Next Generation Information Computing Development Program through the National Research Foundation of Korea funded by the Ministry of Science,ICT and Future Planning(No.2012M3C4A7032185)(50%)
文摘The objective of this research is the rapid reconstruction of ancient buildings of historical importance using a single image. The key idea of our approach is to reduce the infinite solutions that might otherwise arise when recovering a 3D geometry from 2D photographs. The main outcome of our research shows that the proposed methodology can be used to reconstruct ancient monuments for use as proxies for digital effects in applications such as tourism, games, and entertainment, which do not require very accurate modeling. In this article, we consider the reconstruction of ancient Mughal architecture including the Taj Mahal. We propose a modeling pipeline that makes an easy reconstruction possible using a single photograph taken from a single view, without the need to create complex point clouds from multiple images or the use of laser scanners. First, an initial model is automatically reconstructed using locally fitted planar primitives along with their boundary polygons and the adjacency relation among parts of the polygons. This approach is faster and more accurate than creating a model from scratch because the initial reconstruction phase provides a set of structural information together with the adjacency relation, which makes it possible to estimate the approximate depth of the entire structural monument. Next, we use manual extrapolation and editing techniques with modeling software to assemble and adjust different 3D components of the model. Thus, this research opens up the opportunity for the present generation to experience remote sites of architectural and cultural importance through virtual worlds and real-time mobile applications. Variations of a recreated 3D monument to represent an amalgam of various cultures are targeted for future work.