A novel method case-based reasoning was proposed for suspicious behavior recognition. The method is composed of three departs: human behavior decomposition, human behavior case representation and case-based reasoning....A novel method case-based reasoning was proposed for suspicious behavior recognition. The method is composed of three departs: human behavior decomposition, human behavior case representation and case-based reasoning. The new approach was proposed to decompose behavior into sub-behaviors that are easier to recognize using a saliency-based visual attention model. New representation of behavior was introduced, in which the sub-behavior and the associated time characteristic of sub-behavior were used to represent behavior case. In the process of case-based reasoning, apart from considering the similarity of basic sub-behaviors,order factor was proposed to measure the similarity of a time order among the sub-behaviors and span factor was used to measure the similarity of duration time of each sub-behavior, which makes the similarity calculations more rational and comprehensive.Experimental results show the effectiveness of the proposed method in comparison with other related works and can run in real-time for the recognition of suspicious behaviors.展开更多
To boost research into cognition-level visual understanding,i.e.,making an accurate inference based on a thorough understanding of visual details,visual commonsense reasoning(VCR)has been proposed.Compared with tradit...To boost research into cognition-level visual understanding,i.e.,making an accurate inference based on a thorough understanding of visual details,visual commonsense reasoning(VCR)has been proposed.Compared with traditional visual question answering which requires models to select correct answers,VCR requires models to select not only the correct answers,but also the correct rationales.Recent research into human cognition has indicated that brain function or cognition can be considered as a global and dynamic integration of local neuron connectivity,which is helpful in solving specific cognition tasks.Inspired by this idea,we propose a directional connective network to achieve VCR by dynamically reorganizing the visual neuron connectivity that is contextualized using the meaning of questions and answers and leveraging the directional information to enhance the reasoning ability.Specifically,we first develop a GraphVLAD module to capture visual neuron connectivity to fully model visual content correlations.Then,a contextualization process is proposed to fuse sentence representations with visual neuron representations.Finally,based on the output of contextualized connectivity,we propose directional connectivity to infer answers and rationales,which includes a ReasonVLAD module.Experimental results on the VCR dataset and visualization analysis demonstrate the effectiveness of our method.展开更多
Reading text in images automatically has become an attractive research topic in computer vision.Specifically,end-to-end spotting of scene text has attracted significant research attention,and relatively ideal accuracy...Reading text in images automatically has become an attractive research topic in computer vision.Specifically,end-to-end spotting of scene text has attracted significant research attention,and relatively ideal accuracy has been achieved on several datasets.However,most of the existing works overlooked the semantic connection between the scene text instances,and had limitations in situations such as occlusion,blurring,and unseen characters,which result in some semantic information lost in the text regions.The relevance between texts generally lies in the scene images.From the perspective of cognitive psychology,humans often combine the nearby easy-to-recognize texts to infer the unidentifiable text.In this paper,we propose a novel graph-based method for intermediate semantic features enhancement,called Text Relation Networks.Specifically,we model the co-occurrence relationship of scene texts as a graph.The nodes in the graph represent the text instances in a scene image,and the corresponding semantic features are defined as representations of the nodes.The relative positions between text instances are measured as the weights of edges in the established graph.Then,a convolution operation is performed on the graph to aggregate semantic information and enhance the intermediate features corresponding to text instances.We evaluate the proposed method through comprehensive experiments on several mainstream benchmarks,and get highly competitive results.For example,on the SCUT-CTW1500,our method surpasses the previous top works by 2.1%on the word spotting task.展开更多
基金Project(50808025)supported by the National Natural Science Foundation of ChinaProject(2013GK3012)supported by the Science and Technology Project of Hunan Province,China
文摘A novel method case-based reasoning was proposed for suspicious behavior recognition. The method is composed of three departs: human behavior decomposition, human behavior case representation and case-based reasoning. The new approach was proposed to decompose behavior into sub-behaviors that are easier to recognize using a saliency-based visual attention model. New representation of behavior was introduced, in which the sub-behavior and the associated time characteristic of sub-behavior were used to represent behavior case. In the process of case-based reasoning, apart from considering the similarity of basic sub-behaviors,order factor was proposed to measure the similarity of a time order among the sub-behaviors and span factor was used to measure the similarity of duration time of each sub-behavior, which makes the similarity calculations more rational and comprehensive.Experimental results show the effectiveness of the proposed method in comparison with other related works and can run in real-time for the recognition of suspicious behaviors.
基金Project supported by the National Natural Science Foundation of China(Nos.61876130 and 61932009)。
文摘To boost research into cognition-level visual understanding,i.e.,making an accurate inference based on a thorough understanding of visual details,visual commonsense reasoning(VCR)has been proposed.Compared with traditional visual question answering which requires models to select correct answers,VCR requires models to select not only the correct answers,but also the correct rationales.Recent research into human cognition has indicated that brain function or cognition can be considered as a global and dynamic integration of local neuron connectivity,which is helpful in solving specific cognition tasks.Inspired by this idea,we propose a directional connective network to achieve VCR by dynamically reorganizing the visual neuron connectivity that is contextualized using the meaning of questions and answers and leveraging the directional information to enhance the reasoning ability.Specifically,we first develop a GraphVLAD module to capture visual neuron connectivity to fully model visual content correlations.Then,a contextualization process is proposed to fuse sentence representations with visual neuron representations.Finally,based on the output of contextualized connectivity,we propose directional connectivity to infer answers and rationales,which includes a ReasonVLAD module.Experimental results on the VCR dataset and visualization analysis demonstrate the effectiveness of our method.
文摘Reading text in images automatically has become an attractive research topic in computer vision.Specifically,end-to-end spotting of scene text has attracted significant research attention,and relatively ideal accuracy has been achieved on several datasets.However,most of the existing works overlooked the semantic connection between the scene text instances,and had limitations in situations such as occlusion,blurring,and unseen characters,which result in some semantic information lost in the text regions.The relevance between texts generally lies in the scene images.From the perspective of cognitive psychology,humans often combine the nearby easy-to-recognize texts to infer the unidentifiable text.In this paper,we propose a novel graph-based method for intermediate semantic features enhancement,called Text Relation Networks.Specifically,we model the co-occurrence relationship of scene texts as a graph.The nodes in the graph represent the text instances in a scene image,and the corresponding semantic features are defined as representations of the nodes.The relative positions between text instances are measured as the weights of edges in the established graph.Then,a convolution operation is performed on the graph to aggregate semantic information and enhance the intermediate features corresponding to text instances.We evaluate the proposed method through comprehensive experiments on several mainstream benchmarks,and get highly competitive results.For example,on the SCUT-CTW1500,our method surpasses the previous top works by 2.1%on the word spotting task.