Real-time indoor camera localization is a significant problem in indoor robot navigation and surveillance systems.The scene can change during the image sequence and plays a vital role in the localization performance o...Real-time indoor camera localization is a significant problem in indoor robot navigation and surveillance systems.The scene can change during the image sequence and plays a vital role in the localization performance of robotic applications in terms of accuracy and speed.This research proposed a real-time indoor camera localization system based on a recurrent neural network that detects scene change during the image sequence.An annotated image dataset trains the proposed system and predicts the camera pose in real-time.The system mainly improved the localization performance of indoor cameras by more accurately predicting the camera pose.It also recognizes the scene changes during the sequence and evaluates the effects of these changes.This system achieved high accuracy and real-time performance.The scene change detection process was performed using visual rhythm and the proposed recurrent deep architecture,which performed camera pose prediction and scene change impact evaluation.Overall,this study proposed a novel real-time localization system for indoor cameras that detects scene changes and shows how they affect localization performance.展开更多
Scene text detection is an important task in computer vision.In this paper,we present YOLOv5 Scene Text(YOLOv5ST),an optimized architecture based on YOLOv5 v6.0 tailored for fast scene text detection.Our primary goal ...Scene text detection is an important task in computer vision.In this paper,we present YOLOv5 Scene Text(YOLOv5ST),an optimized architecture based on YOLOv5 v6.0 tailored for fast scene text detection.Our primary goal is to enhance inference speed without sacrificing significant detection accuracy,thereby enabling robust performance on resource-constrained devices like drones,closed-circuit television cameras,and other embedded systems.To achieve this,we propose key modifications to the network architecture to lighten the original backbone and improve feature aggregation,including replacing standard convolution with depth-wise convolution,adopting the C2 sequence module in place of C3,employing Spatial Pyramid Pooling Global(SPPG)instead of Spatial Pyramid Pooling Fast(SPPF)and integrating Bi-directional Feature Pyramid Network(BiFPN)into the neck.Experimental results demonstrate a remarkable 26%improvement in inference speed compared to the baseline,with only marginal reductions of 1.6%and 4.2%in mean average precision(mAP)at the intersection over union(IoU)thresholds of 0.5 and 0.5:0.95,respectively.Our work represents a significant advancement in scene text detection,striking a balance between speed and accuracy,making it well-suited for performance-constrained environments.展开更多
In recent years,images have played a more and more important role in our daily life and social communication.To some extent,the textual information contained in the pictures is an important factor in understanding the...In recent years,images have played a more and more important role in our daily life and social communication.To some extent,the textual information contained in the pictures is an important factor in understanding the content of the scenes themselves.The more accurate the text detection of the natural scenes is,the more accurate our semantic understanding of the images will be.Thus,scene text detection has also become the hot spot in the domain of computer vision.In this paper,we have presented a modified text detection network which is based on further research and improvement of Connectionist Text Proposal Network(CTPN)proposed by previous researchers.To extract deeper features that are less affected by different images,we use Residual Network(ResNet)to replace Visual Geometry Group Network(VGGNet)which is used in the original network.Meanwhile,to enhance the robustness of the models to multiple languages,we use the datasets for training from multi-lingual scene text detection and script identification datasets(MLT)of 2017 International Conference on Document Analysis and Recognition(ICDAR2017).And apart from that,the attention mechanism is used to get more reasonable weight distribution.We found the proposed models achieve 0.91 F1-score on ICDAR2011 test,better than CTPN trained on the same datasets by about 5%.展开更多
Segmentation-based scene text detection has drawn a great deal of attention,as it can describe the text instance with arbitrary shapes based on its pixel-level prediction.However,most segmentation-based methods suffer...Segmentation-based scene text detection has drawn a great deal of attention,as it can describe the text instance with arbitrary shapes based on its pixel-level prediction.However,most segmentation-based methods suffer from complex post-processing to separate the text instances which are close to each other,resulting in considerable time consumption during the inference procedure.A label enhancement method is proposed to construct two kinds of training labels for segmentation-based scene text detection in this paper.The label distribution learning(LDL)method is used to overcome the problem brought by pure shrunk text labels that might result in suboptimal detection perfor⁃mance.The experimental results on three benchmarks demonstrate that the proposed method can consistently improve the performance with⁃out sacrificing inference speed.展开更多
This paper presents a novel system for violent scenes detection, which is based on machine learning to handle visual and audio features. MKL (Multiple Kernel Learning) is applied so that multimodality of videos can ...This paper presents a novel system for violent scenes detection, which is based on machine learning to handle visual and audio features. MKL (Multiple Kernel Learning) is applied so that multimodality of videos can be maximized. The largest features of our system is that mid-level concepts clustering is proposed and implemented in order to learn mid-level concepts implicitly. By this algorithm, our system does not need manually tagged annotations. The whole system is trained on the dataset from MediaEval 2013 Affect Task and evaluated by its official metric. The obtained results outperformed its best score.展开更多
Scene text detection is an important step in the scene text reading system.There are still two problems during the existing text detection methods:(1)The small receptive of the convolutional layer in text detection is...Scene text detection is an important step in the scene text reading system.There are still two problems during the existing text detection methods:(1)The small receptive of the convolutional layer in text detection is not sufficiently sensitive to the target area in the image;(2)The deep receptive of the convolutional layer in text detection lose a lot of spatial feature information.Therefore,detecting scene text remains a challenging issue.In this work,we design an effective text detector named Adaptive Multi-Scale HyperNet(AMSHN)to improve texts detection performance.Specifically,AMSHN enhances the sensitivity of target semantics in shallow features with a new attention mechanism to strengthen the region of interest in the image and weaken the region of no interest.In addition,it reduces the loss of spatial feature by fusing features on multiple paths,which significantly improves the detection performance of text.Experimental results on the Robust Reading Challenge on Reading Chinese Text on Signboard(ReCTS)dataset show that the proposed method has achieved the state-of-the-art results,which proves the ability of our detector on both particularity and universality applications.展开更多
Video understanding and content boundary detection are vital stages in video recommendation.However,previous content boundary detection methods require collecting information,including location,cast,action,and audio,a...Video understanding and content boundary detection are vital stages in video recommendation.However,previous content boundary detection methods require collecting information,including location,cast,action,and audio,and if any of these elements are missing,the results may be adversely affected.To address this issue and effectively detect transitions in video content,in this paper,we introduce a video classification and boundary detection method named JudPriNet.The focus of this paper is on objects in videos along with their labels,enabling automatic scene detection in video clips and establishing semantic connections among local objects in the images.As a significant contribution,JudPriNet presents a framework that maps labels to“Continuous Bag of Visual Words Model”to cluster labels and generates new standardized labels as video-type tags.This facilitates automatic classification of video clips.Furthermore,JudPriNet employs Monte Carlo sampling method to classify video clips,the features of video clips as elements within the framework.This proposed method seamlessly integrates video and textual components without compromising training and inference speed.Through experimentation,we have demonstrated that JudPriNet,with its semantic connections,is able to effectively classify videos alongside textual content.Our results indicate that,compared with several other detection approaches,JudPriNet excels in high-level content detection without disrupting the integrity of the video content,outperforming existing methods.展开更多
Recently,segmentation-based scene text detection has drawn a wide research interest due to its flexibility in describing scene text instance of arbitrary shapes such as curved texts.However,existing methods usually ne...Recently,segmentation-based scene text detection has drawn a wide research interest due to its flexibility in describing scene text instance of arbitrary shapes such as curved texts.However,existing methods usually need complex post-processing stages to process ambiguous labels,i.e.,the labels of the pixels near the text boundary,which may belong to the text or background.In this paper,we present a framework for segmentation-based scene text detection by learning from ambiguous labels.We use the label distribution learning method to process the label ambiguity of text annotation,which achieves a good performance without using additional post-processing stage.Experiments on benchmark datasets demonstrate that our method produces better results than state-of-the-art methods for segmentation-based scene text detection.展开更多
Scene text detection plays a significant role in various applications,such as object recognition,document management,and visual navigation.The instance segmentation based method has been mostly used in existing resear...Scene text detection plays a significant role in various applications,such as object recognition,document management,and visual navigation.The instance segmentation based method has been mostly used in existing research due to its advantages in dealing with multi-oriented texts.However,a large number of non-text pixels exist in the labels during the model training,leading to text mis-segmentation.In this paper,we propose a novel multi-oriented scene text detection framework,which includes two main modules:character instance segmentation(one instance corresponds to one character),and character flow construction(one character flow corresponds to one word).We use feature pyramid network(FPN)to predict character and non-character instances with arbitrary directions.A joint network of FPN and bidirectional long short-term memory(BLSTM)is developed to explore the context information among isolated characters,which are finally grouped into character flows.Extensive experiments are conducted on ICDAR2013,ICDAR2015,MSRA-TD500 and MLT datasets to demonstrate the effectiveness of our approach.The F-measures are 92.62%,88.02%,83.69%and 77.81%,respectively.展开更多
Natural scene recognition has important significance and value in the fields of image retrieval,autonomous navigation,human-computer interaction and industrial automation.Firstly,the natural scene image non-text conte...Natural scene recognition has important significance and value in the fields of image retrieval,autonomous navigation,human-computer interaction and industrial automation.Firstly,the natural scene image non-text content takes up relatively high proportion;secondly,the natural scene images have a cluttered background and complex lighting conditions,angle,font and color.Therefore,how to extract text extreme regions efficiently from complex and varied natural scene images plays an important role in natural scene image text recognition.In this paper,a Text extremum region Extraction algorithm based on Joint-Channels(TEJC)is proposed.On the one hand,it can solve the problem that the maximum stable extremum region(MSER)algorithm is only suitable for gray images and difficult to process color images.On the other hand,it solves the problem that the MSER algorithm has high complexity and low accuracy when extracting the most stable extreme region.In this paper,the proposed algorithm is tested and evaluated on the ICDAR data set.The experimental results show that the method has superiority.展开更多
文摘Real-time indoor camera localization is a significant problem in indoor robot navigation and surveillance systems.The scene can change during the image sequence and plays a vital role in the localization performance of robotic applications in terms of accuracy and speed.This research proposed a real-time indoor camera localization system based on a recurrent neural network that detects scene change during the image sequence.An annotated image dataset trains the proposed system and predicts the camera pose in real-time.The system mainly improved the localization performance of indoor cameras by more accurately predicting the camera pose.It also recognizes the scene changes during the sequence and evaluates the effects of these changes.This system achieved high accuracy and real-time performance.The scene change detection process was performed using visual rhythm and the proposed recurrent deep architecture,which performed camera pose prediction and scene change impact evaluation.Overall,this study proposed a novel real-time localization system for indoor cameras that detects scene changes and shows how they affect localization performance.
基金the National Natural Science Foundation of PRChina(42075130)Nari Technology Co.,Ltd.(4561655965)。
文摘Scene text detection is an important task in computer vision.In this paper,we present YOLOv5 Scene Text(YOLOv5ST),an optimized architecture based on YOLOv5 v6.0 tailored for fast scene text detection.Our primary goal is to enhance inference speed without sacrificing significant detection accuracy,thereby enabling robust performance on resource-constrained devices like drones,closed-circuit television cameras,and other embedded systems.To achieve this,we propose key modifications to the network architecture to lighten the original backbone and improve feature aggregation,including replacing standard convolution with depth-wise convolution,adopting the C2 sequence module in place of C3,employing Spatial Pyramid Pooling Global(SPPG)instead of Spatial Pyramid Pooling Fast(SPPF)and integrating Bi-directional Feature Pyramid Network(BiFPN)into the neck.Experimental results demonstrate a remarkable 26%improvement in inference speed compared to the baseline,with only marginal reductions of 1.6%and 4.2%in mean average precision(mAP)at the intersection over union(IoU)thresholds of 0.5 and 0.5:0.95,respectively.Our work represents a significant advancement in scene text detection,striking a balance between speed and accuracy,making it well-suited for performance-constrained environments.
基金supported by National Natural Science Foundation of China(Nos.U1536121,61370195).
文摘In recent years,images have played a more and more important role in our daily life and social communication.To some extent,the textual information contained in the pictures is an important factor in understanding the content of the scenes themselves.The more accurate the text detection of the natural scenes is,the more accurate our semantic understanding of the images will be.Thus,scene text detection has also become the hot spot in the domain of computer vision.In this paper,we have presented a modified text detection network which is based on further research and improvement of Connectionist Text Proposal Network(CTPN)proposed by previous researchers.To extract deeper features that are less affected by different images,we use Residual Network(ResNet)to replace Visual Geometry Group Network(VGGNet)which is used in the original network.Meanwhile,to enhance the robustness of the models to multiple languages,we use the datasets for training from multi-lingual scene text detection and script identification datasets(MLT)of 2017 International Conference on Document Analysis and Recognition(ICDAR2017).And apart from that,the attention mechanism is used to get more reasonable weight distribution.We found the proposed models achieve 0.91 F1-score on ICDAR2011 test,better than CTPN trained on the same datasets by about 5%.
基金supported by ZTE Industry⁃University⁃Institute Coopera⁃tion Funds under Grant No.HC⁃CN⁃20200717012.
文摘Segmentation-based scene text detection has drawn a great deal of attention,as it can describe the text instance with arbitrary shapes based on its pixel-level prediction.However,most segmentation-based methods suffer from complex post-processing to separate the text instances which are close to each other,resulting in considerable time consumption during the inference procedure.A label enhancement method is proposed to construct two kinds of training labels for segmentation-based scene text detection in this paper.The label distribution learning(LDL)method is used to overcome the problem brought by pure shrunk text labels that might result in suboptimal detection perfor⁃mance.The experimental results on three benchmarks demonstrate that the proposed method can consistently improve the performance with⁃out sacrificing inference speed.
文摘This paper presents a novel system for violent scenes detection, which is based on machine learning to handle visual and audio features. MKL (Multiple Kernel Learning) is applied so that multimodality of videos can be maximized. The largest features of our system is that mid-level concepts clustering is proposed and implemented in order to learn mid-level concepts implicitly. By this algorithm, our system does not need manually tagged annotations. The whole system is trained on the dataset from MediaEval 2013 Affect Task and evaluated by its official metric. The obtained results outperformed its best score.
基金This work is supported by the National Natural Science Foundation of China(61872231,61701297).
文摘Scene text detection is an important step in the scene text reading system.There are still two problems during the existing text detection methods:(1)The small receptive of the convolutional layer in text detection is not sufficiently sensitive to the target area in the image;(2)The deep receptive of the convolutional layer in text detection lose a lot of spatial feature information.Therefore,detecting scene text remains a challenging issue.In this work,we design an effective text detector named Adaptive Multi-Scale HyperNet(AMSHN)to improve texts detection performance.Specifically,AMSHN enhances the sensitivity of target semantics in shallow features with a new attention mechanism to strengthen the region of interest in the image and weaken the region of no interest.In addition,it reduces the loss of spatial feature by fusing features on multiple paths,which significantly improves the detection performance of text.Experimental results on the Robust Reading Challenge on Reading Chinese Text on Signboard(ReCTS)dataset show that the proposed method has achieved the state-of-the-art results,which proves the ability of our detector on both particularity and universality applications.
文摘Video understanding and content boundary detection are vital stages in video recommendation.However,previous content boundary detection methods require collecting information,including location,cast,action,and audio,and if any of these elements are missing,the results may be adversely affected.To address this issue and effectively detect transitions in video content,in this paper,we introduce a video classification and boundary detection method named JudPriNet.The focus of this paper is on objects in videos along with their labels,enabling automatic scene detection in video clips and establishing semantic connections among local objects in the images.As a significant contribution,JudPriNet presents a framework that maps labels to“Continuous Bag of Visual Words Model”to cluster labels and generates new standardized labels as video-type tags.This facilitates automatic classification of video clips.Furthermore,JudPriNet employs Monte Carlo sampling method to classify video clips,the features of video clips as elements within the framework.This proposed method seamlessly integrates video and textual components without compromising training and inference speed.Through experimentation,we have demonstrated that JudPriNet,with its semantic connections,is able to effectively classify videos alongside textual content.Our results indicate that,compared with several other detection approaches,JudPriNet excels in high-level content detection without disrupting the integrity of the video content,outperforming existing methods.
基金supported by the National Key R&D Program of China(2018AAA0100104,2018AAA0100100)the National Natural Science Foundation of China(Grant No.61702095)the Natural Science Foundation of Jiangsu Province(BK20211164).
文摘Recently,segmentation-based scene text detection has drawn a wide research interest due to its flexibility in describing scene text instance of arbitrary shapes such as curved texts.However,existing methods usually need complex post-processing stages to process ambiguous labels,i.e.,the labels of the pixels near the text boundary,which may belong to the text or background.In this paper,we present a framework for segmentation-based scene text detection by learning from ambiguous labels.We use the label distribution learning method to process the label ambiguity of text annotation,which achieves a good performance without using additional post-processing stage.Experiments on benchmark datasets demonstrate that our method produces better results than state-of-the-art methods for segmentation-based scene text detection.
基金supported by the National Natural Science Foundation of China under Grant No.61902435the National Science and Technology Major Project of China under Grant No.2018AAA0102102+1 种基金the 111 Project under Grant No.B18059the Hunan Provincial Natural Science Foundation of China under Grant No.2019JJ50808.
文摘Scene text detection plays a significant role in various applications,such as object recognition,document management,and visual navigation.The instance segmentation based method has been mostly used in existing research due to its advantages in dealing with multi-oriented texts.However,a large number of non-text pixels exist in the labels during the model training,leading to text mis-segmentation.In this paper,we propose a novel multi-oriented scene text detection framework,which includes two main modules:character instance segmentation(one instance corresponds to one character),and character flow construction(one character flow corresponds to one word).We use feature pyramid network(FPN)to predict character and non-character instances with arbitrary directions.A joint network of FPN and bidirectional long short-term memory(BLSTM)is developed to explore the context information among isolated characters,which are finally grouped into character flows.Extensive experiments are conducted on ICDAR2013,ICDAR2015,MSRA-TD500 and MLT datasets to demonstrate the effectiveness of our approach.The F-measures are 92.62%,88.02%,83.69%and 77.81%,respectively.
基金This work is supported by State Grid Shandong Electric Power Company Science and Technology Project Funding under Grant Nos.520613180002,62061318C002the Fundamental Research Funds for the Central Universities(Grant No.HIT.NSRIF.201714)+1 种基金Weihai Science and Technology Development Program(2016DX GJMS15)Key Research and Development Program in Shandong Provincial(2017GGX90103).
文摘Natural scene recognition has important significance and value in the fields of image retrieval,autonomous navigation,human-computer interaction and industrial automation.Firstly,the natural scene image non-text content takes up relatively high proportion;secondly,the natural scene images have a cluttered background and complex lighting conditions,angle,font and color.Therefore,how to extract text extreme regions efficiently from complex and varied natural scene images plays an important role in natural scene image text recognition.In this paper,a Text extremum region Extraction algorithm based on Joint-Channels(TEJC)is proposed.On the one hand,it can solve the problem that the maximum stable extremum region(MSER)algorithm is only suitable for gray images and difficult to process color images.On the other hand,it solves the problem that the MSER algorithm has high complexity and low accuracy when extracting the most stable extreme region.In this paper,the proposed algorithm is tested and evaluated on the ICDAR data set.The experimental results show that the method has superiority.