A tremendous amount of vendor invoices is generated in the corporate sector.To automate the manual data entry in payable documents,highly accurate Optical Character Recognition(OCR)is required.This paper proposes an e...A tremendous amount of vendor invoices is generated in the corporate sector.To automate the manual data entry in payable documents,highly accurate Optical Character Recognition(OCR)is required.This paper proposes an end-to-end OCR system that does both localization and recognition and serves as a single unit to automate payable document processing such as cheques and cash disbursement.For text localization,the maximally stable extremal region is used,which extracts a word or digit chunk from an invoice.This chunk is later passed to the deep learning model,which performs text recognition.The deep learning model utilizes both convolution neural networks and long short-term memory(LSTM).The convolution layer is used for extracting features,which are fed to the LSTM.The model integrates feature extraction,modeling sequence,and transcription into a unified network.It handles the sequences of unconstrained lengths,independent of the character segmentation or horizontal scale normalization.Furthermore,it applies to both the lexicon-free and lexicon-based text recognition,and finally,it produces a comparatively smaller model,which can be implemented in practical applications.The overall superior performance in the experimental evaluation demonstrates the usefulness of the proposed model.The model is thus generic and can be used for other similar recognition scenarios.展开更多
End-to-end text spotting is a vital computer vision task that aims to integrate scene text detection and recognition into a unified framework.Typical methods heavily rely on region-of-interest(Rol)operations to extrac...End-to-end text spotting is a vital computer vision task that aims to integrate scene text detection and recognition into a unified framework.Typical methods heavily rely on region-of-interest(Rol)operations to extract local features and complex post-processing steps to produce final predictions.To address these limitations,we propose TextFormer,a query-based end-to-end text spotter with a transformer architecture.Specifically,using query embedding per text instance,TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multitask modeling.It allows for mutual training and optimization of classification,segmentation and recognition branches,resulting in deeper feature sharing without sacrificing flexibility or simplicity.Additionally,we design an adaptive global aggregation(AGG)module to transfer global features into sequential features for reading arbitrarilyshaped texts,which overcomes the suboptimization problem of Rol operations.Furthermore,potential corpus information is utilized from weak annotations to full labels through mixed supervision,further improving text detection and end-to-end text spotting results.Extensive experiments on various bilingual(i.e.,English and Chinese)benchmarks demonstrate the superiority of our method.Especially on the TDA-ReCTS dataset,TextFormer surpasses the state-of-the-art method in terms of 1-NED by 13.2%.展开更多
Reading text in images automatically has become an attractive research topic in computer vision.Specifically,end-to-end spotting of scene text has attracted significant research attention,and relatively ideal accuracy...Reading text in images automatically has become an attractive research topic in computer vision.Specifically,end-to-end spotting of scene text has attracted significant research attention,and relatively ideal accuracy has been achieved on several datasets.However,most of the existing works overlooked the semantic connection between the scene text instances,and had limitations in situations such as occlusion,blurring,and unseen characters,which result in some semantic information lost in the text regions.The relevance between texts generally lies in the scene images.From the perspective of cognitive psychology,humans often combine the nearby easy-to-recognize texts to infer the unidentifiable text.In this paper,we propose a novel graph-based method for intermediate semantic features enhancement,called Text Relation Networks.Specifically,we model the co-occurrence relationship of scene texts as a graph.The nodes in the graph represent the text instances in a scene image,and the corresponding semantic features are defined as representations of the nodes.The relative positions between text instances are measured as the weights of edges in the established graph.Then,a convolution operation is performed on the graph to aggregate semantic information and enhance the intermediate features corresponding to text instances.We evaluate the proposed method through comprehensive experiments on several mainstream benchmarks,and get highly competitive results.For example,on the SCUT-CTW1500,our method surpasses the previous top works by 2.1%on the word spotting task.展开更多
基金Researchers would like to thank the Deanship of Scientific Research,Qassim University,for funding publication of this project.
文摘A tremendous amount of vendor invoices is generated in the corporate sector.To automate the manual data entry in payable documents,highly accurate Optical Character Recognition(OCR)is required.This paper proposes an end-to-end OCR system that does both localization and recognition and serves as a single unit to automate payable document processing such as cheques and cash disbursement.For text localization,the maximally stable extremal region is used,which extracts a word or digit chunk from an invoice.This chunk is later passed to the deep learning model,which performs text recognition.The deep learning model utilizes both convolution neural networks and long short-term memory(LSTM).The convolution layer is used for extracting features,which are fed to the LSTM.The model integrates feature extraction,modeling sequence,and transcription into a unified network.It handles the sequences of unconstrained lengths,independent of the character segmentation or horizontal scale normalization.Furthermore,it applies to both the lexicon-free and lexicon-based text recognition,and finally,it produces a comparatively smaller model,which can be implemented in practical applications.The overall superior performance in the experimental evaluation demonstrates the usefulness of the proposed model.The model is thus generic and can be used for other similar recognition scenarios.
基金supported by the National Natural Science Foundation of China(No.61902027).
文摘End-to-end text spotting is a vital computer vision task that aims to integrate scene text detection and recognition into a unified framework.Typical methods heavily rely on region-of-interest(Rol)operations to extract local features and complex post-processing steps to produce final predictions.To address these limitations,we propose TextFormer,a query-based end-to-end text spotter with a transformer architecture.Specifically,using query embedding per text instance,TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multitask modeling.It allows for mutual training and optimization of classification,segmentation and recognition branches,resulting in deeper feature sharing without sacrificing flexibility or simplicity.Additionally,we design an adaptive global aggregation(AGG)module to transfer global features into sequential features for reading arbitrarilyshaped texts,which overcomes the suboptimization problem of Rol operations.Furthermore,potential corpus information is utilized from weak annotations to full labels through mixed supervision,further improving text detection and end-to-end text spotting results.Extensive experiments on various bilingual(i.e.,English and Chinese)benchmarks demonstrate the superiority of our method.Especially on the TDA-ReCTS dataset,TextFormer surpasses the state-of-the-art method in terms of 1-NED by 13.2%.
文摘Reading text in images automatically has become an attractive research topic in computer vision.Specifically,end-to-end spotting of scene text has attracted significant research attention,and relatively ideal accuracy has been achieved on several datasets.However,most of the existing works overlooked the semantic connection between the scene text instances,and had limitations in situations such as occlusion,blurring,and unseen characters,which result in some semantic information lost in the text regions.The relevance between texts generally lies in the scene images.From the perspective of cognitive psychology,humans often combine the nearby easy-to-recognize texts to infer the unidentifiable text.In this paper,we propose a novel graph-based method for intermediate semantic features enhancement,called Text Relation Networks.Specifically,we model the co-occurrence relationship of scene texts as a graph.The nodes in the graph represent the text instances in a scene image,and the corresponding semantic features are defined as representations of the nodes.The relative positions between text instances are measured as the weights of edges in the established graph.Then,a convolution operation is performed on the graph to aggregate semantic information and enhance the intermediate features corresponding to text instances.We evaluate the proposed method through comprehensive experiments on several mainstream benchmarks,and get highly competitive results.For example,on the SCUT-CTW1500,our method surpasses the previous top works by 2.1%on the word spotting task.