摘要
文档分析与识别(简称文档识别)技术将各种非结构化文档数据(图像、联机笔迹)转化为结构化数据,便于计算机处理和理解,应用场景十分广阔。20世纪60年代以来,文档识别方法研究与应用受到广泛关注并取得巨大进展。得益于深度学习技术的发展和应用,文档识别的性能快速提升,相关技术在文档数字化、票据处理、笔迹录入、智能交通、文档检索与信息抽取等领域得到广泛应用。首先介绍文档识别的背景和技术范畴,回顾该领域发展历史,然后重点对深度学习方法兴起以来的研究进行综述,分析当前技术存在的不足,并建议未来值得重视的研究方向。研究现状综述部分,按文档分析与识别的几个主要技术环节(文档图像预处理、版面分析、场景文本检测、文本识别、结构化符号和图形识别、文档检索与信息抽取)分别进行介绍,简述传统方法研究的代表性工作,重点介绍深度学习方法研究的新进展。总体上,当前研究对象向深度、广度扩展,处理方法全面转向深度神经网络模型和深度学习方法,识别性能大幅提升且应用场景不断扩展。在现状分析基础上,指出当前技术在识别精度和可靠性、可解释性、学习能力和自适应性等方面还有明显不足。最后从提升性能、应用扩展、提升学习能力几个角度提出一些研究方向。从提升性能角度,研究问题包括文本识别可靠性、可解释性、全要素识别、长尾问题、多语言、复杂版面分割与理解、变形文档分析与识别等。应用扩展包括新应用(如机器人流程自动化(robotic process automation,RPA)、文字信息抄录、考古)和新技术问题(语义信息抽取、跨模态融合、面向应用的推理决策等)两方面。从提升学习能力角度,相关问题包括小样本学习、迁移学习、多任务学习、领域自适应、结构化预测、弱监督学习、自监督学习、开放集学习和跨模态学习等。
Document analysis and recognition(called document recognition in brief)is aimed to covert non-structured documents(typically,document images and online handwriting)into structured texts for facilitating computer processing and understanding.It is needed in wide applications due to the pervasive communication and usage of documents.The field of document recognition has attracted intensive attention and produced enormous progress in research and applications since 196Os.Particularly,the recent development of deep learning technology has boosted the performance of document recognition remarkably compared to traditional methods,and the technology has been applied successfully to document digitization,form processing,handwriting input,intelligent transportation,document retrieval and information extraction.In this article,we first introduce the background and involved techniques of document recognition,give an overview of the history of research(divided into four periods according to the objects of research,the methods and applications),and then review the main research progress with emphasis on deep learning based methods developed in recent years.After identifying the insufficiency of current technology,we finally suggest some important issues for future research.The review of recent progress is divided into sections corresponding to main processing steps,namely image pre-processing,layout analysis,scene text detection,text recognition,structured symbol and graphics recognition,document retrieval and information extraction.The review of recent progress is divided into sections corresponding to the main processing steps,namely image pre-processing,layout analysis,scene text detection,text recognition,structured symbol and graphics recognition,document retrieval and information extraction.1)Due to the popularity of camera-captured document images,the current main task in image pre-processing is the rectification of distorted image while the task of binarization is still concerned.Recent methods are mostly end-to-end deep learning based transformation methods.2)Layout analysis is dichotomized into physical layout analysis(page segmentation)and logical layout analysis(semantic region segmentation and reading order prediction).Recent page segmentation methods based on fully convolutional network(FCN)or graph neural network(CNN)have shown promises.Logical layout analysis has been addressed by deep neural networks fusing multi-modal information.Table structure analysis is a special task of layout analysis and has been studied intensively in recent years.3)Scene text detection is a hot topic in document analysis and computer vision fields.Deep learning based methods for text methods can be divided into regression-based methods,segmentation-based methods and hybrid methods.FCN is prevalently used for extracting visual features,based on which models are built to predict text regions.4)Text recognition is the core task in document analysis.We review recent works for handwritten text recognition and scene text recognition,which share some common strategies but also show different preferences.There are two main streams of methods:segmentation-based and sequence-to-sequence learning methods.The convolutional recurrent neural network(CRNN)model has received high attention in recent years and is being extended in respect of encoding,decoding or learning strategies,while segmentationbased methods combining deep learning are still performing competitively.A noteworthy tendency is the extension of text line recognition to page-level recognition.Following text recognition,we also review the works of end-to-end scene text recognition(also called as text spoting),for which text detection and recognition models are learned jointly.5)Among symbol and graphics in documents,mathematical expressions and flowcharts have received increasing attention.Recent methods for mathematical expression recognition are mostly image-to-markup generation methods using encoder-decoder models,while graph-based methods promise in generating both recognition and segmentation results.Flowchart recognition is addressed using structured prediction models such as GNN.6)Document retrieval concerned mainly keyword spotting in pre-deep learning era,while recent works focus on information extraction(spotting semantic entities)by fusing layout and language information.Pre-trained layout and multi-modal language models are showing promises,while visual information is not considered adequately.Overall,the recent progress shows that the objects of recognition are expanded in breadth and depth,the methods are getting closer to deep neural networks and deep learning,the recognition performance is improved constantly,and the technology is applied to extensive scenes.The review also reveals the insufficiencies of the current technology in accuracy and reliability on various tasks,the interpretability,the learning ability and adaptability.Future works are suggested in respect of performance promotion,application extension,and improved learning.Issues of performance promotion include the reliability of recognition,interpretability,omni-element recognition,long-tailed recognition,multilingual documents,complex layout analysis and understanding,recognition of distorted documents.Issues related to applications include new applications(such as robotic process automation(RPA),text scription in natural scenes,archeology),new technical problems involved in applications(such as semantic information extraction,cross-modal fusion,rea soning and decision related to application scenes).Aiming to improve the automatic system design,learning ability and adaptability,the involved learning problems/methods include small sample learning,transfer learning,multi-task learning,domain adaptation,structured prediction,weakly-supervised learning,self-supervised learning,open set learning,and cross-modal learning.
作者
刘成林
金连文
白翔
李晓辉
殷飞
Liu Chenglin;Jin Lianwen;Bai Xiang;Li Xiaohui;Yin Fei(State Key Laboratory of Multi-Modal Artificial Intelligence Systems,Institute of Automation,Chinese Academy of Sciences,Bejing 100190,China;School of Arificial Intelligence,University of Chinese Academy of Sciences,Bejing 100049,China;School of Electronic and Information Engineering,South China University of Technology,Guangzhou 510641,China;School of Electronic Information and Communications,Huazhong University of Science and Technology,Wuhan 430074,China)
出处
《中国图象图形学报》
CSCD
北大核心
2023年第8期2223-2252,共30页
Journal of Image and Graphics
基金
国家自然科学基金项目(61936003,61733007,61721004)
科技部“创新2030”新一代人工智能重大项目(2020AAA0109702)。
关键词
文档分析与识别
文档智能
版面分析
文本检测
文本识别
图形符号识别
语义信息抽取
document analysis and recognition
document intelligence
layout analysis
text detection
text recognition
graphics and symbol recognition
document information extraction