表格识别技术研究进展被引量：15

A survey on table recognition technology

导出

摘要表格广泛存在于科技文献、财务报表、报纸杂志等各类文档中,用于紧凑地存储和展现数据,蕴含着大量有用信息。表格识别是表格信息再利用的基础,具有重要的应用价值,也一直是模式识别领域的研究热点之一。随着深度学习的发展,针对表格识别的新研究和新方法纷纷涌现。然而,由于表格应用场景广泛、样式众多、图像质量参差不齐等因素,表格识别领域仍然存在着大量问题亟需解决。为了更好地总结前人工作,为后续研究提供支持,本文围绕表格区域检测、结构识别和内容识别等3个表格识别子任务,从传统方法、深度学习方法等方面,综述该领域国内外的发展历史和最新进展。梳理了表格识别相关数据集及评测标准,并基于主流数据集和标准,分别对表格区域检测、结构识别、表格信息抽取的典型方法进行了性能比较。然后,对比分析了国内相对于国外,在表格识别方面的研究进展与水平。最后,结合表格识别领域目前面临的主要困难与挑战,对未来的研究趋势和技术发展目标进行了展望。 Optimal data access and massive data derived information extraction has become an essential technology nowadays.Table-related paradigm is a kind of efficient structure for the clustered data designation,display and analysis.It has been widely used on Internet and vertical fields due to its simplicity and intuitiveness.Computer based tables,pictures or portable document format(PDF)files as the carrier will cause structural information loss.It is challenged to trace the original tables back.Inefficient manual based input has more errors.Therefore,two decadal researches have focused on the computer automatic recognition of tables issues originated from documents or PDF files and multiple tasks loop.To obtain the table structure and content and extract specific information,table recognition aims to detect the table via the image or PDF and other electronic files automatically.It is composed of three tasks recognition types like table area detection,table structure recognition and table content recognition.There are two types of existed table recognition methods in common.One is based on optical character recognition(OCR)technology to recognize the characters in the table directly,and then analyze and identify the position of the characters.The other one is to obtain the key intersections and the positions of each frameline of the table through digital image processing to analyze the relationship between cells in the table.However,most of these methods are only applicable to a single field and have poor generalization ability.At the same time,it is constrained of some experience-based threshold design.Thanks to the development of deep learning technology,semantic segmentation algorithm,object detection algorithm,text sequence generation algorithm,pre training model and related technologies facilitates technical problem solving for table recognition.Most deep learning algorithms have carried out adaptive transformation according to the characteristics of tables,which can improve the effect of table recognition.It uses object detection algorithm for table detection task.Object detection and text sequence generation algorithms are mainly used for table structure recognition.Most pre training models have played a good effect on the aspect of table content recognition.But many table structure recognition algorithms still cannot handle these well for wireless tables and less line tables.On the aspects of table images of natural scenes,the relevant algorithms have challenged to achieve the annotation in practice due to the influence of brightness and inclination.A large number of datasets provide sufficient data support for the training of table recognition model and improve the effect of the model currently.However,there are some challenging issues between these datasets multiple annotation formats and different evaluation indicators.Some datasets provide the hyper text markup language(HTML)code of the structure only in the field of table structure recognition and some datasets provide the location of cells in the table and the corresponding row and column attributes.Some datasets are based on the position of cells or the content of cells in accordance with evaluation indicators.Some datasets are based on the adjacent relationship between cells or the editing distance between HTML codes for the recognition of table structure.Our research critically reviews the research situation of three sub tasks like table detection,structure recognition and content recognition and try to predict future research direction further.

作者高良才李一博都林张新鹏朱子仪卢宁金连文黄永帅汤帜 Gao Liangcai;Li Yibo;Du Lin;Zhang Xinpeng;Zhu Ziyi;Lu Ning;Jin Lianwen;Huang Yongshuai;Tang Zhi(Wangxuan Computer Institute,Peking University,Beijing 100871,China;Huawei AI Application Research Center,Huawei Technology Co.,Ltd.,Beijing 100085,China;School of Electronics and Information Engineering,South China University of Technology,Guangzhou 510640,China)

机构地区北京大学王选计算机研究所华为技术有限公司AI应用研究中心华南理工大学电子与信息学院

出处《中国图象图形学报》 CSCD 北大核心 2022年第6期1898-1917,共20页 Journal of Image and Graphics

基金国家重点研发计划资助(2019YFB1406303)。

关键词表格区域检测表格结构识别表格内容识别深度学习单元格识别表格信息抽取 table area detection table structure recognition table content recognition deep learning table cell recognition table information extraction

分类号 TP391.4 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

同被引文献150

1应自炉,赵毅鸿,宣晨,邓文博.多特征融合的文档图像版面分析[J].中国图象图形学报,2020,0(2):311-320. 被引量：7
2王阳,李振东,杨观赐.基于深度学习的OCR文字识别在银行业的应用研究[J].计算机应用研究,2020,37(S02):375-379. 被引量：21
3王珂,杨芳,姜杉.光学字符识别综述[J].计算机应用研究,2020,37(S02):22-24. 被引量：30
4王兴,郑勇锋,严永兵,刘沿娟,张梦伊.基于OCR技术的票据识别算法研究[J].智能计算机与应用,2021,11(11):101-106. 被引量：6
5杨爽.大数据背景下社保档案管理电子化探讨[J].秦智,2022(10):58-60. 被引量：2
6陈敏,吴勰,马德江.贵州古彝文字符集编码设计[J].科技经济市场,2006(7). 被引量：1
7王华,丁晓青.一种多字体印刷藏文字符的归一化方法[J].计算机应用研究,2004,21(6):41-43. 被引量：10
8王华,丁晓青.一种多字体印刷藏文字符识别方法[J].计算机工程,2004,30(13):18-20. 被引量：10
9李贤兵,张应中,罗晓芳.工程图纸标题栏信息自动提取方法的研究[J].计算机工程与设计,2004,25(11):2005-2007. 被引量：8
10赵骥,李晶皎,王丽君,张继生.基于HMM的满文文本识别后处理的研究[J].中文信息学报,2006,20(4):63-67. 被引量：3

引证文献15

1林鑫,余华娟,闫奕臻.复杂表格数据化中的单元格语义关系识别研究[J].数字图书馆论坛,2022(9):28-35.
2王秀光,尹世阁.OCR技术在企业文档识别中的研究与实践[J].信息与电脑,2022,34(18):175-178. 被引量：3
3罗小清,贾网,李佳静,闫宏飞,孟涛,冯科.一种面向证券信息披露长文档的表格分类方法[J].中文信息学报,2023,37(5):70-79. 被引量：1
4刘成林,金连文,白翔,李晓辉,殷飞.文档智能分析与识别前沿:回顾与展望[J].中国图象图形学报,2023,28(8):2223-2252. 被引量：4
5秦海,李艺杰,梁桥康,王耀南.针对文档图像的非对称式几何校正网络[J].中国图象图形学报,2023,28(8):2314-2329. 被引量：2
6徐传运,马莹丽,李刚,舒涛,李星光.混合相似性度量的仪表询价电子表格结构识别[J].重庆理工大学学报（自然科学）,2024,38(1):150-159.
7肖雪丽,冷颖雄,谢洁芳,邓茵,周彦吉.基于端到端模型的档案电子化系统设计[J].机电工程技术,2024,53(1):176-179. 被引量：1
8林杰,曾俊冬,初凤红,王树昂.基于LayoutXLM的核电工程图纸标题栏信息自动提取方法研究[J].制造业自动化,2024,46(4):53-58.
9孙寅生,袁贞明.基于两阶段深度学习的表格结构识别方法[J].杭州师范大学学报（自然科学版）,2024,23(3):255-264.
10张洪廙,李韧,杨建喜,杨小霞,肖桥,蒋仕新,王笛.表格问答研究综述[J].中文信息学报,2024,38(4):1-16.

二级引证文献10

1钱光超,翟玥,孙建.基于OCR技术的项目过程精益化管控系统设计与实现[J].电脑编程技巧与维护,2023(5):3-7.
2李惠仪,肖雪丽,廖常辉.文本内容识别技术的研究与实践[J].信息记录材料,2023,24(7):98-101.
3高强,张仰森,孙圆明,贾启龙.一种面向催化材料领域的文献信息抽取方法[J].北京信息科技大学学报（自然科学版）,2024,39(2):50-56.
4蒋存波,李昕烨,金红,丁俊良.注塑件机器视觉缺陷检测的几何矫正方法研究[J].电子测量技术,2024,47(4):127-135.
5姜兴兴,刘建涛,李春雷,靳彩霞,张林凤.基于油田开发实例的相似油藏智能推荐[J].内蒙古石油化工,2024,50(5):117-120.
6王维兰,胡金水,魏宏喜,库尔班·吾布力,邵文苑,毕晓君,贺建军,李振江,丁凯,金连文,高良才.少数民族文字文本分析与识别的研究进展[J].中国图象图形学报,2024,29(6):1685-1713.
7曾水玲,李昭贤,张嘉雄,丁龙飞,赵才荣.结合注意力机制和编码器—解码器架构的化学结构识别方法[J].中国图象图形学报,2024,29(7):1960-1969.
8方靖宇,韩文涛,应成才,何天祥,徐瑞吉,毛科技.基于深度学习的CAD表格识别算法设计[J].科技资讯,2024,22(16):16-20.
9俞凯.自动识别技术在现代物流管理中的应用探究[J].信息与电脑,2024,36(14):86-88.
10王力禾,陈敏.高校人事部门档案管理的归档需求分析与流程设计[J].服务科学和管理,2024,13(3):295-299.

1刘颖,艾豪,张伟东.基于深度学习的多模态情感识别综述[J].西安邮电大学学报,2022,27(1):60-71. 被引量：2
2无.《钢铁行业2020~2035年技术发展预测报告》提要(下)[J].中国钢铁业,2021(4):26-29.
3郭凌钊.执念[J].快乐青春（经典阅读）（小学生必读）,2022(7):10-11.
4吕安强,魏伦.基于光纤传感技术的风机叶片故障检测技术研究进展[J].高压电器,2022,58(7):83-92. 被引量：14
5李虎山,徐小伟,莫文龙.油泥资源化利用与无害化处理技术研究进展[J].现代化工,2022,42(6):69-72. 被引量：6
6张舒.新媒体时代下纸媒的转型与思考[J].文化学刊,2022(3):118-120. 被引量：1
7付绪光,陈其国.氯硅烷高沸物裂解技术研究进展[J].中国氯碱,2022(6):14-18. 被引量：2
8钟英豪,严谨,梅纳尔多,魏斌.基于摩擦纳米发电机的波浪能发电技术研究进展[J].海洋技术学报,2022,41(3):97-107. 被引量：3
9林猛,周刚,杨亚伟,石军.特殊天气条件下的目标检测方法综述[J].计算机工程与应用,2022,58(13):36-47. 被引量：4
10张庆泉.重金属污染土壤淋洗修复技术研究进展[J].山西化工,2022,42(3):60-61. 被引量：2

中国图象图形学报

2022年第6期

浏览历史

内容加载中请稍等...

表格识别技术研究进展被引量：15

同被引文献150

引证文献15

二级引证文献10

相关作者

相关机构

相关主题

浏览历史

表格识别技术研究进展 被引量：15

同被引文献150

引证文献15

二级引证文献10

相关作者

相关机构

相关主题

浏览历史

表格识别技术研究进展被引量：15