摘要
针对目前主流图表自动识别算法不适用于大型联锁表扫描件图像的问题,本文设计了基于连通域的联锁表扫描件图表自动识别算法。该算法能通过图像处理和神经网络对扫描件中的表格及文字进行识别后,完整复现在电子表格中,并将疑似识别错误的字符及其所在单元格突出显示,方便人工复核。算法主要分为预处理、定位和识别三个部分。其中预处理部分,提出了DN-OSTU与DNG-OTSU二值化算法,通过不同内核均值滤波和图像相除、线性归一化等方法对光线不均的扫描件进行二值化,并提出基于累计概率霍夫变换PPHT的倾斜矫正算法,能快速且准确检测出倾斜角度;定位部分,采用基于连通域的定位算法定位表格方框及文字区域,并提出基于图表特征的RS方框查补算法,确保表格的完整性和各单元格定位的准确性;识别部分,使用原图提取字符制作训练集,训练卷积循环神经网络CRNN,达到较高准确率。实验中,对多家设计院提供的联锁表进行测试,实验结果表明:单元格识别准确率达到92.8%,字符识别准确率为98.74%,单图从识别到电子表输出速率均在5秒以内。本文设计的联锁表扫描件图表自动识别算法具有准确率高、鲁棒性好、识别速度快等特点,可为纸质版联锁表扫描件复现电子版从而二次开发提供有效的技术途径。
Aiming at the problem that the current mainstream automatic chart recognition algorithm is not suitable for scanning images of large interlocking tables, this paper designs an automatic recognition algorithm for scanning charts of interlocking tables based on connected domains. The algorithm can recognize the tables and texts in the scanned images through image processing and neural network, and then reproduce them in the electronic form completely, and highlight the characters and their cells that are suspected of being misrecognized to facilitate manual review. The algo-rithm is mainly divided into three parts: preprocessing, positioning and recognition. In the prepro-cessing part, DN-OSTU and DNG-OTSU binarization algorithms are proposed, and the scans with un-even light are binarized by means of different kernel mean filtering, image division, and linear normalization. And a tilt correction algorithm based on Progressive Probabilistic Hough Transform is proposed, which can quickly and accurately detect the tilt angle. In the positioning part, a positioning algorithm based on connected domains is used to locate table boxes and text areas, and an RS box checking and filling algorithm based on chart features is proposed to ensure the integrity of the table and the accuracy of each cell positioning. In the recognition part, we use the original image to extract characters to make a training set, then train the convolutional recurrent neural network CRNN to achieve high accuracy. In the experiment, the interlocking tables provided by a number of design institutes were tested. The experimental results showed that the accuracy of cell recognition reached 92.8%, the accuracy of character recognition was 98.74%, and the output rate of single image from recognition to electronic watch was within 5 seconds. The automatic recognition algo-rithm for scanned parts of interlocking table designed in this paper has the characteristics of high accuracy, good robustness, and fast recognition speed. This can reproduce the electronic version of the scanned copy of the paper version of the interlocking table, thereby providing an effective tech-nical approach for secondary development.
出处
《计算机科学与应用》
2020年第10期1804-1819,共16页
Computer Science and Application
关键词
连通域
联锁表
卷积循环神经网络
图表自动识别算法
Connected Domains
Interlocking Tables
The Convolutional Recurrent Neural Network
Automatic Chart Recognition Algorithm