摘要
现有的基于游程的表格框线检测算法检测速度快,但对于复杂表格框线检测质量不高,甚至出现大量错误。提出一种基于游程层次聚类的表格框线检测算法。首先,把可能属于同一条横线或纵线的游程划分到一个游程组,定义了两条框之间的相似度;然后以这组游程为初始原子类,通过层次聚类迭代地选择相似度最大的两条横线或纵线合并为一条框线。当相似度最大的两条框线相似度小于预先设定的一个阈值或仅剩下一条框线时迭代停止。针对图像中的标题和说明段等文字信息形成的线条,提出亲属表格线的概念,删除不包含两条亲属表格线的线段,最后对提取的框线进行二次提取。为了对算法加速,提出对各游程组并行聚类。实验结果表明,该算法相比现有算法对一些复杂表格的框线识别率提高了50%以上。
The existing frame line detection algorithm based on run-length takes few of time, but has low quality for complex frame line detection and even a lot of errors. A kind of method based on run-length clustering for frame line detection was presented. Firstly, run-lengths which belonged to the same horizontal fine or vertical line were put together as one group. The similarity between two lines was defined. Then under hierarchical clustering, the two lines with the biggest similarity were merged into one line iteratively with these run-lengths as the initial classes in the group. The iteration stopped when the similarity between the two lines with the biggest similarity is less than the threshold set beforehand or only one line is left. For those lines generated by captions and explanatolT paragraphs, the paper defined the concept of relative frame fine and the fines which did not have two relatives were deleted. Frame fines were extracted after one process named second extraction. In order to increase the speed of the algorithm, parallel clustering for each run-length group was presented. The experimental result shows that this algorithm increases accuracy by 50% for the frame line recognition of some complex tables compared with the existing method.
作者
白伟
崔喆
BAI Wei;CUI Zhe(Chengdu Institute of Computer Application,Chinese Academy of Sciences,Chengdu Sichuan 610041,China;University of Chinese Academy of Sciences,Belting 100049,China)
出处
《计算机应用》
CSCD
北大核心
2018年第A01期179-182,共4页
journal of Computer Applications
基金
四川省科技支撑计划项目(2015GZ0088)
"西部之光"联合学者项目
关键词
表格识别
框线检测
表格线游程
层次聚类
table recognition
frame line detection
run-length of table line
hierarchical clustering