摘要
文档图像理解中最重要的部分是逻辑结构的提取。目前的研究主要集中在页面的布局分析上,少数对文档逻辑结构的研究只是针对单页文档或页面关系简单的多页文档。建筑标书的特殊性在于其层次式的逻辑组成结构没有明确的索引信息标识。本文提出了一种利用页面间引用关系获取文档逻辑结构的方法。该方法采用修正的树形结构表示文档的逻辑结构,逻辑树的创建过程就是逻辑结构的获取过程,而且有利于更高层的语义处理及还原输出。该方法已在标书自动处理系统中实现,保证了该系统的灵活和高效。
The most important part of document image understanding technology is to extract logical structure of the document. Currently,the main research is focused on kyout analysis, and only less work is aimed at single - page documents or multi - page documents with simple logical structure. The noticeable characteristic of construction tender document is that the hierarchical architecture is not obviously expressed but implied in citing information. In this paper, a new document logical structure extracting method which makes use of the citing information is presented. The hierarchy of tender documents itself leads to extracting their logical structures and dispkying them as modified tree structure. The creation of logical tree corresponds to the procedure of logical structure extracting. Such data structure is useful for higher level semantic processing and reconstruction.This method which ensures efficiency and flexibility of the whole system has been successfully implemented in VHTendei-a tender automatically processing system.
出处
《计算机应用与软件》
CSCD
北大核心
2002年第4期33-37,共5页
Computer Applications and Software