基于知识的多页文档逻辑结构的分析和理解

A KNOWLEDGE - BASED APPROACH TO LOGICAL STRUCTURE ANALYSIS AND UNDERSTANDING FOR MULTI - PAGE DOCUMENTS

下载PDF

导出

摘要文档图像理解中最重要的部分是逻辑结构的提取。目前的研究主要集中在页面的布局分析上,少数对文档逻辑结构的研究只是针对单页文档或页面关系简单的多页文档。建筑标书的特殊性在于其层次式的逻辑组成结构没有明确的索引信息标识。本文提出了一种利用页面间引用关系获取文档逻辑结构的方法。该方法采用修正的树形结构表示文档的逻辑结构,逻辑树的创建过程就是逻辑结构的获取过程,而且有利于更高层的语义处理及还原输出。该方法已在标书自动处理系统中实现,保证了该系统的灵活和高效。 The most important part of document image understanding technology is to extract logical structure of the document. Currently,the main research is focused on kyout analysis, and only less work is aimed at single - page documents or multi - page documents with simple logical structure. The noticeable characteristic of construction tender document is that the hierarchical architecture is not obviously expressed but implied in citing information. In this paper, a new document logical structure extracting method which makes use of the citing information is presented. The hierarchy of tender documents itself leads to extracting their logical structures and dispkying them as modified tree structure. The creation of logical tree corresponds to the procedure of logical structure extracting. Such data structure is useful for higher level semantic processing and reconstruction.This method which ensures efficiency and flexibility of the whole system has been successfully implemented in VHTendei-a tender automatically processing system.

作者王姝华李佐蔡士杰曹阳

机构地区南京大学计算机软件新技术国家重点实验室香港理工大学建筑与房地产系

出处《计算机应用与软件》 CSCD 北大核心 2002年第4期33-37,共5页 Computer Applications and Software

关键词文档理解文档处理物理结构多页文档逻辑结构知识库办公自动化 Document understanding Document processing Layout analysis Physical structure Logical structure

分类号 TP317.1 [自动化与计算机技术—计算机软件与理论] C931.4 [经济管理—管理学]

引文网络
相关文献

参考文献8

1[1]A.Simon,J.Pret and A.P.Johnson,A Fast Algorithm for Bottom- Up Document Layout Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19,No.3, 1997:pp.273 ～ 282.
2[2]Y.Y. Tang, H. Ma, D. Xi,X. Mao and C. Y. Suen, Modified Fractal Signature(MFS):A New Approach to Document Analysis for Automtaic Knowledge Acquisition. IEEE Transactions on Knowledge and Data Engineering,Vol.9,No.5,1997:pp.747 ～ 762.
3[3]S. Baumann,M.B.H. Ali,A. Dengel etc.,Message Extraction from Printed Doouments. In Proceedings of the Fourth International Conference on Donument Analysis and Recognition, Ulm, Germany, August, 18 - 20, 1997: pp.1055 ～ 1059.
4[4]D.Niyogi,A Knowledge- Based Approach to Deriving Logical Structure from Document Images. Dissertation, State University of New York at Buffalo, August,1994.
5[5]Y. Y Tang, H.Ma etc., Multiresolution Analysis in Extraction of Reference Lines from Documents with Gray Level Background. IEEE Transactions on Pattem Analysis and Machine Intelligence, Vol. 19, No. 8,1997: pp. 921 ～926.
6[6]Y.Y.Tang and J. Liu, Information Acquistion and Storage of Froms in Document Processing. In Proceedings of the Fourth Intemational Conference on Document Analysis and R eongnition, Um, Germany, August. 18 - 20,1997:pp. 170 ～ 174.
7[7]C.C.Lin, Y. Niwa and S. Narita, Logical Structure Analysis of Book Document Images Using Contents Information. In Proceedings of the Fourth International Conference on Document Analysis and Recognition, Ulm, Germany,August. 18 - 20.1997: pp. 1048 ～ 1054.
8[8]S.H .Wang,Z.Li,R. Y.Yang, S.J.Cai,A Document Image Understanding System for Teller: In Proceedings of International Symposiumon Future Software Technolog. Nanjing, China, Oct.27 - 29,1999:pp.360 ～ 362.

1李宁,梁琦,施运梅.格式信息在文档理解中的作用[J].北京信息科技大学学报（自然科学版）,2012,27(6):1-7. 被引量：6
2邹莉.软PLC梯形图向指令表转换新算法的研究与实现[J].聊城大学学报（自然科学版）,2013,26(1):105-110. 被引量：6
3周晓英.信息构建目标及其在政府网站中的实现[J].情报资料工作,2004,25(2):5-8. 被引量：41
4王志刚,陈良安,吴正大.无线射频识别在邮政速递总包处理中的应用[J].计算机工程与应用,2007,43(9):242-248. 被引量：3
5出入库数据采集的RFID应用[J].现代制造,2008(31):52-53.
6张阔,徐鹏,李涓子,王克宏.基于优化层次聚类的文档逻辑结构抽取[J].清华大学学报（自然科学版）,2005,45(4):471-474. 被引量：2
7朱大立,陈晓苏.基于数字水印的电子文档信息标识应用方案[J].计算机应用,2010,30(7):1818-1820. 被引量：1
8郁书好,苏守宝.基于OWL的本体建模研究[J].计算机与现代化,2006(10):11-13. 被引量：3
9车开森.基于SEO的网页布局分析[J].科技传播,2012,4(6):164-165. 被引量：2
10高良才,汤帜,林晓帆,俞银燕,房婧.一种基于聚类技术的图书目录识别方法[J].北京大学学报（自然科学版）,2010,46(4):531-538. 被引量：3

计算机应用与软件

2002年第4期

浏览历史

内容加载中请稍等...

基于知识的多页文档逻辑结构的分析和理解

参考文献8

相关作者

相关机构

相关主题

浏览历史