摘要
以XML作为信息表现模型,以XSLT作为信息抽取规则,设计并实现了一套面向科技论文的PDF文档的信息抽取系统。首先将PDF源文档转换为一种中间XML文档,然后利用文本特征、位置特征以及显示特征对中间XML文档进行基于XSLT规则的信息抽取。测试结果表明,系统的抽取效果良好,并具有较强的扩展性。
Information extraction of PDF Document is a necessary approach to information processing. Choosing XML as information display model and XSLT as information extraction rule, this paper propose a system of PDF Information Extraction based on scientific and technological article. The fundamental thought can be expressed as follows : converting the PDF - formatted document to a XML - formatted middle document first, then applying XSLT rules to the middle document according to its description on text, location and display. Good results from the system, and has a strong scalability.
出处
《计算机与数字工程》
2008年第5期156-159,共4页
Computer & Digital Engineering
基金
福建省高等学校科技项目"数字图书馆资源整合与分类技术的研究(编号:JA04164)"资助