PDF文档中的脚注识别研究

Footnote Identification within a PDF Document

下载PDF

导出

摘要针对PDF文档的脚注识别问题,提出一种自动识别脚注及其引用,并建立它们之间匹配关系的方法。首先针对PDF文档提取脚注的一系列特征,包括页面布局、字体信息、语义信息等,然后基于文档部件风格一致性,利用聚类技术,处理在不同文档中相异但在同一文档中相似的特征,从而使得识别过程能够适应不同文档类型。此外,利用脚注与引用的匹配结果为识别过程提供反馈,进一步提高了识别准确性。在真实文档测试集上的实验结果表明,所提方法对于PDF文档的脚注识别取得较高的准确率和召回率。 A robust method of identifying and linking footnote and its reference in the text is proposed to solve the footnote recognition problem. Novel features of the footnote, including page layout, font information, lexieal and linguistic features, are utilized for the task. Clustering is adopted to handle the features which vary in different kinds of documents but stable within one document so that the process of identification is adaptive with document types. In addition, this method leverages results from the matching process to provide feedback to the identification process and further improves the algorithm accuracy. The primary experiments in real document sets show that the proposed method is promising to identify footnote in a PDF document.

作者黎斯达高良才汤帜俞银燕

机构地区北京大学计算机科学技术研究所

出处《北京大学学报（自然科学版）》 EI CAS CSCD 北大核心 2015年第6期1017-1021,共5页 Acta Scientiarum Naturalium Universitatis Pekinensis

基金国家自然科学基金(61202232) 北京市自然科学基金(4132033)资助

关键词脚注 PDF文档文档分析与理解 footnote PDF documents document analysis and understanding

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献14

1Lovegrove W S, Brailsford D F. Document analysis of PDF files: method, results and implications. Elec- tronic publishing, 1995, 8(3): 207-220.
2Rahman F, Alam H. Conversion of PDF documents into HTML: a case study of document image analysis //Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers (ACSSC 03). New York, 2003:87-91.
3D6jean H, Meunier J L. A system for converting PDF documents into structured XML format // Proc of Document Analysis Systems (DAS 06). Heidelberg, 2006:129-140.
4Baker J, Sexton A P, Sorge V. A linear grammar approach to mathematical formula recognition from PDF // Proc Springer Symp Intelligent Computer Mathematics (ICM 09). Grand Bend, 2009:201-216.
5Choudhury S R, Mitra P, Kirk A, et al. Figure metadata extraction from digital documents // 12th International Conference on Document Analysis andRecognition (ICDAR'13). Washington, DC, 2013: 135-139.
6Bhatia S, Mitra P, Summarizing figures, tables, and algorithms in scientific publications to augment search results. ACM Transactions on Information Systems, 2012, 30(1): Article 3.
7Lopez L D, Yu J, Arighi C N, et al. An automatic system for extracting figures and captions in biomedical pdf documents // IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Atlanta, 2011:578-581.
8Breuel T M. High performance document layout analysis // Proc Syrup on Document Image Under- standing Technology (SDIUT 03). Greenbelt, MD, 2003:1-10.
9Fang J, Gao L, Bai K, et al. A table detection method for multipage pdf documents via visual seperators and tabular structures // International Conference on Document Analysis and Recognition (ICDAR). Beijing, 2011:779-783.
10Lin X, Gao L, Tang Z, et al. Mathematical formula identification and performance evaluation in PDF documents. International Journal on Document Analysis and Recognition, 2014, 17(3): 239-255.

1关于脚注的小技巧[J].个人电脑,2002,8(7):136-136.
2姚文连.论文脚注来去自如[J].电脑爱好者,2011(16):32-32.
3INSTRUCTIONS FOR AUTHORS[J].Virologica Sinica,2016,31(2):192-192.
4王文可.在Word文档中应用批注、题注、脚注、尾注的功能[J].电子乐园,2009(8):6-8.
5Terence Tao 刘小川(译) 陆柱家(校).我的回忆[J].数学译林,2009(1):95-96.
6郭振海.Word小附件大作用[J].电子科技,2000,13(19):25-25.
7Instructions for Manuscript Preparation[J].Plasma Science and Technology,2011,13(3).
8徐建中,李颖,蒋红波.帮助文件的制作及调用[J].河北工业科技,2002,19(3):27-30.
9樊海珍.Windows风格的HELP设计(一)[J].警察技术,1999(2):35-39.
10俞伟明.修改文档不留痕迹[J].微电脑世界,2004(11):101-102.

北京大学学报（自然科学版）

2015年第6期

浏览历史

内容加载中请稍等...

PDF文档中的脚注识别研究

参考文献14

相关作者

相关机构

相关主题

浏览历史