摘要
现有的从PDF文档抽取文本内容的方法(如PDFBox类库采用的方法)处理速度较低,无法满足高速网络中内容分析的需求,也不能对网络中部分到达的PDF数据包进行流式的处理。为此,提出了基于自动机理论的PDF文本内容抽取方法。该方法通过建立具有层次的关键字自动机,可以快速地抽取完整PDF文档和不完整PDF文档中的文本内容。在中文和英文PDF文档数据集下的实验结果表明,基于自动机理论的PDF文本内容抽取方法耗时仅为PDFBox方法的17%~37%。
The existing methods of extracting text content from a PDF file, such as the one adopted by the PDFBox library, are not efficient enough to handle the high-speed network traffic. Moreover, these methods cannot extract the contents streamingly from partial PDF packets in transfer. This paper proposed a new method based on automaton theory. The method adopted a hierarchical keyword Deterministic Finite Automaton (DFA) to extract information from complete or incomplete PDF files. The experimental results show that the response time of the proposed method is about 17% - 37% of the algorithm used by PDFBox when processing PDF files in Chinese or English.
出处
《计算机应用》
CSCD
北大核心
2012年第9期2491-2495,共5页
journal of Computer Applications
基金
国家自然科学基金资助项目(61070026)
国家863计划项目(2011AA010705)
关键词
文本内容抽取
自动机
确定的有穷自动机
不完整文档
text content extraction
automaton
Deterministic Finite Automation (DFA)
incomplete document