摘要
为了使全文检索系统支持多种文件格式的检索,必须先对待检索的文件进行文本抽取以转化为便于建立索引的纯文本。针对多格式的文本抽取问题,文章设计了一种基于插件的支持多格式的文本抽取系统,该系统采用文件后缀名和魔数(magic number)结合的方式自动识别文件类型,以统一接口调用已存在的针对单一类型文件的抽取插件,对得到的纯文本进行编码转换以使得最终的输出编码统一,系统还针对目录输入设计了多进程并行优化以利用CPU多核优势,使用贪心算法优化任务分配以使总运行时间尽可能短。该系统易于扩展,编程接口简单。实验结果表明,该系统能正常抽取文本内容和元数据,且其抽取效率高于Apache的Tika等开源文本抽取系统。
This paper designs a text extraction system that converts multi-format file sources to plain texts; such a system plays a key role in full-text retrieval tasks. The system is designed based on plugins and is able to support a variety of file formats. The system detects file types using the combination of file extensions and magic numbers, calls existing single-type-oriented plugins through a uniform interface, and unifies the encoding of output plain texts. Two novel features of the system include designing a greedy scheduling algorithm that minimizes the overall running time, as well as implementing the algorithm in a multi-process manner that takes full advantages of multiple cores. The system is easy to expand and has simple APIs. Experimental results show that the system can extract text contents and metadata of supported file formats, and outperform Apache's Tika, an existing open source system.
出处
《电子技术(上海)》
2014年第8期32-36,共5页
Electronic Technology
关键词
文本抽取
多格式
插件
文件类型识别
编码转换
多进程
任务分配算法
text extraction
multi-format
plugins
file type identification
character encoding conversion
multi-process
scheduling algorithm