期刊文献+

基于插件的文本抽取系统的设计与实现

Design and Implementation of a Text Extraction System Based on Plugins
原文传递
导出
摘要 为了使全文检索系统支持多种文件格式的检索,必须先对待检索的文件进行文本抽取以转化为便于建立索引的纯文本。针对多格式的文本抽取问题,文章设计了一种基于插件的支持多格式的文本抽取系统,该系统采用文件后缀名和魔数(magic number)结合的方式自动识别文件类型,以统一接口调用已存在的针对单一类型文件的抽取插件,对得到的纯文本进行编码转换以使得最终的输出编码统一,系统还针对目录输入设计了多进程并行优化以利用CPU多核优势,使用贪心算法优化任务分配以使总运行时间尽可能短。该系统易于扩展,编程接口简单。实验结果表明,该系统能正常抽取文本内容和元数据,且其抽取效率高于Apache的Tika等开源文本抽取系统。 This paper designs a text extraction system that converts multi-format file sources to plain texts; such a system plays a key role in full-text retrieval tasks. The system is designed based on plugins and is able to support a variety of file formats. The system detects file types using the combination of file extensions and magic numbers, calls existing single-type-oriented plugins through a uniform interface, and unifies the encoding of output plain texts. Two novel features of the system include designing a greedy scheduling algorithm that minimizes the overall running time, as well as implementing the algorithm in a multi-process manner that takes full advantages of multiple cores. The system is easy to expand and has simple APIs. Experimental results show that the system can extract text contents and metadata of supported file formats, and outperform Apache's Tika, an existing open source system.
出处 《电子技术(上海)》 2014年第8期32-36,共5页 Electronic Technology
关键词 文本抽取 多格式 插件 文件类型识别 编码转换 多进程 任务分配算法 text extraction multi-format plugins file type identification character encoding conversion multi-process scheduling algorithm
  • 相关文献

参考文献12

  • 1Manning C D, Raghavan P, Schutze H. Introduction to information retrieval[M]. Cambridge: Cambridge university press, 2008.
  • 2Apache Software Foundation. Tika[EB/OL]. https://tika.apache.org/,2014-02-19/2014-02-26.
  • 3曹鼎.文件类型识别技术研究[D].郑州:解放军信息工程大学,2011.
  • 4Freed N, Borenstein N. Multipurpose internet mail extensions (MIME) part two: Media types[R], rfc 2046,November, 1996.
  • 5Nilsson M. ID3 tag version 2.4.0-Main \Structure [EB/OL]. http//www. id3. org/id3v2, 2000.
  • 6Unicode Consortium. The Unicode Standard, Version 2.0[M]. Addison-Wesley Longman Publishing Co., Inc., 1997.
  • 7Li S, Momoi K. A composite approach to language/ encoding detection[C].Proc. 19th International Unicode Conference. 2001.
  • 8GNU. Introduction to libiconv. [EB/OL]. https ://www.gnu.org/software/libiconv/, 2014-02- 26/2011-08-07.
  • 9何军,王飙.多核处理器的结构设计研究[J].计算机工程,2007,33(16):208-210. 被引量:24
  • 10Wiangtong T, Cheung P Y K, Luk W. Hardware/ software codesign: a systematic approach targeting data-intensive applications[J]. Signal Processing Magazine,IEEE, 2005,22(3): 14-22.

二级参考文献16

  • 1Kunle O K,Basem A N,Hammond L,et al.The Case for a Single-chip Multiprocessor[C]//Proc.of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems,New York.1996-10-02.
  • 2Tullsen D M,Eggers S J,Levy H M.Simultaneous Multithreading:Maximizing On-chip Parallelism[C]//Proc.of the 22nd Ann.Int'l Symp.on Computer Architecture.1995:392-403.
  • 3Kahle J A.Introduction to the Cell Multiprocessor[J].IBM Journal Res.& Dev.,2005,49(4/5):589-604.
  • 4Kongetira P.A 32-Way Multithreaded SPARC Processor[J].IEEE Micro,2005,25(2):21-29.
  • 5Barroso L A.Piranha:a Scalable Architecture Based on Single-chip Multiprocessing[C]//Proc.of Int'l Symp.on Computer Architecture.2000:165-175.
  • 6Kalla R.IBM Power5 Chip:A Dual-core Multithreaded Processor[J].IEEE Micro,2004,24(2):40-47.
  • 7McNairy C,Bhatia R.Montecito:A Dual-core,Dual-thread Itanium Processor[J].IEEE Micro,2005,25(2):10-20.
  • 8Hammond L.The Stanford Hydra CMP[J].IEEE Micro,2000,20(2):71-84.
  • 9Darbha S, Agrawal D P. Optimal scheduling algorithm for distributed-memory machines[J]. IEEE Trans Parallel and Distributed Systems, 1998,9(1):87-94.
  • 10Park C I, Choe T Y. An optimal scheduling algorithm based on task duplication[J].IEEE Trans Computers,2002,51(4):444-448.

共引文献28

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部