摘要
介绍基于电力公司的多格式文档智能信息搜索系统的设计原理和实现过程。通过PHP调用COM组件以及Java调用jar包,将其他多种文档转换为".txt"文档,经过分词并采用基于句子特征的文本摘要生成方法生成".txt"文档的摘要。检索模块采用基于词索引的全文检索,信息检索模型采用空间向量模型,实现摘要及高相关度句子的输出。
This article describes the design principle and implementation process of the intelligent information re-trieval system based on multiple -format document electric power company. This system realizes how to convertPDF, HTML, XLS, D0C file to txt file by calling C0M component using PHP and calling jar package using Java.On this basis we realize the abstract generation of txt file by using Chinese word segmentation and automatic abstracttechnology based on the characteristics of sentences. Retrieval module uses Full - text retrieval based on word in-dex, takes space vector model as information retrieval and realizes the output of abstract and sentences with highcorrelation.
出处
《重庆科技学院学报(自然科学版)》
CAS
2014年第4期154-157,168,共5页
Journal of Chongqing University of Science and Technology:Natural Sciences Edition
基金
国家自然科学基金项目(60705015)
安徽省自然科学基金项目(KJ2013B095)
关键词
信息搜索系统
格式转换
自动文摘
全文检索
空间向量模型
information retrieval system
format conversion
automatic abstract
full -text retrieval
spacevector model