摘要
全文检索的效率依赖于数据结构-倒排索引,存储倒排索引需要较大的硬盘存储空间。提出了一种新的压缩算法,主要用于倒排索引中文档标识符的压缩。对于给定的文档集合使用信息检索工具Terrier,使用不同的压缩算法压缩倒排索引中的文档标识符,从而生成倒排索引文件,然后比较倒排索引文件的大小。实验结果表明,使用新的压缩算法能够节省倒排索引文件的存储空间。
The efficiency of text search engines relies on data structure : inverted index. And the more large space is need to storage the inverted index. A new compression algorithm was proposed. For the given document collections. Terrier, the information retrival tool, was used to build inverted index, and the state-of-the-art compression techniques was used to compress inverted file. Then the compress ratio was confirmed by comparing the file size. Experiments show that thenew compression techniques can get much better compress ratio.
出处
《山东大学学报(理学版)》
CAS
CSCD
北大核心
2014年第12期30-35,共6页
Journal of Shandong University(Natural Science)
基金
中央高校基本科研业务费专项资金项目(2011JBM231)
关键词
倒排索引
整数压缩
索引压缩
inverted index
integer compression
index compression