期刊文献+

基于指令级并行的倒排索引压缩算法 被引量:7

SIMD-Based Inverted Index Compression Algorithms
下载PDF
导出
摘要 文本信息数量的快速增长给传统的信息检索技术带来了新的挑战.搜索引擎通常使用倒排索引来高效地处理查询.为了减少存储开销和加快访问速度,倒排索引通常被压缩存储.因此,如何选择一个高性能的压缩算法对高效查询处理是非常有必要的.在已有倒排链压缩算法PackedBinary和PForDelta的基础上,利用CPU的超标量特性和SIMD向量指令集,将其压缩和解压缩中的关键步骤并行化,提出了2种指令级并行压缩算法SIMD-PB和SIMD-PFD.基于GOV2和ClueWeb09B两个公开数据集的实验表明,SIMD-PB和SIMD-PFD算法在压缩率不变的情况下,压缩和解压缩速度比现有的压缩算法均有非常明显的提升.其中解压缩速度比起目前最好的倒排链压缩算法,最高能提升17%.此外,实验表明算法在较长的倒排链、较大的压缩块单位上有更好的解压缩性能. The rapid growth of text information has brought about new challenges to traditional information retrieval. In large search engines, indexing is required to help users acquire important data they need, and techniques of inverted index have great influence on the efficiency of query processing in such systems. The data in inverted index is stored in the form of arrays of integers, and techniques of compression are required to reduce the cost of storing such data in disks and memory, as well as to boost the hit rate of CPU cache and speed up transferring data. Therefore, it is necessary to choose a highly efficient compression algorithm to process query effectively. In this paper, we propose two instruction-level-parallelized algorithms, i.e. SIMD-PB and SIMD-PFD, which improve two competitive compression algorithms respectively, i.e. PackedBinary and PForDelta, and exploit SIMD instructions to accelerate the Pack and Unpack procedure in the algorithms. Experiments based on public datasets of GOV2 and ClueWeb09B show that our novel algorithms have good performance on encoding and decoding speed without impairing the compression ratio, and outperform the former fastest inverted list compression algorithms by at most 17%, with respect to decompression speed. Furthermore, experiments indicate that our novel algorithms have better performance on longer posting list and larger block size w.r.t. decoding speed.
出处 《计算机研究与发展》 EI CSCD 北大核心 2015年第5期995-1004,共10页 Journal of Computer Research and Development
基金 国家"九七三"重点基础研究发展计划基金项目(2014CB340400) 国家自然科学基金项目(61272340) 江苏未来网络创新研究院项目-云服务数字资源搜索(BY2013095-4-02)
关键词 单指令多数据流 倒排索引 压缩 整数编码 信息检索 single instruction multiple data(SIMD) inverted index compression integer encoding information retrieval
  • 相关文献

参考文献19

  • 1Witten I H, Moffat A, Bell T C. Managing Gigabytes: Compressing and Indexing Documents and Images [M]. 2nd ed. San Francisco: Morgan Kaufmann, 1999.
  • 2Anh V N, Moffat A. Index compression using 64 bit words [J]. Software: Practice and Experience, 2010, 40(2): 131- 147.
  • 3Heman S. SupeFscalar database compression between RAM and CPU cache [D]. Amsterdam, Holland: University of Amsterdam, 2005.
  • 4Zukowski M, Heman S, Nes N, et al. Super scalar RAM- CPU cache compression [C] //Proc of the 22nd Int Conf on Data Engineering. Los Alamitos, CA: IEEE Computer Society, 2006: 59-71.
  • 5Elias P. Universal codeword sets and representations of the integers [J]. IEEE Trans on Information Theory, 1975, 21 (2) : 194-203.
  • 6Rice R, Plaunt J. Adaptive variable length coding for efficient compression of spacecraft television data [J]. IEEE Transon Communication Technology, 1971, 19(6): 889- 897.
  • 7Scholer F, Williams H, Yiannis J, et al. Compression of inverted indexes for fast query evaluation [C] //Proc of the 25th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 2902 : 222-229.
  • 8Dean J. Challenges in building large-scale information retrieval systems.. Invited talk [C]//Proc of the 2nd ACM Int Conf on Web Search and Data Mining. New York: ACM, 2009.
  • 9Anh V N, Moffat A. Inverted index compression using word- aligned binary codes [J]. Information Retrieval, 2005, 8 (1) : 151-166.
  • 10Anh V N, Moffat A. Improved word aligned binary compression for text indexing [J]. Knowledge and Data Engineering, 2006, 18(6): 857-861.

二级参考文献15

  • 1纪蕾,陈英.基于文档重排的索引压缩技术[J].清华大学学报(自然科学版),2005,45(S1):1828-1832. 被引量:1
  • 2朱虹,吴林.倒排索引压缩及在RDBMS全文检索中的实现[J].华中科技大学学报(自然科学版),2005,33(4):7-9. 被引量:3
  • 3王虎,王潜平.对几种倒排文件压缩技术的研究与分析[J].计算机工程与应用,2006,42(7):169-173. 被引量:2
  • 4Witten I H, Moffat A, Bell T C. Managing gigabytes: compressing and indexing documents and images[M]. New York: Van Nostrand Reinhold, 1994.
  • 5Navarro G, Moura E, Neubert M. Adding compression to block addressing inverted indexes[J]. Information Retrieval, 2000, 3(1): 49-77.
  • 6Moffat A, Zobel J. Self-indexing inverted files for fast text retrieval[J]. ACM Transactions on Information Systems, 1996(10): 349-379.
  • 7Scholer F, Wiliams H, Yiannis J. Compression of inverted indexes for fast query evaluation[J]. ACM Transactions on Information Systems, 2002(8): 222-229.
  • 8Williams H, Zobel J. Compressing integers for fast file access[J]. Computer Journal, 1999, 42(3): 193-201.
  • 9Zobel J,Moffat J.Adding compression to a full-text retrieval system[J].Software Practice and Experience,1995 ;25 (8):891~903
  • 10Williams HE,Zobel J.Compressing integers for fast file access[J].The Computer Journal,1999; 42 (3):193~201

共引文献6

同被引文献24

引证文献7

二级引证文献14

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部