期刊文献+

倒排索引压缩算法研究综述 被引量:3

Survey on Inverted Index Compressions
下载PDF
导出
摘要 不断增长的互联网网页信息和成千上万的用户查询请求给搜索引擎的索引更新和查询访问带来了前所未有的实时性挑战.高效的索引压缩算法能够降低索引数据的存储和传输开销,加快处理器对索引数据的处理速度,因此能直接影响搜索引擎系统的查询性能.首先,概述了倒排索引中倒排链表所包含的d-gap和freq整数序列的存储结构,并依据压缩码字的对齐方式对倒排索引压缩算法进行分类;其次,详细阐述了当前流行的字对齐压缩算法,并总结了Simple、Frame of Reference(FOR)、Optimized Chunk Splitting(OCS)等几类典型的倒排索引压缩算法;之后,综述了倒排索引压缩算法的SIM D并行化研究,即采用SIMD指令集中Shuffle数据置换和垂直布局存储来加速算法对d-gap整数序列的并行处理性能.然后,针对压缩倒排索引的随机访问问题,综述了通常采用的自索引技术和原始序列压缩算法两种策略.最后,对倒排索引压缩算法在搜索引擎系统中的应用进行了分析和总结并对未来倒排索引压缩算法可能的研究方向进行了探讨和展望. The ever-growing Internet web pages and tens of thousands of user queries have brought about huge challenges to the frequent index updates and real-time query processing in search engines.Efficient index compression algorithms can be used to reduce the costs of index storage and data transfer,as well as to accelerate the speed of CPU processing.In this paper,we present a survey on the techniques and algorithms on inverted index compression in decade,and expect to provide further research directions by classifying and summarizing the existing methods.Firstly,we outline the inverted index structure storing d-gap and freq integer sequences,and categorize the compression algorithms of inverted index according to the alignment of the compressed codewords.Secondly,we detail the most popular word-aligned compression algorithm,and summarize several state-of-the-arts,such as Simple,Frame of Reference(FOR)and Optimized Chunk Splitting(OCS).After that,the random access of the compressed inverted index is reviewed,and two commonused strategies are summarized,which are the self-indexing technique and the original sequence compression algorithms.Then,we reviewa recent research direction of the inverted index compression,i.e.,adopting the SIM D shuffle permutation instruction and vertical layout to parallel the common inverted index compression algorithms.Finally,we summarize the applications of the inverted index compression algorithms in the search engine systems,and discuss the possible future research directions of the inverted index compression.
作者 姜琨 朱磊 宋省身 杨岳湘 JIANG Kun;ZHU Lei;SONG Xing-shen;YANG Yue-xiang(School of Computer Science and Engineering,Xi'an University of Technology,Xi'an 710048,China;College of Computer,National University of Defense Technology,Changsha 410073,China)
出处 《小型微型计算机系统》 CSCD 北大核心 2020年第4期715-723,共9页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(61602374)资助。
关键词 搜索引擎 倒排索引压缩算法 字对齐压缩 SIMD指令集 自索引结构 search engine inverted index compression word-aligned compression SIMD instructions self-indexes
  • 相关文献

参考文献3

二级参考文献39

  • 1刘小珠,孙莎,曾承,彭智勇.基于缓存的倒排索引机制研究[J].计算机研究与发展,2007,44(z3):153-158. 被引量:8
  • 2朱虹,吴林.倒排索引压缩及在RDBMS全文检索中的实现[J].华中科技大学学报(自然科学版),2005,33(4):7-9. 被引量:3
  • 3王虎,王潜平.对几种倒排文件压缩技术的研究与分析[J].计算机工程与应用,2006,42(7):169-173. 被引量:2
  • 4Navarro G, Makinen V. Compressed full-text indexes. ACM Computing Surveys, 2007, 39(1) : 1-61.
  • 5Zobel J, Moffat A. Inverted files for text search engines. ACM Computing Surveys, 2006, 38(2): 1-56.
  • 6Seoa C, Leeb S W, Kima H J. An efficient inverted index technique for XML documents using RDBMS. Information and Software Technology, 2003, 45(1) : 11-22.
  • 7Wang Chao-Kun, Li Jian Zhong, Shi Sheng-Fei. N-gram inverted index structures on music data for theme mining and content based information retrieval. Pattern Recognition Letters, 2006, 27(9): 492-503.
  • 8Gupta A, Hon W K, Shah R, Vitter J S. Compressed dictionaries: Space measures, data sets, and experiments. Lecture Notes in Computer Science, 2006, 4007:158-169.
  • 9Buttcher S, Clarke C L A. Index compression is good, especially for random access//Proceedings of the 16th ACM Conference on Information and Knowledge Management. Lisboa, Portugal, 2007:761-770.
  • 10Moffat A, Stuiver L. Binary interpolative coding for effective index compression. Information Retrieval, 2000, 3(1): 25 -47.

共引文献20

同被引文献11

引证文献3

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部