Ginix: Generalized Inverted Index for Keyword Search

Ginix: Generalized Inverted Index for Keyword Search

导出

摘要 Keyword search has become a ubiquitous method for users to access text data in the face of information explosion. Inverted lists are usually used to index underlying documents to retrieve documents according to a set of keywords efficiently. Since inverted lists are usually large, many compression techniques have been proposed to reduce the storage space and disk I/O time. However, these techniques usually perform decompression operations on the fly, which increases the CPU time. This paper presents a more efficient index structure, the Generalized INverted IndeX （Ginix）, which merges consecutive IDs in inverted lists into intervals to save storage space. With this index structure, more efficient algorithms can be devised to perform basic keyword search operations, i.e., the union and the intersection operations, by taking the advantage of intervals. Specifically, these algorithms do not require conversions from interval lists back to ID lists. As a result, keyword search using Ginix can be more efficient than those using traditional inverted indices. The performance of Ginix is also improved by reordering the documents in datasets using two scalable algorithms. Experiments on the performance and scalability of Ginix on real datasets show that Ginix not only requires less storage space, but also improves the keyword search performance, compared with traditional inverted indexes. Keyword search has become a ubiquitous method for users to access text data in the face of information explosion. Inverted lists are usually used to index underlying documents to retrieve documents according to a set of keywords efficiently. Since inverted lists are usually large, many compression techniques have been proposed to reduce the storage space and disk I/O time. However, these techniques usually perform decompression operations on the fly, which increases the CPU time. This paper presents a more efficient index structure, the Generalized INverted IndeX （Ginix）, which merges consecutive IDs in inverted lists into intervals to save storage space. With this index structure, more efficient algorithms can be devised to perform basic keyword search operations, i.e., the union and the intersection operations, by taking the advantage of intervals. Specifically, these algorithms do not require conversions from interval lists back to ID lists. As a result, keyword search using Ginix can be more efficient than those using traditional inverted indices. The performance of Ginix is also improved by reordering the documents in datasets using two scalable algorithms. Experiments on the performance and scalability of Ginix on real datasets show that Ginix not only requires less storage space, but also improves the keyword search performance, compared with traditional inverted indexes.

作者 Hao Wu Guoliang Li Lizhu Zhou

机构地区 the Department of Computer Science and Technology

出处《Tsinghua Science and Technology》 SCIE EI CAS 2013年第1期77-87,共11页 清华大学学报（自然科学版（英文版）

基金 supported by the National Natural Science Foundation of China(No.60833003)

关键词 keyword search index compression document reordering keyword search index compression document reordering

分类号 TP391.3 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献25

1F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel, Compression of inverted indexes for fast query evaluation, in Proc. of the 25th Annual International ACM SIG1R Conference on Research and Development in Information Retrieval, Tammpere, Fin!and, 2002, pp. 222229.
2M. Zukowski, S. Hman, N. Nes, and E A. Boncz, Super- scalar RAM-CPU cache compression, in Proc. of the 22nd International Conference on Data Engineering, Atlanta, Georgia, USA, 2006, pp. 59.
3W. Shieh, T. Chen, J. J. Shann, and C. Chung, Inverted file compression through document identifier reassignment, Information Processing and Management, vol. 39, no. 1, pp. 117-131, 2003.
4R. Blanco and A. Barreiro, TSP and cluster-based solutions to the reassignment of document identifiers, Information Retrieval, vol. 9, no. 4, pp. 499-517, 2006.
5F. Silvestri, Sorting out the document identifier assignment problem, in Proc. of the 29th European Conference on IR Research, Rome, Italy, 2007, pp. 101-112.
6H. Yan, S. Ding, and T. Sue1, Inverted index compression and query processing with optimized document ordering, in Proc. of the 18th International Conference on World Wide Web, Madrid, Spain, 2009, pp. 401-410.
7S. Ding, J. Attenberg, and T. Suel, Scalable techniques for document identifier assignment ininverted indexes, in Proc. of the 19th International Conference on World Wide Web, Raleigh, North Carolina, USA, 2010, pp. 311-320.
8M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava, Fast indexes and algorithms for set similarity selection queries, in Proc. of the 24th International Conference on Data Engineering, Cancun, Mexico, 2008, pp. 267-276.
9W. J. Bouknight, A procedure for generation of three- dimensional half-toned computer graphics presentations, Communications of the ACM, vol. 13, no. 9, pp.527-536, September 1970.
10Home page of DBLP bibliography, http://www.informatik. uni-trier.de/ley/db, 2012.

1刘春恒,周东华.故障检测中阈值的一种自适应选择方法[J].上海海运学院学报,2001,22(3):46-50. 被引量：9
2为何会出现“Disk I/O error”的提示[J].数码时代,2008(10):136-136.
3张志远,徐恒盼.一种基于倒排索引的多维网络存储模型[J].计算机技术与发展,2016,26(4):25-30. 被引量：1
4Liru Zhang Tadashi Ohmori Mamoru Hoshi.Keyword Search on Both XML and Relational Data[J].通讯和计算机（中英文版）,2011,8(4):264-275.
5Long-xiang WANG,Xiao-she DONG,Xing-jun ZHANG,Yin-feng WANG,Tao JU,Guo-fu FENG.TextGen： a realistic text data content generation method for modern storage system benchmarks[J].Frontiers of Information Technology & Electronic Engineering,2016,17(10):982-993.
6朱华林.开机出现Disk I/O错误的解决[J].电子乐园,2009(8):36-36.
7李国良,冯建华,周立柱.Keyword Searches in Data-Centric XML Documents Using Tree Partitioning[J].Tsinghua Science and Technology,2009,14(1):7-18. 被引量：1
8Xi-ming LI Ji-hong OUYANG You LU.Topic modeling for large-scale text data[J].Frontiers of Information Technology & Electronic Engineering,2015,16(6):457-465. 被引量：1
9王珊,张俊,彭朝晖,战疆,杜小勇,Zhao-hui Xiao-yong.基于本体的关系数据库语义检索[J].计算机科学与探索,2007,1(1):59-78. 被引量：15
10唐李洋,倪志伟,李应.基于Cassandra的可扩展分布式反向索引的构建[J].计算机科学,2011,38(6):187-190. 被引量：10

Tsinghua Science and Technology

2013年第1期

浏览历史

内容加载中请稍等...

Ginix: Generalized Inverted Index for Keyword Search

参考文献25

相关作者

相关机构

相关主题

浏览历史