期刊文献+

AS-Index: A Structure for String Search Using n-Grams and Algebraic Signatures 被引量:1

AS-Index: A Structure for String Search Using n-Grams and Algebraic Signatures
原文传递
导出
摘要 We present the AS-Index, a new index structure for exact string search in disk resident databases. AS-Index relies on a classical inverted file structure, whose main innovation is a probabilistic search based on the properties of algebraic signatures used for both n-grams hashing and pattern search. Specifically, the properties of our signatures allow to carry out a search by inspecting only two of the posting lists. The algorithm thus enjoys the unique feature of requiring a constant number of disk accesses, independently from both the pattern size and the database size. We conduct extensive experiments on large datasets to evaluate our index behavior. They confirm that it steadily provides a search performance proportional to the two disk accesses necessary to obtain the posting lists. This makes our structure a choice of interest for the class of applications that require very fast lookups in large textual databases. We describe the index structure, our use of algebraic signatures, and the search algorithm. We discuss the operational trade-offs based on the parameters that affect the behavior of our structure, and present the theoretical and experimental performance analysis. We next compare the AS-Index with the state-of-the-art alternatives and show that 1) its construction time matches that of its competitors, due to the similarity of structures, 2) as for search time, it constantly outperforms the standard approach, thanks to the economical access to data complemented by signature calculations, which is at the core of our search method. We present the AS-Index, a new index structure for exact string search in disk resident databases. AS-Index relies on a classical inverted file structure, whose main innovation is a probabilistic search based on the properties of algebraic signatures used for both n-grams hashing and pattern search. Specifically, the properties of our signatures allow to carry out a search by inspecting only two of the posting lists. The algorithm thus enjoys the unique feature of requiring a constant number of disk accesses, independently from both the pattern size and the database size. We conduct extensive experiments on large datasets to evaluate our index behavior. They confirm that it steadily provides a search performance proportional to the two disk accesses necessary to obtain the posting lists. This makes our structure a choice of interest for the class of applications that require very fast lookups in large textual databases. We describe the index structure, our use of algebraic signatures, and the search algorithm. We discuss the operational trade-offs based on the parameters that affect the behavior of our structure, and present the theoretical and experimental performance analysis. We next compare the AS-Index with the state-of-the-art alternatives and show that 1) its construction time matches that of its competitors, due to the similarity of structures, 2) as for search time, it constantly outperforms the standard approach, thanks to the economical access to data complemented by signature calculations, which is at the core of our search method.
出处 《Journal of Computer Science & Technology》 SCIE EI CSCD 2016年第1期147-166,共20页 计算机科学技术学报(英文版)
关键词 full text indexing large-scale indexing algebraic signature full text indexing, large-scale indexing, algebraic signature
  • 相关文献

参考文献28

  • 1Margaritis G, Anastasiadis S V. SeFS: Unleashing the power of full-text search on file systems. In Proc. the 5th USENIX Conf. File and Storage Technology, Feb. 2007, Ar- ticle No. 12.
  • 2Crochemore M, Lecroq T. Pattern matching and text- compression algorithms. ACM Computing Surveys, 1996, 28(1): 39-41.
  • 3Ferragina P, Grossi R. The String B-tree: A new data struc- ture for string search in external memory and its applica- tions. J. ACM, 1999, 46(2): 236-280.
  • 4Phoophakdee B, Zaki M J. Genome-scale diskbased suf- fix tree indexing. In Proc. Int. Conf. Management of Data (SIGMOD), June 2007, pp.833-844.
  • 5Miller E, Shen D, Liu J, Nicholas C. Performance and scal- ability of a large-scale n-gram based information retrieval system. Journal of Digital Information, 2000.
  • 6Kim M S, Whang K, Lee J G, Lee M J. n-Gram/2L: A space and time efficient two-level n-gram inverted index struc- ture. In Proc. the 31st Int. Conf. Very Large Data Bases (VLDB), Aug. 2005, pp.325-336.
  • 7Litwin W, Schwarz T. Algebraic signatures for scalable dis- tributed data structures. In Proc. the 20th Int. Conf. Data Engineering (ICDE), March 2004, pp.412-423.
  • 8du Mouza C, Litwin W, Rigaux P, Schwarz T J E. AS-index: A structure for string search using n-grams and algebraic signatures. In Proc. the 18th Int. Conf. Information and Knowledge Management ( CIKM), Nov. 2009, pp.295-304.
  • 9Gray J, Fitzgerald B. Flash disk opportunity for server ap- plications. ACM Queue, 2008, 6(4): 18-23.
  • 10Charras C, Lecroq T, Pehoushek J D. A very fast string matching algorithm for small alphabets and long patterns. In Proc. the 9th Int. Syrup. Combinatorial Pattern Match- ing (CPM), July 1998, pp.55-64.

同被引文献2

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部