期刊文献+

基于前缀剪枝的大规模向量空间相似检索框架

A Large-Scale Vector Space Similarity Retrieval Framework Based on Prefix Pruning
下载PDF
导出
摘要 针对大规模文本集合下基于权重的相似性查询问题,提出一种支持前缀剪枝的高效检索框架。首先给出向量空间模型下相似性及其带权前缀定义,理论证明了带权前缀剪枝的正确性;其次,面向大规模文本查询,提出一种新的倒排索引结构,利用索引叶节点维护记录的前缀权重,并基于该索引构建高效的相似检索算法;最后,在TF/IDF权重策略下证明该方法能够有效支持大规模带权相似检索。结果表明,其查询效率较Lucene的归并验证策略提升了5倍以上。 Aiming at the problem of weight-based similarity query under large-scale text collection,an efficient retrieval framework supporting prefix pruning is proposed.Firstly,we give the definition of similarity and its weighted prefix under the vector space model,and theoretically prove the correctness of weighted prefix pruning;then,for large-scale text query,we propose a new inverted index structure,use the index leaf nodes to maintain the prefix weights of the records,and construct efficient similarity retrieval algorithms based on the index;finally,we prove that the meth-od can effectively support large-scale similar retrieval with weights,and the results show that its query efficiency is more than 5 times higher than that of Lucene's subsumption verification strategy.
作者 刘健博 邓凌风 李文海 田野 LIU Jianbo;DENG Lingfeng;LI Wenhai;TIAN Ye(Wuhan DNect Technology ltd.,Wuhan 430205,China;School of Computer Science,Wuhan University,Wuhan 430072,China;School of Software Engineering,Hubei Open University,Wuhan 430074,China)
出处 《软件导刊》 2024年第6期92-97,共6页 Software Guide
基金 武汉市重点研发计划项目(2023010402040006)。
关键词 前缀剪枝 TF/IDF 向量空间 倒排索引 信息检索 数据库 prefix-based pruning TF/IDF vector space model inverted index information retrieval database
  • 相关文献

参考文献5

二级参考文献23

  • 1Behm A, Li Chen, et al. Answering approximate string queries on large data sets using external memory [C] //Proc of IEEE ICDE'll. Los Alamitos, CA: IEEE Computer Society, 2011:888-899.
  • 2Wagner R, Fischer M. The string-to-string correction problem [J]. Journal of the ACM, 1974, 21(1): 168-173.
  • 3Zobel J, Moffat A. Inverted files for text search engines [J]. ACM Computer Survey, 2006, 38(2): 6-20.
  • 4Gravano L, Ipeirotis P, et al. Approximate string join in a database (almost) for free [C] //Proc of VLDB'01. San Francisco: Morgan Kaufmann, 2001:491-500.
  • 5Chaudhuri S, Ganti V, et al. A primitive operator for similarity joins in data cleaning [C] //Proc of IEEE ICDE'06. Los Alamitos, CA: IEEE Computer Society, 2006:5-15.
  • 6Xiao Chuan, Wang Wei, et al. Ed-join: An efficient algorithm for similarity joins with edit distance constrains [J]. Proceedings of the VLDB Endowment. 2008, 1 (1): 933-944.
  • 7Li Chen, l.u Jiaheng, et al. Efficient merging and filtering algorithms for approximate string searches [C] //Proe of IEEE ICDE'08. Los Alamitos, CA: IEEE Computer Society, 2008:257-266.
  • 8Sarawagi S, Kirpal A. Efficient set joins on similarity predicates [C] //Proc of ACM SIGMOD'04. New York: ACM, 2004:743-754.
  • 9Behm A, Ji Sbengyue, et aI. Space-constrained gram-based indexing for efficient approximate string search [C] //Proc of IEEE ICDE'09. Los Alamitos, CA: IEEE Computer Society, 2009:204-215.
  • 10Hadijieleftheriou M, Koudas N, et al. lnereamental maintenance of length normalized indexes for approximate string matching [C] //Proc of ACM SIGMOD'10. New York: ACM, 2011:429-440.

共引文献36

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部