一种全文索引的压缩方法

A Compressed Suffix Automaton for Full Text Indexing

导出

摘要全文索引广泛应用于数据库、数据压缩、模式匹配算法以及信息生物学等领域。本文研究了后缀自动机全文索引结构,针对后缀自动机空间占用大的问题提出了一种边压缩方法。该方法通过后缀链接函数模拟实现自动机的跳转边,从而删除部分跳转边。在最终的压缩结构中,跳转边的数量与状态数量一致,而在后缀自动机中跳转边的数量是状态数量的一倍。证明了对于因子判定等问题,压缩的后缀自动机与后缀自动机具有相同的时间复杂度。 Full text indexes are widely used in areas such as data base,data compression,pattern matching and bioinformatics.We present in this paper a compression method for suffix automata.By deleting some transaction edges,the suffix automata can still work like the original suffix automata without losing performance.The compressed suffix automata have edges with the number similar with that of states while in the original ones the number of edges is twice of that of states.We also proved that using the compressed suffix automata the membership problem for the factor set of a given word can be solved linear time.

作者杨炜鸿张猛

机构地区吉林大学计算机科学与技术学院吉林工商学院信息工程分院

出处《情报科学》 CSSCI 北大核心 2010年第11期1710-1713,共4页 Information Science

基金国家自然科学基金项目(60873235) 教育部中央高校基本科研业务费(200903186) 吉林省科技厅自然基金项目(20101522) 吉林省教育厅项目(2009599 2010400)

关键词文本索引后缀自动机压缩 full text index suffix automaton compression

分类号 G350 [文化科学—情报学]

引文网络
相关文献

参考文献16

1P. Weiner. Linear pattern matching algorithm [C]. USA:In Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory, IEEE,1973.
2E. Ukkonen. On-line construction of suffix trees[J]. Algorithmica, Springer-Vedag, Germany, 1995,(14):249-60.
3D. Gusfield. Algorithms on Strings Trees and Sequences[M]. New York:Cambridge University Press,1997:87-208.
4A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M.T. Chen, and J. Seiferas. The smallest automation recognizing the subwords of a text[J]. Theoretical Computer Science, Elsevier, Holland, 1985, (40):31-55.
5A. Blumer, J. Blumer, D. Haussler, R. McConnell, and A. Ehrenfeucht. Complete inverted files for efficient text retrieval and analysis [J]. Journal of the ACM, ACM, USA, 1987, 34(3):578-595.
6M. Crochemore. Transducers and repetitions [J]. Theoretical Computer Science, Elsevier, Holland, 1986,(45):63-86.
7U. Manber and G. Myers. Suffix arrays: A new method for on-line string searches [J]. SIAM Journal on Computing, ACM, USA, 1993, (22):935-948.
8R. Grossi and J. Vitter. Compressed suffix arrays and suffix trees with applieafions to text indexing and string matching [C]. USA:In Proceedings of the 32nd ACM Symposium on Theory of Computing, ACM, 2000.
9M. Crochemore and R. Verin. Direct construction of compact directed acyclic word graphs. In Combinatorial Pattern Matching[C]. Germany: Springer, 1997.
10J. Karkkainen. Suffix cactus : a cross between suffix tree and suffix array[C]. Combinatorial Pattern Matching, 1995.

二级参考文献16

1Charras C, Lecroq TT. Handbook of Exact String Matching Algorithms. London: King's College London Publications, 2004.
2Knuth DE, Jr. Morris JH, Pratt VR. Fast pattern matching in strings. SIAM Journal on Computing, 1977,6(1):323-350.
3Baeza-Yates RA, Gonnet GH. A new approach to text searching. Communications of the ACM, 1992,35(10):74-82.
4Boyer RS, Moore JS. A fast string searching algorithm. Communications of the ACM, 1977,20(10):762-772.
5Sunday DM. A very fast substring search algorithm. Communications of the ACM, 1990,33(8):132-142.
6Cantone D, Faro S. Fast-Search: A new efficient variant of the Boyer-Moore string matching algorithm. In: Alberto A, Massimo M, eds. Proc. of the 2nd Int'l Workshop on Experimental and Efficient Algorithms (WEA 2003). Lecture Notes in Computer Science 2647, Heidelberg: Springer-Verlag, 2003.47-58.
7Crochemore M, Rytter W. Text Algorithms. Oxford: Oxford University Press, 1994.
8Lecroq T. A variation on the Boyer-Moore algorithm. Theoretical Computer Science, 1992,92(1):119-144.
9Crochemore M, Hancart C. Automata for matching patterns. In: Rosenberg G, Salomaa A, eds. Handbook of Formal Languages,volume 2: Linear Modeling: Background and Application. Heidelberg: Springer-Verlag, 1997. 399-462.
10Yao AC. The complexity of pattern matching for a random string. SIAM Journal on Computing, 1979,8(3):368-387.

共引文献24

1Jin Shu(1),Liu Fengyu(2)(1.NAEG System Integration Engineering Co.Ltd,Nanjing,210003,P.R.China,2.Nanjing University of Science & Technology,Computer Science Department,210094,P.R.China).A Parallel String Searching Algorithm for Information Filtering[J].工程科学（英文版）,2007,5(3):82-90.
2王成江,冉兵,戴迪,吴磊.基于滑动窗口的动态手写签名局部相关性研究[J].三峡大学学报（自然科学版）,2006,28(2):157-160.
3黄栋,余综.模式匹配算法在FPGA芯片上的设计与实现[J].计算机工程与设计,2006,27(17):3273-3276. 被引量：1
4刘传汉,王永成,刘德荣,李党林.基于混合策略的单模式匹配算法[J].上海交通大学学报,2007,41(1):36-41. 被引量：3
5何申,罗文坚,王煦法.一种检测器长度可变的非选择算法[J].软件学报,2007,18(6):1361-1368. 被引量：24
6申晋祥,杨秋翔.模式匹配算法的研究与改进[J].电脑开发与应用,2007,20(7):9-10.
7许秀林,胡克瑾.基于组合策略的单模式串精确匹配算法[J].计算机应用,2008,28(1):232-235. 被引量：1
8巩宁平,高太平.一种基于编译技术的可信赖计算方法的设计与实现[J].计算机应用与软件,2008,25(1):46-48. 被引量：2
9许秀林,吴楠.单模式串匹配自动机的设计与实现[J].南通职业大学学报,2008,22(1):60-64.
10任丛美,阮冬茹,郭彦颖.入侵检测模式匹配算法的研究与改进[J].中国新技术新产品,2008(16):12-12.

1刘畅,张猛.基于后缀数组改进的全文索引结构研究[J].吉林大学学报（信息科学版）,2013,31(2):183-186.
2唐培和,杨新伦,刘浩.Google搜索引擎及其实现技术[J].广西工学院学报,2004,15(2):29-33. 被引量：4
3邓佩珍.数字图书馆关键技术——数据压缩的原理与方法[J].图书馆学研究,2008(11):35-38. 被引量：1
4陈颖.让时评的速度“慢”一些——《西安晚报》时评版操作小议[J].新闻知识,2012(6):14-14.
5蒲秋如.中国电视要做“门户频道”——从互联网传播看中国电视改革之路[J].新闻爱好者（下半月）,2007(9):16-16.
6刘春科.一种无标引实现汉字全文索引与全文检索的新方法[J].情报学报,1991,10(2):113-121. 被引量：1
7李子臣,张丽宁.互联网上的图像信息检索[J].互联网世界,2001(10):74-76.
8李晓晖,朱毅,唐慧佳,王燮.基于Oracle的文献资料库全文检索技术[J].成都信息工程学院学报,2003,18(2):110-114. 被引量：4
9周强.用Lucene实现MARC记录全文索引之探索[J].图书馆学刊,2005,27(2):22-24. 被引量：2
10杨炜鸿,张毅,于洪梅.基于模拟后缀数组索引结构的实现[J].情报科学,2009,27(12):1834-1836.

情报科学

2010年第11期

浏览历史

内容加载中请稍等...

一种全文索引的压缩方法

参考文献16

二级参考文献16

共引文献24

相关作者

相关机构

相关主题

浏览历史