摘要
后缀数组、BWT、LCP数组是进行全文索引和文本压缩的重要数据结构,BWT和LCP数组通常由构造完成的后缀数组计算而来。基于诱导排序的SAIS算法是最快的后缀数组构造算法之一,本文对SAIS进行改进后提出了藏文后缀数组算法ITSBL,在诱导产生后缀数组的同时计算BWT而无须在内存中保存完整的后缀数组,结合藏文的音节结构特点对计算出的后缀数组进行处理,得到以藏文音节字为单位的藏文后缀数组和LCP数组,结果更符合藏文的使用习惯。相比单独计算后缀数组、BWT、LCP数组,ITSBL算法在较大文本下性能提升约10%,较小文本下提升约30%,具有一定的应用价值。
Suffix array,BWT array and LCP array are important data structures for full-text indexing and text compression.BWT array and LCP array are usually computed from the constructed suffix array.SAIS algorithm based on induced sorting is one of the fastest suffix array construction algorithms.This paper improves SAIS and proposes Tibetan suffix array algorithm:ITSBL algorithm,while inducing the generation of suffix array,computes BWT without storing a complete suffix array in memory,and processes the computed suffix array in combination with the characteristics of Tibetan syllable structure to obtain Tibetan suffix array and LCP array in unit of Tibetan syllable word,and the results are more in line with the usage habits of Tibetan.Compared with the separate calculation of suffix array,BWT,LCP array,the performance is improved by about 10%under large text and about 30%under small text,which has certain application value.
作者
张学通
彭展
ZHANG Xuetongg;PENG Zhan(College of Information Engineering,Xizang Minzu University,Xianyang 712082,China;Xizang Key laboratory of Optical Information Processing And Visualization Technology,Xianyang 712082,China;Xizang Cyberspace Governance Research Base,Xianyang712082,China)
出处
《中央民族大学学报(自然科学版)》
2024年第2期32-39,共8页
Journal of Minzu University of China(Natural Sciences Edition)
基金
西藏自治区自然科学基金(XZ202101ZR0089G)。
关键词
诱导排序
藏文
后缀数组
induced sort
Tibetan
suffix array