基于后缀树的带有通配符的模式匹配研究被引量：7

Pattern Matching with Wildcards Based on Suffix Tree

下载PDF

导出

摘要由于在生物序列分析、文本索引、网络入侵检测等领域的应用需求,带有通配符的模式匹配问题一直是研究的热点。针对已有的研究工作中通配符和长度约束具有较强的局限性问题,研究带有灵活通配符的模式匹配问题,其中通配符可以在模式的任意两子串间出现且可以指定灵活的长度约束。采用非线性数据结构——后缀树,设计了求解模式所有解的完备算法PAST。预处理阶段采用在线增量式算法构建具有文本先验知识的后缀树,搜索阶段结合动态规划的思想,逐个匹配模式中字符,最终得到完备解。在基因序列上的实验表明,PAST比其他算法具有更好的时间性能。 Pattern matching with wildcards is a hot research problem that can be used in biological sequence analysis,text indexing,network intrusion detection,and so on.Aiming at the problem that the wildcards have strong limitations in the existing research work,pattern matching with flexible wildcards was studied.The wildcards can appear between any two substrings and can be specified with flexible length constraints.The nonlinear data structure—suffix tree was used to design a completeness algorithm PAST.In the prepare process,an online incremental algorithm was used to build the suffix tree which has priori knowledge of the text.In the search phase,the idea of dynamic programming was used to match the characters of the pattern.Experiments on DNA sequences show that our method has better perfor-mances in time than the related matching algorithm.

作者侯宝剑谢飞胡学钢刘应玲王海平

机构地区合肥工业大学计算机与信息学院合肥师范学院计算机科学与技术系中国科学技术大学物理学院

出处《计算机科学》 CSCD 北大核心 2012年第12期177-180,194,共5页 Computer Science

基金国家"863"计划课题(2012AA011005) 国家博士后科学基金(2012M511403) 安徽省自然科学基金(11040606M134) 中央高校基本科研基金(2010HGXJ0714)资助

关键词模式匹配通配符后缀树 Pattern matching Wildcards Suffix tree

分类号 TP309 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献17

1Altschul S F, Gish W, Miller W, et al. Basic local alignment search tool [J]. Journal of Molecular Biology, 1990, 215 (3): 403-410.
2Fischer M J,Paterson M S. String matching and other products [J]. Complexity of computation, Massachusetts Institute of Technology, 1974,7:113-125.
3Don A, ZhelevaE, GregoryM, et al. Discovering interesting usa- ge pawterns in text collections: integrating text mining with visualization[A]//Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management (CIKM'07)[C]. New York:ACM, 2007 : 213- 222.
4Yang Jiong, Wang Wei, Yu P S. Mining asynchronous periodic patterns in time series data [J]. IEEE Transactions on Know- ledge and Data Engineering, 2003,15 (3) : 613-628.
5Tanbeer S K, Ahmed C F, Jeong B-S, et al. Efficient frequent pattern mining over data streams[A]//Proceeding of the 17th ACM Conference on Information and Knowledge Management (CIKM'08)[C]. Califomia:ACM, 2008:1447-1448.
6Cole R, Gottlieb LA, Lewenstein M. Dictionary matching and in- dexing with errors and don' t cares [ A]//Proceedings of the 36th ACM Symposium on the Theory of Computing[C]. New York: ACM, 2004:91-100.
7Manber U, Baeza-Yates R. An algorithm for string matching with a sequence of don't cares [J]. Information Processing Let- ters, 1991,37(3) : 133-136.
8Min Fan, Wu Xiwdong, Lu Zhen-yu. Pattern Matching with In- dependent Wildcard Gaps[A]//2009 Eighth IEEE International Conference on Dependable[C]. 2009 : 194 199.
9霍红卫,王小武.DNA序列中基于适应性后缀树的重复体识别算法[J].计算机学报,2010,33(4):747-754. 被引量：4
10Ukkonen E. On-line Construction of Suffix Trees [J]. Algorith- mica, 1995,1 (14) : 249-260.

二级参考文献31

1Lander E S, Linton L M, Birren Bet al, Initial sequencing and analysis of the human genome. Nature, 2001, 409 (6822) : 860-921.
2Saha Surya, Bridges Susan, Magbanua Zenaida V, Peterson Daniel G. Empirical comparison of ab initio repeat finding programs. Nucleic Acids Research, 2008, 36(7) : 2284-2294.
3Lefebvre A, Leeroq T, Dauchel H, Alexandre J. FORRepeats: Detects repeats on entire chromosomes and between genomes. Bioinformatics, 2003, 19(3): 319-326.
4Jones Nell C, Pevzner Pavel A. Introduction to Bioinformatics Algorithms. Cambridge, Massachusetts: MIT Press, 2004.
5Huntington's Disease Collaborative Research Group. A novel gene containing a trinucleotide repeat that is expanded an unstable on Huntington's disease chromosomes. Cell, 1993, 72(6), 971-983.
6Bergman Casey M, Quesneville Hadi. Discovering and detecting transposable elements in genome sequences. Briefings in Bioinformatics, 2007, 8(6) : 382-392.
7Pevzner P A, Tang H, Tesler G. De novo repeat classification and fragment assembly. Genome Research, 2004, 14 (9): 1786-1796.
8Kurtz S, Schleiermacher C. REPuter: Fast computation of maximal repeats in complete genomes. Bioinformatics, 1999, 15(5): 426-427.
9Price A L, Jones N C, Pevzner P A. De novo identification of repeat families in large genomes. Bioinformatics, 2005, 21 (Supplement) : i351-i358.
10Edgar R, Myers E. Piler: Identification and classification of genomic repeats. Bioinformatics, 2005, 21 (Supplement) : i152-i158.

共引文献23

1陈聪,韩建民,贾泂,辛德东.基于FSA的DNA重复体频率统计算法[J].计算机工程,2011,37(11):184-186.
2霍红卫,郭丹丹,于强,张懿璞,牛伟.(l,d)-模体识别问题的遗传优化算法[J].计算机学报,2012,35(7):1429-1439. 被引量：6
3李艳,孙乐,朱怀忠,武优西.网树求解有向无环图中具有长度约束的简单路径和最长路径问题[J].计算机学报,2012,35(10):2194-2203. 被引量：7
4王海平,胡学钢,谢飞,郭丹,吴信东.模式特征对带有通配符和长度约束的模式匹配问题的影响[J].模式识别与人工智能,2012,25(6):1013-1021. 被引量：8
5黄国林,郭丹,胡学钢.求解近似模式匹配的启发式算法[J].计算机科学与探索,2013,7(1):83-91.
6武优西,刘亚伟,郭磊,吴信东.子网树求解一般间隙和长度约束严格模式匹配[J].软件学报,2013,24(5):915-932. 被引量：14
7木妮娜.玉素甫,古丽娜.玉素甫,张海军.基于QSA数组计算序列中所有NE重复模式的算法[J].计算机科学,2014,41(3):249-252. 被引量：3
8张浩,侯宝剑,叶明全.求解PMWOC问题的算法[J].安徽师范大学学报（自然科学版）,2014,37(3):242-246.
9项泰宁,郭丹,王海平,胡学钢.带通配符的模式匹配问题及其解空间特征分析[J].计算机科学,2014,41(9):269-273. 被引量：1
10强继朋,谢飞,高隽,胡学钢,吴信东.带任意长度通配符的模式匹配[J].自动化学报,2014,40(11):2499-2511. 被引量：5

同被引文献98

1陈垚亮,洪骥,崔万云,肖仰华.BWA Plus:一个基于频繁序列的下一代基因比对工具[J].计算机研究与发展,2011,48(S3):391-394. 被引量：2
2罗四维,赵连伟.基于谱图理论的流形学习算法[J].计算机研究与发展,2006,43(7):1173-1179. 被引量：76
3葛宏伟,梁艳春.基于隐马尔可夫模型和免疫粒子群优化的多序列比对算法[J].计算机研究与发展,2006,43(8):1330-1336. 被引量：9
4蔡晓妍,戴冠中,杨黎斌.改进的多模式字符串匹配算法[J].计算机应用,2007,27(6):1415-1417. 被引量：11
5罗泽举,李艳会,宋丽红,朱思铭.基于隐马尔可夫模型的DNA序列识别[J].华南理工大学学报（自然科学版）,2007,35(8):123-126. 被引量：7
6朱扬勇,熊赟.DNA序列数据挖掘技术[J].软件学报,2007,18(11):2766-2781. 被引量：37
7HAAPASALO T, SILVAS'FI P, SIPPU S, et al. Online dictionary matching with variable-length gaps[C]. Proceedings of the 10th SEA, 2011 : 76 - 87.
8BILLE P, G~RTZ I L, VILDH~J H W, et al. String matching with variable length gaps[J]. Theoretical Computer Science, 2012,443:25 - 34.
9TANBEER S K, AHMED C F, JEONG B S, et al. Efficient frequent pattern mining over data streams[C]. Proceedings of the 17th ACM conference on information and knowledge management. ACM, 2008:1447 - 1448.
10FISCHER M J, PATERSON M S. String matching and other products[J]. In complexity of computation, vol. 7, edited by R. M. Karp. Cambridge, MA: Massachusetts Institute of Technology, 1974.

引证文献7

1张玉新,李成海,白瑞阳.一种改进的单模式匹配算法[J].制造业自动化,2014,36(11):15-17. 被引量：1
2张浩,侯宝剑,叶明全.求解PMWOC问题的算法[J].安徽师范大学学报（自然科学版）,2014,37(3):242-246.
3王洪波,荣岩,罗贺,王晓佳.基于流形学习的DNA序列数据挖掘方法研究[J].合肥工业大学学报（自然科学版）,2014,37(8):933-937. 被引量：2
4项泰宁,郭丹,王海平,胡学钢.带通配符的模式匹配问题及其解空间特征分析[J].计算机科学,2014,41(9):269-273. 被引量：1
5屈正庚,赵杰.一种改进的高效多模式匹配算法[J].系统仿真技术,2014,10(2):116-120. 被引量：2
6沈璐,纪允,纪冬宝,李萍.带可变长度通配符的模式匹配算法[J].计算机工程与应用,2015,51(15):43-47.
7张浩,叶明全.求解PMWOC问题的位并行算法[J].计算机应用研究,2015,32(10):2973-2977.

二级引证文献6

1钱松波,刘嘉勇.一种适于HTTP数据还原的QS改进算法[J].通信技术,2015,48(3):351-356. 被引量：1
2刘杰,张淑艳.数据挖掘在检验医学中的应用[J].中华检验医学杂志,2015,38(12):888-890. 被引量：7
3王翠娥,李香林,崔冬华.复杂网络数据流的入侵数据检测方法仿真[J].计算机仿真,2015,32(12):272-275. 被引量：5
4汪浩,王海平,吴信东.带有通配符和长度约束的模式匹配问题求解模型[J].计算机科学,2016,43(4):279-283. 被引量：1
5王文霞.BF模式匹配算法的探讨与改进[J].运城学院学报,2016,34(6):63-65. 被引量：1
6董美,常志军,张润杰.一种面向科技文献元数据增量数据规范的多模式匹配算法[J].数据分析与知识发现,2021,5(6):135-144. 被引量：1

1魏振钢.一种求解树深度的非递归算法[J].新浪潮,1995(8):17-19.
2张少宏,戴宪华.基于对齐的生物序列相似性分析[J].生物信息学,2005,3(2):81-84. 被引量：2
3韦艳艳,张超群.面向问题求解的《编译原理》教学探索[J].现代计算机（中旬刊）,2013(3):34-36.
4樊超.在hadoop下运用Mapreduce构建文本索引[J].电子制作,2013,21(13):56-56.
5林淑飞.概率算法求解模式匹配问题[J].数字技术与应用,2013,31(5):154-155.
6王新华.组合式复杂大系统的分解性质与求解模式的研究[J].矿业研究与开发,1996,16(S1):174-178.
7张晓煜,许立.完全二叉树相关性质的补充证明[J].甘肃科技纵横,2010,39(3):26-27.
8刘秉毅.面向文本数据库管理系统FIMS的文本索引及检索[J].软件,1994,15(3):20-25. 被引量：1
9李亮,梅松.基于邻接表存储结构的遍历策略探讨[J].无线互联科技,2012,9(3):61-62.
10刘惠敏,董毅.动态模拟二叉树的建立[J].黄冈职业技术学院学报,2004,6(1):75-76. 被引量：1

计算机科学

2012年第12期

浏览历史

内容加载中请稍等...

基于后缀树的带有通配符的模式匹配研究被引量：7

参考文献17

二级参考文献31

共引文献23

同被引文献98

引证文献7

二级引证文献6

相关作者

相关机构

相关主题

浏览历史

基于后缀树的带有通配符的模式匹配研究 被引量：7

参考文献17

二级参考文献31

共引文献23

同被引文献98

引证文献7

二级引证文献6

相关作者

相关机构

相关主题

浏览历史

基于后缀树的带有通配符的模式匹配研究被引量：7