Due to the huge size of patterns to be searched,multiple pattern searching remains a challenge to several newly-arising applications like network intrusion detection.In this paper,we present an attempt to design effic...Due to the huge size of patterns to be searched,multiple pattern searching remains a challenge to several newly-arising applications like network intrusion detection.In this paper,we present an attempt to design efficient multiple pattern searching algorithms on multi-core architectures.We observe an important feature which indicates that the multiple pattern matching time mainly depends on the number and minimal length of patterns.The multi-core algorithm proposed in this paper leverages this feature to decompose pattern set so that the parallel execution time is minimized.We formulate the problem as an optimal decomposition and scheduling of a pattern set,then propose a heuristic algorithm,which takes advantage of dynamic programming and greedy algorithmic techniques,to solve the optimization problem.Experimental results suggest that our decomposition approach can increase the searching speed by more than 200% on a 4-core AMD Barcelona system.展开更多
The traditional multiple pattern matching algorithm, deterministic finite state automata, is implemented by tree structure. A new algorithm is proposed by substituting sequential binary tree for traditional tree. It i...The traditional multiple pattern matching algorithm, deterministic finite state automata, is implemented by tree structure. A new algorithm is proposed by substituting sequential binary tree for traditional tree. It is proved by experiment that the algorithm has three features, its construction process is quick, its cost of memory is small. At the same time, its searching process is as quick as the traditional algorithm. The algorithm is suitable for the application which requires preprocessing the patterns dynamically.展开更多
Objectives:Genomic signatures like k-mers have become one of the most prominent approaches to describe genomic data.As a result,myriad real-world applications,such as the construction of de Bruijn graphs in genome ass...Objectives:Genomic signatures like k-mers have become one of the most prominent approaches to describe genomic data.As a result,myriad real-world applications,such as the construction of de Bruijn graphs in genome assembly,have been benefited by recognizing genomic signatures.In other words,an efficient approachof genomic signatureprofiling is an essential need for tackling high-throughput sequencing reads.However,most of the existing approaches only recognize fixed-size k-merswhile many research studies have shown the importance of considering variable-length k-mers.Methods:In this paper,we present a novel genomic signature profiling approach,TahcoRoll,by extending the Aho–Corasick algorithm(AC)for the task of profiling variable-length k-mers.We first group nucleotides into two clusters and represent each cluster with a bit.The rolling hash technique is further utilized to encode signatures and read patterns for efficient matching.Results:In extensive experiments,TahcoRoll significantly outperforms the most state-of-the-art k-mer counters and has the capability of processing reads across different sequencing platforms on a budget desktop computer.Conclusions:The single-thread version of TahcoRoll is as efficient as the eight-thread version of the state-of-the-art,JellyFish,while the eight-thread TahcoRoll outperforms the eight-thread JellyFish by at least four times.展开更多
基金supported by the National Natural Science Foundation of China under Grant Nos.60803030,60925009,60921002the National Basic Research 973 Program of China under Grant No.2011CB302502
文摘Due to the huge size of patterns to be searched,multiple pattern searching remains a challenge to several newly-arising applications like network intrusion detection.In this paper,we present an attempt to design efficient multiple pattern searching algorithms on multi-core architectures.We observe an important feature which indicates that the multiple pattern matching time mainly depends on the number and minimal length of patterns.The multi-core algorithm proposed in this paper leverages this feature to decompose pattern set so that the parallel execution time is minimized.We formulate the problem as an optimal decomposition and scheduling of a pattern set,then propose a heuristic algorithm,which takes advantage of dynamic programming and greedy algorithmic techniques,to solve the optimization problem.Experimental results suggest that our decomposition approach can increase the searching speed by more than 200% on a 4-core AMD Barcelona system.
基金This project was supported by the National "863" High Technology Research and Development Program of China(2003AA142160) and the National Natural Science Foundation of China (60402019)
文摘The traditional multiple pattern matching algorithm, deterministic finite state automata, is implemented by tree structure. A new algorithm is proposed by substituting sequential binary tree for traditional tree. It is proved by experiment that the algorithm has three features, its construction process is quick, its cost of memory is small. At the same time, its searching process is as quick as the traditional algorithm. The algorithm is suitable for the application which requires preprocessing the patterns dynamically.
文摘Objectives:Genomic signatures like k-mers have become one of the most prominent approaches to describe genomic data.As a result,myriad real-world applications,such as the construction of de Bruijn graphs in genome assembly,have been benefited by recognizing genomic signatures.In other words,an efficient approachof genomic signatureprofiling is an essential need for tackling high-throughput sequencing reads.However,most of the existing approaches only recognize fixed-size k-merswhile many research studies have shown the importance of considering variable-length k-mers.Methods:In this paper,we present a novel genomic signature profiling approach,TahcoRoll,by extending the Aho–Corasick algorithm(AC)for the task of profiling variable-length k-mers.We first group nucleotides into two clusters and represent each cluster with a bit.The rolling hash technique is further utilized to encode signatures and read patterns for efficient matching.Results:In extensive experiments,TahcoRoll significantly outperforms the most state-of-the-art k-mer counters and has the capability of processing reads across different sequencing platforms on a budget desktop computer.Conclusions:The single-thread version of TahcoRoll is as efficient as the eight-thread version of the state-of-the-art,JellyFish,while the eight-thread TahcoRoll outperforms the eight-thread JellyFish by at least four times.