Data structures
Comparison of Three Pattern Matching Algorithms using DNA Sequences.pdf
www.ijsetr.com
ISSN 2319-8885
Vol.03,Issue.35
November-2014,
Pages:6951-6955
Copyright @ 2014 IJSETR. All rights reserved.
Comparison of Three Pattern Matching Algorithms using DNA Sequences NYO ME TUN
1 , THIN MYA MYA SWE
2
1 Dept of Research and Development, Computer University, Mandalay, Myanmar, E-mail: nyomizin@gmail.com.
2 Dept of Research and Development, Computer University, Mandalay, Myanmar.
Abstract: Pattern matching is commonly used in computer science and information processing such as text editor in computing,
database queries, bioinformatics, language syntax checker, music content retrieval, DNA sequences matching, search engines
and many more applications. There are many forms of pattern matching. Among them, the most well-known form is
bioinformatics application, DNA sequences analysis of various diseases which are stored in database for retrieval and
comparison. To get high quality results in a short time is to use pattern matching algorithms because of DNA database is very
complex and huge and not to retrieve easily. This paper presents three pattern matching algorithms, Knuth-Morris-Pratt (KMP),
Brute-Force and Boyer-Moore. The main process of this system is to find the matched DNA sequence in DNA database that find
the matched DNA sequence in DNA database using these pattern matching algorithms. In this system, at least two DNA
sequences are required to analyze. This system compares similarity values with threshold value and stores particular result which
is diseased or not. Finally, this system can identify optimal result according to each result using voting. This system is
implemented with Java programming language and the comparison results are demonstrated with bar graph with respect to
similarity and processing time of each algorithm.
Keywords: Pattern Matching, DNA Sequences Analysis, Medical Diagnosis.
I. INTRODUCTION
Exact sequence matching is a vital component of
bioinformatics which is the application of computer
technology to management and analysis of biological data.
Computer technologies are applied to gather, store, analyze
and merge biological data. Therefore, bioinformatics is an
interface between biological and computational sciences [2].
DNA Pattern matching, the problem of finding subsequences
within a long DNA sequence has many applications in
computational biology. As the sequences can be long,
matching can be an expensive operation, especially as
approximate matching is allowed [6]. There are many
biological sequence data in different databases. The
biological and bibliographical sequence data are growing in
these databases at an exponential rate. The computational
demands are to find matched sequence and to analyze these
sequence data contained in these databases. Exponential
growth of gene databases enforces the need for efficient
information extraction methods and algorithms for DNA
sequencing exploiting existing large amount of DNA
information available in DNA database. Manually it is very
tedious and time consuming [1].
Deoxyribonucleic acid (DNA) is an acronym for the
molecule deoxyribonucleic acid and it is also called a double
helix because it is a double-stranded molecule. It is mainly
composed of nucleotides of four types. The four bases in
DNA are Adenine (A), Cytosine (C), Guanine (G), and
Thymine (T). A DNA sequence is a representation of a string
of nucleotides contained in a strand of DNA. For example:
ATTCGTAACTAGTAAGTTA. The DNA sequencing
techniques have allowed the vast amount of data to be
analyzed in a short span of time. So, pattern matching
techniques plays a vital role in computational biology for
data analysis related to biological data such as DNA
sequences[1][2][6]. This paper presents and analyzes three
pattern matching algorithms to find the matched DNA
sequence in DNA database. Pattern-matching algorithm
matches the pattern exactly or approximately within the text.
The organization of this paper is as follows. SectionII
presents the related work of pattern matching process. In
sectionIII, theoretical background for DNA sequence
matching is given. In sectionIV, proposed system is
demonstrated. In sectionV, system implementation is
described. In sectionVI, the experimental results are shown.
Finally, in sectionVII, the conclusions and references are
given.
II. RELATED WORK
The pattern matching algorithms can be applied for
detecting the unusual patterns present in the gene database. It
can show how the disease can be transformed from parents to
their children and can identify the presence of the disease on
hereditary basis and its impact. Moreover, KMP algorithm
known as DNA sequencing algorithm is used to detect
unusual patterns in DNA sequences and to analyze these
sequences in the gene database [1][2]. Approximate String
Matching Algorithm, fuzzy string searching technique is used
NYO ME TUN, THIN MYA MYA SWE
International Journal of Scientific Engineering and Technology Research
Volume.03, IssueNo.35, November-2014, Pages: 6951-6955
to find strings that match a pattern approximately. It is
applied in finding approximate substring matches inside a
given string and finding dictionary strings that match the
pattern approximately. Another application of approximate
string matching can be effectively used to retrieve fast video
as it uses the content based video retrieval in contrast with
the traditional video retrieval which was slow and time
consuming. String based video retrieval method first converts
the unstructured video into a curve and marks the feature
string of it. Approximate string matching is then used to
retrieve video quickly [3][4].
III. THEORETICAL BACKGROUND
A. Pattern Matching
DNA pattern matching is a fundamental and upcoming
area in computational molecular biology. Pattern matching is
an important task of the pattern discovery process in today's
world for finding the structural and functional behavior in
proteins and genes. Although pattern matching is commonly
used in computer science and information processing, it can
be found in everyday tasks. The string matching can be
described as: given a specific strings P generally called
pattern searching in a large sequence/text T to locate P in T.
if P is in T, the matching is found and indicates the position
of P in T, else pattern does not occurs in the given text. As
the size of the data grows it becomes more difficult for users
to retrieve necessary information from the sequences. Hence
more efficient and robust methods are needed for fast pattern
matching techniques. It is one of the most important areas
which have been studied in bioinformatics. Pattern matching
techniques has two categories and is generally divides into:
Single pattern matching
Multiple pattern matching techniques.
Single pattern matching is to find all occurrences of the
pattern in the given input text. Suppose, if more than one
pattern are matched against the given input text
simultaneously, then it is known as, multiple pattern
matching. Multiple pattern matching can search multiple
patterns in a text at the same time. It has a high performance
and good practicability, and is more useful than the single
pattern matching algorithms [5][6][7]. Let P = {p1, p2,
p3,...,pm} be a set of patterns of m characters and
T={t1,t2,t3…,tn} in a text of n character which are strings of
nucleotide sequence characters from a fixed alphabet set
called Σ={A C, G, T}[7]. There are various pattern matching
algorithms. These efficient algorithms are used to the
sequence of DNA in the DNA database. The present day
pattern matching algorithms match the pattern exactly or
approximately within the text [1][2][7]. Among them, this
paper applies three pattern matching algorithms. They are:
Knuth-Morris-Pratt
Brute-Force and
Boyer-Moore
B. Knuth-Morris-Pratt (KMP) Algorithm
The KMP algorithm is linear sequence matching
algorithm and plays an important role in bioinformatics,
DNA sequence matching, disease detection etc. It scans the
sequence from left to right. This algorithm preprocess the
pattern string P using failure function. There is a match,
increase the current indices. If not, consult the failure
function the new index in P here needs to continue checking
P against T. The algorithm repeat this process until find a
match of P (pattern) in T(Text) or index for T reach n, the
length of T. The main part of the KMP algorithm is the
while-loop which performs comparison between a character
in T and a character in P for each iteration. This algorithm
doesn’t need backtracking and shift more than one position.
Therefore, KMP is called as an intelligent search algorithm
[1] [2].
TABLE I: Knuth-Morris-Pratt (KMP) Pattern Matching
Algorithm
C. Brute-Force Algorithm
The brute force algorithm is a powerful technique that is
used to search the pattern. This algorithm is simplest string
method which is used to match each character of pattern to
the corresponding character in text until all characters are
found to match successful search; or mismatch is detected.
This algorithm scans the sequence from left to right direction.
This algorithm doesn’t need preprocessing phase. It consists
of two nested loops, with the outer loop indexing through all
possible starting indices of the pattern in the text, and the
inner loop indexing through each character of the pattern,
comparing it to its potentially corresponding character in the
text [2] [6].
Comparison of Three Pattern Matching Algorithms using DNA Sequences
International Journal of Scientific Engineering and Technology Research
Volume.03, IssueNo.35, November-2014, Pages: 6951-6955
TABLE II: Brute-Force Pattern Matching Algorithm
D. Boyer-Moore Algorithm
Boyer-Moore algorithm is an efficient string searching
algorithm that can determine whether or not a match of a
particular string exists within another string. So, this
algorithm is used in bioinformatics mostly, for disease
detection. This algorithm scans the matching of the two
sequences from right to left direction. Therefore, this
algorithm takes backward approach. If no mismatch occurs,
then the pattern has been found. Otherwise, the algorithm
computes a shift; that is, an amount by which the pattern is
moved to the right before a new matching attempt is
undertaken. In case of a mismatch, the knowledge gained
from the last occurrence function helps in calculating the new
index to start the next search. This algorithm preprocesses the
pattern P using the last-occurrence function L mapping ∑,
where L(c) is defined as: the largest index i such that P[i] = c
or -1 if no such index exists [1] [2].
E. DNA (Deoxyribonuclei Acid)
Deoxyribonucleic Acid (DNA) is a nucleic acid that
contains genetic instructions. DNA is a double helix
structure. It contains three components: a five-carbon sugar
(Deoxyribose), a series of phosphate groups, and four
nitrogenous bases. The four bases in DNA are Adenine (A),
Thymine (T), Guanine (G), and Cytosine (C). Thymine and
adenine always come in pairs. Likewise, guanine and
cytosine bases come together too. Every human has his/her
unique genes. Genes are made up of DNA; therefore the
DNA sequence of each human is unique. The sequence of
DNA constitutes the heritable genetic information in nuclei,
plasmids, mitochondria, and chloroplasts that forms the basis
for the developmental programs of all living organisms
[1][2][6][7].
TABLE III: Boyer-Moore Pattern Matching Algorithm
The process of determining the DNA sequence is used to
map out the sequence of the four nucleotides that comprise a
strand of DNA. Studying the DNA sequence is necessary in
basic research studying fundamental biological processes, as
well as in applied fields such as diagnostic and forensic
research. The four nucleotides (A, T, G, C) are grouped to
form words and these words make sentences which are called
genes. The occurrence of unwanted or mutated words causes
the disease. Every disease will have its own words or
sequences on the occurrence of the mutated sequence in the
DNA [1][2][6]. This paper matches the DNA sequences of
three diseases called HIV, Lung cancer and Leukaemenia.
IV. PROPOSED SYSTEM
This paper presents three pattern matching algorithms for
DNA sequence matching. The main process of this paper is
finding the matched DNA sequence in DNA database using
pattern matching algorithms. When entering human DNA
sequence (suspected DNA sequence) and the type of disease
to be checked as input, the matched DNA sequence are
searched using three pattern matching algorithms: Knuth-
Morris-Pratt (KMP), Brute-Force and Boyer-Moore. The
total matched indices are got and similarity values with
respect to entire sequence are computed. Then, the similarity
values are compared with threshold value (user defined
value) and stores the particular results which is diseased or
not. Finally, the optimal result is sent back according to each
stored result using voting method. Fig.1 is the proposed
system architecture for the DNA sequence matching.
NYO ME TUN, THIN MYA MYA SWE
International Journal of Scientific Engineering and Technology Research
Volume.03, IssueNo.35, November-2014, Pages: 6951-6955
Fig.1. System Architecture Design.
V. SYSTEM IMPLEMENTATION This system is implemented using Java programming
language and tested using different DNA sequence with
different file size. Pattern matching techniques are used to
search the matched DNA sequences in mostly bioinformatics.
The main process of the system is finding the matched DNA
sequence in a set of DNA database. The pattern matching
algorithms are used to find the matched sequence and the
total matched indices are used to compute similarity value
with respect to entire sequence. Since DNA sequences are
very large and complex, it is impossible to analyze the vast
amount of data in a short of span. The pattern matching
algorithms are efficient to trace the sequence of DNA in the
DNA database. These efficient pattern matching techniques
can give optimal result for particular diseased DNA
sequence. All the three algorithms require at least two DNA
sequences. One of these sequences is generally a suspected
DNA sequence and other is a diseased sequence. Moreover,
pattern matching techniques is used to optimize the time and
to analyze the vast amount of data in a short span of time.
VI. EXPERIMENTAL RESULTS
The results of the proposed system are described in this
section. Fig.3 describes the particular result of checked
disease for particular algorithm. From this Fig.3, we can
decide the optimal result for particular disease. Fig.2 (a) is
the input DNA, Figure 2 (b) presents the types of DNA
available in this system. Fig.3 describes the result of the
system. Finally, the similarity and processing time are shown
as Fig.4 which is two comparison results of three pattern
matching algorithms. Threshold value used in this system is
0.75 (user defined value).
(a)
(b)
Fig.2. (a) Input DNA, (b). DNA Types available in this
system.
Fig.3. Result of the system.
Fig.4. Two Comparison Results of Three Pattern
Matching Algorithms.
VII. CONCLUSION
This paper provides a path in diagnosing the disease by the
identification of presence of diseased DNA sequence in DNA
database. The pattern matching algorithms are effectively
used in matching DNA sequences because of DNA database
is very complex and huge and not to retrieve easily. The
Comparison of Three Pattern Matching Algorithms using DNA Sequences
International Journal of Scientific Engineering and Technology Research
Volume.03, IssueNo.35, November-2014, Pages: 6951-6955
pattern matching process is needed to keep pace with ever
growing demands in sequence comparison. The required
sequences are suspected sequence and diseased sequence.
These algorithms in this system are easy to implement and
effective in multiple detections of DNA sequences. This
paper identifies the particular disease on DNA sequence
using three pattern matching algorithms. Finally, this paper
gives optimal result according to each result using voting.
VIII. ACKNOWLEDGEMENTS My Sincere thanks to my supervisor Dr.Thin Mya Mya
Swe, Lecture, Research and Development Department,
Computer University (Mandalay), Myanmar for providing
me an opportunity to do my research work. I express my
thanks to my Institution namely Computer University
(Mandalay) for providing me with a good environment and
facilities like internet books, computers and all that as my
source to complete this research. My heart-felt thanks to my
family, friends and colleagues who have helped me for the
completion of this work.
IX. REFERENCES
[1] Kuhu Shukla, Samarjeet Borah, Sunil Pathak, “An
analysis of influential DNA sequencing Algorithms”.
[2] S.Rajesh, S. Prathima, Dr. L.S.S.reddy, “Unusual pattern
detection in DNA database using KMP algorithm”.
[3] Nimisha Scingla, Deepark Garg, “String Matching
algorithms and their Applicability in various Applications”.
[4] Vidya Saikrishna Prof. Akhtar Rasool and Dr. Nilay,
Khare Department, “String Matching and its Application in
Diversified Fields”.
[5] “String and pattern matching Algorithms”.
[6] Rajib Paul, Pooja Roy, “Improved Pattern Matching To
Find DNA Patterns”.
[7] Raju Bhukya, DVLN Somayajulu “Exact Multiple Pattern
Matching Algorithm using DNA Sequence and Pattern Pair”.
[8]http://www.broadinstitute.org/cgibin/cancer/datasets.cgi.
[9]http://www.pombase.org/downloads/dna datasets.
A Novel Pattern Matching Algorithm in Genome Sequence Analysis.pdf
A Novel Pattern Matching Algorithm in Genome Sequence Analysis
Ashish Prosad Gope #1, Rabi Narayan Behera#2
#1M.Tech Student (4th Semester) Dept. of Information Technology, Institute of Engineering & Management,
Salt Lake, Kolkata, India #2Asst. Professor, Dept. of Information Technology, Institute of Engineering & Management,
Salt Lake, Kolkata, India
Abstract— DNA sequences has been for years a great concern for many research papers in Bio-Informatics. DNA sequence is a long string of characters specifying the nucleotides presented in the DNA. In bioinformatics the most well-known application is DNA sequence detection. Stored DNA sequence of various disease are retrieved and compared in order to check for the existence of a disease. To search for the pattern a well-established pattern matching algorithm is needed in order to get the result at the cost of sufficient amount of time. We’ve specifically referred the DNA sequences instead of any text strings and implemented the algorithms upon it. This paper evaluates four pattern matching algorithms’ performance and then proposes a new algorithm based upon Rabin Karp algorithm which ensures that character comparisons can be eliminated from Rabin Karp algorithm. These algorithms look for the specified pattern in a huge strand of DNA sequence.
I. INTRODUCTION Bioinformatics is an interdisciplinary research area that is the interface between the biological and computational sciences. The advent of electronic computers has arguably been the most revolutionary development in the history of Science and technology. The Ultimate goal of bioinformatics is to uncover the wealth of Biological information hidden in the mass of data and obtains a clearer insight into the fundamental biology of organisms. This new knowledge could have profound impacts on fields as varied as human health. Agriculture, the environment, energy and biotechnology. There are many other applications of bioinformatics, including predicting entire protein strands, learning how genes express themselves in various species, and building complex models of entire cells. As computing power increases and our databases of genetic and molecular information expand, the realm of bioinformatics is sure to grow and change drastically, allowing us to build models of incredible complexity and utility. When we know a particular sequence is the cause for a disease, the trace of the sequence in the DNA and the number of occurrences of the sequence defines the intensity of the disease. As the DNA is a large database, I propose String and Pattern matching algorithms to find out a particular sequence in the given DNA. This paper entirely focuses on a novel approach for detecting the patterns present in the gene database. Pattern matching is a mechanism to find out the exact location of a specified pattern, iff the pattern exists in the text. Before moving forward let us convey you about the structure of our paper. We’ve discussed the preliminaries needed to move forward in section 2, after that in section 3, disease caused by genetic factors has been revisited. In section 4 we discussed detection of disease using pattern matching and in section 5 the
central ideas of this paper i.e. the pattern matching problem has been discussed. In subsequent sections i.e. in section 6, 7, 8 and 9, the Brute Force, Knuth-Morris-Pratt algorithm, Boyer Moore algorithm and Rabin Karp algorithm respectively has been described. In section 10 we’ve described our idea to improve the Rabin Karp algorithm and in section 11 the references used in this paper has been given.
II. PRELIMINARIES Every human has his/her unique genes. Genes are made up of DNA. DNA is contained in each living cell of an organism, and it is the carrier of that organism’s genetic code. The genetic code is a set of Sequences which define what proteins to build within the organism. DNA consists of two strands, each being a string of four nitrogenous bases i.e. Adenine, Cytosine, Guanine and Thymine. In a computer we represent each nitrogen base with a single character: A for Adenine, G for Guanine and C for Cytosine and T for Thymine. Thymine (T) & Adenine (A) always come in pairs. Likewise, Guanine (G) & Cytosine (C) bases come together too. Using these codes an entire DNA can be coded based upon their nucleotides contained in a strand. For example: ATGCGATATGCATGCATGCATAT. The term DNA sequencing comprehends biochemical methods for determining the order of the nucleotide bases, adenine, guanine, cytosine, and thymine, in a DNA oligonucleotide [10]. Determining the DNA sequence is therefore useful in basic research studying fundamental biological processes, as well as in applied fields such as diagnostic or forensic research.
The power and ease of using sequence information has however, made it the method of choice in modern bioinformatics analysis.[11]
III. DESEASE CAUSED BY GENETIC FACTORS
An unhealthy symptoms or a specific illness in the body is termed as a disease. Disease refers to any unnatural condition of an organism that affects normal functions. Diseasemay be referred to disabilities, disorders, syndromes, symptoms[9].Genes are the basic building blocks of heredity. They get passed from parent to child. They hold DNA, the instructions for making proteins. A genetic disease is any disease that is caused by an abnormality in an individual's genome. Some of the genetic disorders are inherited from the parents, while other genetic diseases are caused by mutations in a pre-existing gene or group of genes.
IV. DETECTION OF DISEASE USING PATTERN MATCHING Over the last decade, genetic studies have identified numerous associations between chromosomal alleles in the human genome and important human diseases. Unfortunately, these extending
Ashish Prosad Gope et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5450-5457
www.ijcsit.com 5450
findings of casual variants in the region of DNA is not a straight forward task [8]. Causal variant identification typically involves searching through sizable regions of genomic DNA in the locality of disease-associated SNPs (single nucleotide Polymorphism).When we know a particular sequence is the cause for a disease, the trace of the sequence in the DNA and the number of occurrences of the sequence defines the intensity of the disease. As the DNA is a large database we need to go for efficient algorithms to find out a particular sequence in the given DNA.
V. THE PATTERN MATCHING PROBLEM In pattern-matching problem on strings, we are given a text string T of length n and a pattern string P of length m, and want to find whether P is a substring of T. The meaning of a “match” is that there is a substring of text T starting at some index i that matches pattern P, so that T[i]=P[0], T[i+1]=P[1] ... T[i+m-1]=P[m-1] i.e. P= T[i..i+m-1]. Thus, the output from a pattern-matching algorithm is either an indication that the pattern P does not exist in T or the starting index in T of a substring matching P.[12]
T =” abacaabaccabacabaabb “
And the pattern string:
P = "abacab".
Then P is a substring of T. Namely, P = T [10...15]. There are various pattern-matching algorithms. Here we are to review four pattern matching algorithms and present an algorithm which is based upon Rabin-Karp algorithm but modified. These efficient algorithms can be used to trace the sequence of DNA in a huge gene database. Following are the four algorithms which are described below.
• Brute-Force • Knuth-Morris –Pratt • Boyer-Moore • Rabin-Karp Algorithm
VI. BRUTE FORCE ALGORITHM It is also known as Naive String Matching algorithm. It has no pre- processing phase, needs constant extra space. It always shifts the window by exactly one position to the right. It requires 2n expected text characters comparisons. It finds all valid shifts using a loop that checks the condition P[1....m]=T[s+1...s+m] for each of the n-m+1 possible values of s. The algorithm is as following: BRUTE_FORCE(T, P) n ← length[T ] m ← length[P] for s ← 0 to n − m
do if P[1 . .m] = T [s + 1 . . s + m] then print “Pattern occurs with shift” s
The Brute force string-matching procedure can be presented as shifting the pattern over the text, observing for which shifts all of the characters of the pattern equal the corresponding characters in the text, as illustrated in the following example.
T=ANPANMAN P=MAN
VI.I. Complexity Procedure BRUTE_FORCE takes time O(m) in best case i.e. when the pattern is found with in first m characters of text. And in the worst case the pattern will be matched total (m (n-m+1)). For
example, consider the text string “AN” (a string of n a’s) and the pattern “AM”. For each of the (n−m+1) possible values of the shift s, the loop on line 4 to compare corresponding characters must execute m times to validate the shift. The worst-case running time is thus O(mn). The running time of BRUTE_FORCE is equal to its matching time, since there is no preprocessing. VI.II. Drawbacks Of This Approach In O(mn) approach. if ‘m’ is the length of pattern ‘p’ and ‘n’ is the length of string ‘S’. Suppose S=ATGATAATGAAG and p=AATA.
Figure 1: Brute Force comparison process
j= 0 1 2 3 4 5 6 7 8 9 10 S= A T G A T A A T G A G p= A T A A A T A A A T A A A T A A
In table 1 we’ve shown when mismatch is detected for the first time in comparison of p[3] with S[3], pattern ‘p’ would be moved one position to the right and matching procedure resumes from here. Here the first comparison that would take place would be between p[0]=‘A’ and S[1]=‘T’. It should be noted here that S[1] had been previously involved in a comparison in 2nd iteration of the loop in this algorithm. This is a repetitive use of S[1] in another comparison. It is these repetitive comparisons that lead to the runtime of O(mn), which made it very slow.
VII. KMP ALGORITHM We now present a linear-time string-matching algorithm due to
Knuth, Morris, and Pratt. The basic idea behind the algorithm discovered by Knuth, Morris, and Pratt is this: when a mismatch is detected, our false start (which is the main drawback of Brute Force algorithm) consists of characters that we know in advance (since they’re in the pattern). Somehow we should be able to take advantage of this information instead of backing up the pointer over all those known characters
VII.I The Prefix Function For A Pattern
Fully skipping past the pattern on detecting a mismatch as described in the previous paragraph won’t work when the pattern could match itself at the point of the mismatch. To calculate the positions for the pattern as to how much a pattern need to shift itself so that the corresponding characters of text match with it. The table is called as next table or sometimes failure function (figure 2) for the pattern to be searched [14]. Consider another example of this next table. This next[j] be the character position in the pattern which should be checked next after such a mismatch, so that we can slide the pattern (j - next[j]) places relative to the text [6].
Figure 2: Next table j 1 2 3 4 5 6 7 8 9 10
pattern A T G A T G A G A T
next -1 0 0 -1 0 0 -1 4 -1 0
Here next[j]= 0 means that we are to slide the pattern all the way past the current text character. Now we shall discuss how to pre- compute this table; fortunately, the calculations are quite simple, and we will see that they require only O(m) steps. Now we
Ashish Prosad Gope et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5450-5457
www.ijcsit.com 5451
represent following the algorithm to calculate the next function or prefix function:
next(p) //p signifies pattern int i=0, j=-1; next[i]=j; for(i=0;i<m;i++) { if(i==0) next[i]=j; else if(p[i]==p[j]) { next[i]=next[j]; } else { next[i]=j; } while (j>=0 && p[i]!=p[j]) j=next[j]; j++; }
This program takes O(m) units of time, as next[t] in the innermost loop always shifts the upper copy of the pattern to the right, so it is performed a total of m times at most. A slightly different way to prove that the running time is bounded by a constant times m is to observe that the variable starts at 0 and it is increased, m- 1 times, by 1; furthermore its value remains nonnegative. Therefore the operation next[j], which always decreases j, can be performed at most m-1 times [6].
VII.II. The Pattern Matching Algorithm
The Knuth-Morris-Pratt matching algorithm is given in pseudo code below as the procedure KMP-MATCHER. KMP- MATCHER calls the auxiliary procedure next() to compute next table. Below T & P signifies text & pattern respectively.
KMP-MATCHER(T, P) n ← length[T] m ← length[P] next=next(P) //array consisting of prefix values j ← 0 //Number of characters matched. for k ← 1 to n //Scan the text from left to right. do while j > 0 and P[j + 1] ≠ T [k] do j ← next[j] //Next character does not match. if P[j + 1] = T [k ] then j ← j + 1 //Next character matches. if j = m //Is all of P matched? then print “Pattern occurs with shift” k– m j ← next[j] // Look for the next match. For convenience, let us assume that the input text is present in an array text T[ 1…n ], and that the pattern appears in pattern P[1…m]. We shall also assume that m > 0, i.e., that the pattern is nonempty. Let k and j be integer variables such that text T[k] denotes the current text character and pattern P[j] denotes the corresponding pattern character; thus, the pattern is essentially aligned with positions p + 1 through p + m of the text, where k =p +j [15]. VII.III. Complexity The KMP algorithm works by turning the patterns given into a machine, and then running the machine. It takes O(m) space and time complexity in pre-processing phase, and O(n+m) time complexity in searching phase (independent of the alphabet size). KMP is a linear time string matching algorithm. [6]
VIII. BOYER-MOORE ALGORITHM A significantly faster string searching method can be developed by scanning the pattern from right to left when trying to match it against the text. The Boyer-Moore algorithm (BM) was developed by R.S.Boyer and J.C.Moore in 1977 [7]. The Boyer Moore algorithm scans the characters of the pattern from right to left beginning with the rightmost one and performs the comparisons from right to left. VIII.I Bad Character Rule To convey the idea of the bad character rule, let us suppose that the last (rightmost) character of pattern P is y and the character in text T it aligns with is x, x ≠ y. When mismatch occurs, we can safely shift P to the right so that the rightmost x in P is below the mismatched x in T, and this is possible if the rightmost position of character x exists in pattern P. This observation is formalized below [16]. For a particular alignment of pattern P against text T, the rightmost (n-i) characters of pattern P match their counterparts in text T, but the next character to the left, P(i), doesn’t matches with its counterpart, say in position k of T. The bad character rule says that P should be shifted right by Max[1,i - R(T(k))] places. The point of this shift rule is to shift P by more than one character when possible. In the below example, T(5) = t mismatches with P(3) and R(t) = 1 so P can be shifted right by two positions. After the shift, the comparison of P and T begins again at the right end of P.
Figure 3: Compare from right 1 2
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7
T A C T C T T G A T G C T C T T A C
P A G A T G A T
VIII.II. Extended Bad Shift Rule When a mismatch occurs at position i of pattern P and the mismatched character in text T is x, then shift P to the right so that the closest x to the left of position i in P is below the mismatched x in T.
VIII.III The Good Suffix Rule Now we introduce another rule called the good suffix rule. Suppose for a given pattern P and text T, a substring t of text T matches a suffix of pattern P, but a mismatch occurs at the next comparison to the left. Then find, if it exists, the rightmost copy t’ of t in P such that t0 is not a suffix of P and the character to the left of t’ in P differs from the character to the left of t in P. Shift P to the right so that substring t0 in P is below substring t in T (see Figure 4). If t’ does not exist, then shift the left end of P. past the left end of t in T by the least amount so that a prefix of the shifted pattern matches a suffix of t in T. If no such shift is possible, then shift P n places to the right. If an occurrence of P is found, then shift P by the least amount so that a proper prefix of the shifted P matches a suffix of the occurrence of P in T. If no such shift is possible, then shift P by n places, i.e., shifting P past t in T.
Figure 4: case when good suffix rule applies 0 1
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8
T p r s t a b c t u b a b v q x r s t
^
P q c a b d a B d a b
1 2 3 4 5 6 7 8 9 0
Ashish Prosad Gope et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5450-5457
www.ijcsit.com 5452
Good suffix shift rule, where character x of T mismatches with character y of P. Characters y and z of P are guaranteed to be distinct by the good suffix rule, so z has a chance of matching x. When the mismatch occurs at position 8 of P and position 10 of T, t = ab and t0 occurs in P starting at position 3. Hence P is shifted right by six places resulting in the following alignment.
Figure 5: Shifting using good suffix rule
0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 T p r s t a b c t u b a b v q x r s t ^ P q c a b d a b a b 1 2 3 4 5 6 7 8 9 Now in cases where we have matched the final m characters of pattern P before failing, we clearly wish to shift our attention down string by 1+m. So, L(i) is the largest index j less than n such that Nj (P)≥|P[i..n]| (which is n - i + 1). L’(i) is the largest index j less than n such that Nj(P) = |P[i..n]| = (n - i + 1).Now The pre- processing stage must also prepare for the case when L’(i) = 0 or when an occurrence of P is found. l’(i) equals the largest j ≤ |P[i..n]|, which is n-i+1, such that Nj (P) = j. So we can say that the required shift will be max (L(i),L’(i)).
VIII.IV. The complete Boyer-Moore algorithm: Given the pattern P, //pre-processing stage Compute L’(i) and l(i) for each position i of P, and compute R(x) for each character x ∈ ∑ //Search stage k := n; while k ≤ m do begin i := n; h := k; while i > 0 and P(i) = T(h) do begin i := i - 1; h := h - 1; end; if i = 0 then begin report an occurrence of P in T ending at position k. k := k + n – l’(2); end else shift P (increase k) by the maximum amount determined by the (extended) bad character rule and the good suffix rule. end
Note that although we have always talked about shifting P", and given rules to determine by how much P should be “shifted", there is no shifting in the actual implementation. Rather, the index k is increased to the point where the right end of P would be shifted". Hence, each act of shifting P takes constant time [17]. The good suffix rule in Boyer-Moore method has a worst-case running time of O(m) provided that the pattern does not appear in the text. This was first proved by Knuth, Morris and Pratt [6]. VIII.V. Algorithm Complexity The BM algorithm is successful at achieving a sub linear running time in the average case, and if some special conditions occurred then also was capable of O(n+m) in the worst case.
IX. RABIN-KARP ALGORITHM Previous three algorithms which we’ve seen is based upon string matching to see whether the pattern is matched with the text portion or not. RABIN KARP matcher is one of the most effective string matching algorithms. To find a numeric pattern ‘P’ from a given text ‘T’. It first divides the pattern with a predefined prime number ‘q’ to calculate the modular of the pattern P. Then it tests the first m characters (m=|P|) from text T to compute remainder of m characters from text T. If the remainder of the Pattern and the remainder of the text T are equal only then we compare the characters of the text portion with the pattern otherwise there is no need for the comparison [1]. We’ve to repeat the process for next set of characters from text for all the possible shifts which are from s=0 to nm (where n denotes the length of text and m denotes the length of P). So according to this two numbers n1 and n2 can only be equal if
REM (n1/q) = REM (n2/q) [1] After division we will be having three cases:-
• Case 1: Successful hit: - In this case if REM (n1) = REM(n2) and also characters of n1 matches with characters of n2.
• Case 2: Spurious hit: - In this case REM (n1) = REM (n2) but characters of n1 are not equal to characters of n2.
• Case 3: If REM (n1) is not equal to REM (n2), then no need to compare n1 and n2.
For a given text T, pattern P and prime number q T=234567899797797976534356678886756456890975545343434 24545475655454 P=667888 q=11 So to find out this pattern from the given text T we will take equal number of characters from text as in pattern and divide the value of these characters with predefined number q and also divide the pattern with the same predefined number q. Now compare their remainders to decide whether to compare the text with pattern or not. Rem (Text) =234567/11=3 Rem (Pattern) =667888/11=1 As both the remainders are not equal so there is no need to compare text with pattern. Now move on to set of characters of same length next from text and repeat the procedure. The Boyer Moore Algorithm goes as follows: Rabin_Karp_Matcher (T,P,d,q) { n =Length (T) m= Length (P) t0=0 p=0 h=dm-1mod q for i=1 to m { p = (d * p + P[i]) mod q t0 =(d * t0 + T[i] ) mod q } for s =0 to n-m { if ts=p {
//comparison for spurious hits if P[1….m] = T[s+1…….s+m]
then print pattern matches at shift ‘s’ } if s<= n-m
Ashish Prosad Gope et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5450-5457
www.ijcsit.com 5453
ts+1= (d(ts-h*T[s+1]) + T[s+1+m] ) mod q }
} So the entire process can be written as follows: where Say P has length L and S has length n. One way to search for P in S:
1. Hash P to get h(P). 2. Iterate through all length L substrings of S, hashing those substrings and comparing to h(P). 3. If a substring hash value does match h(P), do a string comparison on that substring and P, stopping if they do match and continuing if they do not.
IX.I Numerical Example: Let’s step back from strings for a second. Say we have P and S be two integer arrays: P = [5; 0; 3; 3; 0] S = [4; 8; 5; 0; 3; 3; 0; 8] The length 5 substrings of S will be denoted as such: S0 = [4; 8; 5; 0; 3] S1 = [8; 5; 0; 3; 3] S2 = [5; 0; 3; 3; 0] And so on … We want to see if P ever appears in S using the three steps in the method above. Our hash function will be: h(k)= (k[0] * 104 + k[1] * 103 + k[2] * 102 + k[3] * 101 + k[4] * 100)mod m Or in other words, we will take the length 5 array of integers and concatenate the integers into a 5 digit number, then take the number mod m. h(P) = 50330 mod m, h(S0) = 48503 mod m, and h(S1) = 85033 mod m. Note that with this hash function, we can use h(S0) to help calculate h(S1). We start with 48503, chop off the first digit to get 8503, multiply by 10 to get 85033, and then add the next digit to get 85033. More formally: h(Si+1) = [(h(Si) - (10
5 * first digit of Si)) * 10 + next digit after Si] mod m We can imagine a window sliding over all the substrings in S. Calculating the hash value of the next substring. In this numerical example, we looked at single digit integers and set our base b = 10 so that we can interpret the arithmetic easier. To generalize for other base b and other substring length L, our hash function is h(k) = (k[0]bL-1 + k[1]bL-2 + k[2]bL-3.... k[L - 1]b0) mod m And calculating the next hash value can be done by: h(Si+1) = b(h(Si) – b
L-1S[i]) + S[i + L] mod m Following is the example taken from [15]:
Figure 6:
The above figure[15] illustrates (a) A text string. A window of length 5 is shown shaded. The numerical value of the shaded
number is computed modulo 13, yielding the value 7. (b) The same text string with values computed modulo 13 for each possible position of a length-5 window. Assuming the pattern P = 31415, we look for windows whose value modulo 13 is 7, since 31415 ≡ 7 (mod 13). The first, beginning at text position 7, is indeed an occurrence of the pattern, while the second, beginning at text position 13, is a spurious hit. (c) Computing the value for a window in constant time, given the value for the previous window. The first window has value 31415. Dropping the high-order digit 3, shifting left (multiplying by 10), and then adding in the low-order digit 2 gives us the new value 14152.
X. IMPROVED IDEA: Theory As we can see, spurious hit is an extra burden on algorithm which increases its time complexity when we have to compare the text with pattern and won’t be able to get the pattern at that shift. So to avoid this extra matching, we’ve improved the Rabin Karp algorithm slightly, called IRK algorithm which says that along with remainders compare the quotients also. That is IRK checks whether, REM(n1/q)=REM(n2/q) and QUOTIENT (n1/q) = QUOTIENT (n2/q), where n1= pattern & n2=Text & q is the prime number. So, according to this technique along with the calculation of remainder, we will also find out the quotient and if both remainder and quotient of text matches with pattern then it is successful hit otherwise it is an unsuccessful hit or spurious hit and then we can remove the possibility of comparing the spurious hits. That means there is no extra computation of spurious hits if remainder and quotient are same then pattern found else pattern not found. Basically the algorithm is same as the original rabin karp algorithm, but with little modifications, which are shown in bold italic font. The algorithm goes as follows: IRK( T, P, d, q ) n ← length (T ) //text length m ← length ( P ) //pattern length h ← dm-1 mod q p ← 0 t0 ← 0 q_p ← 0 //quotient post hash calculation for pattern //quotient post hash calculation for portions of text of size m q_t ← 0 for i ← 1 to m //Preprocessing
do temp_p ← ( d*p + P[ i ] ) q_p ← temp_p / q p ← temp_p mod q temp_t ← ( d*t0 + T[ i ] ) q_t ← temp mod q t0 ← temp mod q
for s ← 0 to n – m // Matching //comparison only if quotient matches, removal of spurious hit do if p = ts && q_p = q_t then print “Pattern occurs with shift” s if s < n – m //quotient, post hash calculation of next m characters in text.
temp_t ← ( d * ( ts – T[ s + 1 ] * h ) + T[ s + m + 1 ] ) / q
q_t ← temp_t /q //subtracting LSB, Shifting and adding MSB then ts+1 ← ( d * ( ts – T[ s + 1 ] * h ) + T[ s + m + 1 ] )mod q ts= ts+1
X.I. Comparison using Graphs: The results of our experiments are depicted in the graphs below. In the first graphs we have represented the performance of the
Ashish Prosad Gope et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5450-5457
www.ijcsit.com 5454
algorithms with a fixed text file size of 1MB. Y axis represents time in microseconds and X-axis represent the corresponding algorithms. Figure 7: comparison of algorithms with respect to 1 MB text file
Now we compare only between Rabin Karp and IRK algorithms with the same text file size of 1 MB, in figure 8. Figure 8: Comparison of Rabin Karp and IRK algorithms using file size 1MB.
Rabin Karp scores the running time of 100750 microseconds and IRK adjusted the running time within 95500 microseconds, both upon same 1 MB text file. Below is the graph which depicts the comparison between Rk and IRK algorithm using a 2MB file size. Also we’ve compared the algorithm upon 2 MB text file size, whose readings are as follows 260750 for Rabin Karp and 175250 for IRK algorithm.
Figure 9: depicts the comparison of Rabin Karp and IRK algorithms using file size 2MB.
X.II Example of IRK algorithm: T= ABBCABCA //text P= BCA //pattern q=13(say) d=256 (for character) Hash(P)= (66 * 2562 + 67 * 2561 + 65) mod q p = 0 // hash value for pattern q_p = 334045 //quotient
A B B C A B C A
hash(ABB) = 0 // same hash q_t0 = 328965 //but quotient different
A B B C A B C A
hash(BBC) = 1 q_t1 = 334026
A B B C A B C A
hash(BCA) = 0 q_t2 = 334045
A B B C A B C A
hash(CAB) = 7 q_t3 = 339047 // both matched
A B B C A B C A
hash(ABC) = 11 q_t1 = 328984
A B B C A B C A
hash(BCA) = 0 // hash matched q_t2 = 334045 // quotient matched
Since the hash =0 and quotient = 334045 both matched. Only the pattern BCA is matched. And hash(ABB) = 0 and quotient = 328965, which has not matched, ABB is not compared. X.3 Time Complexity In Best case doesn’t differ much from the original Rabin Karp algorithm, but the in average case complexity can be improved significantly. Due to imposing of constraint of matching the quotient post hashing as well as the hash value of the text portion of size m , reduction in comparison has been seen. Which reduces the time complexity during worst case from O((n-m+1)m) to O(nm+1). This time complexity is hugely depends on the selected prime number, q. So selecting the right prime number gives this algorithm a satisfiable optimization in terms of worst case time complexity.
XI. CONCLUSION AND FUTURE SCOPES This version of Rabin Karp algorithm can be used with Genetic Algorithm in order to search for a pattern in to huge text files of size >500MB. Implementation using GA can produce an improved version of this algorithm for more sophisticated use and can make the search even faster by using the genetic operators such as selection, mutation, crossover etc. Our Future scope lies among this thinking that it could be possible for us to implement this IRK algorithm using GA for optimize the pattern analysis. Further analysis and improvement of this algorithm is welcome from any scholars.
REFERENCES 1. Richard M. Karp, Michael O. Rabin, Efficient Randomized pattern-
matching algorithms, International Business Machine, 1987 2. Roberto Ierusalimschy, A Text Pattern-Matching Tool based on
Parsing Expression Grammars, 2008 3. Rajesh S., Prathima S., Reddy L.S.S., Unusual Pattern Detection in
DNA Database Using KMP Algorithm, International Journal of Computer Applications (0975 - 8887), Volume 1 – No. 22, 2010
4. Jiyeon Choi, Myka R. Ababon, Mai Soliman, Yong Lin, Linda M. Brzustowicz, Paul G. Matteson, James H. Millonig, Autism Associated Gene, ENGRAILED2, and Flanking Gene Levels Are Altered in Post-Mortem Cerebellum- PLOS ONE, 2014
5. Gupta, A.R., and State, M.W. (2007) Recent Advances in the Genetics of Autism. Biological Psychiatry 61, 429-437.
6. Donald E. Knuth, James H. Morris, Jr And Vaughan R. Pratt, FAST PATTERN MATCHING IN STRINGS, Vol. 6, No. 2, June 1977, SIAM J. COMPUT.
7. R. Boyer and J. Moore, A fast string searching algorithm”, CACM, 20, 10, 1977, pp.262-272.
Ashish Prosad Gope et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5450-5457
www.ijcsit.com 5455
8. Christopher B. Kingsley, Identification of Causal Sequence Variants of Disease in the Next Generation Sequencing Era, Methods in Molecular Biology, Volume 700, 2011, pp 37-46.
9. Melissa Conrad Stoppler MD (2014, Jan 15). Genetic Diseases Overview [Online]. Available: http://www.medicinenet.com/genetic_disease/article.htm.
10. DNA Sequencing, Wikipedia, http://en.wikipedia.org/ wiki/Genetic_analysis#DNA_Sequencing
11 .Biological Databases, http://biotech.fyicenter.com /resource/Biological_databases.html
12. Michael T. Goodrich; Roberto Tamassia; David M. Mount, 2011. Data Structures and Algorithms in C++, Second Edition
13. Akhtar Rasool Amrita Tiwari et al, (IJCSIT) Vol. 3 (2) , 2012,3394 – 3397, International Journal of Computer Science and Information Technologies.
14. Sedgewick, Robert, 1984-Algorithms., ADDISON-WESLEY PUBLISHING COMPANY
15. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein et al. 2009, 3RD edition, Introduction to Algorithms, MIT Press.
16. Dan Gusfield. COMPUTER SCIENCE AND COMPUTATIONALBIOLOGY, University of California, Davis, 2007
17. Boyer-Moore Tutorial, The University of California, Davis, http://www.cs.ucdavis.edu/~gusfield/ cs224f11/ bnotes.pdf, 2007
Ashish Prosad Gope et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5450-5457
www.ijcsit.com 5456
Ashish Prosad Gope et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5450-5457
www.ijcsit.com 5457
A Pattern Matching Algorithm for Codon Optimization and CpG MotifEngineering in DNA Expression Vectors.pdf
A Pattern Matching Algorithm for Codon Optimization and CpG Motif-
Engineering in DNA Expression Vectors
Ravi Vijaya Satya and Amar Mukherjee
School of Engineering and Computer Science
University of Central Florida
Orlando, FL 32816
rvijaya, amar@cs.ucf.edu
Udaykumar Ranga
Jawaharlal Nehru Center for Advanced
Scientific Research
Jakkur, Bangalore, India
udaykumar@jncasr.ac.in
Abstract
Codon optimization enhances the efficiency of DNA
expression vectors used in DNA vaccination and gene
therapy by increasing protein expression. Additionally,
certain nucleotide motifs have experimentally been
shown to be immuno-stimulatory while certain others
immuno-suppressive. In this paper, we present
algorithms to locate a given set of immuno-modulatory
motifs in the DNA expression vectors corresponding to a
given amino acid sequence and maximize or minimize
the number and the context of the immuno-modulatory
motifs in the DNA expression vectors. The main
contribution is to use multiple pattern matching
algorithms to synthesize a DNA sequence for a given
amino acid sequence and a graph theoretic approach for
finding the longest weighted path in a directed graph
that will maximize or minimize certain motifs. This is
achieved using O(n 2 ) time, where n is the length of the
amino acid sequence. Based on this, we develop a
software tool.
Key Words: Codon optimization, immuno-modulatory
motifs, multiple pattern matching, longest weighted path.
1. Introduction
DNA vaccines have revolutionized the field of vaccine
technology by demonstrating the ability to induce humoral
and cellular immune responses in experimental animals
and humans [9]. Immunization of animals with plasmid
DNA encoding a protein antigen was an accidental
observation that eventually led the way to a novel strategy
of immunization. DNA vaccines, also known as "naked
DNA" or "nucleic acid" vaccines, encode the antigens of
pathogenic organisms including viruses, bacteria, fungi
and parasites [40]. The protein antigen is processed
within the cell and presented by the MHC-I and –II
pathways thereby eliciting specific immune responses
essential for controlling pathogenic infections [2].
Although DNA vaccines have been successful in
generating strong immune responses in smaller animal
model such as mouse, they have not been as efficient in
larger species such as primates and humans [23, 3, 25, 4].
Stimulating both the arms of the immune system is often
desirable for efficient control of infectious diseases
especially in the larger animals. In the case of
recombinant protein vaccines, immune-enhancers
technically known as adjuvants, such as Freund’s
adjuvants and Alum, are in use to enhance antigen
specific immune responses. However, no such adjuvants
are available for use in the context of DNA vaccines. The
lack of suitable adjuvants for DNA vaccines is one
important reason for the poor performance of the DNA
vaccines in larger animals.
Nucleotide sequence encoding a foreign protein is
directly placed under the control of a mammalian
promoter to construct a DNA vaccine. Several amino
acids are encoded by more than one triplet codon and
different organisms have variable requirement for codon
preference [13]. Cloning of a wild type gene from a
parasite into a DNA vaccine often leads to insufficient
levels of protein synthesis in the host cell as a result of
codon bias between the species. Successful immunization
with DNA vaccines requires high expression of cloned
genes to synthesize large quantities of the foreign
protein. For instance, the overall genetic content of
Human Immunodeficiency Virus-1 is AT-rich, while that
of the human beings is CG-rich. The codon frequency of
the pathogenic DNA embedded into the mammalian
expression vector may not be optimal for adequate
protein expression in the host resulting in low level
protein expression. A potential solution for the codon
bias is to optimize the codon sequences of a gene to suit
Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE
the requirements of the host without altering the original
amino acid sequence of the protein [41, 16]. This
approach has been successful in eliciting strong immune
responses in several species of experimental animals [8,
28]. While immunization with synthetic genes, codon-
optimized for mammalian expression stimulated strong
immune response, immunization performed in parallel
with wild type genes generated low or moderate levels of
immune response [34, 36].
In addition to codon optimization of the synthetic
genes, a range of molecular approaches is being
evaluated to up-regulate immune responses generated by
DNA vaccines. Co-expression of cytokine genes [20], co-
stimulatory receptors [12, 35] or other
immunemodulators [33], synthetic assembly of T-helper
or CTL epitopes [6] and formulation with a variety of
chemical adjuvants [11, 24, 33, 37, 39] have been some
of the approaches reported. However, most of these
approaches may not be suitable for human application
due to toxic manifestations of the adjuvants. An ideal
agent used as an adjuvant for DNA vaccine must enhance
the immunogenicity without apparent cytotoxicity to the
host. Engineering CpG islands into DNA vaccines has
been one promising approach that showed enhanced
immune responses [19].
The well-established immune-enhancing property of
bacterial DNA has been mapped to sequence motifs
consisting of un-methylated CpG dinucleotides flanked
by base pairs in a specific context [18]. Two important
differences between the bacterial and mammalian DNA
enable the mammalian innate immune system to
recognize the former as a foreign component. CpG motifs
in bacteria are found at the expected 1:16 frequency;
however, their frequency in mammals is 4 times less than
expected. Bacterial CpG are non-methylated while those
of mammals are mostly methylated. The mammalian
immune system takes advantage of these two chemical
differences between the bacterial and mammalian DNA
to identify a bacterial infection and wage strong and
rapid anti-bacterial immune responses [29].
CpG-mediated activation of the mammalian innate
immune system has been extensively exploited in the
vaccination technology. Co-injection of CpG containing
oligonucleotides or empty vectors with protein antigens
elicited potent immune response to the antigen.
Methylation of the CpG motifs, on the other hand,
abrogated immune response to the antigens suggesting
that the CpG motifs possess adjuvant properties only
when unmethylated or hypomethylated [[5, 17]. Since the
DNA expression vectors used in genetic immunizations
are usually grown in bacteria, several CpG motifs on the
plasmids are not methylated possibly activating the
innate component of the mammalian immune system.
Presence of hypo-methylated CpG motifs is essential for
induction of immune responses to the antigens encoded
by the vectors.
While certain CpG motifs are immuno-stimulatory
(CpG-S), enhancing immune responses when the host is
vaccinated, others CpG motifs in a different nucleotide
context are immuno-inhibitory or -neutralizing (CpG-N)
abrogating antigen-specific immune response [18].
Engineering the nature and the frequency of the CpG
motifs may be critical for the design of DNA expression
vectors. DNA expression vectors are primarily used for
two different applications, expression of a foreign gene in
a host for vaccination or for correcting a genetic defect of
the host [26].
Although the basic design of these two types of vectors
is identical in several respects, their requirement for the
presence and nature of CpG motifs is diagonally
opposite. While genetic vaccines require CpG-S motifs
for efficient stimulation of the host immune system, such
a strong response must be avoided for long-term survival
of the DNA expression vector intended for gene therapy.
An ideal strategy for designing DNA vaccines must
recruit as many CpG-S motifs as possible, concomitantly
eliminating as many CpG-N motifs as possible without
altering the original protein sequence of the genes. In
contrast, design of a DNA expression vector intended for
gene therapy must eliminate as many CpG-S motifs as
possible and recruit as many CpG-N motifs as possible
for the best result. Thus, depending on the application,
some CpG motifs may be desirable, while certain others
may be undesirable.
The software tool we report here is designed to help
the researcher to engineer the composition of a gene with
respect to codon usage and CpG motifs without
modifying the original amino acid sequence of the
protein. Codon optimization is advantageous for protein
expression in a heterologous expression system such as
E.coli, Picchia, Saccharomyces, Baculo virus,
mammalian cells etc. The software could also engineer
the content and nature of the CpG motifs recruited into a
DNA expression vector in addition to optimizing the
codon usage and resolve potential conflicts arising
between these two requirements and find an optimal
nucleotide sequence.
2. Prior Work and Summary of
Contributions
Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE
Objective of this algorithm is to (1) identify the best
triplet codon for codon optimization and (2) engineer the
nature and content of the CpG motifs of a gene expressed
from a genetic vector. Input for the software is the amino
acid sequence of a gene of interest or the nucleotide
sequence that needs genetic modification. The input
amino acid sequence is reverse-translated to generate
nucleotide sequence that is optimized for the codon and
CpG content. During the process of reverse-translation,
there is a one-to-many relationship between the amino
acid sequence and the DNA sequences. A given amino
acid sequence could correspond to an exponential
number of DNA sequences with respect to the length of
the given amino acid sequence. Our task here is to choose
the particular sequence that has the desirable properties
(maximum number of the desirable motifs, and the
minimum number of the undesirable motifs) from this
large number of possible DNA sequences. A brute force
search on all the possible DNA sequences will take
exponential time. The motifs (or the patterns) that need
optimization are small DNA sequences, generally not
exceeding eight nucleotides in length. If different motifs
have to be optimized simultaneously, it is possible to
have multiple motif occurrences (both desirable and
undesirable) sharing the same position in the amino acid
sequence. In such cases it may not be possible to have all
the motif occurrences in the same DNA sequence. This
problem will be explained in detail in Section 4.
McInerney[27] presented a program to perform the
codon usage analysis on a sequence or a database of
sequences. Other work in this area of bioinformatics is
mostly on finding CpG islands within the genes and
transcriptional elements that regulate gene expression.
Ponger et al [30] developed a software package to locate
CpG islands associated with the transcription start sites
of the genes. Lin et al [22] presented software for
locating non-overlapping maximum average segments in
a given sequence. These publications analyzed DNA
sequences and searched for nucleotide motifs with high
CG content.
Our emphasis in this paper is to synthesize a DNA
sequence that codes for a functional protein, with certain
characteristics and optimization parameters. The main
contribution is to use multiple pattern-matching
algorithms to synthesize a DNA sequence for a given
amino acid sequence and a graph theoretic approach for
finding the longest weighted path in a directed graph that
will maximize or minimize certain motifs as well as
guarantee certain fitness factors of codon frequency usage
for a particular species. This is achieved using O(n 2 ) time
and storage resources compared to the brute force
algorithm that might take exponential amount of
resources, where n is the length of the amino acid
sequence. The software tools developed for the purpose
will find applications in the rapid development of
vaccines and gene therapy.
In Section 3, we introduce the basic terminology. In
Section 4, we give a precise combinatorial formulation of
the problem in terms of pattern matching operations and
present the main algorithm. Section 5 gives the
complexity analysis. Section 6 gives a description of the
software and in Section 7 we present the results.
3. Terms and definitions
The alphabet of a DNA sequence is denoted as ΣN =
{A,C,G,T} 1 where A,C,G,T are the four DNA molecules
adenine, cytosine, guanine, and thymine, respectively.
The alphabet for the amino acid sequence is ΣA =
(A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,stop}, each
letter indicating one of the 20 amino acids, and stop
indicating the stop codon. For example, ‘A’ indicates
Alanine, ‘R’ indicates Arginine, ‘N’ indicates
Asparagine, etc. A codon c is a triplet of the DNA
symbols in ΣN. As the size of DNA alphabet is 4, there
are 64 possible codons. In the context of amino acid
translation from mRNA, each element of ΣA gets mapped
to a maximum of 6 codons 2 . This mapping is also
referred to as the genetic code. Symbolically, the
mapping of an amino acid σ ∈ ΣA can be denoted as a set
C(σ) of n σ codons, where 1 ≤ nσ ≤ 6. The inverse-
mapping associates an amino acid σ with a codon c,
denoted by C -1
(c) = σ. The frequency of occurrence of
members of C(σ) is different for each species. This
relative difference in the frequency of occurrence for each
codon is called codon bias. The codon usage table for a
species gives the codon bias information for that species 3 .
For an amino acid σ, the codon bias (or the frequency
information) is given by the set of fractions F(σ),
corresponding to C(σ). For each fi ∈ F(σ), 1 ≤ i≤ nσ, 0 ≤
fi≤ 1, and 1 1
=∑ =
σ n
i
i f . When designing a DNA vaccine
for a species, it is often desirable to have those codons
that occur more frequently in that species. Therefore, for
each amino acid we generally choose the most frequent
1 The protein synthesis takes place from mRNA. The actual RNA
alphabet is {A,C,G,U}, in which thymine is replaced by uracil. For
simplicity of notation, we use T instead of U in this paper.
2 See http://molbio.info.nih.gov/molbio/gcode.html
3 See http://www.kazusa.or.jp/codon/ for codon usage tables for different
species.
Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE
codon, henceforth referred to as the best codon. For an
amino acid σ, the index of the best codon, b σ , is defined
as b σ = {i | fi = MAX(F(σ))}. If there are more than one
values for b σ ,, we take one arbitrarily. It is also useful to
have a measure of how desirable a codon is. This
quantity is given by the fitness of the codon. The fitness
of the ith codon, ci of an amino acid σ is given by:
fitness(ci) = σ b
i
f f
, the ratio of the frequency of the ith
codon of σ to the frequency of the best codon of σ. Given
a list of codons, the average fitness of the list of codons
is defined as the arithmetic mean of the fitness values of
all codons in the list.
A motif is a short DNA sequence of length l, denoted
as M = m1m2m3……ml, mi ∈ ΣN,1≤ i ≤ l. We will
consider immuno stimulatory and immune neutralizing
(CPG-S and CPG-N) sequences as possible motifs. A
motif occurs as part of a long DNA sequence and as such
will correspond to a sequence of codons. Since the
beginning of the codon boundary could be aligned with
any one of the first, second, or third positions in the
motif, this gives rise to three possible codon encodings
for the motif. In each codon encoding, the codons that lie
completely within the motif can be translated to an amino
acid sequence. These amino acid sequences will be
referred to as amino acid search patterns (henceforth
referred to as aasp). The parts of the motif (less than
three nucleotide beginning or the end of the motif that do
not completely correspond to a codon are referred to as
the head or tail, respectively, of the motif. When l < 5,
all or some of M 1 , M
2 , M
3 may be null. The
corresponding amino acid search patterns, aasp(M 1 ),
aasp(M 2 ), aasp(M
3 ) will also be empty strings, but their
head and/or tail portions will be non-null. This implies
that, for such an aasp, every position in the amino acid
sequence is a possible occurrence of the motif. The three
possible codon encodings of a motif M and their
corresponding three amino acid search patterns are
shown in Table 1.
As an example, if M = ATCGAT, M 1 = ATCGAT,
aasp(M 1 ) = ID, head(M
1 ) = null, tail(M
1 ) = null; M
2 =
TCG, aasp(M 2 ) = S, head(M
2 ) = A, tail(M
2 ) = AT ; and
M 3 = CGA, aasp(M
3 ) = R, head(M
3 ) = AT, tail(M
3 ) =
AT. For example, searching for the motif M = ATCGAT
in the amino acid sequence …LSI…, we find S =
assp(M 2 ). head(M
2 ) = A and tail(M
2 ) = AT. L, the amino
acid that precedes S, has two codons that end with A
(head(M 2 )), and I, the amino acid that follows S, has
three codons that begin with AT (tail(M 2 )), as shown
below.
L S I
CTA
TTA
TCG ATA
ATC
ATT
Selecting any codon from the first column and any
codon from the third column will result in an occurrence
of ATCGAT. Therefore, there are 2*3 = 6 unique
combination of codons that result in the occurrence of
ATCGAT at the location in the amino acid sequence
beginning with L and ending with I. These will be
referred to as the possible occurrences of the motif. The
codons corresponding to an occurrence will be referred to
Table 1. Head and tail of amino acid search patterns
)()(
3*3/321
1
l mmmmM LL=
aasp(M 1 ) = a1a2…… 3/la ,
where a1 = C -1
(m1m2m3),etc.
nullMhead =)( 1
nullMtail =)( 1
if 03mod =l
l mMtail =)(
1 if 13mod =l
ll mmMtail
1
1 )(
−
= if 23mod =l
)()(
13*3/)1(432
2
+− =
l mmmmM LL
aasp(M 2 ) = a1a2…… 3/)1( −la ,
where a1 = C -1
(m1m2m3), etc.
1
2 )( mMhead = nullMtail =)(
2 if 13mod =l
l mMtail =)(
2 if 23mod =l
ll mmMtail
1
2 )(
−
= if 03mod =l
)()(
23*3/)2(543
3
+− =
l mmmmM LL
aasp(M 3 ) = a1a2…… 3/)2( −la ,
where a1 = C -1
(m3m4m5),etc.
21
3 )( mmMhead = nullMtail =)(
3 if 23mod =l
l mMtail =)(
3 if 03mod =l
ll mmMtail
1
3 )(
−
= if 13mod =l
Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE
as constituent codons for that occurrence. We call the
collection of all possible occurrences of all motifs as the
potential occurrences, as all of these occurrences might
not be retainable in the final solution. As noted in
Section 1, some of these occurrences may be desirable
while others may be undesirable; we assign a desirability
value of +1 and –1, respectively , for these occurrences.
In general, these values could be arbitrary.
An additional step of pre-processing is required for
undesirable motifs. We want to avoid occurrence of an
undesirable motif. However, we need to make sure that
the undesirable motifs do not re-appear in the final
solution as a result of selecting the best codons for amino
acids that are not part of a desirable motif. We achieve
this by considering all alternate codon combinations of
each potential occurrence of an undesirable motif, and
treating them as potential occurrences with a desirability
value of zero. We call these alternate potential
occurrences as neutral occurrences. In the above
example, if ATCGAT were an undesirable motif, out of a
total of 6*6*3 = 108 possible encodings for the
subsequence LSI, 108-6 = 102 codon combinations do not
result in the occurrence of ATCGAT. Each one of these
102 combinations will be registered as a neutral
occurrence.
The weight of an occurrence is the sum of the
average fitness of the constituent codons and the
desirability for that motif. The total weight of a DNA
sequence is the sum of the individual weights of all the
occurrences from the given set of motifs that are part of
the DNA sequence.
4. The Algorithm
4.1 Problem Definition
Given an amino acid sequence SA = a1a2a3…………aJ,
and a list of motifs M = {M1,M2,M3, ……,Mk}, with their
corresponding desirability values, D = {D1, D2, D3, ...
..., Dk}, the problem is to find the codon sequence, SC =
c1c2c3…………cJ, where each ci ∈ C(ai), 1≤ i ≤ J, and
the DNA sequence, SD = d1d2d3…………d3*J,
corresponding one-to-one to SC, where ci = d3i-2d3i-1d3i,
such that: 1)SD has the maximum total weight possible;
and 2)Every codon ci in SC that is not part of an
occurrence (desirable, undesirable or neutral) of a motif
in M is the best codon for the corresponding amino acid
ai. In other words, we need to find the exact mapping of
the amino acid sequence into a DNA sequence that has
the maximum weighted occurrences of the motifs in M.
We find all possible occurrences of all the motifs,
including neutral occurrences for the undesirable motifs.
Out of these, if two occurrences (of the same motif or of
different motifs) extend over the same position in SA, and
require different codons for that position, then the two
occurrences have a conflict and cannot co-exist. We have
to select the largest weighted subset of these occurrences
that are mutually non-conflicting.
4.2 The Algorithm Steps
Step1: Pre-processing:
A)Form a list Lp of search patterns consisting of all
three search patterns aasp(Mi 1 ), aasp(Mi
2 ) and aasp(Mi
3 )
for each motif Mi ∈ M, 1 ≤ i ≤ k, in the list of motifs.
Also, enumerate the head and tail for all three
alignments of each motif. Each entry in Lp is an element
of a data structure having the following members:
• codon_list, the list of codons that lie entirely
within the motif.
• search_pattern, the aasp corresponding to the
codon_list.
• length, the size of the codon_list. (note: length
may be 0 for short motifs).
• head, the corresponding head – a DNA sequence
which is either null, or 1 or 2 characters in length.
• tail, the corresponding tail – a DNA sequence
which is either null or 1 or 2 characters in length.
• desirability, the desirability value for the
corresponding motif, +1 or –1.
B) Build an Aho-Corasick keyword tree [1] ACT for
all the amino acid search patterns.
• A node in the keyword tree might correspond to
multiple search patterns with different heads and
tails, henceforth referred to as the
list_of_matches, denoted by the indices into Lp.
• When the length of a motif is less than five
nucleotides, some amino acid search patterns
might be empty strings. i.e., they will have non-
empty head and/or tail, but empty search pattern.
In the keyword tree, the root itself will correspond
to such empty amino acid search patterns.
Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE
Step2: Finding all potential occurrences: In this
step, we build the list of potential occurrences, P. Each
entry in P will have the following attributes:
• pBegin is the position in SA at which the match
starts.
• pEnd, the position in SA at which the match ends.
• A list of codons Lc of size pEnd-pBegin+1. The
codon at position ‘r’ in Lc indicates rth codon in
the match.
• The weight of the occurrence. We will generate
the list P in sorted order in ascending values of
pBegin; if two occurrences have same pBegin
value, they are sorted in ascending value of pEnd.
The following algorithm traverses the tree
according to the given amino acid sequence.
Algorithm 4.2.1 find_potential_occurrences
Inputs: The Aho-Corasick keyword tree ACT, the amino
acid sequence SA, list of search patterns Lp
Output: The list of potential occurrences, P, sorted.
Initialize P ← null, cur_node ← ACT.root
for q=1, until q ≤ J, traverse ACT:
if cur_node.child[aq] != null
cur_node ← cur_node.child[aq]
else
cur_node ← cur_node.failure_link
if cur_node.list_of_matches is not empty,
/*Corresponding aasp has occurred. Check if the motif
corresponding to the aasp can actually occur. The head
and tail are compared with all the codons of the amino
acids on either side aasp*/
for each entry i in the list, do the following:
a. Enumerate position p, the starting position of
Lp(i).search_pattern in SA.
b. pBegin ← p, pEnd ← q
c. if (Lp(i).head ≠ null) /* head is not null */
i. pBegin ← pBegin-1
ii. form a set of codons Chead such that
Lp(i).head is a suffix of each chead ∈
Chead, Chead ⊆ C(ap-1)
d. if (Lp(i).tail ≠ null) /* tail is not null */
i. pEnd ← pEnd+1
ii. form a set of codons Ctail such that
Lp(i).tail is a suffix of each ctail ∈ Ctail,
Chead ⊆ C(aq+1)
e. for each unique combination of chead ∈ Chead and
ctail ∈ Ctail, insert an entry to the list of potential
occurrences, P, with the entries pBegin, pEnd,
Lp(i).codon_list, and the corresponding weight
of the occurrence. Note: Chead and/or Ctail will be
empty if Lp(i).head and/or Lp(i).tail are null.
if a non-zero number of occurrences were added in
the above step, and if Lp(i).desirability < 0, insert an
entry in P for each unique combination of codons
selecting one codon each from the sets
C(SA(pBegin)), ……, C(SA(pEnd)). i.e, insert all
neutral occurrences corresponding to the current
undesirable occurrence into P.
Example: Let us assume we are trying to maximize the
number of occurrences of the motif ATCGAT in the
amino acid sequence MEPRVID. The three amino acid
search patterns, aasp(M 1 ), aasp(M
2 ) and aasp(M
3 ) are
ID, S, and R, respectively. Searching for them, we find R
at position 4, and ID at positions 6-7. R is aasp(M 3 ),
head(M 3 ) is AT, tail(M
3 ) is T, therefore we have to see if
any codons ending with AT can occur at position 3 in
MEPRVID, and if any codons beginning with T can occur
at position 5. The amino acid at position 3 is P, which
has the codons [AGA, AGG, CGA, CGC, CGG and
CGT], none of which end with AT, therefore it is not
possible for ATCGAT to occur from positions 3-5.
Coming to the next match, ID at positions 6-7, ID
=aasp(M 1 ). head(M
1 ) = null, tail(M
1 ) is null, therefore
ID at positions 6-7 can be added to the list of matches.
The corresponding entry in P will be: pBegin ← 6, pEnd
← 7, list of codons, LC ← ATC,GAT.
Step3: Mapping to a directed acyclic graph: In this
step, we find a non-conflicting subset of the occurrences
in P that has the maximum over-all weight. This can be
done by mapping the problem to that of finding the
critical path in a directed acyclic graph and finding the
nodes in the critical path, as follows:
1. Each occurrence in P is represented by a node v
in a weighted directed acyclic graph G = (V,E), where V
and E denote the vertex and the directed edge set of G.
The node weight wi of a node vi, is given by the weight of
the corresponding occurrence, P(i). i.e, wi ← P(i).weight,
1 ≤ i ≤ |P|, wi ∈ W, where W is the weight vector
corresponding to P.
2. Start with an empty edge list E. For each i, 1≤ i
≤ |P|, and j, i < j ≤ |P|, if P(i).pEnd < P(j).pBegin or if
P(i) and P(j) have the same codon encodings for all
positions r, P(j).pBegin ≤ r ≤ min(P(i).pEnd, P(j).pEnd),
then vi, vj will have an edge between them. i.e., E ← E∪
(vi,vj).
Now, we can solve for the longest weighted path
(critical path) in G, and insert the nodes in the longest
path in a list Lv. The longest weighted path can be found
by the critical path algorithm [7].
Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE
4.3 Translation of the amino acid sequence to a
DNA sequence
Once we have a set of occurrences of motifs that do
not conflict with each other, the next step is to do the
actual mapping from the amino acid sequence to a codon
(or DNA) sequence. At each position in SA that is not part
of any occurrence, we select the best codon for the
corresponding amino acid. At positions in SA that are
part of an occurrence, we select the codon that is required
for the occurrence.
Let Lv be the final list of occurrences.
1. for each ai ∈ SA, 1 ≤ i ≤ J,
ci ← best codon (C(ai))
2. for all i, 1 ≤ i ≤ |Lv|,
for all j, Lv(i).pBegin ≤ j ≤ Lv(i).pEnd, /* j is
between pBegin and pEnd for the occurrence Lv(i) */
cj ← Lv(i). c
L [j-pBegin] /*the codon sequence
for the occurrence of Lv(i) */
At the end of the second step, we will have the output
codon sequence, SC = c1c2c3……cJ. This can be converted
into the output DNA sequence SD by replacing each
codon with the corresponding DNA triplet.
5. Complexity Analysis
Step1-Pre-processing: Enumerating the amino acid
search patterns takes constant time for each motif
(assuming that the upper bound for the length of each
motif is a constant). There are k motifs, therefore, this
step takes O(k) time. Building the search tree is linear,
therefore takes O(k) time.
Step2- Enumeration of all potential occurrences:
Finding the potential occurrences O(L+3k+|P|),
according to the Aho-Corasick Algorithm. The actual
number of matches considered is greater than |P|, but still
in O(|P|). As the length of each motif is bound by a
constant, so is the number of all possible codon
combinations for each motif. Therefore, |P| is in O(L).
Inserting an entry into P takes log|P| time, as P is a
sorted list. Therefore, the overall complexity for this step
is |P|log|P|.
Step3-Mapping P to a directed acyclic graph (DAG):
a. Constructing the graph – involves checking the
upper triangle of the adjacency matrix of the graph -
takes O(|P| .(|P| -1)/2) ~ O(|P| 2 ) time.
b. Finding the critical path takes O(|V| + |E|) time.
Here, |V| ← |P|. The graph is expected to be dense in
general, therefore, |E| ← |P| 2 . Therefore, this part
takes O(|P| 2 ) time.
As the number of potential occurrences found, |P|, is
in O(J), the overall complexity of the algorithm is O(J 2 ).
6. Software
The software is available both as a console program,
and as a windows application with a GUI. The program
requires four input files: a file containing the input amino
acid sequence, a file containing the codon bias
information, and a file each for the desirable and
undesirable motifs. The desirability values can be
supplied in the input files. If desirability values are not
supplied in the input files, a default value of +1 is
assigned for desirable motifs, and –1 for undesirable
motifs. Additionally, it is made sure that each
occurrence of an undesirable motif has a weight less than
zero, by setting it to a small –ve value if it is zero. The
amino acid sequence file should be in the FASTA format,
containing a single sequence. The output is in ASCII text
format, listing different solutions with varying degrees of
codon optimization. The GUI writes the formatted results
to an html file. The sample out put of the program is
shown in figure-3.
7. Results
We tested the validity of the program on different
proteins. The results are shown in Table 2. These test
cases were run with list of 33 desirable motifs and 5
undesirable motifs. Each motif was any where between 3
to 8 nucleotides long. In the sequences that were tested,
all the occurrences of undesirable motifs could be
eliminated. However, in some cases, it might be
impossible to eliminate all occurrences of undesirable
motifs.
Table 2. Results - CpG-S and CpG-N motifs
Amino
acid
sequence
(length)
CpG-S
motfs
found
CpG-S
motfs
retained
CpG-N
motifs
found
CpG-N
motifs
eliminate
d
Tat
(72)
80 7 66 66
C3d
(302)
224 30 273 273
M23809
(226)
164 24 209 209
Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE
Table 3 shows the average codon fitness, which is a
measure of the degree of codon optimization. It can be
observed that the average fitness could be improved by
approximately 30% for all amino acid sequences tested.
Table 3. Results – average fitness
Amino
acid
sequence
(length)
Wild
type
Maximizin
g CpG-S
motifs
Minimizin
g CpG-N
motifs
Both
max.
and
min.
Tat
(72)
0.671 0.948 0.981 0.930
C3d
(302)
0.672 0.935 0.985 0.922
M23809
(226)
0.654 0.950 0.980 0.937
The analysis of one of these proteins, a 72 amino acid
regulatory protein, Tat, of the Human Immunodeficiency
Virus type-1 (HIV-1), is presented here. The Tat protein
of HIV-1 plays critical role in regulating viral gene
expression [15]. Additionally, Tat also modulated several
functions of the host and a wide range of pathogenic
properties has been ascribed to this viral protein.
Considering the pathogenic significance of Tat, this viral
antigen is being used as an important component of the
viral vaccines, to elicit cell-mediated and humoral
immune responses [31].
One important limitation of using the wild type Tat
gene directly as a vaccine component without any
modification is the sub-optimal codon content of the viral
gene for mammalian immunization. Selective use of
specific codons for protein translation is a characteristic
feature of several species, a phenomenon called codon
bias. Recent approaches of enhancing the immune
responses against antigens have laid emphasis on protein
translation efficiency, which is a function of the base
composition of the pathogen. Codon usage of the Tat
protein of HIV-1 is strikingly different from that of the
human genes. The AT content of HIV-1 Tat is
approximately 54% while that of the host human genome
is about 35%, relatively low in AT content. Sub-optimal
codon use could result in compromised protein
translation efficiency in a mammalian context and may
reflect in the generation of a poor immune response
against the AT-rich gene.
The characteristic features of codon use of the Tat
gene in the virus are depicted in figure-2. The 72 amino
acids encoding Tat exon-1 consisting of 219 bp represent
the consensus sequence of the HIV-1 subtype-C strains.
We used the 72 amino acid sequence (and the stop
codon) as an input for optimizing the nucleotide
sequence for mammalian expression. Alternatively, it is
also possible to use the wild type nucleotide sequence of
the gene directly obtained from the virus as the input.
The results (the occurrences of desirable and undesirable
motifs) were manually verified.
Using the software, we identified a total of 37 codons
that could be replaced by the corresponding best codons
to optimize the codon content of this gene to match that
of the mammalian genome. Of note, optimization of the
codons is fairly a simple task, provided the objective is
only to replace the sub-optimal codons with the best ones,
as is the case with several of the substitutions. For
instance, serine at the position 16 in the viral gene is
encoded by the codon AGT that has a fitness value of
only 0.62, the lowest of all the 6 codons available for this
amino acid (Figure 2). Substituting AGT at this position
with AGC (fitness value 1.0) is likely to improve the
average fitness of the viral gene for mammalian protein
expression.
The algorithm identified seven potential motifs where
CpG-S motifs could be engineered into the gene. One of
these motifs, comprising of the amino acid residues S57
and A58 (AGC GCT), is naturally present in the viral
gene (Figure 2). However, a possible conflict was
identified between the possible codon optimization of
these amino acid residues and not disrupting the CpG-S
motif in this sequence. The codon GCT (fitness 0.67)
encoding for alanine could have been substituted by the
best codon GCC (fitness 1.0) for improved average
fitness of the gene. Importantly, use of GCT in the gene
design allowed recruiting a potential CpG-S island that
could have far more beneficial effect on immunization
potential of the DNA vaccine. The average fitness of this
codon combination (AGC GCT) 0.835 ((1.0 + 0.67) / 2),
as opposed to that of best codon combination 1.0, may
not significantly modulate the protein translation
efficiency.
The algorithm also identified a second site (L35 and
V36) in the viral gene where substituting the original
viral codons CTA GTT (average fitness 0.275) with the
best codons CTG GTG (average fitness 1.0) would
improve the overall fitness of the gene. In contrast,
replacing the original codons with others, CTC GTA
(fitness 0.36), would permit recruiting a putative CpG-S
motif into the gene, though at the expense of average
fitness of the gene (Figure 2). At the moment, it is not
clear how to resolve a conflict that may arise between
optimizing the codon content of a gene and recruiting
CpG-S motifs into the sequence. Most often, the codons
useful in CpG motif design are not the ones with the
highest fitness value thus leading to a potential
Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE
disagreement between the diagonally juxtaposed
requirements.
Optimization of Tat sequence finally resulted in a
nucleotide sequence that was modified in a total of 37
codons (Figure 2). The AT content of the gene changed
from 54% of the wildtype gene to 43% in the codon-
optimized version thus resembling closely that of the
mammalian context. The overall fitness value of the
codon-optimized gene also improved from 0.67 to 0.93
suggesting a possible improvement at the level of protein
translation.
Using the algorithm, we designed several variants of
the Tat gene that differed from one another in the
number of CpG islands engineered and the extent of
codon content optimized. Experiments are presently
underway to evaluate the antigenic potential of these Tat
expression vectors in different mouse strains.
Acknowledgements
U. R. acknowledges financial support from the
Department of Biotechnology, Government of India.
References
[1] Aho A. and Corasick M: Efficient string matching: an aid
to bibliographic search. Comm. CAN, 18:333-40,1975.
[2] Bagarazzi ML, Boyer JD, Ayyavoo V, Weiner DB: Nucleic
acid-based vaccines as an approach to immunization against
human immunodeficiency virus type-1. Curr Top Microbiol
Immunol 226:107-143 (1998).
[3] Boyer JD, Cohen AD, Vogt S, Schumann K, Nath B, Ahn
L, Lacy K, Bagarazzi ML, Higgins TJ, Baine Y, Ciccarelli RB,
Ginsberg RS, MacGregor RR, Weiner DB: Vaccination of
Seronegative Volunteers with a Human Immunodeficiency
Virus Type 1 env/rev DNA Vaccine Induces Antigen-Specific
Proliferation and Lymphocyte Production of beta-Chemokines.
J Infect Dis 181:476-483 (2000).
[4] Calarota S, Bratt G, Nordlund S, Hinkula J, Leandersson
AC, Sandstrom E, Wahren B: Cellular cytotoxic response
induced by DNA vaccination in HIV-1-infected patients. Lancet
351:1320-1325 (1998).
[5] Chen Y, Lenert P, Weeratna R, McCluskie M, Wu T,
Davis HL, Krieg AM: Identification of methylated CpG motifs
as inhibitors of the immune stimulatory CpG motifs. Gene Ther
8:1024-1032 (2001).
[6] Ciernik IF, Berzofsky JA, Carbone DP: Induction of
cytotoxic T lymphocytes and antitumor immunity with DNA
vaccines expressing single T cell epitopes. J Immunol
156:2369-2375 (1996).
[7] Dolan A., Aldous J.: Networks and algorithms, John Wiley
and Sons, New York, NY, 1993. pp. 366-386.
[8] Egan MA, Charini WA, Kuroda MJ, Schmitz JE, Racz P,
Tenner-Racz K, Manson K, Wyand M, Lifton MA, Nickerson
CE, Fu T, Shiver JW, Letvin NL: Simian immunodeficiency
virus (SIV) gag DNA-vaccinated rhesus monkeys develop
secondary cytotoxic T-lymphocyte responses and control viral
replication after pathogenic SIV infection. J Virol 74:7485-
7495 (2000).
[9] Gurunathan S, Klinman DM, Seder RA: DNA vaccines:
immunology, application, and optimization*. Annu Rev
Immunol 18:927-974 (2000).
[10] Gusfield D: Algorithms on Strings Trees and Sequences.
Cambridge Univeristy Press, 1997.
[11] Hamajima K, Sasaki S, Fukushima J, Kaneko T, Xin KQ,
Kudoh I, Okuda K: Intranasal administration of HIV-DNA
vaccine formulated with a polymer, carboxymethylcellulose,
augments mucosal antibody production and cell- mediated
immune response. Clin Immunol Immunopathol 88:205-210
(1998).
[12] Ihata A, Watabe S, Sasaki S, Shirai A, Fukushima J,
Hamajima K, Inoue J, Okuda K: Immunomodulatory effect of a
plasmid expressing CD40 ligand on DNA vaccination against
human immunodeficiency virus type-1 [In Process Citation].
Immunology 98:436-442 (1999).
[13] Ikemura T: Codon usage and tRNA content in unicellular
and multicellular organisms. Mol Biol Evol 2:13-34 (1985).
[14] Jareborg N, Durbin R, ‘Alfresco--a workbench for
comparative genomic sequence analysis’, Genome Res 2000
Aug;10(8):1148-57.
[15] Jeang,K.T., H.Xiao and E.A.Rich: Multifaceted activities
of the HIV-1 transactivator of transcription, Tat. J Biol Chem
Dev. 274, 28837-28840 (1999).
[16] Kim CH, Oh Y, Lee TH: Codon optimization for high-
level expression of human erythropoietin (EPO) in mammalian
cells. Gene 199:293-301 (1997).
[17] Klinman DM, Yamshchikov G, Ishigatsubo Y:
Contribution of CpG motifs to the immunogenicity of DNA
vaccines. J Immunol 158:3635-3639 (1997).
[18] Krieg AM: CpG motifs in bacterial DNA and their
immune effects. Annu Rev Immunol 20:709-760 (2002).
[19] Krieg AM, Yi AK, Schorr J, Davis HL: The role of CpG
dinucleotides in DNA vaccines. Trends Microbiol 6:23-27
(1998).
[20] Lee AH, Suh YS, Sung YC: DNA inoculations with HIV-1
recombinant genomes that express cytokine genes enhance
HIV-1 specific immune responses [In Process Citation].
Vaccine 17:473-479 (1999).
Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE
[21] Lee MK, Lynch ED, King MC: SeqHelp: a program to
analyze molecular sequences utilizing common computational
resources. Genome Res 1998 Mar;8(3):306-12.
[22] Lin YL, Huang X, Jiang T, Chao KM: MAVG: locating
non-overlapping maximum average segments in a given
sequence. Bioinformatics 2003 Jan;19(1):151-152.
[23] Liu MA, McClements W, Ulmer JB, Shiver J, Donnelly J:
Immunization of non-human primates with DNA vaccines.
Vaccine 15:909-912 (1997).
[24] Lodmell DL, Ray NB, Ulrich JT, Ewalt LC: DNA
vaccination of mice against rabies virus: effects of the route of
vaccination and the adjuvant monophosphoryl lipid A
(MPL(R)). Vaccine 18:1059-1066 (2000).
[25] MacGregor RR, Boyer JD, Ugen KE, Lacy KE, Gluckman
SJ, Bagarazzi ML, Chattergoon MA, Baine Y, Higgins TJ,
Ciccarelli RB, Coney LR, Ginsberg RS, Weiner DB: First
human trial of a DNA-based vaccine for treatment of human
immunodeficiency virus type 1 infection: safety and host
response. J Infect Dis 178:92-100 (1998).
[26] Manthorpe M, Cornefert-Jensen F, Hartikka J, Felgner J,
Rundell A, Margalith M, Dwarki V: Gene therapy by
intramuscular injection of plasmid DNA: studies on firefly
luciferase gene expression in mice. Hum Gene Ther 4:419-431
(1993).
[27] McInerney JO: GCUA: general codon usage analysis.
Bioinformatics 1998;14(4):372-3.
[28] Nagata T, Uchijima M, Yoshida A, Kawashima M, Koide
Y: Codon optimization effect on translational efficiency of
DNA vaccine in mammalian cells: analysis of plasmid DNA
encoding a CTL epitope derived from microorganisms.
Biochem Biophys Res Commun 261:445-451 (1999).
[29] Ozinsky A, Underhill DM, Fontenot JD, Hajjar AM,
Smith KD, Wilson CB, Schroeder L, Aderem A: The repertoire
for pattern recognition of pathogens by the innate immune
system is defined by cooperation between Toll-like receptors.
Proc Natl Acad Sci U S A 97:13766-13771 (2000).
[30] Ponger L, Mouchiroud D: CpGProD: identifying CpG
islands associated with transcription start sites in large genomic
mammalian sequences. Bioinformatics 2002 Apr;18(4):631-3.
[31] Rusnati,M. and M.Presta: HIV-1 Tat protein: a target for
the development of anti-AIDS therapies. Drugs Fut. Dev. 27,
481-493 (2002).
[32] Sasaki S, Fukushima J, Hamajima K, Ishii N, Tsuji T, Xin
KQ, Mohri H, Okuda K: Adjuvant effect of Ubenimex on a
DNA vaccine for HIV-1 [published erratum appears in Clin
Exp Immunol 1998 May;112(2):354]. Clin Exp Immunol
111:30-35 (1998a).
[33] Sasaki S, Sumino K, Hamajima K, Fukushima J, Ishii N,
Kawamoto S, Mohri H, Kensil CR, Okuda K: Induction of
systemic and mucosal immune responses to human
immunodeficiency virus type 1 by a DNA vaccine formulated
with QS-21 saponin adjuvant via intramuscular and intranasal
routes. J Virol 72:4931-4939 (1998b).
[34] Stratford R, Douce G, Zhang-Barber L, Fairweather N,
Eskola J, Dougan G: Influence of codon usage on the
immunogenicity of a DNA vaccine against tetanus. Vaccine
19:810-815 (2000).
[35] Tsuji T, Hamajima K, Ishii N, Aoki I, Fukushima J, Xin
KQ, Kawamoto S, Sasaki S, Matsunaga K, Ishigatsubo Y, Tani
K, Okubo T, Okuda K: Immunomodulatory effects of a plasmid
expressing B7-2 on human immunodeficiency virus-1-specific
cell-mediated immunity induced by a plasmid encoding the
viral antigen. Eur J Immunol 27:782-787 (1997).
[36] Uchijima M, Yoshida A, Nagata T, Koide Y: Optimization
of codon usage of plasmid DNA vaccine is required for the
effective MHC class I-restricted T cell responses against an
intracellular bacterium. J Immunol 161:5594-5599 (1998).
[37] Verschoor EJ, Mooij P, Oostermeijer H, van der Kolk M,
ten Haaft P, Verstrepen B, Sun Y, Morein B, kerblom L, Fuller
DH, Barnett SW, Heeney JL: Comparison of immunity
generated by nucleic acid-, MF59-, and ISCOM- formulated
human immunodeficiency virus type 1 vaccines in rhesus
macaques: evidence for viral clearance [In Process Citation]. J
Virol 73:3292-3300 (1999).
[38] Wang S, Liu X, Fisher K, Smith JG, Chen F, Tobery TW,
Ulmer JB, Evans RK, Caulfield MJ: Enhanced type I immune
response to a hepatitis B DNA vaccine by formulation with
calcium- or aluminum phosphate [In Process Citation]. Vaccine
18:1227-1235 (2000).
[39] Wang TT, Cheng WC, Lee BH: A simple program to
calculate codon bias index. Mol Biotechnol 1998
Oct;10(2):103-6.
[40] Wolff JA, Malone RW, Williams P, Chong W, Acsadi G,
Jani A, Felgner PL: Direct gene transfer into mouse muscle in
vivo. Science 247:1465-1468 (1990).
[41] zur MJ, Chen MC, Doe B, Schaefer M, Greer CE, Selby
M, Otten GR, Barnett SW: Increased expression and
immunogenicity of sequence-modified human
immunodeficiency virus type 1 gag gene. J Virol 74:2628-2635
(2000).
Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE
M 1
E 2
P 3
V 4
D 5
P 6
N 7
L 8
E 9
P 10
W 11
N 12
wt ATG GAG CCA GTA GAT CCT AAC CTA GAG CCC TGG CAT
co - - - - - - - - - - - - - - - - - - - - - - - G - - - - - - - - - - - -
H 13
P 14
G 15
S 16
Q 17
P 18
K 719
T 10
A 21
C 22
N 23
N 24
wt CAT CCA GGA AGT CAG CCT AAA ACT GCT TGC AAC AAC
co - - C - - C - - -C - - -C - - - - - C - - G - - C - - C - - - - - - - - -
C 25
Y 26
C 27
K 28
H 29
C 30
S 31
Y 32
H 33
C 34
L 35
V 36
wt TGT TAT TGT AAA CAC TGT AGC TAC CAT TGT CTA GTT
co - - C - - C - - C - - G - - - - - C - - - - - - - - C - - C - - G - - G
C 37
F 38
Q 39
T 40
K 41
G 42
L 43
G 44
I 45
S 46
Y 47
G 48
wt TGC TTT CAG ACA AAA GGC TTA GGC ATT TCC TAT GGC
co - - - - - C - - - - - C - - G - - - C-G - - - - - C AG - - - C - - -
R 49
K 50
K 51
R 52
R 53
Q 54
R 55
R 56
S 57
A 58
P 59
P 60
wt AGG AAG AAG CGG AGA CAG CGA CGA AGC GCT CCT CCA
co - - A - - - - - - A-A - - - - - - A - - A - - - - - - - - - - - - - -
S 61
S 62
E 63
D 64
H 65
Q 66
N 67
L 68
I 69
S 70
K 71
Q 72
wt AGT AGT GAG GAT CAT CAA AAT CTT ATA TCG AAG CAA
co - - C - - C - - - - - C - - C - - - - - - - - - - - - - - S - - - - - G
Figure 1. The wild type Tat amino acid sequence
A
B
C
Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE
A B C
Codon H 13
P 14
G 15
S 16
L 35
V 36
S 57
A 58
Wt CAT
(0.70)
CCA
(0.83)
GGA
(0.72)
AGT
(0.62)
CTA
(0.17)
GTT
(0.38)
AGC
(1.0)
GCT
(0.67)
Best CAC
(1.0)
CCC
(1.0)
GGC
(1.0)
AGC
(1.0)
CTG
(1.0)
GTG
(1.0)
AGC
(1.0)
GCC
(1.0)
CpG ----- ----- ----- ----- CTC
(0.48)
GTA
(0.24)
AGC
(1.0)
GCT
(0.67)
Figure 2. Codon optimization at three different places in the Tat amino acid sequence
Maximizing the desirable Motifs:
AT content: 50% Avg Fitness: 0.967748%
CAC TTG ATT GTT ACC CCA TCG GGC TGC GGC GAA CAA AAT ATG ATT GGC ATG ACC CCA ACC
GTT ATT GCA GTT CAC TAT CTC GAC GAA ACC GAA CAA TGG GAA AAA TTC GGC TTG GAA AAA
CGC CAA GGG GCG TTG GAA TTG ATT AAA AAA GGC TAT ACC CAA CAA TTG GCG TTC CGC CAA
CCA TCG TCG GCG TTC GCA GCG TTC
Minimizing the Undesirable Motifs:
AT content: 55% Avg Fitness: 0.96046%
CAC TTG ATT GTT ACC CCA TCA GGC TGC GGC GAA CAA AAT ATG ATT GGC ATG ACC CCA ACT
GTT ATT GCA GTT CAC TAT TTG GAC GAA ACT GAA CAA TGG GAA AAA TTT GGC TTG GAA AAA
CGC CAA GGC GCA TTG GAA TTG ATT AAA AAA GGC TAT ACC CAA CAA TTG GCA TTC CGT CAA
CCA TCG AGC GCA TTC GCA GCA TTC
Maximizing Desirable and Minimizing Undesirable Motifs:
AT content: 53% Avg Fitness: 0.94195%
CAC TTG ATT GTT ACC CCA TCA GGC TGC GGC GAA CAA AAT ATG ATT GGC ATG ACC CCA ACT
GTT ATT GCA GTT CAC TAT CTC GAC GAA ACT GAA CAA TGG GAA AAA TTT GGC TTG GAA AAA
CGC CAA GGC GCA TTG GAA TTG ATT AAA AAA GGC TAT ACC CAA CAA TTG GCG TTC CGT CAA
CCA TCG TCG GCG TTC GCA GCG TTC
Figure 3. Sample output of the program for a part of C3d sequence (desirable occurrences are shown in bold and
neutral occurrences are shown in italics)
Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE
An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern.pdf
An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern
Mahmoud Moh'd Mhashi
Computers and Information Technology
Tabuk University
Email: mmhashi@ut.edu.sa
Abstract
The objective of string matching is to search and find all the occurrences of the text pattern Pat with size m
characters in the text Text with size n characters, where n >> m. The string matching algorithm is so
fundamental that most computer programs use it, and it continues to receive great attention due to various
applications in text manipulation and other applications, especially in biological sequence analysis and
comparison. The main goal of this research is to develop a new enhanced skipping algorithm with intelligent
issues that outperforms the algorithms introduced in the literature. The string matching algorithms consist of
two main operations: checking and skipping. Traditionally, authors try to reduce the number of comparisons
through the checking operation. The new algorithm in this paper decreases the number of comparisons by
decreasing the number of checking operations through an increase in the shift distance. This enhances the
searching response time. The proposed algorithm is an efficient algorithm that can be used to search for exact
occurrences of long patterns in DNA patterns.
1. Introduction
Exact string searching is one of the most important problems that has been investigated; studies range
from finding the shortest string in DNA sequencing [1-4] to searching for occurrences of a pattern. In general,
string searching algorithms deal with searching the occurrences of a string (the pattern) of size m in a string (the
text) of size n (where n � m) [5-7]. In the literature, many exact string searching and pattern matching
algorithms were introduced and their performances were investigated against classical exact string searching
algorithms such as the Naïve (brute force) algorithm and Boyer-Moore-Horsepool(BMH) algorithms. Some of
these algorithms preprocess both the text and the pattern [8] while others preprocess the pattern [9-12].
The exact string searching problem consists of two major steps: checking and skipping, the checking step itself
consists of two phases:
1) A search along the text for a reasonable candidate string
2) A detailed comparison of the candidate against the pattern to verify the potential match.
Some characters of the candidate string must be selected carefully in order to avoid the problem of
repeated examination of each character of text when patterns are partially matched; the fewer the number of
character comparisons in the checking step the better the algorithm is. There are different algorithms that can be
used to check if the characters in the text match with the corresponding characters in the pattern [13-21]. After
the checking step, the skipping step shifts the pattern to the right to determine the next position in the text where
the substring text can possibly match with the pattern.
IMACST: VOLUME 3 NUMBER 1 FEBRUARY 2012 13
1857-7202/1202002
This research will focus on searching for exact occurrences of long patterns in the DNA alphabet. Searching
for DNA sequence is studied by many authors in the literature [22-26]. Molecular biologists often search for the
important information from the databases in different directions of different uses [27]. A DNA sequence can be
abstracted as a string over an alphabet of 4 basic characters, (A, C, G, and T). DNA sequences can be very long.
For example, the total size of a human genome (DNA) is around 3Gbp. Researchers tend to search patterns in
DNA data frequently [28]. So, an efficient searching tool which is enough to allow users to locate patterns in
large DNA data is desired. Developing a fast and efficient algorithm to find all occurrences of DNA patterns is
important to the different applications using exact matching. The main goal is to develop a new enhanced
checking and skipping algorithm where intelligent issues outperform the algorithms that are introduced in the
literature.
2. The Naïve or (Brute Force) Algorithm
Let us assume that the target sequence is an array Text[n] of n characters and the pattern sequence is the array
Pat[m] of m characters, then a Naïve approach to the problem would be:
������� � ��� ����������������� ��� ��� ����� ��������� ��� ��� ���
���
������������������ ����������� ��������� ��� ��� � ����� ��� ��� �����!!���
�����������
��������������������������������
����������������" �# ��� ��$� �����!������%�������$�����%���
�������������������
��������������������������������������� ��� � &���
�������������������������
�����������������������'�����()��*��'� �� ����(����� ��������(����(����� �����!���������
���������������������+� �,���
����������������������-��
���������������������� #. �������!!���
����������������-��
������������-��
������ �'�����
���-��
In the outer loop, Text is searched for occurrences of the first character in Pat. In the inner loop, a
detailed comparison of the candidate string is made against Pat to verify the potential match. The algorithm has
a worst case time of O(nm), where O(nm) is the number of comparisons performed by the algorithm to find all
the occurrences of Pat with size m characters in the Text with size n characters. The worst case in this algorithm
14 M.M. Mhashi: An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern
occurs when we get a match on each of the n Text characters and at each position we may need to perform m
comparisons.
3. Multiple Reference Character Algorithms (MRCA4m)
A string matching algorithm is a succession of checking and skipping, where the aim of a good
algorithm is to minimize the work done during each checking and to maximize the length distance during the
skipping. An algorithm can skip to the next position in the text without missing any pattern occurrence. Most of
the string matching algorithms preprocess the pattern before the search phase to help the algorithm to maximize
the length of the skips. The preprocessing phase in this new MRCA4m algorithm helps in increasing the
performance of maximizing the length of the skips.
In order to test the role of skipping distance on the number of checking operations, the MRCA
algorithm in [29] is enhanced. MRCA algorithm has five reference characters, including three static and two
dynamic references. Two new versions of MRCA algorithms are developed. The first version of MRCA (let us
call it MRCA2m) has three reference characters, including two static characters and one dynamic character.
The second version of MRCA (let us call it MRCA4m) has seven reference characters, including four static
reference characters and three dynamic characters. So, the number of reference characters in MRCA2m,
MRCA, and MRCA4m is three, five, and seven reference characters, respectively. The shift distance ranges
from 2m+1 characters in MRCA2m, 3m+1 characters in MRCA, and 4m+1 characters in MRCA4m. Of course,
increasing the shift distance in the skip operation decreases the number of checking operations. This is, in turn,
reduces the number of comparisons required to find all the occurrences of patterns in the text and decrease the
elapsed time. A case study will be conducted to test the above issues and will be explained in the next section.
A brief description of MRCA follows.
As mentioned earlier, CSA has five reference characters. The Text pointer TextIx always points to the
character, which is next to the character corresponding to the last character in Pat and the reference character ref
always points to the character that corresponds to the last character in Pat (i.e. ref = TextIx – 1). Now let ref1 =
TextIx, then the reference character ref2 can be calculated from ref or ref1, where ref2 can be found as “ref2 =
TextIx + m - 1” or “ref2 = TextIx + m” depending on the existence of ref or ref1 in Pat, where ref2 = TextIx + m
- 1 during the checking step if the character at ref doesn’t exist in Pat, or ref2 = TextIx + m after the checking
step if the character at ref1 doesn’t exist in Pat. The above formulas may be illustrated in the following
example:
Text: K K C D E F G K
Pat: E F G
In the above, Text and Pat assume that the pointers TextIx and ref1 point to the character ‘D’, the
pointer ref points to the character ‘C’, and m = 3, in this case since the character ‘C’ at ref doesn’t exist in Pat,
then ref2 = TextIx + m - 1 = 3 + 3 - 1 = 5. So, start counting from zero, ref2 points to the character ‘F’ in Text.
The occurrence of character ‘D’ at ref1 must be examined whether Pat occurs in Text or not, and since ‘D’
doesn’t occur in Pat then ref2 = TextIx + m = 3 + 3 = 6, hence, ref2 points to the character ‘G’ in Text.
IMACST: VOLUME 3 NUMBER 1 FEBRUARY 2012 15
1857-7202/1202002
In addition to that, MRCA pre-processes the pattern to produce two different arrays, namely skip and
pos, each array has a length equals to the alphabet size. The skip array is used when the reference character ref1
exists in Pat, it expresses how much the pattern is to be shifted forward after the checking step. While the pos
array defines where each one of the different reference characters ref1, ref2, ref_ref1, or ref_ref2 is located in
Pat, if any one of them exists in Pat, where the two dynamic pointers ref_ref1 and ref_ref2 can be calculated
from the two static pointers ref1 and ref2 respectively. In particular, the dynamic pointer ref_ref1 which is
calculated at the skipping step if ref1 occurs in Pat can be found as ref_ref1 = ref1 + m - pt, where m is the Pat
length and pt is the location of ref1 in Pat (i.e., pt = pos[Text[ref1]]), and the dynamic pointer ref_ref2 which is
calculated and used only during checking step if ref doesn’t occur in Pat or after the checking step if ref1
doesn’t occur in Pat can be found as ref_ref2 = ref2 + m - pt1, where pt1 determines where ref2 is located in
Pat. The above formulas can be illustrated in the following examples:
Example 2: Calculating ref_ref1
0 1 2 3 4 5 6 7
Text: K K G E K F G H
Pat: E F G
Assuming the checking step is performed on the above Text and Pat, then the reference character will be
character ‘E’ at location 3 in Text, and since ‘E’ at ref1 occurs at location 1 in Pat, then pt = pos[Text[ref1]] =
pos[‘E’] = 1 (note that for the case of counting position the counting starts from 1 not zero), and ref_ref1 may be
found as ref_ref1 = ref1 + m - pt = 3 + 3 - 1 = 5 (i.e. ref_ref1 points to the character ‘F’ in Text at location 5).
Example 3: Calculating ref_ref2 when the character at ref doesn’t occur in Pat
0 1 2 3 4 5 6 7 8 9 10
Text: A B C D E F G H F E G
Pat: F E G
Since ref doesn’t occur in Pat, then ref2=5 (calculated above in Example 1). Consequently, ref2 + m -
pt1, where m = 3 and pt1 = pos[Text[ref2]]) = pos[Text[‘F’]]) = 1. Hence, ref_ref2 may be found as ref_ref2 = 5
+ 3 - 1 = 7. In such a case, the alignment will be with ‘H’ at location 7 in Text, however, the letter ‘H” doesn’t
occur in Pat, so, TextIx will move forward 7 locations to point to the character ‘G’ at position 10.
Example 4: Calculating ref_ref2 after the checking step if ref1 doesn’t exist in Pat
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Text: C D E F G H I J K L E D C M
Pat: E D C
Let Text and Pat occur as shown above. Since the character at ref occurs in Pat and there is a mismatch then
ref1 will be considered as a basis for calculating ref2, and since ref1 doesn’t occur in Pat then ref2 = TextIx + m
= 6, and pt1 = pos[Text[ref2]]) = pos[Text[‘I’]]) = 0. Hence, ref_ref2 may be found as ref_ref2 = ref2 + m - pt1 =
6 + 3 - 0 = 9. In addition, the TextIx pointer will be shifted forward 10 positions to align with the letter
16 M.M. Mhashi: An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern
Text[ref_ref2] = Text[‘L’] = 3m+1 positions, which is the maximum shift distance that this algorithm can skip
with only two–character-access. As a result, the pointer TextIx will point to the letter ‘M’ at position Text[13],
and the result will be as follows:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Text: C D E F G H I J K L E D C M
Pat: E D C
Based on that, The MRCA algorithm that reflects the above issues is as follows:
&��������� ���� ..����� ����������������� ��� �������/�.�������.,�/���
������
0���� �������
1���2��3�##���+# .�"�� �������#���#' .��2��
4�����������5������5�67�89��5!!�����
:�������/�.$5%����.,�/$5%���0����� ��� ���
������-��
;���2��<�=/'� �. ������.���� �����/�.���������� ����� �.��������2��
>��������5����5����� ��� ��5!!�����
?��������������$5%��/�.$�%��5�!&��.,�/$�%���0������� ��� � �5� &���
������-��
����-��
@���������9<76�� ����������������� ��� ��� ����� ��������� ��� ��� �������/�.�������.,�/���
���������
&��������� �������������#�.�A=�.=��� ��B���
&&�������/���/�&��� ���� �&��� �A� �&��� �0��� �A� �0���
&0������������$67�89%�����-���
&1���2��C/��� ���������+# ��������������� ����.��� ����� ����������2��
&4��������$���$�%%���&��#�.�A=�.=��� ������ ����������� ��� ���
&:���22�7�����7 ��� �����/ �������
&;���" �# �� ������� ��� ��� !&���
����������22�< �,����.� /D�< �,����.��� ����'�� �� ����� �� ����� ������ �/� ���'.�=�.=��� E��
&>����������� ��$� ����� ����� ��� �!�#�.�A=�.=��� %�������$#�.�A=�.=��� %���
IMACST: VOLUME 3 NUMBER 1 FEBRUARY 2012 17
1857-7202/1202002
�������������
&?������22�< �,���"�� �� ����� ������ ���� ������� ./���.�� ����.��� ���� �����������
&@��������������$� ��$� ����� ����� ��� %%���
0�����������22�< �,�� ����'�� �� ������������ ������=���� �����# ��� ��#'��������.��� ����� ���
0&�����������������B������������������ ��� � �&�������������� ����
00���������������� ��$� ����� �!!B%�F�����$�����%���
�������������������
01����������������#�.�A=�.=��� �����������
04���������������������� �����
������������������-��
0:���������������'���()�6�����'�� �� ����#��������(��� ���� ���� ��� ���(����(��� �����&�� ��#���
���������������-��
�����������-�
0;�����22�7������ �.,�//����/�����
0>������� ��D��
0?�������� �&���� ���������������/����/�.$� ��$� �&%%��
0@�����������F/����
�������������
1�������������� �0���� �&!����� ��� ��/�&���/�.$� ��$� �0%%���
1&�� �������F/�&����
10���������������������� �A� �0���� �0�!����� ��� � �/�&���
10���������������������� �����!��1������� ��� � �/�&��!�&� �/�.$� ��$� �A� �0%%���
������������������-��
11������������� #. ��
���������������������
14������������������ �����!��0������� ��� � �/�&�!�&��
� ������-�
� -�
�1:������ #. �
� ��
��1;������������� �A� �&���� �&�!����� ��� � �/���
��1>������������ �����!��.,�/$� ��$� �A� �&%%� �/��!�&��
18 M.M. Mhashi: An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern
�������������-�
���1?������� �'�����
���������-�
The first new version of MRCA is MRCA2m. As mentioned earlier, MRCA2m has skipping distance 2m+1.
The algorithm MRCA2m is as the same as the algorithm MRCA except the segment of code starting at line 26.
The new version from line 26 follows:
0;����������22�7������ �.,�//����/����
0>�������������� ��D�
0?������������� �&���� �������
0@������������������F/�.$� ��$� �&%%���
������������������
1������������������ �0���� ��!����� ��� ���
1&����������������/�&���/�.$� ��$� �0%%���
10���������������� �����!��0������� ��� � �/�&��
������������������-�
11��������������� #. �
���������������������
14�������������������� �A� �&���� �&�!����� ��� � �/���
1:������������������� �����!��.,�/$� ��$� �A� �&%%� �/��!�&��
�������������������-��
��������������-����
���1;������ �'����
����������-�
The second new version of MRCA is MRCA4m. As mentioned earlier, MRCA4m has skipping
distance ranges from one character to 4m+1 characters. The segment of code of MRCA4m starting from line 26
in MRCA follows:
���0;������22�7������ �.,�//����/����
���0>������ ��D�
���0?������� �&���� ������
���0@������/����/�.$� ��$� �&%%���
���1������������F/���
��������������
IMACST: VOLUME 3 NUMBER 1 FEBRUARY 2012 19
1857-7202/1202002
���1&�������������� �0���� �&�!����� ��� ��
���10�������������/�&���/�.$� ��$� �0%%��
���11�����������������F/�&����
���14������������������� �1���� �0�!����� ��� ��
���1:������������������/�0���/�.$� ��$� �1%%��
���1;�������������������� �A� �1���� �1�!����� ��� � �/�0��
���1>������������������� �����!��4������� ��� �!�&� �/�0� �/�.$� ��$� �A� �1%%��
������������������������-
���1?����������������� #. ��
��������������������������
���1@����������������������� �����!��0������� ��� � �/�&�!�&��22� /�.$� ��$� �A� �0%%�
���������������������������-
�����������������-
���4���������� #. �
������������������
���4&�������������� �A� �&���� �&�!����� ��� � �/���
���40�������������� �����!��.,�/$� ��$� �A� �&%%� �/��!�&��
������������������-�
��������������-���������
����41������ �'����
�����������-�
4. Illustration and Discussion
The three algorithms MRCA2m, ECSA, and MRCA4m were implemented and compared on DNA sequence
of 4 basic characters, (A, C, G, T), with a size of more than 1.8 mega characters. The algorithms executed using
Intel(R) Pentium(R) 4 PC with CPU speed 2.40GHz, 246MB RAM, and Windows XP professional operating
system, and a program was designed in C++ to select randomly 6 groups. Each group of patterns consists of
3000 patterns. The number of characters in patterns ranges from 6 to 50 characters. The average number of
occurrences ranges from 0 to 2346. The cost of the searching process to find all the occurrences of the different
patterns in each group in Text is measured by finding 1) the search clock time, where the total clock time
includes the preprocessing clock time of patterns, and 2) the average number of checking operations (i.e., the
first main operation in string searching algorithms) required to find each group of patterns. The second
20 M.M. Mhashi: An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern
measurement is important to give us an indication about the percentage of reduction in the number of
comparisons. Table (1) shows the clock time required to find each group of patterns.
Table 1: The clock time (seconds) required by MRCA2m, MRCA, and MRCA4m to find all the occurrences of
each group of patterns
Grou
p #
3000
Patterns
with
Length
In
character
Number of
Occurr-
ences
Clock time in seconds
Improvements of latest
version MRCA4m
vs.
MRCA2 ECSA Naive
MRCA2m
2m+1
ECSA
3m+1
MRCA4m
4m+1
1 6 0 22.1 7.75 6.61 5.39 15% 31%
2 6 0 22.4 8.44 7.25 6.19 14% 27%
3 6 2346 22.5 9.94 8.72 7.84 12% 21%
4 25 0 22.3 2.78 2.13 1.44 24% 48%
5 25 0 22.1 3.06 2.88 2.56 6% 16%
6 25 928 24.3 15.33 13.47 13.19 12% 14%
7 50 0 22.3 1.52 1.03 0.81 32% 47%
8 50 0 22.4 2.31 1.89 1.70 18% 26%
9 50 127 31.8 13.52 12.50 12.20 8% 10%
One can notice from Table (1) that the number of occurrences is zero. This is because the performance
of these versions of the algorithm ECSA needs to be tested. The characters in patterns in groups 1, 4, and 7 are
completely different from the characters in Text. About one third of the characters in the patterns in groups 2,
5, and 8 are exists in Text. Finally, the characters in groups 3, 6, and 9 are 100% exist in Text. Each group
consists of 3000 patterns. The time required to find all the occurrences of each group ranges from 1.515
seconds (group 7) to 15.38 seconds (group 6), 1.031 seconds to 13.468 seconds, and 0.812 seconds to 13.187
seconds by the algorithms MRCA2m, ECSA, and MRCA4m, respectively. Furthermore, one can see from
Table (1) that the algorithm MRCA4m (4m+1) reduces the clock time required by MRCA2m and ECSA
algorithms by 6.11% to 31.95% and 9.71 (group 9) to 48.33% (group 4). In order to show the significant of the
time required by these versions of MRCA, the naive algorithm is included in Table (1). The time required to
find all the occurrences of each group of patterns by the naïve algorithm ranges from 22.1 seconds to 31.8
seconds. Figure (1) show the time required to find each group of patterns by the different algorithms.
IMACST: VOLUME 3 NUMBER 1 FEBRUARY 2012 21
1857-7202/1202002
0
5
10
15
20
1 2 3 4 5 6 7 8 9
ECSA_1 2m +1 ECSA 3m +1 ECSA_2 4m +1
Fig.1: The clock time required by MRCA2m, ECSA, and MRCA4m algorithms to find all the occurrences of patterns in each group.
Table 2: The number of checking operation required by MRCA2m, ECSA, and MRCA4m to find all the occurrences of each group of patterns
Grou
p #
3000
Patterns
with
Length
In
character
Number of
Occurr-
ences
Number of Checking operations
Improvements of latest
version MRCA4m
vs.
MRCA2m ECSA
MRCA2m
2m+1
ECSA
3m+1
MRCA4m
4m+1
1 6 0 138708 86283 65577 53% 24%
2 6 0 152076 99498 78764 48% 21 %
3 6 6517 163794 118503 100550 39% 15%
4 25 0 32028 20755 15493 52% 25%
5 25 0 40160 34588 30629 24% 12%
6 25 4354 224542 210939 207656 8% 2%
7 50 0 16367 10673 7936 52% 26%
8 50 0 23683 22343 21938 7% 2%
9 50 3126 214086 201129 193386 10% 4%
Table (2) presents the three algorithms with the number of checking operations required to find all the
occurrences of Pat in Text for all the different groups. The information of number of groups, pattern length, and
22 M.M. Mhashi: An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern
the number of occurrences in this table are the same as in Table (1). The number of checking operations ranges
from 16367 (group 7) to 224542 (group 6), 10673 to 210939, and 7936 to 207656 by the algorithms MRCA2m,
ECSA, and MRCA4m, respectively. Furthermore, from Table (2) the algorithm MRCA4m (4m+1) reduces the
percentage of number of checking operations required by MRCA2m and ECSA algorithms by the ranges from
7% (group 8) to 53% (group 1) and from 2% (group 8) to 26% (group 7).
From the analysis of the discussed algorithms, it is clear that the new enhanced algorithm MRCA4m
outperforms the other algorithms, especially when the Pattern does not occur in Text (53%). This is due to the
long shift distance (4m+1). This decreases the number of checking operation, and in turn reduces the number of
basic comparison operations. However, when the characters in Pat exist in Text, the performance of the new
algorithm is not high as that in the previous case, especially when the pattern is long a little bit (group 6 and
group 9). Most of the words, sentences, and patterns have, prefix, suffix, infix, or may be a combination of two
of them. In such a case a lot of not necessary comparisons performed till the algorithm decides that there is no
mismatch. This decreases the performance of the new algorithm.
6. Conclusions
A new exact string searching algorithm MRCA4m was developed, and its performance was compared with
two algorithms, namely, MRCA2m, and MRCA. The new algorithm increases the performance of the skipping
phases needed for the string searching process. In the skipping phase MRCA4m focuses on increasing the shift
distance. The search clock time and counting the number of checking phase operation criteria were used in a
case study to compare the performance of MRCA4m against MRCA2m and MRCA. The results showed that:
1) Using MRCA4m improved the clock time required by MRCA2m and MRCA by 6% to 32% and 10% to 48%
respectively.
2) Using MRCA4m improved the number of checking operation required by MRCA2m and MRCA by 7% to
53% and 2% to 26% respectively.
3) Decreasing the pattern length decreases the system performance.
4) The performance of the new algorithm is positively distinguished in case that the percentage of existence of
the characters of patterns in Text is low. In such a case, if there is some applications required that situation
(i.e., rare existence characters of Pat in Text), it is possible to increase the shift distance more and more, and
in turn, increasing the performance.
5) Increasing the shift distance has a major effect on reducing the number of checking operations, and in turn
decreases the number of comparisons and increasing the system performance.
References
1] Ukkonen E., "A linear–time algorithm for finding approximate shortest common superstring", Algorithmica,
1990; 5:313–323.
2] Stephen G., "String Searching Algorithms", World Scientific, Singapore, 1994.
3] Apostolico, "A, Galil Z. Pattern Matching Algorithms", Oxford University Press, 1997.
IMACST: VOLUME 3 NUMBER 1 FEBRUARY 2012 23
1857-7202/1202002
4] C. Gibas. And P. Jambeck., Developing Bioinformatics Computer Skills. O’Reilly, Sebastopol First
Edition April 2001 pp. 1, 446.
5] Gusfield D., "Algorithms on strings, trees and sequences", Cambridge University Press, Cambridge, 1997.
6] Fenwick P., "Fast string matching for multiple searches", Software–Practice and Experience, 2001,
31(9):815–833.
7] Liu Z, Du X,and Ishii N., "An improved adaptive string searching algorithm", Software Practice and
Experience, 1988, 28(2):191–198.
8] Sunday D., "A very fast substring search algorithm", Communications of the ACM, 1990, 33(8):132–142.
9] Raita T., "Tuning the Boyer-Moore-Horspool String Searching Algorithm", Software Practice and
Experience, 1992, 22 (10):879-844.
10] Bruce W., Watson, E., "A Boyer-Moore-style Algorithm for Regular Expression Pattern Matching", Science
of Computer Programming, 2003, 48: 99-117.
11] Ager M. S., Danvy O., and Rohde H. K., "Fast partial evaluation of pattern matching in strings",
ACM/SIGPLAN Workshop Partial Evaluation and Semantic-Based Program Manipulation, San Diego,
California, USA, pp. 3 – 9, 2003.W.-K. Chen, Linear Networks and Systems (Book style)., Belmont, CA:
Wadsworth, 1993, pp. 123–135.
12] Fredriksson and Grabowski S., “Practical and Optimal String Matching”, Proceedings of SPIRE'2005,
Lecture Notes in Computer Science 3772, 2005, pp. 374-385, Springer Verlag.
13] Hernandez, and Rosenblueth D., “Disjunctive partial deduction of a right-to-left string-matching algorithm”,
Information Processing Letters, 2003, 87: 235–241.
14] Apostolico A., and R.Giancarlo R., “The Boyer-Moore-Galil string searching strategies revisited”, SIAM J.
Comput., 1986, 15(1): 98-105.
15] Crochemore, “Transducers and repetitions”, Theoret. Comput. Sci., 1986; 45: 63-86.
16] Crochemore M., and Perrin D., “Two-way string-matching”, J. ACM, 1991, 38: 651-675.
17] Galil Z., and Giancarlo R., “On the exact complexity of string matching: upper bounds”, SIAM J. Comput.,
1992, 21: 407-437.
18] Smith P. D., “Experiments with a very fast substring search algorithm”, SP&E, 1991, 21(10): 1065-1074.
19] Smith P., "On Tuning the Boyer-Moore-Horspool String Searching Algorithm", Short Communication,
Software Practice and Experience, 1994, 24(4):435-436.
20] Mhashi M., "A Fast String Matching Algorithm using Double-Length Skip Distances", Dirasat Journal,
University of Jordan, Jordan, 2003, 30(1):84-92.
21] Mhashi, M., "The Performance of the Character-Access On the Checking Phase in String Searching
Algorithms", Transactions on Enformatica, Systems Sciences and Engineering, 2005, 9: pp. 38 –43.
24 M.M. Mhashi: An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern
22] W. Kim, S. Park, J. Won. S. Kim, and J. Yoon., "An efficient DNA sequence searching method using
position specific weighting scheme", Journal of Information Science, 2006, 32(2).
23] A. F. Klaib and H. Osborne. “Searching Protein Sequence Database Using BRBMH Matching Algorithm,”
International Journal of Computer Science and Network Security (IJCSNS), 8(12), pp. 410-414, 2008.
24] Y. Huang, X. Pan, Y. Gao, and G. Cai, "A Fast Pattern Matching Algorithm for Biological Sequences,"
IEEE, pp. 608 – 611, 2008.
25]R. Thathoo, A. Virmani, S. S. Lakshmi, N. Balakrishnan and K. Sekar, “TVSBS: A fast exact pattern
matching algorithm for biological sequences,” Current Science, 2006, 91(1): 47–53.
26] A. L. Delcher, A. Phillippy, J. Carlton, and S. L. Salzberg, "Fast algorithms for large-scale genome
alignment and comparison," Nucleic Acids Res, 29002, 30, pp. 2478-2483.
27] Raju Bhukya., and DVLN Somayajulu., " Exact Multiple Pattern Matching Algorithm using DNA Sequence
and Pattern Pair", International Journal of Computer Applications 17(8):32-38.
28] L. Cheng, D. Cheung, and S. Yiu., Approximate String Matching in DNA Sequences, Proceedings of the
Eighth International Conference on Database Systems for Advanced Applications, 2003
29] Mhashi M., "The Effectof Multiple Reference Characters on Detecting Matches in String Searching
Algorithms", Software Practice and Experience, 2005, 35(13): 1299 - 1315.
�
�
�
�
�
�
�
�
�
IMACST: VOLUME 3 NUMBER 1 FEBRUARY 2012 25
1857-7202/1202002
Analyzing large-scale DNA Sequences on multi-core Architecture.pdf
Analyzing large-scale DNA Sequences on Multi-core Architectures
(CSE-2015, ©IEEE)
Suejb Memeti and Sabri Pllana Department of Computer Science, Linnaeus University
351 95 Växjö, Sweden {suejb.memeti, sabri.pllana}@lnu.se
Abstract—Rapid analysis of DNA sequences is important in preventing the evolution of different viruses and bacteria during an early phase, early diagnosis of genetic predispositions to certain diseases (cancer, cardiovascular diseases), and in DNA forensics. However, real-world DNA sequences may comprise several Gigabytes and the process of DNA analysis demands adequate computational resources to be completed within a reasonable time. In this paper we present a scalable approach for parallel DNA analysis that is based on Finite Automata, and which is suitable for analyzing very large DNA segments. We evaluate our approach for real-world DNA segments of mouse (2.7GB), cat (2.4GB), dog (2.4GB), chicken (1GB), human (3.2GB) and turkey (0.2GB). Experimental results on a dual-socket shared-memory system with 24 physical cores show speedups of up to 17.6×. Our approach is up to 3× faster than a pattern- based parallel approach that uses the RE2 library.
Index Terms—parallel DNA analysis, multi-core architectures, finite automata
I. INTRODUCTION The need for high performance computational biology has
emerged as a result of fast growth in biological information, the complexity of interactions that underlie many processes in biology, as well as the diversity and the interconnectedness of organisms at the molecular level [5]. These biological information are accumulated via different techniques, however they require adequate analysis and processing to extract useful information that make the results evident.
According to Benson et al. [9] the number of Deoxyribonu- cleic Acid (DNA) sequences and nucleotide bases in these sequences is growing exponentially, doubling every 18 months. As these data are collected, motif search and DNA sequencing are just some examples among many for analytics of Next Gen Sequencing Analysis.
A DNA sequence contains specific genetic instructions, which make the living organisms function properly. In a DNA strand there are four bases of nucleotides: A-adenine, C- cytosine, G-guanine and T-thymine. DNA analysis is important for discovery of differences and similarities of organisms and exploration of the evolutionary relationship between them. This process often requires comparisons of the corresponding DNA sequences, for example, checking whether one sequence is a subsequence of another, or comparing the occurrences of specific k-mers in the corresponding DNA sequences. In computational biology k-mers refer to all the possible sub- strings (sub-sequences) of length k of a DNA sequence. They
have an important role during sequence assembly and can be used in sequence alignment as well.
Analyzing DNA sequences within a reasonable time is important for domain scientists to study various phenomenons, such as the evolution of viruses and bacteria during an early phase [20], or diagnosis of genetic predispositions to certain diseases.
Modern parallel computing systems promise to provide the capabilities to cope with the DNA analysis processing requirements. Existing approaches use both hardware and software to accelerate regular expression matching. The hard- ware based approaches (such as [7], [27]) are faster, but less flexible and more expensive, whereas software based acceleration techniques are flexible in terms of updating or adding new patterns [30]. Recently different software based DNA analysis techniques designed for multi-core systems have been proposed [6], [11], [14], [16], [19], [23].
In this paper, we will first explore and discuss the paral- lelization opportunities of DNA analysis, and thereafter we introduce a parallel algorithm for DNA analysis that is based on Finite Automata. We use a domain decomposition approach for parallelization; in our approach the DNA sequence is split into several chunks, and each chunk is assigned to a thread to perform pattern matching. Our algorithm is optimized to do efficient speculations of the possible initial states for each chunk. Only one regular expression matching (REM) for a chunk is required to be completely performed; the remaining REMs stop when the converging point is reached. A converging point is a state where two or more REM starting from different states meet after the same number of symbols is read. Furthermore, we use a memory efficient data structure that saves the necessary information to count and highlights the k-mers. Experiments with real-world DNA segments (for human and various animals) on a dual socket shared-memory system with 48 threads show significant speedups compared to the sequential version (up to 17.6x). The implementation of our algorithm is up to 3x faster than a pattern-based algorithm implemented using the RE2 library [3]. Major contributions of this paper include: • a parallel algorithm for DNA analysis that is based on
Finite Automata; • empirical evaluation of our algorithm with real-world
DNA segments of mouse (2.7GB), cat (2.4GB), dog
ar X
iv :1
50 9.
01 50
6v 1
[ cs
.D C
] 4
S ep
2 01
5
(2.4GB), chicken (1GB), human (3.2GB) and turkey (0.2GB);
• a comparison of our algorithm with a pattern-based algorithm implementation that uses RE2 library.
The rest of the paper is organized as follows. Section II provides background information on pattern matching, whereas Section III presents our algorithm for counting and extracting k-mers in a DNA sequence. Section IV presents the experimental setup and discusses the experimental results. The work described in this paper is compared and contrasted to the related work in Section V. Section VI provides a summary of our work.
II. REGULAR EXPRESSION MATCHING (REM) WITH FINITE AUTOMATA (FA)
Regular expression matching verifies whether a pattern is present in a string. REM is commonly used for determining the locations of a pattern within a sequence of tokens, in search and replace functions, or to highlight important information out of a huge data set. In the context of computational biology, pattern matching is used for analyzing and processing biological information in order to extract the useful parts of the data and make them evident. The formal definition of the REM is as follows: the input text is an array T [1..n] where n is the length of the input, and pattern P [1..m] where the length of the pattern m ≤ n. The alphabet
∑ defines the possible
characters of the input string. A Finite Automaton (FA) is a machine for processing
information by scanning the input text T in order to find the occurrences of the pattern P . A formal definition of the FA is as follows: FA is a quintuple of (Q,
∑ ,δ,q0,F ), where Q
is the finite set of states, ∑
is the finite alphabet, δ is the transition function Q×
∑ −→ Q, q0 is the start state and F
is the distinguished set of final states. A well known algorithm for multiple pattern matching is the
Aho-Corasick algorithm. It is able to match any occurrences (including the overlapped ones) of multiple patterns linearly to the size of the input string. It examines each character of the input string only once. It builds an automaton by creating states and transitions corresponding to these states. It adds failure transitions when there is no regular transition leaving from the current state on a particular character, which makes it possible to match multiple and overlapping occurrences of the patterns. Furthermore, this algorithm is capable of delivering input-independent performance if implemented efficiently in parallel systems, which is a reason why we use this algorithm as basis of our work.
III. DESIGN AND IMPLEMENTATION OF A DNA ANALYSIS ALGORITHM
In this section we first provide the details about the outline of our algorithm. Thereafter we discuss the most important implementation aspects to achieve a scalable algorithm for counting and extracting specific k-mers from a large DNA sequence.
A. Our algorithm for counting and extracting the location of k-mers (k-mers CoEx)
Figure 1 depicts two possible ways for parallel execution of regular expression matching for bio-computing applications: (a) input-based approach that splits the input string into smaller chunks and processes them in separate threads and (b) pattern-based approach that splits the patterns in sub- patterns, creating separate state machines for each of them and processing the same input string with each different machine [16].
(a) Input-based approach
(b) Pattern-based approach
Fig. 1. Load balancing using Input and Pattern partitioning approach.
Our algorithm uses the input-based approach. The challenge of this approach is determining the initial state for each chunk. Finding the correct starting state for each chunk is important for finding the occurrences of the patterns that appear in the crossing border. Other researchers use different ways of finding the initial states, for instance Luchaup et al. [18] use speculation to find the initial state based on the most visited states, Devi and Rajagopalan [12] use an index based technique, Chacon et al. [11] use Suffix-Arrays, Villa et al. [31], uses the pattern length overlapping approach.
Fig. 2. K-mers CoEx finite state automaton for matching the patterns: ”acg”, ”cat”, ”cta” and ”tta”
Our way of determining the possible initial states is as follows: (1) find the set of source states (L) for the first element of the sub-input mapped to the running thread (Tn); (2) find the set of destination states (S) for the last character of the sub-input mapped to the previous thread (Tn−1); (3) find the intersection of S and L (S ∩ L), which is the set of possible initial states [21]. The first thread (T0) always starts from the initial state q0. Each thread is responsible for finding the set of possible initial states, and for each state of this set a regular expression matching is performed. When all threads have finished their job, the results are joined by a binary reduction, which connects the last visited state of Tn to the first visited state of Tn+1.
This method provides very good results for sparse transition tables and good performance for dense matrices for DFAs with relatively small number of states. However, in a DFA with large number of states this method seems to be less efficient. This happens because one thread may be responsible to perform multiple REM for the same input, due to multiple possible initial states. To reduce the operations required for each thread to perform the REM starting from different states, further optimizations are needed.
While investigating the REM using the modified Aho- Corasick DFA representation, we noticed that the result con- verges after several symbols are read (in our experiments, 10 is the max number of steps required to find the converging point). A converging point is a state where two or more REM starting from different states meet after the same number of symbols are examined. This insight allows us to significantly minimize the execution cost required to perform the REM starting from each possible initial state. The details about the process of the convergence are given in the next Section.
B. Implementation Aspects
In this section we will explain the implementation details of our algorithm, including the process of building the DFA,
splitting the input among the available threads, finding the set of possible initial states for each chunk assigned to a thread, running the REM for the first state in this set and the process of finding the converging point. A reference of each process to the corresponding lines of codes in Algorithm 1 will be provided.
Furthermore, we define qi as the state of the automaton where i is the state id, Ei as a sub-pattern of the selected patterns, Ri as the process of performing an REM starting from the item at index i of the set of possible initial states. The values used in the switch-case (Line 25 - 30) (ex. 122, 127, 128...) determine that a specific sub-pattern (Ei) has been matched.
1) Building the Deterministic Finite Automaton: The AC algorithm with failure transitions has a drawback due to the non-deterministic transitions for a single input character. Fig- ure 2 illustrates our solution to eliminate the failure transitions by adding the right transition (indicated by dashed lines) for each state. Having a valid transition for each possible character to another state in the automaton, guarantees that for each symbol the same amount of operations will be performed. The example automaton shown on Figure 2 is able to match the following patterns: ”acg”, ”cat”, ”cta” and ”tta”. For example, if we read string ”ac” we reach state q2, and when ”a”, ”c” or ”t” is read we know exactly that state q6, q5 or q10 is next, respectively.
2) Splitting the input and finding the possible starting states (PSS): The process of splitting the input among the available threads is depicted in Table I.a. This is a straight forward step, where the input length is divided by the number of available threads (Line 5-6). The chunks are assigned to the threads consecutively based on the thread IDs.
The pseudo-code of the process of finding the PSS (See Section III-A) is shown on Algorithm 1 Line 10. When the PSS are determined the thread performs an REM starting from each item of PSS (Line 14). Table I.b,c depicts the REM process starting from each item of PSS; each table corresponds to a thread. For example, the first thread (see Table I.b) initiates the REM starting from the following PSS: q3, q10, q13, and q139.
For every Ri (REM starting from the PSS at index i) a CR structure is created and stored in the results. The CR structure stores the initial state, last state and the total number of occurrences for each of the sub-patterns (Line 61 - 65).
3) Determining the Converging Point: We investigated that while performing an REM of the same input string starting from different states, the REM converges after a certain num- ber of steps. An example of how the convergence happens is depicted in Table I.b,c. For example in Table I.c, the matching of CCCTATACGA... (see Table I.a) starting from q3 leads to the following transitions: q2 → q2 → q2 → q10 → q11 → q130 → q40 → q60 → q15 → q1.... Matching the same input starting from q10 leads to the following transitions: q19 → q2; no further transitions are needed, because the state at position two for both REMs (R0 and R1) is q2, which indicates the point of convergence.
Algorithm 1 k-mers CoEx Input: transition table dfa; set of final states F ; input string I Output: list of CR results
1: procedure KCOEX(dfa, F, I) 2: steps = 10 . The max number of steps required to converge 3: result = list < list < CR >> . Stores the final states for each thread 4: for T0...Tn do in parallel . T - Thread, n - total number of threads 5: start position = t i ∗ (I.length/n) . t i = thread id 6: SI = substring(start position, I.length/n) . SI - sub input 7: if ti ! = 0 then 8: P SS = get possible starting states(I[start position],
I[start position − 1]) 9: else
10: P SS = [0] 11: end if 12: fr list = list < F R > 13: psi i = 0 14: for int cs in P SS do 15: CR cr . stores the init state, last state, and total number of
occurrences for each subexpression 16: char i = 0 17: for char c in SI do 18: if char i == 0 then 19: cr.init state = cs 20: else if char i == SI.length − 1 then 21: cr.last state = dfa[cs][c] 22: end if 23: cs = dfa[cs][c] 24: if cs in F then 25: switch cs do 26: case 117 27: cr.final states[0] + + . agggtaaa | tttaccct is
found 28: case 122 or 128 29: cr.final states[1] + + .
(c|g|t)gggtaaa | tttaccc(a|c|g) is found 30: ... 31: end if 32: if psi i == 0 and char i ≤ steps then 33: F R fr 34: fr.current state = cs 35: fr.final states[0] = cr.final states[0] 36: ... 37: fr.final states[8] = cr.final states[8] 38: fr list.add(fr) 39: else if psi i > 0 and fr list[char i].current state == cs
then . check for convergence 40: cr.final states[0] + = results[t i][0].final states[0]−
fr list[char i].final states[0] 41: ... 42: break 43: end if 44: end for 45: results[t i].add(cr) 46: end for 47: end for 48: end procedure
Input: transition table dfa; the first character of the input mapped to Tn (current thread) first char; the last character of the input mapped to Tn−1 (previous thread) last char Output: list of states
49: procedure GET POSSIBLE STARTING STATES(dfa, first char, last char) 50: S = L = list < q > . q - state of the DFA 51: for q0...qn do 52: if dfa[qi][first char] ∈ Q then . Q-list of states 53: Si = qi 54: end if 55: if dfa[qi][last char] ∈ Q then 56: Li = dfa[qi][last char] 57: end if 58: end for 59: return S ∩ L 60: end procedure
61: struct CR{ 62: int init state 63: int last state 64: int final states[9] . Stores the number of occurrences for each
sub-expression (E1...E9) 65: }
66: struct F R{ 67: int current state 68: int final states[9] 69: }
TABLE I THE PROCESS OF SPLITTING THE INPUT, PERFORMING THE REM
STARTING FROM EACH POSSIBLE INITIAL STATE, AND THE PROCESS OF CONVERGENCE
a) Splitting the input string into chunks
0 1 2 3 4 5 6 7 8 9
T1 → G T G A G C C G A G . . . chunk 1 T2 → C C C T A T A C G A . . . chunk 2
b) REM for chunk 1
In it
ia l
st at
e 1 → 7 21 31 1 7 19 2 9 1 7 . . . 6 → 16 21 . . . 18 → 32 48 13 1 . . . 43 → 63 21 . . .
c) REM for chunk 2
In it
ia l
st at
e 3 → 2 2 2 10 11 130 40 60 15 1 . . . 10 → 19 2 . . . 44 → 67 90 112 129 1 8 11 5 15 . . . 63 → 84 105 2 . . .
TABLE II A TABULAR REPRESENTATION OF THE fr list (LINE 12)
Step CS E1 E2 E3 E4 E5 E6 E7 E8 E9
1 67 0 0 0 0 0 0 0 0 0 ...
4 129 0 0 0 0 0 1 0 0 0
In order to properly count the number of occurrences of k-mers when a converging point is met, we need to store the number of occurrences of k-mers for the first n steps while performing R0. For each of the first n examined characters a FR structure is created and stored in the fr list (Line 32- 38). The FR structure is shown on Algorithm 1 Line 66 - 69, which stores the current state of the automaton, and the current number of occurrences for each of the sub-patterns. An example of this process is depicted in Table II, where each row represents a FR structure. In this example, we assume that q44 is the first state in the PSS, which means that the R0 starts from q44. After four characters (CCCT ) are examined, a final
TABLE III THE TABULAR REPRESENTATION OF THE results (LINE 3)
T Q0 Qn E1 E2 E3 E4 E5 E6 E7 E8 E9
1 1 130 0 0 0 1 0 0 0 0 0 96 130 0 0 0 1 0 0 0 0 0
2 3 130 0 0 1 0 1 0 2 0 1 44 130 0 0 1 0 1 1 2 0 1
T - thread index Q0 - start state, Qn - end state E1 − E9 - sub-expressions CS - current state
state (q129 - representing E6, see Table I.e row 3) is reached, therefore the number of current occurrences of k-mers for E6 becomes 1 (see Table II).
The REM starting from the remaining states of PSS will be performed until the converging point is reached. For instance (see Table I.b), R1,R2 and R3 need only 2, 4, 2 characters to be examined, respectively. When the converging point is reached, the total number of final states (CR.final states) is calculated by adding the total number of states found for R0 (results[t i][0].final states) to the Ri and subtracting the final states (fr list[char i].final states) found of R0 at the converging point (Line 41).
IV. EXPERIMENTAL EVALUATION In this section we describe the experimentation environment
used for the evaluation of our proposed algorithm and we discuss the obtained performance results.
A. Experimentation Environment We have performed experiments on a shared-memory sys-
tem with two 12-core Intel Xeon processors of the type E5- 2695 v2 and 16GB Memory. In total the system has 24 physical cores and each physical core supports two threads (also known as logical cores). We have implemented our algorithm using C++11 programming language and OpenMP. For compilation we used the Intel Compiler icc 15.0.0. In order to address the variability in performance measurements we have repeated each experiment 20 times.
For our experimental evaluation we have selected data-sets of genomes from the GenBank National Center for Biotech- nology Information sequence database [2]: mouse (2.7GB), cat (2.4GB), dog (2.4GB), chicken (1GB), human (3.2GB) and turkey (0.2GB). The information about the data-sets is provided in Table IV.
TABLE IV DNA DATA-SETS
Genome Reference Size (MB)
Mouse GRCm38.p2 2830 Cat Felis catus-6.2 2490 Dog CanFam3.1 2440 Chicken Gallus gullus-4.0 1060 Human GRCh38 3250 Turkey Meleagris gallopavo 193
In our experiments we used the same patterns as in the regex-dna benchmark, listed in Table V [1], which are used to extract and match DNA 8-mers and substitute nucleotides according to standards of International Union of Biochemistry (IUB). In Section IV-B we compare the performance of regex- dna benchmark with our k-mers CoEx algorithm.
The DFA for the given regular expression (Table V) was generated using our PaREM tool [21]. Figure 3 depicts the DFA of 137 states, which is able to find the occurrences (including the overlapping ones) of the selected patterns. For simplicity the failure links are omitted from the DFA graph.
TABLE V PATTERNS OF THE regex-dna BENCHMARK. THE SYMBOL ”|” DETERMINES
THE ”OR” REGEX OPERATOR
E1 agggtaaa | tttaccct E2 (c|g|t)gggtaaa | tttaccc(a|c|g) E3 a(a|c|t)ggtaaa | tttacc(a|g|t)t E4 ag(a|c|t)gtaaa | tttac(a|g|t)ct E5 agg(a|c|t)taaa | ttta(a|g|t)cct E6 aggg(a|c|g)aaa | ttt(c|g|t)ccct E7 agggt(c|g|t)aa | tt(a|c|g)accct E8 agggta(c|g|t)a | t(a|c|g)taccct E9 agggtaa(c|g|t) | (a|c|g)ttaccct
Fig. 5. Speedup of our k-mers CoEx algorithm implementation.
B. Results
We first present the performance results of our k-mers CoEx algorithm for various problem and machine sizes, and thereafter we compare our algorithm with a pattern-based algorithm implementation that uses RE2 library (known as regex-dna benchmark). Figure 4 depicts the execution time in logarithmic scale for each of our selected data-sets and for various numbers of threads {1,2,6,12,24,48}. We observe a good scalability of our algorithm as we increase the number of threads or the input size. For example, the analysis of the human’s DNA sequence using one thread takes 27 seconds, and by increasing the number of threads to 2, 6, 12, 24 and 48 the execution time reduces to 14.3s, 5.3s, 3s, 2s and 1.5s respectively.
Figure 5 depicts the obtained speedup of our algorithm com- pared to a sequential version of the Aho-Corasick algorithm for DNA analysis. We may observe that the k-mers CoEx algorithm scales gracefully with respect to the size of data-sets and the number of threads. The maximal speedup of 17.65× is achieved for the largest data-set (that is the human DNA segment) using 48 threads.
Figure 6 compares the performance of our k-mers CoEx algorithm with the regex-dna benchmark [1], which is im- plemented in C++ using the RE2 library [3] and OpenMP. The RE2 implementation is based on splitting the pattern in smaller patterns, and matching the input string in parallel for each sub-pattern. Since the regex-dna benchmark does not
Fig. 3. The DFA automaton that matches (counts and extracts the location of k-mers) 8-mers from a given DNA sequence. The corresponding regular expression is shown on Table V. For simplicity the failure links are omitted from the figure.
Fig. 4. Performance results of our k-mers CoEx algorithm for various numbers of threads and data-sets. As input are used six DNA sequences of various lengths: mouse (2.7GB), cat (2.4GB), dog (2.4GB), chicken (1GB), human (3.2GB), and turkey (0.2GB). The experiments are performed by varying the number of threads (2, 6, 12, 24, and 48). The performance measurements for each experiment have been repeated 20 times.
support larger data-sets, we have compared the two algorithms for the two smallest data-sets: chicken (1060MB) and turkey (193MB). Our k-mers CoEx algorithm outperforms the regex- dna benchmark for both data-sets. We may observe that the k-mers CoEx algorithm running on one thread takes the same amount of time as the regex-dna benchmark running on the total amount of threads. This happens because one thread has to perform at least one sequential REM for a specific sub- pattern. In this class of algorithms where the execution time is mainly dependent on the length of the input, balancing the work among the available threads should be done by splitting the input string, instead of the pattern length. One could benefit from partitioning a long pattern into smaller one, in cases when the input string is either relatively short or can not be split
(Real-time Network Intrusion Detection).
V. RELATED WORK
In this section we discuss the state-of-the-art in pattern matching and DNA sequence analysis techniques for multi- core architectures.
Existing approaches use both hardware and software to accelerate the process of regular expression matching. In com- parison to hardware based state machines, which are faster, less flexible and more expensive, software based acceleration techniques are flexible in terms of updating or adding new patterns [30].
Herath et al. presented in [16] an implementation of the Aho-Corasick string matching algorithm using POSIX threads,
Fig. 6. Performance comparison between k-mers CoEx and the regex-dna benchmark (RE2)
which is based on the pattern partitioning approach. A repli- cation of the Herath’s study with the intention to improve the software implementation of the Aho-Corasick algorithm was conducted by Arudchutha et al. [6].
Marçais and Kingsford [19] present the Jellyfish tool, which is based on the lock-free hash table that is optimized for counting k-mers of length up to 31 bases. Rizk et al. [28] present a similar approach to Jellyfish [19], so called DSK, which is designed for small-memory servers. The k-mers are counted by traversing the hash tables. Using hash tables for the internal representation resulted to be memory inefficient [14]. As described by Drews et al. [14] a sequence corresponding to a human chromosome with 24-230MB of input data would require gigabytes of memory to store the k-mers information.
Drews et al. [14] achieved significant speedup by partition- ing the input string among the threads in such a way that each thread processes only sequences starting with a specified prefix used to divide the radix tree among the threads. They achieved up to 6.9× speedup on a shared memory system with 8 cores.
The n-step FM-index approach presented by Chacón et al. [11] achieved speedups from 1.4× to 2.4× with respect to their original FM-index search algorithm.
An approach based on the Aho-Corasick string matching algorithm designed for the Cray XMT architecture is proposed by Villa et al. [31]. They split the input among the available threads, and overlap the input by the pattern length. Their approach is applicable for multiple patterns as long as they are of the same length, otherwise, the occurrences of the shortest patterns occurring on the crossing border may be counted by both threads.
A method for searching arbitrary regular expressions using speculation is proposed by Luchaup et al. [18]. The drawback is that if an REM performed by a thread does not converge on its sub-input, then the next thread has to start from a new state that breaks the serialization and limits the scalability.
Our k-mers CoEx algorithm is tailored for large-scale DNA
analysis. In our approach, the DNA segment is split into several chunks, and efficient speculations of the possible initial states for each chunk are performed. Furthermore, our algorithm optimizes the REM using a converging point.
VI. SUMMARY
We have described a parallel algorithm based on Finite Automata for counting and extracting k-mers in a DNA segment. In a series of experiments with real world data- sets we have observed that the algorithm scales well with respect to various problem and machine sizes. We achieved the maximal speedup of 17.65× for the largest data-set (that is the human DNA segment) using 48 threads on a dual-socket shared-memory system with 24 physical cores. In comparison to the regex-dna benchmark our algorithm was up to three times faster.
In this paper we have studied the performance of our approach for DNA sequence analysis on a shared-memory system with two 12-core Intel Xeon processors. It may be useful to compare the performance that we achieved using all available cores of the host Intel Xeon processors with the performance achieved when all available cores of the Intel Xeon Phi coprocessor are used [22]. Furthermore, software technologies, such as [13], enable the use of all cores of homogeneous processors of the host and all available cores of the coprocessor.
Future work could address generalization of our approach for DNA sequence analysis for various types of accelerated systems using techniques that ensure performance portability [8], [17], [24], [29]. The use of modeling and simulation techniques [10], [15], [25], [26] could help to reason about the performance on extreme-scale computing architectures [4].
REFERENCES
[1] The computer language benchmarks game. http://benchmarksgame. alioth.debian.org/. Accessed: Jan. 2015.
[2] National center for biotechnology information u.s. national library of medicine. http://www.ncbi.nlm.nih.gov/genbank. Accessed: Jan. 2015.
[3] Re2: An efficient, principled regular expression library. https://code. google.com/p/re2/. Accessed: Jan. 2015.
[4] E. Abraham, C. Bekas, I. Brandic, S. Genaim, E. B. Johnsen, I. Kondov, S. Pllana, and A. Streit. Preparing hpc applications for exascale: Challenges and recommendations. In 2015 International Conference on Network-Based Information Systems (NBiS). IEEE, 2015.
[5] S. Aluru, N. M. Amato, and D. A. Bader. Editorial: Special section on high-performance computational biology. IEEE Transactions on Parallel and Distributed Systems, 17(8):737–739, 2006.
[6] S. Arudchutha, T. Nishanthy, and R. G. Ragel. String matching with multicore cpus: Performing better with the aho-corasick algorithm. arXiv preprint arXiv:1403.1305, 2014.
[7] Y. Benenson, T. Paz-Elizur, R. Adar, E. Keinan, Z. Livneh, and E. Shapiro. Programmable and autonomous computing machine made of biomolecules. Nature, 414(6862):430–434, 11 2001.
[8] S. Benkner, S. Pllana, J. Traff, P. Tsigas, U. Dolinsky, C. Augonnet, B. Bachmayer, C. Kessler, D. Moloney, and V. Osipov. PEPPHER: Efficient and Productive Usage of Hybrid Computing Systems. Micro, IEEE, 31(5):28–41, Sept 2011.
[9] D. A. Benson, M. Cavanaugh, K. Clark, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and E. W. Sayers. Genbank. Nucleic acids research, 41(D1):D36–D42, 2013.
[10] I. Brandic, S. Pllana, and S. Benkner. An approach for the high-level specification of qos-aware grid workflows considering location affinity. Scientific Programming, 14(3–4):231–250, 2006.
[11] A. Chacn, J. C. Moure, A. Espinosa, and P. Hernndez. n-step fm- index for faster pattern matching. In V. N. Alexandrov, M. Lees, V. V. Krzhizhanovskaya, J. Dongarra, and P. M. A. Sloot, editors, ICCS, volume 18 of Procedia Computer Science, pages 70–79. Elsevier, 2013.
[12] S. N. Devi and S. P. Rajagopalan. Article: An index based pattern matching using multithreading. International Journal of Computer Applications, 50(6):13–17, July 2012.
[13] J. Dokulil, E. Bajrovic, S. Benkner, S. Pllana, M. Sandrieser, and B. Bachmayer. High-level support for hybrid parallel execution of c++ applications targeting intel xeon phi coprocessors. Procedia Computer Science, 18(0):2508 – 2511, 2013. 2013 International Conference on Computational Science.
[14] F. Drews, J. Lichtenberg, and L. R. Welch. Scalable parallel word search in multicore/multiprocessor systems. The Journal of Supercomputing, 51(1):58–75, 2010.
[15] T. Fahringer, S. Pllana, and J. Testori. Teuta: Tool support for performance modeling of distributed and parallel applications. In Computational Science - ICCS 2004, volume 3038 of Lecture Notes in Computer Science, pages 456–463. Springer Berlin Heidelberg, 2004.
[16] D. Herath, C. Lakmali, and R. Ragel. Accelerating string matching for bio-computing applications on multi-core cpus. In Industrial and Information Systems (ICIIS), 2012 7th IEEE International Conference on, pages 1–6, Aug 2012.
[17] C. Kessler, U. Dastgeer, S. Thibault, R. Namyst, A. Richards, U. Dolin- sky, S. Benkner, J. Traff, and S. Pllana. Programmability and perfor- mance portability aspects of heterogeneous multi-/manycore systems. In Design, Automation Test in Europe Conference Exhibition (DATE), 2012, pages 1403–1408, March 2012.
[18] D. Luchaup, R. Smith, C. Estan, and S. Jha. Speculative parallel pattern matching. IEEE Transactions on Information Forensics and Security, 6(2):438–451, 2011.
[19] G. Marais and C. Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764– 770, 2011.
[20] A. Mellmann and et al. Prospective genomic characterization of the german enterohemorrhagic escherichia coli o104:h4 outbreak by rapid next generation sequencing technology. PLoS ONE, 6(7):e22751, 07 2011.
[21] S. Memeti and S. Pllana. Parem: A novel approach for parallel regular expression matching. In Computational Science and Engineering (CSE),
2014 IEEE 17th International Conference on, pages 690–697, Dec 2014. [22] S. Memeti and S. Pllana. Accelerating dna sequence analysis using
intel xeon phi. In 2015 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA). IEEE, 2015.
[23] K. J. Nordström, M. C. Albani, G. V. James, C. Gutjahr, B. Hartwig, F. Turck, U. Paszkowski, G. Coupland, and K. Schneeberger. Mutation identification by direct comparison of whole-genome sequencing data from mutant and wild-type individuals using k-mers. Nature biotech- nology, 31(4):325–330, 2013.
[24] S. Pllana, S. Benkner, E. Mehofer, L. Natvig, and F. Xhafa. Towards an intelligent environment for programming multi-core computing systems. In Euro-Par 2008 Workshops - Parallel Processing, volume 5415 of Lecture Notes in Computer Science, pages 141–151. Springer Berlin Heidelberg, 2009.
[25] S. Pllana, S. Benkner, F. Xhafa, and L. Barolli. Hybrid performance modeling and prediction of large-scale computing systems. In Complex, Intelligent and Software Intensive Systems, 2008. CISIS 2008. Interna- tional Conference on, pages 132–138, March 2008.
[26] S. Pllana, I. Brandic, and S. Benkner. A Survey of the State of the Art in Performance Modeling and Prediction of Parallel and Distributed Com- puting Systems. International Journal of Computational Intelligence Research (IJCIR), 4(1):17–26, January 2008.
[27] J. Reif and S. Sahu. Autonomous programmable nanorobotic devices using dnazymes. In M. Garzon and H. Yan, editors, DNA Computing, volume 4848 of Lecture Notes in Computer Science, pages 66–78. Springer Berlin Heidelberg, 2008.
[28] G. Rizk, D. Lavenier, and R. Chikhi. Dsk: k-mer counting with very low memory usage. Bioinformatics, 29(5):652–653, 2013.
[29] M. Sandrieser, S. Benkner, and S. Pllana. Using Explicit Platform Descriptions to Support Programming of Heterogeneous Many-Core Systems. Parallel Computing, 38(1-2):52–56, January 2012.
[30] B. Soewito and N. Weng. Methodology for evaluating DNA pattern searching algorithms on multiprocessor. In Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE
2007, October 14-17, 2007, Harvard Medical School, Boston, MA, USA, pages 570–577, 2007.
[31] O. Villa, D. G. Chavarra-Miranda, and K. J. Maschhoff. Input- independent, scalable and fast string matching on the cray xmt. In IPDPS. IEEE, 2009.
- I Introduction
- II Regular Expression Matching (REM) with Finite Automata (FA)
- III Design and Implementation of a DNA Analysis algorithm
- III-A Our algorithm for counting and extracting the location of k-mers (k-mers CoEx)
- III-B Implementation Aspects
- III-B1 Building the Deterministic Finite Automaton
- III-B2 Splitting the input and finding the possible starting states (PSS)
- III-B3 Determining the Converging Point
- IV Experimental Evaluation
- IV-A Experimentation Environment
- IV-B Results
- V Related Work
- VI Summary
- References