Data structures

naren28
ProjectAssignment.zip

Comparison of Three Pattern Matching Algorithms using DNA Sequences.pdf

www.ijsetr.com

ISSN 2319-8885

Vol.03,Issue.35

November-2014,

Pages:6951-6955

Copyright @ 2014 IJSETR. All rights reserved.

Comparison of Three Pattern Matching Algorithms using DNA Sequences NYO ME TUN

1 , THIN MYA MYA SWE

2

1 Dept of Research and Development, Computer University, Mandalay, Myanmar, E-mail: nyomizin@gmail.com.

2 Dept of Research and Development, Computer University, Mandalay, Myanmar.

Abstract: Pattern matching is commonly used in computer science and information processing such as text editor in computing,

database queries, bioinformatics, language syntax checker, music content retrieval, DNA sequences matching, search engines

and many more applications. There are many forms of pattern matching. Among them, the most well-known form is

bioinformatics application, DNA sequences analysis of various diseases which are stored in database for retrieval and

comparison. To get high quality results in a short time is to use pattern matching algorithms because of DNA database is very

complex and huge and not to retrieve easily. This paper presents three pattern matching algorithms, Knuth-Morris-Pratt (KMP),

Brute-Force and Boyer-Moore. The main process of this system is to find the matched DNA sequence in DNA database that find

the matched DNA sequence in DNA database using these pattern matching algorithms. In this system, at least two DNA

sequences are required to analyze. This system compares similarity values with threshold value and stores particular result which

is diseased or not. Finally, this system can identify optimal result according to each result using voting. This system is

implemented with Java programming language and the comparison results are demonstrated with bar graph with respect to

similarity and processing time of each algorithm.

Keywords: Pattern Matching, DNA Sequences Analysis, Medical Diagnosis.

I. INTRODUCTION

Exact sequence matching is a vital component of

bioinformatics which is the application of computer

technology to management and analysis of biological data.

Computer technologies are applied to gather, store, analyze

and merge biological data. Therefore, bioinformatics is an

interface between biological and computational sciences [2].

DNA Pattern matching, the problem of finding subsequences

within a long DNA sequence has many applications in

computational biology. As the sequences can be long,

matching can be an expensive operation, especially as

approximate matching is allowed [6]. There are many

biological sequence data in different databases. The

biological and bibliographical sequence data are growing in

these databases at an exponential rate. The computational

demands are to find matched sequence and to analyze these

sequence data contained in these databases. Exponential

growth of gene databases enforces the need for efficient

information extraction methods and algorithms for DNA

sequencing exploiting existing large amount of DNA

information available in DNA database. Manually it is very

tedious and time consuming [1].

Deoxyribonucleic acid (DNA) is an acronym for the

molecule deoxyribonucleic acid and it is also called a double

helix because it is a double-stranded molecule. It is mainly

composed of nucleotides of four types. The four bases in

DNA are Adenine (A), Cytosine (C), Guanine (G), and

Thymine (T). A DNA sequence is a representation of a string

of nucleotides contained in a strand of DNA. For example:

ATTCGTAACTAGTAAGTTA. The DNA sequencing

techniques have allowed the vast amount of data to be

analyzed in a short span of time. So, pattern matching

techniques plays a vital role in computational biology for

data analysis related to biological data such as DNA

sequences[1][2][6]. This paper presents and analyzes three

pattern matching algorithms to find the matched DNA

sequence in DNA database. Pattern-matching algorithm

matches the pattern exactly or approximately within the text.

The organization of this paper is as follows. SectionII

presents the related work of pattern matching process. In

sectionIII, theoretical background for DNA sequence

matching is given. In sectionIV, proposed system is

demonstrated. In sectionV, system implementation is

described. In sectionVI, the experimental results are shown.

Finally, in sectionVII, the conclusions and references are

given.

II. RELATED WORK

The pattern matching algorithms can be applied for

detecting the unusual patterns present in the gene database. It

can show how the disease can be transformed from parents to

their children and can identify the presence of the disease on

hereditary basis and its impact. Moreover, KMP algorithm

known as DNA sequencing algorithm is used to detect

unusual patterns in DNA sequences and to analyze these

sequences in the gene database [1][2]. Approximate String

Matching Algorithm, fuzzy string searching technique is used

NYO ME TUN, THIN MYA MYA SWE

International Journal of Scientific Engineering and Technology Research

Volume.03, IssueNo.35, November-2014, Pages: 6951-6955

to find strings that match a pattern approximately. It is

applied in finding approximate substring matches inside a

given string and finding dictionary strings that match the

pattern approximately. Another application of approximate

string matching can be effectively used to retrieve fast video

as it uses the content based video retrieval in contrast with

the traditional video retrieval which was slow and time

consuming. String based video retrieval method first converts

the unstructured video into a curve and marks the feature

string of it. Approximate string matching is then used to

retrieve video quickly [3][4].

III. THEORETICAL BACKGROUND

A. Pattern Matching

DNA pattern matching is a fundamental and upcoming

area in computational molecular biology. Pattern matching is

an important task of the pattern discovery process in today's

world for finding the structural and functional behavior in

proteins and genes. Although pattern matching is commonly

used in computer science and information processing, it can

be found in everyday tasks. The string matching can be

described as: given a specific strings P generally called

pattern searching in a large sequence/text T to locate P in T.

if P is in T, the matching is found and indicates the position

of P in T, else pattern does not occurs in the given text. As

the size of the data grows it becomes more difficult for users

to retrieve necessary information from the sequences. Hence

more efficient and robust methods are needed for fast pattern

matching techniques. It is one of the most important areas

which have been studied in bioinformatics. Pattern matching

techniques has two categories and is generally divides into:

 Single pattern matching

 Multiple pattern matching techniques.

Single pattern matching is to find all occurrences of the

pattern in the given input text. Suppose, if more than one

pattern are matched against the given input text

simultaneously, then it is known as, multiple pattern

matching. Multiple pattern matching can search multiple

patterns in a text at the same time. It has a high performance

and good practicability, and is more useful than the single

pattern matching algorithms [5][6][7]. Let P = {p1, p2,

p3,...,pm} be a set of patterns of m characters and

T={t1,t2,t3…,tn} in a text of n character which are strings of

nucleotide sequence characters from a fixed alphabet set

called Σ={A C, G, T}[7]. There are various pattern matching

algorithms. These efficient algorithms are used to the

sequence of DNA in the DNA database. The present day

pattern matching algorithms match the pattern exactly or

approximately within the text [1][2][7]. Among them, this

paper applies three pattern matching algorithms. They are:

 Knuth-Morris-Pratt

 Brute-Force and

 Boyer-Moore

B. Knuth-Morris-Pratt (KMP) Algorithm

The KMP algorithm is linear sequence matching

algorithm and plays an important role in bioinformatics,

DNA sequence matching, disease detection etc. It scans the

sequence from left to right. This algorithm preprocess the

pattern string P using failure function. There is a match,

increase the current indices. If not, consult the failure

function the new index in P here needs to continue checking

P against T. The algorithm repeat this process until find a

match of P (pattern) in T(Text) or index for T reach n, the

length of T. The main part of the KMP algorithm is the

while-loop which performs comparison between a character

in T and a character in P for each iteration. This algorithm

doesn’t need backtracking and shift more than one position.

Therefore, KMP is called as an intelligent search algorithm

[1] [2].

TABLE I: Knuth-Morris-Pratt (KMP) Pattern Matching

Algorithm

C. Brute-Force Algorithm

The brute force algorithm is a powerful technique that is

used to search the pattern. This algorithm is simplest string

method which is used to match each character of pattern to

the corresponding character in text until all characters are

found to match successful search; or mismatch is detected.

This algorithm scans the sequence from left to right direction.

This algorithm doesn’t need preprocessing phase. It consists

of two nested loops, with the outer loop indexing through all

possible starting indices of the pattern in the text, and the

inner loop indexing through each character of the pattern,

comparing it to its potentially corresponding character in the

text [2] [6].

Comparison of Three Pattern Matching Algorithms using DNA Sequences

International Journal of Scientific Engineering and Technology Research

Volume.03, IssueNo.35, November-2014, Pages: 6951-6955

TABLE II: Brute-Force Pattern Matching Algorithm

D. Boyer-Moore Algorithm

Boyer-Moore algorithm is an efficient string searching

algorithm that can determine whether or not a match of a

particular string exists within another string. So, this

algorithm is used in bioinformatics mostly, for disease

detection. This algorithm scans the matching of the two

sequences from right to left direction. Therefore, this

algorithm takes backward approach. If no mismatch occurs,

then the pattern has been found. Otherwise, the algorithm

computes a shift; that is, an amount by which the pattern is

moved to the right before a new matching attempt is

undertaken. In case of a mismatch, the knowledge gained

from the last occurrence function helps in calculating the new

index to start the next search. This algorithm preprocesses the

pattern P using the last-occurrence function L mapping ∑,

where L(c) is defined as: the largest index i such that P[i] = c

or -1 if no such index exists [1] [2].

E. DNA (Deoxyribonuclei Acid)

Deoxyribonucleic Acid (DNA) is a nucleic acid that

contains genetic instructions. DNA is a double helix

structure. It contains three components: a five-carbon sugar

(Deoxyribose), a series of phosphate groups, and four

nitrogenous bases. The four bases in DNA are Adenine (A),

Thymine (T), Guanine (G), and Cytosine (C). Thymine and

adenine always come in pairs. Likewise, guanine and

cytosine bases come together too. Every human has his/her

unique genes. Genes are made up of DNA; therefore the

DNA sequence of each human is unique. The sequence of

DNA constitutes the heritable genetic information in nuclei,

plasmids, mitochondria, and chloroplasts that forms the basis

for the developmental programs of all living organisms

[1][2][6][7].

TABLE III: Boyer-Moore Pattern Matching Algorithm

The process of determining the DNA sequence is used to

map out the sequence of the four nucleotides that comprise a

strand of DNA. Studying the DNA sequence is necessary in

basic research studying fundamental biological processes, as

well as in applied fields such as diagnostic and forensic

research. The four nucleotides (A, T, G, C) are grouped to

form words and these words make sentences which are called

genes. The occurrence of unwanted or mutated words causes

the disease. Every disease will have its own words or

sequences on the occurrence of the mutated sequence in the

DNA [1][2][6]. This paper matches the DNA sequences of

three diseases called HIV, Lung cancer and Leukaemenia.

IV. PROPOSED SYSTEM

This paper presents three pattern matching algorithms for

DNA sequence matching. The main process of this paper is

finding the matched DNA sequence in DNA database using

pattern matching algorithms. When entering human DNA

sequence (suspected DNA sequence) and the type of disease

to be checked as input, the matched DNA sequence are

searched using three pattern matching algorithms: Knuth-

Morris-Pratt (KMP), Brute-Force and Boyer-Moore. The

total matched indices are got and similarity values with

respect to entire sequence are computed. Then, the similarity

values are compared with threshold value (user defined

value) and stores the particular results which is diseased or

not. Finally, the optimal result is sent back according to each

stored result using voting method. Fig.1 is the proposed

system architecture for the DNA sequence matching.

NYO ME TUN, THIN MYA MYA SWE

International Journal of Scientific Engineering and Technology Research

Volume.03, IssueNo.35, November-2014, Pages: 6951-6955

Fig.1. System Architecture Design.

V. SYSTEM IMPLEMENTATION This system is implemented using Java programming

language and tested using different DNA sequence with

different file size. Pattern matching techniques are used to

search the matched DNA sequences in mostly bioinformatics.

The main process of the system is finding the matched DNA

sequence in a set of DNA database. The pattern matching

algorithms are used to find the matched sequence and the

total matched indices are used to compute similarity value

with respect to entire sequence. Since DNA sequences are

very large and complex, it is impossible to analyze the vast

amount of data in a short of span. The pattern matching

algorithms are efficient to trace the sequence of DNA in the

DNA database. These efficient pattern matching techniques

can give optimal result for particular diseased DNA

sequence. All the three algorithms require at least two DNA

sequences. One of these sequences is generally a suspected

DNA sequence and other is a diseased sequence. Moreover,

pattern matching techniques is used to optimize the time and

to analyze the vast amount of data in a short span of time.

VI. EXPERIMENTAL RESULTS

The results of the proposed system are described in this

section. Fig.3 describes the particular result of checked

disease for particular algorithm. From this Fig.3, we can

decide the optimal result for particular disease. Fig.2 (a) is

the input DNA, Figure 2 (b) presents the types of DNA

available in this system. Fig.3 describes the result of the

system. Finally, the similarity and processing time are shown

as Fig.4 which is two comparison results of three pattern

matching algorithms. Threshold value used in this system is

0.75 (user defined value).

(a)

(b)

Fig.2. (a) Input DNA, (b). DNA Types available in this

system.

Fig.3. Result of the system.

Fig.4. Two Comparison Results of Three Pattern

Matching Algorithms.

VII. CONCLUSION

This paper provides a path in diagnosing the disease by the

identification of presence of diseased DNA sequence in DNA

database. The pattern matching algorithms are effectively

used in matching DNA sequences because of DNA database

is very complex and huge and not to retrieve easily. The

Comparison of Three Pattern Matching Algorithms using DNA Sequences

International Journal of Scientific Engineering and Technology Research

Volume.03, IssueNo.35, November-2014, Pages: 6951-6955

pattern matching process is needed to keep pace with ever

growing demands in sequence comparison. The required

sequences are suspected sequence and diseased sequence.

These algorithms in this system are easy to implement and

effective in multiple detections of DNA sequences. This

paper identifies the particular disease on DNA sequence

using three pattern matching algorithms. Finally, this paper

gives optimal result according to each result using voting.

VIII. ACKNOWLEDGEMENTS My Sincere thanks to my supervisor Dr.Thin Mya Mya

Swe, Lecture, Research and Development Department,

Computer University (Mandalay), Myanmar for providing

me an opportunity to do my research work. I express my

thanks to my Institution namely Computer University

(Mandalay) for providing me with a good environment and

facilities like internet books, computers and all that as my

source to complete this research. My heart-felt thanks to my

family, friends and colleagues who have helped me for the

completion of this work.

IX. REFERENCES

[1] Kuhu Shukla, Samarjeet Borah, Sunil Pathak, “An

analysis of influential DNA sequencing Algorithms”.

[2] S.Rajesh, S. Prathima, Dr. L.S.S.reddy, “Unusual pattern

detection in DNA database using KMP algorithm”.

[3] Nimisha Scingla, Deepark Garg, “String Matching

algorithms and their Applicability in various Applications”.

[4] Vidya Saikrishna Prof. Akhtar Rasool and Dr. Nilay,

Khare Department, “String Matching and its Application in

Diversified Fields”.

[5] “String and pattern matching Algorithms”.

[6] Rajib Paul, Pooja Roy, “Improved Pattern Matching To

Find DNA Patterns”.

[7] Raju Bhukya, DVLN Somayajulu “Exact Multiple Pattern

Matching Algorithm using DNA Sequence and Pattern Pair”.

[8]http://www.broadinstitute.org/cgibin/cancer/datasets.cgi.

[9]http://www.pombase.org/downloads/dna datasets.

A Novel Pattern Matching Algorithm in Genome Sequence Analysis.pdf

A Novel Pattern Matching Algorithm in Genome Sequence Analysis

Ashish Prosad Gope #1, Rabi Narayan Behera#2

#1M.Tech Student (4th Semester) Dept. of Information Technology, Institute of Engineering & Management,

Salt Lake, Kolkata, India #2Asst. Professor, Dept. of Information Technology, Institute of Engineering & Management,

Salt Lake, Kolkata, India

Abstract— DNA sequences has been for years a great concern for many research papers in Bio-Informatics. DNA sequence is a long string of characters specifying the nucleotides presented in the DNA. In bioinformatics the most well-known application is DNA sequence detection. Stored DNA sequence of various disease are retrieved and compared in order to check for the existence of a disease. To search for the pattern a well-established pattern matching algorithm is needed in order to get the result at the cost of sufficient amount of time. We’ve specifically referred the DNA sequences instead of any text strings and implemented the algorithms upon it. This paper evaluates four pattern matching algorithms’ performance and then proposes a new algorithm based upon Rabin Karp algorithm which ensures that character comparisons can be eliminated from Rabin Karp algorithm. These algorithms look for the specified pattern in a huge strand of DNA sequence.

I. INTRODUCTION Bioinformatics is an interdisciplinary research area that is the interface between the biological and computational sciences. The advent of electronic computers has arguably been the most revolutionary development in the history of Science and technology. The Ultimate goal of bioinformatics is to uncover the wealth of Biological information hidden in the mass of data and obtains a clearer insight into the fundamental biology of organisms. This new knowledge could have profound impacts on fields as varied as human health. Agriculture, the environment, energy and biotechnology. There are many other applications of bioinformatics, including predicting entire protein strands, learning how genes express themselves in various species, and building complex models of entire cells. As computing power increases and our databases of genetic and molecular information expand, the realm of bioinformatics is sure to grow and change drastically, allowing us to build models of incredible complexity and utility. When we know a particular sequence is the cause for a disease, the trace of the sequence in the DNA and the number of occurrences of the sequence defines the intensity of the disease. As the DNA is a large database, I propose String and Pattern matching algorithms to find out a particular sequence in the given DNA. This paper entirely focuses on a novel approach for detecting the patterns present in the gene database. Pattern matching is a mechanism to find out the exact location of a specified pattern, iff the pattern exists in the text. Before moving forward let us convey you about the structure of our paper. We’ve discussed the preliminaries needed to move forward in section 2, after that in section 3, disease caused by genetic factors has been revisited. In section 4 we discussed detection of disease using pattern matching and in section 5 the

central ideas of this paper i.e. the pattern matching problem has been discussed. In subsequent sections i.e. in section 6, 7, 8 and 9, the Brute Force, Knuth-Morris-Pratt algorithm, Boyer Moore algorithm and Rabin Karp algorithm respectively has been described. In section 10 we’ve described our idea to improve the Rabin Karp algorithm and in section 11 the references used in this paper has been given.

II. PRELIMINARIES Every human has his/her unique genes. Genes are made up of DNA. DNA is contained in each living cell of an organism, and it is the carrier of that organism’s genetic code. The genetic code is a set of Sequences which define what proteins to build within the organism. DNA consists of two strands, each being a string of four nitrogenous bases i.e. Adenine, Cytosine, Guanine and Thymine. In a computer we represent each nitrogen base with a single character: A for Adenine, G for Guanine and C for Cytosine and T for Thymine. Thymine (T) & Adenine (A) always come in pairs. Likewise, Guanine (G) & Cytosine (C) bases come together too. Using these codes an entire DNA can be coded based upon their nucleotides contained in a strand. For example: ATGCGATATGCATGCATGCATAT. The term DNA sequencing comprehends biochemical methods for determining the order of the nucleotide bases, adenine, guanine, cytosine, and thymine, in a DNA oligonucleotide [10]. Determining the DNA sequence is therefore useful in basic research studying fundamental biological processes, as well as in applied fields such as diagnostic or forensic research.

The power and ease of using sequence information has however, made it the method of choice in modern bioinformatics analysis.[11]

III. DESEASE CAUSED BY GENETIC FACTORS

An unhealthy symptoms or a specific illness in the body is termed as a disease. Disease refers to any unnatural condition of an organism that affects normal functions. Diseasemay be referred to disabilities, disorders, syndromes, symptoms[9].Genes are the basic building blocks of heredity. They get passed from parent to child. They hold DNA, the instructions for making proteins. A genetic disease is any disease that is caused by an abnormality in an individual's genome. Some of the genetic disorders are inherited from the parents, while other genetic diseases are caused by mutations in a pre-existing gene or group of genes.

IV. DETECTION OF DISEASE USING PATTERN MATCHING Over the last decade, genetic studies have identified numerous associations between chromosomal alleles in the human genome and important human diseases. Unfortunately, these extending

Ashish Prosad Gope et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5450-5457

www.ijcsit.com 5450

findings of casual variants in the region of DNA is not a straight forward task [8]. Causal variant identification typically involves searching through sizable regions of genomic DNA in the locality of disease-associated SNPs (single nucleotide Polymorphism).When we know a particular sequence is the cause for a disease, the trace of the sequence in the DNA and the number of occurrences of the sequence defines the intensity of the disease. As the DNA is a large database we need to go for efficient algorithms to find out a particular sequence in the given DNA.

V. THE PATTERN MATCHING PROBLEM In pattern-matching problem on strings, we are given a text string T of length n and a pattern string P of length m, and want to find whether P is a substring of T. The meaning of a “match” is that there is a substring of text T starting at some index i that matches pattern P, so that T[i]=P[0], T[i+1]=P[1] ... T[i+m-1]=P[m-1] i.e. P= T[i..i+m-1]. Thus, the output from a pattern-matching algorithm is either an indication that the pattern P does not exist in T or the starting index in T of a substring matching P.[12]

T =” abacaabaccabacabaabb “

And the pattern string:

P = "abacab".

Then P is a substring of T. Namely, P = T [10...15]. There are various pattern-matching algorithms. Here we are to review four pattern matching algorithms and present an algorithm which is based upon Rabin-Karp algorithm but modified. These efficient algorithms can be used to trace the sequence of DNA in a huge gene database. Following are the four algorithms which are described below.

• Brute-Force • Knuth-Morris –Pratt • Boyer-Moore • Rabin-Karp Algorithm

VI. BRUTE FORCE ALGORITHM It is also known as Naive String Matching algorithm. It has no pre- processing phase, needs constant extra space. It always shifts the window by exactly one position to the right. It requires 2n expected text characters comparisons. It finds all valid shifts using a loop that checks the condition P[1....m]=T[s+1...s+m] for each of the n-m+1 possible values of s. The algorithm is as following: BRUTE_FORCE(T, P) n ← length[T ] m ← length[P] for s ← 0 to n − m

do if P[1 . .m] = T [s + 1 . . s + m] then print “Pattern occurs with shift” s

The Brute force string-matching procedure can be presented as shifting the pattern over the text, observing for which shifts all of the characters of the pattern equal the corresponding characters in the text, as illustrated in the following example.

T=ANPANMAN P=MAN

VI.I. Complexity Procedure BRUTE_FORCE takes time O(m) in best case i.e. when the pattern is found with in first m characters of text. And in the worst case the pattern will be matched total (m (n-m+1)). For

example, consider the text string “AN” (a string of n a’s) and the pattern “AM”. For each of the (n−m+1) possible values of the shift s, the loop on line 4 to compare corresponding characters must execute m times to validate the shift. The worst-case running time is thus O(mn). The running time of BRUTE_FORCE is equal to its matching time, since there is no preprocessing. VI.II. Drawbacks Of This Approach In O(mn) approach. if ‘m’ is the length of pattern ‘p’ and ‘n’ is the length of string ‘S’. Suppose S=ATGATAATGAAG and p=AATA.

Figure 1: Brute Force comparison process

j= 0 1 2 3 4 5 6 7 8 9 10 S= A T G A T A A T G A G p= A T A A A T A A A T A A A T A A

In table 1 we’ve shown when mismatch is detected for the first time in comparison of p[3] with S[3], pattern ‘p’ would be moved one position to the right and matching procedure resumes from here. Here the first comparison that would take place would be between p[0]=‘A’ and S[1]=‘T’. It should be noted here that S[1] had been previously involved in a comparison in 2nd iteration of the loop in this algorithm. This is a repetitive use of S[1] in another comparison. It is these repetitive comparisons that lead to the runtime of O(mn), which made it very slow.

VII. KMP ALGORITHM We now present a linear-time string-matching algorithm due to

Knuth, Morris, and Pratt. The basic idea behind the algorithm discovered by Knuth, Morris, and Pratt is this: when a mismatch is detected, our false start (which is the main drawback of Brute Force algorithm) consists of characters that we know in advance (since they’re in the pattern). Somehow we should be able to take advantage of this information instead of backing up the pointer over all those known characters

VII.I The Prefix Function For A Pattern

Fully skipping past the pattern on detecting a mismatch as described in the previous paragraph won’t work when the pattern could match itself at the point of the mismatch. To calculate the positions for the pattern as to how much a pattern need to shift itself so that the corresponding characters of text match with it. The table is called as next table or sometimes failure function (figure 2) for the pattern to be searched [14]. Consider another example of this next table. This next[j] be the character position in the pattern which should be checked next after such a mismatch, so that we can slide the pattern (j - next[j]) places relative to the text [6].

Figure 2: Next table j 1 2 3 4 5 6 7 8 9 10

pattern A T G A T G A G A T

next -1 0 0 -1 0 0 -1 4 -1 0

Here next[j]= 0 means that we are to slide the pattern all the way past the current text character. Now we shall discuss how to pre- compute this table; fortunately, the calculations are quite simple, and we will see that they require only O(m) steps. Now we

Ashish Prosad Gope et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5450-5457

www.ijcsit.com 5451

represent following the algorithm to calculate the next function or prefix function:

next(p) //p signifies pattern int i=0, j=-1; next[i]=j; for(i=0;i<m;i++) { if(i==0) next[i]=j; else if(p[i]==p[j]) { next[i]=next[j]; } else { next[i]=j; } while (j>=0 && p[i]!=p[j]) j=next[j]; j++; }

This program takes O(m) units of time, as next[t] in the innermost loop always shifts the upper copy of the pattern to the right, so it is performed a total of m times at most. A slightly different way to prove that the running time is bounded by a constant times m is to observe that the variable starts at 0 and it is increased, m- 1 times, by 1; furthermore its value remains nonnegative. Therefore the operation next[j], which always decreases j, can be performed at most m-1 times [6].

VII.II. The Pattern Matching Algorithm

The Knuth-Morris-Pratt matching algorithm is given in pseudo code below as the procedure KMP-MATCHER. KMP- MATCHER calls the auxiliary procedure next() to compute next table. Below T & P signifies text & pattern respectively.

KMP-MATCHER(T, P) n ← length[T] m ← length[P] next=next(P) //array consisting of prefix values j ← 0 //Number of characters matched. for k ← 1 to n //Scan the text from left to right. do while j > 0 and P[j + 1] ≠ T [k] do j ← next[j] //Next character does not match. if P[j + 1] = T [k ] then j ← j + 1 //Next character matches. if j = m //Is all of P matched? then print “Pattern occurs with shift” k– m j ← next[j] // Look for the next match. For convenience, let us assume that the input text is present in an array text T[ 1…n ], and that the pattern appears in pattern P[1…m]. We shall also assume that m > 0, i.e., that the pattern is nonempty. Let k and j be integer variables such that text T[k] denotes the current text character and pattern P[j] denotes the corresponding pattern character; thus, the pattern is essentially aligned with positions p + 1 through p + m of the text, where k =p +j [15]. VII.III. Complexity The KMP algorithm works by turning the patterns given into a machine, and then running the machine. It takes O(m) space and time complexity in pre-processing phase, and O(n+m) time complexity in searching phase (independent of the alphabet size). KMP is a linear time string matching algorithm. [6]

VIII. BOYER-MOORE ALGORITHM A significantly faster string searching method can be developed by scanning the pattern from right to left when trying to match it against the text. The Boyer-Moore algorithm (BM) was developed by R.S.Boyer and J.C.Moore in 1977 [7]. The Boyer Moore algorithm scans the characters of the pattern from right to left beginning with the rightmost one and performs the comparisons from right to left. VIII.I Bad Character Rule To convey the idea of the bad character rule, let us suppose that the last (rightmost) character of pattern P is y and the character in text T it aligns with is x, x ≠ y. When mismatch occurs, we can safely shift P to the right so that the rightmost x in P is below the mismatched x in T, and this is possible if the rightmost position of character x exists in pattern P. This observation is formalized below [16]. For a particular alignment of pattern P against text T, the rightmost (n-i) characters of pattern P match their counterparts in text T, but the next character to the left, P(i), doesn’t matches with its counterpart, say in position k of T. The bad character rule says that P should be shifted right by Max[1,i - R(T(k))] places. The point of this shift rule is to shift P by more than one character when possible. In the below example, T(5) = t mismatches with P(3) and R(t) = 1 so P can be shifted right by two positions. After the shift, the comparison of P and T begins again at the right end of P.

Figure 3: Compare from right 1 2

1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7

T A C T C T T G A T G C T C T T A C

P A G A T G A T

VIII.II. Extended Bad Shift Rule When a mismatch occurs at position i of pattern P and the mismatched character in text T is x, then shift P to the right so that the closest x to the left of position i in P is below the mismatched x in T.

VIII.III The Good Suffix Rule Now we introduce another rule called the good suffix rule. Suppose for a given pattern P and text T, a substring t of text T matches a suffix of pattern P, but a mismatch occurs at the next comparison to the left. Then find, if it exists, the rightmost copy t’ of t in P such that t0 is not a suffix of P and the character to the left of t’ in P differs from the character to the left of t in P. Shift P to the right so that substring t0 in P is below substring t in T (see Figure 4). If t’ does not exist, then shift the left end of P. past the left end of t in T by the least amount so that a prefix of the shifted pattern matches a suffix of t in T. If no such shift is possible, then shift P n places to the right. If an occurrence of P is found, then shift P by the least amount so that a proper prefix of the shifted P matches a suffix of the occurrence of P in T. If no such shift is possible, then shift P by n places, i.e., shifting P past t in T.

Figure 4: case when good suffix rule applies 0 1

1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8

T p r s t a b c t u b a b v q x r s t

^

P q c a b d a B d a b

1 2 3 4 5 6 7 8 9 0

Ashish Prosad Gope et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5450-5457

www.ijcsit.com 5452

Good suffix shift rule, where character x of T mismatches with character y of P. Characters y and z of P are guaranteed to be distinct by the good suffix rule, so z has a chance of matching x. When the mismatch occurs at position 8 of P and position 10 of T, t = ab and t0 occurs in P starting at position 3. Hence P is shifted right by six places resulting in the following alignment.

Figure 5: Shifting using good suffix rule

0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 T p r s t a b c t u b a b v q x r s t ^ P q c a b d a b a b 1 2 3 4 5 6 7 8 9 Now in cases where we have matched the final m characters of pattern P before failing, we clearly wish to shift our attention down string by 1+m. So, L(i) is the largest index j less than n such that Nj (P)≥|P[i..n]| (which is n - i + 1). L’(i) is the largest index j less than n such that Nj(P) = |P[i..n]| = (n - i + 1).Now The pre- processing stage must also prepare for the case when L’(i) = 0 or when an occurrence of P is found. l’(i) equals the largest j ≤ |P[i..n]|, which is n-i+1, such that Nj (P) = j. So we can say that the required shift will be max (L(i),L’(i)).

VIII.IV. The complete Boyer-Moore algorithm: Given the pattern P, //pre-processing stage Compute L’(i) and l(i) for each position i of P, and compute R(x) for each character x ∈ ∑ //Search stage k := n; while k ≤ m do begin i := n; h := k; while i > 0 and P(i) = T(h) do begin i := i - 1; h := h - 1; end; if i = 0 then begin report an occurrence of P in T ending at position k. k := k + n – l’(2); end else shift P (increase k) by the maximum amount determined by the (extended) bad character rule and the good suffix rule. end

Note that although we have always talked about shifting P", and given rules to determine by how much P should be “shifted", there is no shifting in the actual implementation. Rather, the index k is increased to the point where the right end of P would be shifted". Hence, each act of shifting P takes constant time [17]. The good suffix rule in Boyer-Moore method has a worst-case running time of O(m) provided that the pattern does not appear in the text. This was first proved by Knuth, Morris and Pratt [6]. VIII.V. Algorithm Complexity The BM algorithm is successful at achieving a sub linear running time in the average case, and if some special conditions occurred then also was capable of O(n+m) in the worst case.

IX. RABIN-KARP ALGORITHM Previous three algorithms which we’ve seen is based upon string matching to see whether the pattern is matched with the text portion or not. RABIN KARP matcher is one of the most effective string matching algorithms. To find a numeric pattern ‘P’ from a given text ‘T’. It first divides the pattern with a predefined prime number ‘q’ to calculate the modular of the pattern P. Then it tests the first m characters (m=|P|) from text T to compute remainder of m characters from text T. If the remainder of the Pattern and the remainder of the text T are equal only then we compare the characters of the text portion with the pattern otherwise there is no need for the comparison [1]. We’ve to repeat the process for next set of characters from text for all the possible shifts which are from s=0 to nm (where n denotes the length of text and m denotes the length of P). So according to this two numbers n1 and n2 can only be equal if

REM (n1/q) = REM (n2/q) [1] After division we will be having three cases:-

• Case 1: Successful hit: - In this case if REM (n1) = REM(n2) and also characters of n1 matches with characters of n2.

• Case 2: Spurious hit: - In this case REM (n1) = REM (n2) but characters of n1 are not equal to characters of n2.

• Case 3: If REM (n1) is not equal to REM (n2), then no need to compare n1 and n2.

For a given text T, pattern P and prime number q T=234567899797797976534356678886756456890975545343434 24545475655454 P=667888 q=11 So to find out this pattern from the given text T we will take equal number of characters from text as in pattern and divide the value of these characters with predefined number q and also divide the pattern with the same predefined number q. Now compare their remainders to decide whether to compare the text with pattern or not. Rem (Text) =234567/11=3 Rem (Pattern) =667888/11=1 As both the remainders are not equal so there is no need to compare text with pattern. Now move on to set of characters of same length next from text and repeat the procedure. The Boyer Moore Algorithm goes as follows: Rabin_Karp_Matcher (T,P,d,q) { n =Length (T) m= Length (P) t0=0 p=0 h=dm-1mod q for i=1 to m { p = (d * p + P[i]) mod q t0 =(d * t0 + T[i] ) mod q } for s =0 to n-m { if ts=p {

//comparison for spurious hits if P[1….m] = T[s+1…….s+m]

then print pattern matches at shift ‘s’ } if s<= n-m

Ashish Prosad Gope et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5450-5457

www.ijcsit.com 5453

ts+1= (d(ts-h*T[s+1]) + T[s+1+m] ) mod q }

} So the entire process can be written as follows: where Say P has length L and S has length n. One way to search for P in S:

1. Hash P to get h(P). 2. Iterate through all length L substrings of S, hashing those substrings and comparing to h(P). 3. If a substring hash value does match h(P), do a string comparison on that substring and P, stopping if they do match and continuing if they do not.

IX.I Numerical Example: Let’s step back from strings for a second. Say we have P and S be two integer arrays: P = [5; 0; 3; 3; 0] S = [4; 8; 5; 0; 3; 3; 0; 8] The length 5 substrings of S will be denoted as such: S0 = [4; 8; 5; 0; 3] S1 = [8; 5; 0; 3; 3] S2 = [5; 0; 3; 3; 0] And so on … We want to see if P ever appears in S using the three steps in the method above. Our hash function will be: h(k)= (k[0] * 104 + k[1] * 103 + k[2] * 102 + k[3] * 101 + k[4] * 100)mod m Or in other words, we will take the length 5 array of integers and concatenate the integers into a 5 digit number, then take the number mod m. h(P) = 50330 mod m, h(S0) = 48503 mod m, and h(S1) = 85033 mod m. Note that with this hash function, we can use h(S0) to help calculate h(S1). We start with 48503, chop off the first digit to get 8503, multiply by 10 to get 85033, and then add the next digit to get 85033. More formally: h(Si+1) = [(h(Si) - (10

5 * first digit of Si)) * 10 + next digit after Si] mod m We can imagine a window sliding over all the substrings in S. Calculating the hash value of the next substring. In this numerical example, we looked at single digit integers and set our base b = 10 so that we can interpret the arithmetic easier. To generalize for other base b and other substring length L, our hash function is h(k) = (k[0]bL-1 + k[1]bL-2 + k[2]bL-3.... k[L - 1]b0) mod m And calculating the next hash value can be done by: h(Si+1) = b(h(Si) – b

L-1S[i]) + S[i + L] mod m Following is the example taken from [15]:

Figure 6:

The above figure[15] illustrates (a) A text string. A window of length 5 is shown shaded. The numerical value of the shaded

number is computed modulo 13, yielding the value 7. (b) The same text string with values computed modulo 13 for each possible position of a length-5 window. Assuming the pattern P = 31415, we look for windows whose value modulo 13 is 7, since 31415 ≡ 7 (mod 13). The first, beginning at text position 7, is indeed an occurrence of the pattern, while the second, beginning at text position 13, is a spurious hit. (c) Computing the value for a window in constant time, given the value for the previous window. The first window has value 31415. Dropping the high-order digit 3, shifting left (multiplying by 10), and then adding in the low-order digit 2 gives us the new value 14152.

X. IMPROVED IDEA: Theory As we can see, spurious hit is an extra burden on algorithm which increases its time complexity when we have to compare the text with pattern and won’t be able to get the pattern at that shift. So to avoid this extra matching, we’ve improved the Rabin Karp algorithm slightly, called IRK algorithm which says that along with remainders compare the quotients also. That is IRK checks whether, REM(n1/q)=REM(n2/q) and QUOTIENT (n1/q) = QUOTIENT (n2/q), where n1= pattern & n2=Text & q is the prime number. So, according to this technique along with the calculation of remainder, we will also find out the quotient and if both remainder and quotient of text matches with pattern then it is successful hit otherwise it is an unsuccessful hit or spurious hit and then we can remove the possibility of comparing the spurious hits. That means there is no extra computation of spurious hits if remainder and quotient are same then pattern found else pattern not found. Basically the algorithm is same as the original rabin karp algorithm, but with little modifications, which are shown in bold italic font. The algorithm goes as follows: IRK( T, P, d, q ) n ← length (T ) //text length m ← length ( P ) //pattern length h ← dm-1 mod q p ← 0 t0 ← 0 q_p ← 0 //quotient post hash calculation for pattern //quotient post hash calculation for portions of text of size m q_t ← 0 for i ← 1 to m //Preprocessing

do temp_p ← ( d*p + P[ i ] ) q_p ← temp_p / q p ← temp_p mod q temp_t ← ( d*t0 + T[ i ] ) q_t ← temp mod q t0 ← temp mod q

for s ← 0 to n – m // Matching //comparison only if quotient matches, removal of spurious hit do if p = ts && q_p = q_t then print “Pattern occurs with shift” s if s < n – m //quotient, post hash calculation of next m characters in text.

temp_t ← ( d * ( ts – T[ s + 1 ] * h ) + T[ s + m + 1 ] ) / q

q_t ← temp_t /q //subtracting LSB, Shifting and adding MSB then ts+1 ← ( d * ( ts – T[ s + 1 ] * h ) + T[ s + m + 1 ] )mod q ts= ts+1

X.I. Comparison using Graphs: The results of our experiments are depicted in the graphs below. In the first graphs we have represented the performance of the

Ashish Prosad Gope et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5450-5457

www.ijcsit.com 5454

algorithms with a fixed text file size of 1MB. Y axis represents time in microseconds and X-axis represent the corresponding algorithms. Figure 7: comparison of algorithms with respect to 1 MB text file

Now we compare only between Rabin Karp and IRK algorithms with the same text file size of 1 MB, in figure 8. Figure 8: Comparison of Rabin Karp and IRK algorithms using file size 1MB.

Rabin Karp scores the running time of 100750 microseconds and IRK adjusted the running time within 95500 microseconds, both upon same 1 MB text file. Below is the graph which depicts the comparison between Rk and IRK algorithm using a 2MB file size. Also we’ve compared the algorithm upon 2 MB text file size, whose readings are as follows 260750 for Rabin Karp and 175250 for IRK algorithm.

Figure 9: depicts the comparison of Rabin Karp and IRK algorithms using file size 2MB.

X.II Example of IRK algorithm: T= ABBCABCA //text P= BCA //pattern q=13(say) d=256 (for character) Hash(P)= (66 * 2562 + 67 * 2561 + 65) mod q p = 0 // hash value for pattern q_p = 334045 //quotient

A B B C A B C A

hash(ABB) = 0 // same hash q_t0 = 328965 //but quotient different

A B B C A B C A

hash(BBC) = 1 q_t1 = 334026

A B B C A B C A

hash(BCA) = 0 q_t2 = 334045

A B B C A B C A

hash(CAB) = 7 q_t3 = 339047 // both matched

A B B C A B C A

hash(ABC) = 11 q_t1 = 328984

A B B C A B C A

hash(BCA) = 0 // hash matched q_t2 = 334045 // quotient matched

Since the hash =0 and quotient = 334045 both matched. Only the pattern BCA is matched. And hash(ABB) = 0 and quotient = 328965, which has not matched, ABB is not compared. X.3 Time Complexity In Best case doesn’t differ much from the original Rabin Karp algorithm, but the in average case complexity can be improved significantly. Due to imposing of constraint of matching the quotient post hashing as well as the hash value of the text portion of size m , reduction in comparison has been seen. Which reduces the time complexity during worst case from O((n-m+1)m) to O(nm+1). This time complexity is hugely depends on the selected prime number, q. So selecting the right prime number gives this algorithm a satisfiable optimization in terms of worst case time complexity.

XI. CONCLUSION AND FUTURE SCOPES This version of Rabin Karp algorithm can be used with Genetic Algorithm in order to search for a pattern in to huge text files of size >500MB. Implementation using GA can produce an improved version of this algorithm for more sophisticated use and can make the search even faster by using the genetic operators such as selection, mutation, crossover etc. Our Future scope lies among this thinking that it could be possible for us to implement this IRK algorithm using GA for optimize the pattern analysis. Further analysis and improvement of this algorithm is welcome from any scholars.

REFERENCES 1. Richard M. Karp, Michael O. Rabin, Efficient Randomized pattern-

matching algorithms, International Business Machine, 1987 2. Roberto Ierusalimschy, A Text Pattern-Matching Tool based on

Parsing Expression Grammars, 2008 3. Rajesh S., Prathima S., Reddy L.S.S., Unusual Pattern Detection in

DNA Database Using KMP Algorithm, International Journal of Computer Applications (0975 - 8887), Volume 1 – No. 22, 2010

4. Jiyeon Choi, Myka R. Ababon, Mai Soliman, Yong Lin, Linda M. Brzustowicz, Paul G. Matteson, James H. Millonig, Autism Associated Gene, ENGRAILED2, and Flanking Gene Levels Are Altered in Post-Mortem Cerebellum- PLOS ONE, 2014

5. Gupta, A.R., and State, M.W. (2007) Recent Advances in the Genetics of Autism. Biological Psychiatry 61, 429-437.

6. Donald E. Knuth, James H. Morris, Jr And Vaughan R. Pratt, FAST PATTERN MATCHING IN STRINGS, Vol. 6, No. 2, June 1977, SIAM J. COMPUT.

7. R. Boyer and J. Moore, A fast string searching algorithm”, CACM, 20, 10, 1977, pp.262-272.

Ashish Prosad Gope et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5450-5457

www.ijcsit.com 5455

8. Christopher B. Kingsley, Identification of Causal Sequence Variants of Disease in the Next Generation Sequencing Era, Methods in Molecular Biology, Volume 700, 2011, pp 37-46.

9. Melissa Conrad Stoppler MD (2014, Jan 15). Genetic Diseases Overview [Online]. Available: http://www.medicinenet.com/genetic_disease/article.htm.

10. DNA Sequencing, Wikipedia, http://en.wikipedia.org/ wiki/Genetic_analysis#DNA_Sequencing

11 .Biological Databases, http://biotech.fyicenter.com /resource/Biological_databases.html

12. Michael T. Goodrich; Roberto Tamassia; David M. Mount, 2011. Data Structures and Algorithms in C++, Second Edition

13. Akhtar Rasool Amrita Tiwari et al, (IJCSIT) Vol. 3 (2) , 2012,3394 – 3397, International Journal of Computer Science and Information Technologies.

14. Sedgewick, Robert, 1984-Algorithms., ADDISON-WESLEY PUBLISHING COMPANY

15. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein et al. 2009, 3RD edition, Introduction to Algorithms, MIT Press.

16. Dan Gusfield. COMPUTER SCIENCE AND COMPUTATIONALBIOLOGY, University of California, Davis, 2007

17. Boyer-Moore Tutorial, The University of California, Davis, http://www.cs.ucdavis.edu/~gusfield/ cs224f11/ bnotes.pdf, 2007

Ashish Prosad Gope et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5450-5457

www.ijcsit.com 5456

Ashish Prosad Gope et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5450-5457

www.ijcsit.com 5457

A Pattern Matching Algorithm for Codon Optimization and CpG MotifEngineering in DNA Expression Vectors.pdf

A Pattern Matching Algorithm for Codon Optimization and CpG Motif-

Engineering in DNA Expression Vectors

Ravi Vijaya Satya and Amar Mukherjee

School of Engineering and Computer Science

University of Central Florida

Orlando, FL 32816

rvijaya, amar@cs.ucf.edu

Udaykumar Ranga

Jawaharlal Nehru Center for Advanced

Scientific Research

Jakkur, Bangalore, India

udaykumar@jncasr.ac.in

Abstract

Codon optimization enhances the efficiency of DNA

expression vectors used in DNA vaccination and gene

therapy by increasing protein expression. Additionally,

certain nucleotide motifs have experimentally been

shown to be immuno-stimulatory while certain others

immuno-suppressive. In this paper, we present

algorithms to locate a given set of immuno-modulatory

motifs in the DNA expression vectors corresponding to a

given amino acid sequence and maximize or minimize

the number and the context of the immuno-modulatory

motifs in the DNA expression vectors. The main

contribution is to use multiple pattern matching

algorithms to synthesize a DNA sequence for a given

amino acid sequence and a graph theoretic approach for

finding the longest weighted path in a directed graph

that will maximize or minimize certain motifs. This is

achieved using O(n 2 ) time, where n is the length of the

amino acid sequence. Based on this, we develop a

software tool.

Key Words: Codon optimization, immuno-modulatory

motifs, multiple pattern matching, longest weighted path.

1. Introduction

DNA vaccines have revolutionized the field of vaccine

technology by demonstrating the ability to induce humoral

and cellular immune responses in experimental animals

and humans [9]. Immunization of animals with plasmid

DNA encoding a protein antigen was an accidental

observation that eventually led the way to a novel strategy

of immunization. DNA vaccines, also known as "naked

DNA" or "nucleic acid" vaccines, encode the antigens of

pathogenic organisms including viruses, bacteria, fungi

and parasites [40]. The protein antigen is processed

within the cell and presented by the MHC-I and –II

pathways thereby eliciting specific immune responses

essential for controlling pathogenic infections [2].

Although DNA vaccines have been successful in

generating strong immune responses in smaller animal

model such as mouse, they have not been as efficient in

larger species such as primates and humans [23, 3, 25, 4].

Stimulating both the arms of the immune system is often

desirable for efficient control of infectious diseases

especially in the larger animals. In the case of

recombinant protein vaccines, immune-enhancers

technically known as adjuvants, such as Freund’s

adjuvants and Alum, are in use to enhance antigen

specific immune responses. However, no such adjuvants

are available for use in the context of DNA vaccines. The

lack of suitable adjuvants for DNA vaccines is one

important reason for the poor performance of the DNA

vaccines in larger animals.

Nucleotide sequence encoding a foreign protein is

directly placed under the control of a mammalian

promoter to construct a DNA vaccine. Several amino

acids are encoded by more than one triplet codon and

different organisms have variable requirement for codon

preference [13]. Cloning of a wild type gene from a

parasite into a DNA vaccine often leads to insufficient

levels of protein synthesis in the host cell as a result of

codon bias between the species. Successful immunization

with DNA vaccines requires high expression of cloned

genes to synthesize large quantities of the foreign

protein. For instance, the overall genetic content of

Human Immunodeficiency Virus-1 is AT-rich, while that

of the human beings is CG-rich. The codon frequency of

the pathogenic DNA embedded into the mammalian

expression vector may not be optimal for adequate

protein expression in the host resulting in low level

protein expression. A potential solution for the codon

bias is to optimize the codon sequences of a gene to suit

Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE

the requirements of the host without altering the original

amino acid sequence of the protein [41, 16]. This

approach has been successful in eliciting strong immune

responses in several species of experimental animals [8,

28]. While immunization with synthetic genes, codon-

optimized for mammalian expression stimulated strong

immune response, immunization performed in parallel

with wild type genes generated low or moderate levels of

immune response [34, 36].

In addition to codon optimization of the synthetic

genes, a range of molecular approaches is being

evaluated to up-regulate immune responses generated by

DNA vaccines. Co-expression of cytokine genes [20], co-

stimulatory receptors [12, 35] or other

immunemodulators [33], synthetic assembly of T-helper

or CTL epitopes [6] and formulation with a variety of

chemical adjuvants [11, 24, 33, 37, 39] have been some

of the approaches reported. However, most of these

approaches may not be suitable for human application

due to toxic manifestations of the adjuvants. An ideal

agent used as an adjuvant for DNA vaccine must enhance

the immunogenicity without apparent cytotoxicity to the

host. Engineering CpG islands into DNA vaccines has

been one promising approach that showed enhanced

immune responses [19].

The well-established immune-enhancing property of

bacterial DNA has been mapped to sequence motifs

consisting of un-methylated CpG dinucleotides flanked

by base pairs in a specific context [18]. Two important

differences between the bacterial and mammalian DNA

enable the mammalian innate immune system to

recognize the former as a foreign component. CpG motifs

in bacteria are found at the expected 1:16 frequency;

however, their frequency in mammals is 4 times less than

expected. Bacterial CpG are non-methylated while those

of mammals are mostly methylated. The mammalian

immune system takes advantage of these two chemical

differences between the bacterial and mammalian DNA

to identify a bacterial infection and wage strong and

rapid anti-bacterial immune responses [29].

CpG-mediated activation of the mammalian innate

immune system has been extensively exploited in the

vaccination technology. Co-injection of CpG containing

oligonucleotides or empty vectors with protein antigens

elicited potent immune response to the antigen.

Methylation of the CpG motifs, on the other hand,

abrogated immune response to the antigens suggesting

that the CpG motifs possess adjuvant properties only

when unmethylated or hypomethylated [[5, 17]. Since the

DNA expression vectors used in genetic immunizations

are usually grown in bacteria, several CpG motifs on the

plasmids are not methylated possibly activating the

innate component of the mammalian immune system.

Presence of hypo-methylated CpG motifs is essential for

induction of immune responses to the antigens encoded

by the vectors.

While certain CpG motifs are immuno-stimulatory

(CpG-S), enhancing immune responses when the host is

vaccinated, others CpG motifs in a different nucleotide

context are immuno-inhibitory or -neutralizing (CpG-N)

abrogating antigen-specific immune response [18].

Engineering the nature and the frequency of the CpG

motifs may be critical for the design of DNA expression

vectors. DNA expression vectors are primarily used for

two different applications, expression of a foreign gene in

a host for vaccination or for correcting a genetic defect of

the host [26].

Although the basic design of these two types of vectors

is identical in several respects, their requirement for the

presence and nature of CpG motifs is diagonally

opposite. While genetic vaccines require CpG-S motifs

for efficient stimulation of the host immune system, such

a strong response must be avoided for long-term survival

of the DNA expression vector intended for gene therapy.

An ideal strategy for designing DNA vaccines must

recruit as many CpG-S motifs as possible, concomitantly

eliminating as many CpG-N motifs as possible without

altering the original protein sequence of the genes. In

contrast, design of a DNA expression vector intended for

gene therapy must eliminate as many CpG-S motifs as

possible and recruit as many CpG-N motifs as possible

for the best result. Thus, depending on the application,

some CpG motifs may be desirable, while certain others

may be undesirable.

The software tool we report here is designed to help

the researcher to engineer the composition of a gene with

respect to codon usage and CpG motifs without

modifying the original amino acid sequence of the

protein. Codon optimization is advantageous for protein

expression in a heterologous expression system such as

E.coli, Picchia, Saccharomyces, Baculo virus,

mammalian cells etc. The software could also engineer

the content and nature of the CpG motifs recruited into a

DNA expression vector in addition to optimizing the

codon usage and resolve potential conflicts arising

between these two requirements and find an optimal

nucleotide sequence.

2. Prior Work and Summary of

Contributions

Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE

Objective of this algorithm is to (1) identify the best

triplet codon for codon optimization and (2) engineer the

nature and content of the CpG motifs of a gene expressed

from a genetic vector. Input for the software is the amino

acid sequence of a gene of interest or the nucleotide

sequence that needs genetic modification. The input

amino acid sequence is reverse-translated to generate

nucleotide sequence that is optimized for the codon and

CpG content. During the process of reverse-translation,

there is a one-to-many relationship between the amino

acid sequence and the DNA sequences. A given amino

acid sequence could correspond to an exponential

number of DNA sequences with respect to the length of

the given amino acid sequence. Our task here is to choose

the particular sequence that has the desirable properties

(maximum number of the desirable motifs, and the

minimum number of the undesirable motifs) from this

large number of possible DNA sequences. A brute force

search on all the possible DNA sequences will take

exponential time. The motifs (or the patterns) that need

optimization are small DNA sequences, generally not

exceeding eight nucleotides in length. If different motifs

have to be optimized simultaneously, it is possible to

have multiple motif occurrences (both desirable and

undesirable) sharing the same position in the amino acid

sequence. In such cases it may not be possible to have all

the motif occurrences in the same DNA sequence. This

problem will be explained in detail in Section 4.

McInerney[27] presented a program to perform the

codon usage analysis on a sequence or a database of

sequences. Other work in this area of bioinformatics is

mostly on finding CpG islands within the genes and

transcriptional elements that regulate gene expression.

Ponger et al [30] developed a software package to locate

CpG islands associated with the transcription start sites

of the genes. Lin et al [22] presented software for

locating non-overlapping maximum average segments in

a given sequence. These publications analyzed DNA

sequences and searched for nucleotide motifs with high

CG content.

Our emphasis in this paper is to synthesize a DNA

sequence that codes for a functional protein, with certain

characteristics and optimization parameters. The main

contribution is to use multiple pattern-matching

algorithms to synthesize a DNA sequence for a given

amino acid sequence and a graph theoretic approach for

finding the longest weighted path in a directed graph that

will maximize or minimize certain motifs as well as

guarantee certain fitness factors of codon frequency usage

for a particular species. This is achieved using O(n 2 ) time

and storage resources compared to the brute force

algorithm that might take exponential amount of

resources, where n is the length of the amino acid

sequence. The software tools developed for the purpose

will find applications in the rapid development of

vaccines and gene therapy.

In Section 3, we introduce the basic terminology. In

Section 4, we give a precise combinatorial formulation of

the problem in terms of pattern matching operations and

present the main algorithm. Section 5 gives the

complexity analysis. Section 6 gives a description of the

software and in Section 7 we present the results.

3. Terms and definitions

The alphabet of a DNA sequence is denoted as ΣN =

{A,C,G,T} 1 where A,C,G,T are the four DNA molecules

adenine, cytosine, guanine, and thymine, respectively.

The alphabet for the amino acid sequence is ΣA =

(A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,stop}, each

letter indicating one of the 20 amino acids, and stop

indicating the stop codon. For example, ‘A’ indicates

Alanine, ‘R’ indicates Arginine, ‘N’ indicates

Asparagine, etc. A codon c is a triplet of the DNA

symbols in ΣN. As the size of DNA alphabet is 4, there

are 64 possible codons. In the context of amino acid

translation from mRNA, each element of ΣA gets mapped

to a maximum of 6 codons 2 . This mapping is also

referred to as the genetic code. Symbolically, the

mapping of an amino acid σ ∈ ΣA can be denoted as a set

C(σ) of n σ codons, where 1 ≤ nσ ≤ 6. The inverse-

mapping associates an amino acid σ with a codon c,

denoted by C -1

(c) = σ. The frequency of occurrence of

members of C(σ) is different for each species. This

relative difference in the frequency of occurrence for each

codon is called codon bias. The codon usage table for a

species gives the codon bias information for that species 3 .

For an amino acid σ, the codon bias (or the frequency

information) is given by the set of fractions F(σ),

corresponding to C(σ). For each fi ∈ F(σ), 1 ≤ i≤ nσ, 0 ≤

fi≤ 1, and 1 1

=∑ =

σ n

i

i f . When designing a DNA vaccine

for a species, it is often desirable to have those codons

that occur more frequently in that species. Therefore, for

each amino acid we generally choose the most frequent

1 The protein synthesis takes place from mRNA. The actual RNA

alphabet is {A,C,G,U}, in which thymine is replaced by uracil. For

simplicity of notation, we use T instead of U in this paper.

2 See http://molbio.info.nih.gov/molbio/gcode.html

3 See http://www.kazusa.or.jp/codon/ for codon usage tables for different

species.

Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE

codon, henceforth referred to as the best codon. For an

amino acid σ, the index of the best codon, b σ , is defined

as b σ = {i | fi = MAX(F(σ))}. If there are more than one

values for b σ ,, we take one arbitrarily. It is also useful to

have a measure of how desirable a codon is. This

quantity is given by the fitness of the codon. The fitness

of the ith codon, ci of an amino acid σ is given by:

fitness(ci) = σ b

i

f f

, the ratio of the frequency of the ith

codon of σ to the frequency of the best codon of σ. Given

a list of codons, the average fitness of the list of codons

is defined as the arithmetic mean of the fitness values of

all codons in the list.

A motif is a short DNA sequence of length l, denoted

as M = m1m2m3……ml, mi ∈ ΣN,1≤ i ≤ l. We will

consider immuno stimulatory and immune neutralizing

(CPG-S and CPG-N) sequences as possible motifs. A

motif occurs as part of a long DNA sequence and as such

will correspond to a sequence of codons. Since the

beginning of the codon boundary could be aligned with

any one of the first, second, or third positions in the

motif, this gives rise to three possible codon encodings

for the motif. In each codon encoding, the codons that lie

completely within the motif can be translated to an amino

acid sequence. These amino acid sequences will be

referred to as amino acid search patterns (henceforth

referred to as aasp). The parts of the motif (less than

three nucleotide beginning or the end of the motif that do

not completely correspond to a codon are referred to as

the head or tail, respectively, of the motif. When l < 5,

all or some of M 1 , M

2 , M

3 may be null. The

corresponding amino acid search patterns, aasp(M 1 ),

aasp(M 2 ), aasp(M

3 ) will also be empty strings, but their

head and/or tail portions will be non-null. This implies

that, for such an aasp, every position in the amino acid

sequence is a possible occurrence of the motif. The three

possible codon encodings of a motif M and their

corresponding three amino acid search patterns are

shown in Table 1.

As an example, if M = ATCGAT, M 1 = ATCGAT,

aasp(M 1 ) = ID, head(M

1 ) = null, tail(M

1 ) = null; M

2 =

TCG, aasp(M 2 ) = S, head(M

2 ) = A, tail(M

2 ) = AT ; and

M 3 = CGA, aasp(M

3 ) = R, head(M

3 ) = AT, tail(M

3 ) =

AT. For example, searching for the motif M = ATCGAT

in the amino acid sequence …LSI…, we find S =

assp(M 2 ). head(M

2 ) = A and tail(M

2 ) = AT. L, the amino

acid that precedes S, has two codons that end with A

(head(M 2 )), and I, the amino acid that follows S, has

three codons that begin with AT (tail(M 2 )), as shown

below.

L S I

CTA

TTA

TCG ATA

ATC

ATT

Selecting any codon from the first column and any

codon from the third column will result in an occurrence

of ATCGAT. Therefore, there are 2*3 = 6 unique

combination of codons that result in the occurrence of

ATCGAT at the location in the amino acid sequence

beginning with L and ending with I. These will be

referred to as the possible occurrences of the motif. The

codons corresponding to an occurrence will be referred to

Table 1. Head and tail of amino acid search patterns

  )()(

3*3/321

1

l mmmmM LL=

aasp(M 1 ) = a1a2……  3/la ,

where a1 = C -1

(m1m2m3),etc.

nullMhead =)( 1

nullMtail =)( 1

if 03mod =l

l mMtail =)(

1 if 13mod =l

ll mmMtail

1

1 )(

= if 23mod =l

  )()(

13*3/)1(432

2

+− =

l mmmmM LL

aasp(M 2 ) = a1a2……  3/)1( −la ,

where a1 = C -1

(m1m2m3), etc.

1

2 )( mMhead = nullMtail =)(

2 if 13mod =l

l mMtail =)(

2 if 23mod =l

ll mmMtail

1

2 )(

= if 03mod =l

  )()(

23*3/)2(543

3

+− =

l mmmmM LL

aasp(M 3 ) = a1a2……  3/)2( −la ,

where a1 = C -1

(m3m4m5),etc.

21

3 )( mmMhead = nullMtail =)(

3 if 23mod =l

l mMtail =)(

3 if 03mod =l

ll mmMtail

1

3 )(

= if 13mod =l

Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE

as constituent codons for that occurrence. We call the

collection of all possible occurrences of all motifs as the

potential occurrences, as all of these occurrences might

not be retainable in the final solution. As noted in

Section 1, some of these occurrences may be desirable

while others may be undesirable; we assign a desirability

value of +1 and –1, respectively , for these occurrences.

In general, these values could be arbitrary.

An additional step of pre-processing is required for

undesirable motifs. We want to avoid occurrence of an

undesirable motif. However, we need to make sure that

the undesirable motifs do not re-appear in the final

solution as a result of selecting the best codons for amino

acids that are not part of a desirable motif. We achieve

this by considering all alternate codon combinations of

each potential occurrence of an undesirable motif, and

treating them as potential occurrences with a desirability

value of zero. We call these alternate potential

occurrences as neutral occurrences. In the above

example, if ATCGAT were an undesirable motif, out of a

total of 6*6*3 = 108 possible encodings for the

subsequence LSI, 108-6 = 102 codon combinations do not

result in the occurrence of ATCGAT. Each one of these

102 combinations will be registered as a neutral

occurrence.

The weight of an occurrence is the sum of the

average fitness of the constituent codons and the

desirability for that motif. The total weight of a DNA

sequence is the sum of the individual weights of all the

occurrences from the given set of motifs that are part of

the DNA sequence.

4. The Algorithm

4.1 Problem Definition

Given an amino acid sequence SA = a1a2a3…………aJ,

and a list of motifs M = {M1,M2,M3, ……,Mk}, with their

corresponding desirability values, D = {D1, D2, D3, ...

..., Dk}, the problem is to find the codon sequence, SC =

c1c2c3…………cJ, where each ci ∈ C(ai), 1≤ i ≤ J, and

the DNA sequence, SD = d1d2d3…………d3*J,

corresponding one-to-one to SC, where ci = d3i-2d3i-1d3i,

such that: 1)SD has the maximum total weight possible;

and 2)Every codon ci in SC that is not part of an

occurrence (desirable, undesirable or neutral) of a motif

in M is the best codon for the corresponding amino acid

ai. In other words, we need to find the exact mapping of

the amino acid sequence into a DNA sequence that has

the maximum weighted occurrences of the motifs in M.

We find all possible occurrences of all the motifs,

including neutral occurrences for the undesirable motifs.

Out of these, if two occurrences (of the same motif or of

different motifs) extend over the same position in SA, and

require different codons for that position, then the two

occurrences have a conflict and cannot co-exist. We have

to select the largest weighted subset of these occurrences

that are mutually non-conflicting.

4.2 The Algorithm Steps

Step1: Pre-processing:

A)Form a list Lp of search patterns consisting of all

three search patterns aasp(Mi 1 ), aasp(Mi

2 ) and aasp(Mi

3 )

for each motif Mi ∈ M, 1 ≤ i ≤ k, in the list of motifs.

Also, enumerate the head and tail for all three

alignments of each motif. Each entry in Lp is an element

of a data structure having the following members:

• codon_list, the list of codons that lie entirely

within the motif.

• search_pattern, the aasp corresponding to the

codon_list.

• length, the size of the codon_list. (note: length

may be 0 for short motifs).

• head, the corresponding head – a DNA sequence

which is either null, or 1 or 2 characters in length.

• tail, the corresponding tail – a DNA sequence

which is either null or 1 or 2 characters in length.

• desirability, the desirability value for the

corresponding motif, +1 or –1.

B) Build an Aho-Corasick keyword tree [1] ACT for

all the amino acid search patterns.

• A node in the keyword tree might correspond to

multiple search patterns with different heads and

tails, henceforth referred to as the

list_of_matches, denoted by the indices into Lp.

• When the length of a motif is less than five

nucleotides, some amino acid search patterns

might be empty strings. i.e., they will have non-

empty head and/or tail, but empty search pattern.

In the keyword tree, the root itself will correspond

to such empty amino acid search patterns.

Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE

Step2: Finding all potential occurrences: In this

step, we build the list of potential occurrences, P. Each

entry in P will have the following attributes:

• pBegin is the position in SA at which the match

starts.

• pEnd, the position in SA at which the match ends.

• A list of codons Lc of size pEnd-pBegin+1. The

codon at position ‘r’ in Lc indicates rth codon in

the match.

• The weight of the occurrence. We will generate

the list P in sorted order in ascending values of

pBegin; if two occurrences have same pBegin

value, they are sorted in ascending value of pEnd.

The following algorithm traverses the tree

according to the given amino acid sequence.

Algorithm 4.2.1 find_potential_occurrences

Inputs: The Aho-Corasick keyword tree ACT, the amino

acid sequence SA, list of search patterns Lp

Output: The list of potential occurrences, P, sorted.

Initialize P ← null, cur_node ← ACT.root

for q=1, until q ≤ J, traverse ACT:

if cur_node.child[aq] != null

cur_node ← cur_node.child[aq]

else

cur_node ← cur_node.failure_link

if cur_node.list_of_matches is not empty,

/*Corresponding aasp has occurred. Check if the motif

corresponding to the aasp can actually occur. The head

and tail are compared with all the codons of the amino

acids on either side aasp*/

for each entry i in the list, do the following:

a. Enumerate position p, the starting position of

Lp(i).search_pattern in SA.

b. pBegin ← p, pEnd ← q

c. if (Lp(i).head ≠ null) /* head is not null */

i. pBegin ← pBegin-1

ii. form a set of codons Chead such that

Lp(i).head is a suffix of each chead ∈

Chead, Chead ⊆ C(ap-1)

d. if (Lp(i).tail ≠ null) /* tail is not null */

i. pEnd ← pEnd+1

ii. form a set of codons Ctail such that

Lp(i).tail is a suffix of each ctail ∈ Ctail,

Chead ⊆ C(aq+1)

e. for each unique combination of chead ∈ Chead and

ctail ∈ Ctail, insert an entry to the list of potential

occurrences, P, with the entries pBegin, pEnd,

Lp(i).codon_list, and the corresponding weight

of the occurrence. Note: Chead and/or Ctail will be

empty if Lp(i).head and/or Lp(i).tail are null.

if a non-zero number of occurrences were added in

the above step, and if Lp(i).desirability < 0, insert an

entry in P for each unique combination of codons

selecting one codon each from the sets

C(SA(pBegin)), ……, C(SA(pEnd)). i.e, insert all

neutral occurrences corresponding to the current

undesirable occurrence into P.

Example: Let us assume we are trying to maximize the

number of occurrences of the motif ATCGAT in the

amino acid sequence MEPRVID. The three amino acid

search patterns, aasp(M 1 ), aasp(M

2 ) and aasp(M

3 ) are

ID, S, and R, respectively. Searching for them, we find R

at position 4, and ID at positions 6-7. R is aasp(M 3 ),

head(M 3 ) is AT, tail(M

3 ) is T, therefore we have to see if

any codons ending with AT can occur at position 3 in

MEPRVID, and if any codons beginning with T can occur

at position 5. The amino acid at position 3 is P, which

has the codons [AGA, AGG, CGA, CGC, CGG and

CGT], none of which end with AT, therefore it is not

possible for ATCGAT to occur from positions 3-5.

Coming to the next match, ID at positions 6-7, ID

=aasp(M 1 ). head(M

1 ) = null, tail(M

1 ) is null, therefore

ID at positions 6-7 can be added to the list of matches.

The corresponding entry in P will be: pBegin ← 6, pEnd

← 7, list of codons, LC ← ATC,GAT.

Step3: Mapping to a directed acyclic graph: In this

step, we find a non-conflicting subset of the occurrences

in P that has the maximum over-all weight. This can be

done by mapping the problem to that of finding the

critical path in a directed acyclic graph and finding the

nodes in the critical path, as follows:

1. Each occurrence in P is represented by a node v

in a weighted directed acyclic graph G = (V,E), where V

and E denote the vertex and the directed edge set of G.

The node weight wi of a node vi, is given by the weight of

the corresponding occurrence, P(i). i.e, wi ← P(i).weight,

1 ≤ i ≤ |P|, wi ∈ W, where W is the weight vector

corresponding to P.

2. Start with an empty edge list E. For each i, 1≤ i

≤ |P|, and j, i < j ≤ |P|, if P(i).pEnd < P(j).pBegin or if

P(i) and P(j) have the same codon encodings for all

positions r, P(j).pBegin ≤ r ≤ min(P(i).pEnd, P(j).pEnd),

then vi, vj will have an edge between them. i.e., E ← E∪

(vi,vj).

Now, we can solve for the longest weighted path

(critical path) in G, and insert the nodes in the longest

path in a list Lv. The longest weighted path can be found

by the critical path algorithm [7].

Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE

4.3 Translation of the amino acid sequence to a

DNA sequence

Once we have a set of occurrences of motifs that do

not conflict with each other, the next step is to do the

actual mapping from the amino acid sequence to a codon

(or DNA) sequence. At each position in SA that is not part

of any occurrence, we select the best codon for the

corresponding amino acid. At positions in SA that are

part of an occurrence, we select the codon that is required

for the occurrence.

Let Lv be the final list of occurrences.

1. for each ai ∈ SA, 1 ≤ i ≤ J,

ci ← best codon (C(ai))

2. for all i, 1 ≤ i ≤ |Lv|,

for all j, Lv(i).pBegin ≤ j ≤ Lv(i).pEnd, /* j is

between pBegin and pEnd for the occurrence Lv(i) */

cj ← Lv(i). c

L [j-pBegin] /*the codon sequence

for the occurrence of Lv(i) */

At the end of the second step, we will have the output

codon sequence, SC = c1c2c3……cJ. This can be converted

into the output DNA sequence SD by replacing each

codon with the corresponding DNA triplet.

5. Complexity Analysis

Step1-Pre-processing: Enumerating the amino acid

search patterns takes constant time for each motif

(assuming that the upper bound for the length of each

motif is a constant). There are k motifs, therefore, this

step takes O(k) time. Building the search tree is linear,

therefore takes O(k) time.

Step2- Enumeration of all potential occurrences:

Finding the potential occurrences O(L+3k+|P|),

according to the Aho-Corasick Algorithm. The actual

number of matches considered is greater than |P|, but still

in O(|P|). As the length of each motif is bound by a

constant, so is the number of all possible codon

combinations for each motif. Therefore, |P| is in O(L).

Inserting an entry into P takes log|P| time, as P is a

sorted list. Therefore, the overall complexity for this step

is |P|log|P|.

Step3-Mapping P to a directed acyclic graph (DAG):

a. Constructing the graph – involves checking the

upper triangle of the adjacency matrix of the graph -

takes O(|P| .(|P| -1)/2) ~ O(|P| 2 ) time.

b. Finding the critical path takes O(|V| + |E|) time.

Here, |V| ← |P|. The graph is expected to be dense in

general, therefore, |E| ← |P| 2 . Therefore, this part

takes O(|P| 2 ) time.

As the number of potential occurrences found, |P|, is

in O(J), the overall complexity of the algorithm is O(J 2 ).

6. Software

The software is available both as a console program,

and as a windows application with a GUI. The program

requires four input files: a file containing the input amino

acid sequence, a file containing the codon bias

information, and a file each for the desirable and

undesirable motifs. The desirability values can be

supplied in the input files. If desirability values are not

supplied in the input files, a default value of +1 is

assigned for desirable motifs, and –1 for undesirable

motifs. Additionally, it is made sure that each

occurrence of an undesirable motif has a weight less than

zero, by setting it to a small –ve value if it is zero. The

amino acid sequence file should be in the FASTA format,

containing a single sequence. The output is in ASCII text

format, listing different solutions with varying degrees of

codon optimization. The GUI writes the formatted results

to an html file. The sample out put of the program is

shown in figure-3.

7. Results

We tested the validity of the program on different

proteins. The results are shown in Table 2. These test

cases were run with list of 33 desirable motifs and 5

undesirable motifs. Each motif was any where between 3

to 8 nucleotides long. In the sequences that were tested,

all the occurrences of undesirable motifs could be

eliminated. However, in some cases, it might be

impossible to eliminate all occurrences of undesirable

motifs.

Table 2. Results - CpG-S and CpG-N motifs

Amino

acid

sequence

(length)

CpG-S

motfs

found

CpG-S

motfs

retained

CpG-N

motifs

found

CpG-N

motifs

eliminate

d

Tat

(72)

80 7 66 66

C3d

(302)

224 30 273 273

M23809

(226)

164 24 209 209

Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE

Table 3 shows the average codon fitness, which is a

measure of the degree of codon optimization. It can be

observed that the average fitness could be improved by

approximately 30% for all amino acid sequences tested.

Table 3. Results – average fitness

Amino

acid

sequence

(length)

Wild

type

Maximizin

g CpG-S

motifs

Minimizin

g CpG-N

motifs

Both

max.

and

min.

Tat

(72)

0.671 0.948 0.981 0.930

C3d

(302)

0.672 0.935 0.985 0.922

M23809

(226)

0.654 0.950 0.980 0.937

The analysis of one of these proteins, a 72 amino acid

regulatory protein, Tat, of the Human Immunodeficiency

Virus type-1 (HIV-1), is presented here. The Tat protein

of HIV-1 plays critical role in regulating viral gene

expression [15]. Additionally, Tat also modulated several

functions of the host and a wide range of pathogenic

properties has been ascribed to this viral protein.

Considering the pathogenic significance of Tat, this viral

antigen is being used as an important component of the

viral vaccines, to elicit cell-mediated and humoral

immune responses [31].

One important limitation of using the wild type Tat

gene directly as a vaccine component without any

modification is the sub-optimal codon content of the viral

gene for mammalian immunization. Selective use of

specific codons for protein translation is a characteristic

feature of several species, a phenomenon called codon

bias. Recent approaches of enhancing the immune

responses against antigens have laid emphasis on protein

translation efficiency, which is a function of the base

composition of the pathogen. Codon usage of the Tat

protein of HIV-1 is strikingly different from that of the

human genes. The AT content of HIV-1 Tat is

approximately 54% while that of the host human genome

is about 35%, relatively low in AT content. Sub-optimal

codon use could result in compromised protein

translation efficiency in a mammalian context and may

reflect in the generation of a poor immune response

against the AT-rich gene.

The characteristic features of codon use of the Tat

gene in the virus are depicted in figure-2. The 72 amino

acids encoding Tat exon-1 consisting of 219 bp represent

the consensus sequence of the HIV-1 subtype-C strains.

We used the 72 amino acid sequence (and the stop

codon) as an input for optimizing the nucleotide

sequence for mammalian expression. Alternatively, it is

also possible to use the wild type nucleotide sequence of

the gene directly obtained from the virus as the input.

The results (the occurrences of desirable and undesirable

motifs) were manually verified.

Using the software, we identified a total of 37 codons

that could be replaced by the corresponding best codons

to optimize the codon content of this gene to match that

of the mammalian genome. Of note, optimization of the

codons is fairly a simple task, provided the objective is

only to replace the sub-optimal codons with the best ones,

as is the case with several of the substitutions. For

instance, serine at the position 16 in the viral gene is

encoded by the codon AGT that has a fitness value of

only 0.62, the lowest of all the 6 codons available for this

amino acid (Figure 2). Substituting AGT at this position

with AGC (fitness value 1.0) is likely to improve the

average fitness of the viral gene for mammalian protein

expression.

The algorithm identified seven potential motifs where

CpG-S motifs could be engineered into the gene. One of

these motifs, comprising of the amino acid residues S57

and A58 (AGC GCT), is naturally present in the viral

gene (Figure 2). However, a possible conflict was

identified between the possible codon optimization of

these amino acid residues and not disrupting the CpG-S

motif in this sequence. The codon GCT (fitness 0.67)

encoding for alanine could have been substituted by the

best codon GCC (fitness 1.0) for improved average

fitness of the gene. Importantly, use of GCT in the gene

design allowed recruiting a potential CpG-S island that

could have far more beneficial effect on immunization

potential of the DNA vaccine. The average fitness of this

codon combination (AGC GCT) 0.835 ((1.0 + 0.67) / 2),

as opposed to that of best codon combination 1.0, may

not significantly modulate the protein translation

efficiency.

The algorithm also identified a second site (L35 and

V36) in the viral gene where substituting the original

viral codons CTA GTT (average fitness 0.275) with the

best codons CTG GTG (average fitness 1.0) would

improve the overall fitness of the gene. In contrast,

replacing the original codons with others, CTC GTA

(fitness 0.36), would permit recruiting a putative CpG-S

motif into the gene, though at the expense of average

fitness of the gene (Figure 2). At the moment, it is not

clear how to resolve a conflict that may arise between

optimizing the codon content of a gene and recruiting

CpG-S motifs into the sequence. Most often, the codons

useful in CpG motif design are not the ones with the

highest fitness value thus leading to a potential

Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE

disagreement between the diagonally juxtaposed

requirements.

Optimization of Tat sequence finally resulted in a

nucleotide sequence that was modified in a total of 37

codons (Figure 2). The AT content of the gene changed

from 54% of the wildtype gene to 43% in the codon-

optimized version thus resembling closely that of the

mammalian context. The overall fitness value of the

codon-optimized gene also improved from 0.67 to 0.93

suggesting a possible improvement at the level of protein

translation.

Using the algorithm, we designed several variants of

the Tat gene that differed from one another in the

number of CpG islands engineered and the extent of

codon content optimized. Experiments are presently

underway to evaluate the antigenic potential of these Tat

expression vectors in different mouse strains.

Acknowledgements

U. R. acknowledges financial support from the

Department of Biotechnology, Government of India.

References

[1] Aho A. and Corasick M: Efficient string matching: an aid

to bibliographic search. Comm. CAN, 18:333-40,1975.

[2] Bagarazzi ML, Boyer JD, Ayyavoo V, Weiner DB: Nucleic

acid-based vaccines as an approach to immunization against

human immunodeficiency virus type-1. Curr Top Microbiol

Immunol 226:107-143 (1998).

[3] Boyer JD, Cohen AD, Vogt S, Schumann K, Nath B, Ahn

L, Lacy K, Bagarazzi ML, Higgins TJ, Baine Y, Ciccarelli RB,

Ginsberg RS, MacGregor RR, Weiner DB: Vaccination of

Seronegative Volunteers with a Human Immunodeficiency

Virus Type 1 env/rev DNA Vaccine Induces Antigen-Specific

Proliferation and Lymphocyte Production of beta-Chemokines.

J Infect Dis 181:476-483 (2000).

[4] Calarota S, Bratt G, Nordlund S, Hinkula J, Leandersson

AC, Sandstrom E, Wahren B: Cellular cytotoxic response

induced by DNA vaccination in HIV-1-infected patients. Lancet

351:1320-1325 (1998).

[5] Chen Y, Lenert P, Weeratna R, McCluskie M, Wu T,

Davis HL, Krieg AM: Identification of methylated CpG motifs

as inhibitors of the immune stimulatory CpG motifs. Gene Ther

8:1024-1032 (2001).

[6] Ciernik IF, Berzofsky JA, Carbone DP: Induction of

cytotoxic T lymphocytes and antitumor immunity with DNA

vaccines expressing single T cell epitopes. J Immunol

156:2369-2375 (1996).

[7] Dolan A., Aldous J.: Networks and algorithms, John Wiley

and Sons, New York, NY, 1993. pp. 366-386.

[8] Egan MA, Charini WA, Kuroda MJ, Schmitz JE, Racz P,

Tenner-Racz K, Manson K, Wyand M, Lifton MA, Nickerson

CE, Fu T, Shiver JW, Letvin NL: Simian immunodeficiency

virus (SIV) gag DNA-vaccinated rhesus monkeys develop

secondary cytotoxic T-lymphocyte responses and control viral

replication after pathogenic SIV infection. J Virol 74:7485-

7495 (2000).

[9] Gurunathan S, Klinman DM, Seder RA: DNA vaccines:

immunology, application, and optimization*. Annu Rev

Immunol 18:927-974 (2000).

[10] Gusfield D: Algorithms on Strings Trees and Sequences.

Cambridge Univeristy Press, 1997.

[11] Hamajima K, Sasaki S, Fukushima J, Kaneko T, Xin KQ,

Kudoh I, Okuda K: Intranasal administration of HIV-DNA

vaccine formulated with a polymer, carboxymethylcellulose,

augments mucosal antibody production and cell- mediated

immune response. Clin Immunol Immunopathol 88:205-210

(1998).

[12] Ihata A, Watabe S, Sasaki S, Shirai A, Fukushima J,

Hamajima K, Inoue J, Okuda K: Immunomodulatory effect of a

plasmid expressing CD40 ligand on DNA vaccination against

human immunodeficiency virus type-1 [In Process Citation].

Immunology 98:436-442 (1999).

[13] Ikemura T: Codon usage and tRNA content in unicellular

and multicellular organisms. Mol Biol Evol 2:13-34 (1985).

[14] Jareborg N, Durbin R, ‘Alfresco--a workbench for

comparative genomic sequence analysis’, Genome Res 2000

Aug;10(8):1148-57.

[15] Jeang,K.T., H.Xiao and E.A.Rich: Multifaceted activities

of the HIV-1 transactivator of transcription, Tat. J Biol Chem

Dev. 274, 28837-28840 (1999).

[16] Kim CH, Oh Y, Lee TH: Codon optimization for high-

level expression of human erythropoietin (EPO) in mammalian

cells. Gene 199:293-301 (1997).

[17] Klinman DM, Yamshchikov G, Ishigatsubo Y:

Contribution of CpG motifs to the immunogenicity of DNA

vaccines. J Immunol 158:3635-3639 (1997).

[18] Krieg AM: CpG motifs in bacterial DNA and their

immune effects. Annu Rev Immunol 20:709-760 (2002).

[19] Krieg AM, Yi AK, Schorr J, Davis HL: The role of CpG

dinucleotides in DNA vaccines. Trends Microbiol 6:23-27

(1998).

[20] Lee AH, Suh YS, Sung YC: DNA inoculations with HIV-1

recombinant genomes that express cytokine genes enhance

HIV-1 specific immune responses [In Process Citation].

Vaccine 17:473-479 (1999).

Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE

[21] Lee MK, Lynch ED, King MC: SeqHelp: a program to

analyze molecular sequences utilizing common computational

resources. Genome Res 1998 Mar;8(3):306-12.

[22] Lin YL, Huang X, Jiang T, Chao KM: MAVG: locating

non-overlapping maximum average segments in a given

sequence. Bioinformatics 2003 Jan;19(1):151-152.

[23] Liu MA, McClements W, Ulmer JB, Shiver J, Donnelly J:

Immunization of non-human primates with DNA vaccines.

Vaccine 15:909-912 (1997).

[24] Lodmell DL, Ray NB, Ulrich JT, Ewalt LC: DNA

vaccination of mice against rabies virus: effects of the route of

vaccination and the adjuvant monophosphoryl lipid A

(MPL(R)). Vaccine 18:1059-1066 (2000).

[25] MacGregor RR, Boyer JD, Ugen KE, Lacy KE, Gluckman

SJ, Bagarazzi ML, Chattergoon MA, Baine Y, Higgins TJ,

Ciccarelli RB, Coney LR, Ginsberg RS, Weiner DB: First

human trial of a DNA-based vaccine for treatment of human

immunodeficiency virus type 1 infection: safety and host

response. J Infect Dis 178:92-100 (1998).

[26] Manthorpe M, Cornefert-Jensen F, Hartikka J, Felgner J,

Rundell A, Margalith M, Dwarki V: Gene therapy by

intramuscular injection of plasmid DNA: studies on firefly

luciferase gene expression in mice. Hum Gene Ther 4:419-431

(1993).

[27] McInerney JO: GCUA: general codon usage analysis.

Bioinformatics 1998;14(4):372-3.

[28] Nagata T, Uchijima M, Yoshida A, Kawashima M, Koide

Y: Codon optimization effect on translational efficiency of

DNA vaccine in mammalian cells: analysis of plasmid DNA

encoding a CTL epitope derived from microorganisms.

Biochem Biophys Res Commun 261:445-451 (1999).

[29] Ozinsky A, Underhill DM, Fontenot JD, Hajjar AM,

Smith KD, Wilson CB, Schroeder L, Aderem A: The repertoire

for pattern recognition of pathogens by the innate immune

system is defined by cooperation between Toll-like receptors.

Proc Natl Acad Sci U S A 97:13766-13771 (2000).

[30] Ponger L, Mouchiroud D: CpGProD: identifying CpG

islands associated with transcription start sites in large genomic

mammalian sequences. Bioinformatics 2002 Apr;18(4):631-3.

[31] Rusnati,M. and M.Presta: HIV-1 Tat protein: a target for

the development of anti-AIDS therapies. Drugs Fut. Dev. 27,

481-493 (2002).

[32] Sasaki S, Fukushima J, Hamajima K, Ishii N, Tsuji T, Xin

KQ, Mohri H, Okuda K: Adjuvant effect of Ubenimex on a

DNA vaccine for HIV-1 [published erratum appears in Clin

Exp Immunol 1998 May;112(2):354]. Clin Exp Immunol

111:30-35 (1998a).

[33] Sasaki S, Sumino K, Hamajima K, Fukushima J, Ishii N,

Kawamoto S, Mohri H, Kensil CR, Okuda K: Induction of

systemic and mucosal immune responses to human

immunodeficiency virus type 1 by a DNA vaccine formulated

with QS-21 saponin adjuvant via intramuscular and intranasal

routes. J Virol 72:4931-4939 (1998b).

[34] Stratford R, Douce G, Zhang-Barber L, Fairweather N,

Eskola J, Dougan G: Influence of codon usage on the

immunogenicity of a DNA vaccine against tetanus. Vaccine

19:810-815 (2000).

[35] Tsuji T, Hamajima K, Ishii N, Aoki I, Fukushima J, Xin

KQ, Kawamoto S, Sasaki S, Matsunaga K, Ishigatsubo Y, Tani

K, Okubo T, Okuda K: Immunomodulatory effects of a plasmid

expressing B7-2 on human immunodeficiency virus-1-specific

cell-mediated immunity induced by a plasmid encoding the

viral antigen. Eur J Immunol 27:782-787 (1997).

[36] Uchijima M, Yoshida A, Nagata T, Koide Y: Optimization

of codon usage of plasmid DNA vaccine is required for the

effective MHC class I-restricted T cell responses against an

intracellular bacterium. J Immunol 161:5594-5599 (1998).

[37] Verschoor EJ, Mooij P, Oostermeijer H, van der Kolk M,

ten Haaft P, Verstrepen B, Sun Y, Morein B, kerblom L, Fuller

DH, Barnett SW, Heeney JL: Comparison of immunity

generated by nucleic acid-, MF59-, and ISCOM- formulated

human immunodeficiency virus type 1 vaccines in rhesus

macaques: evidence for viral clearance [In Process Citation]. J

Virol 73:3292-3300 (1999).

[38] Wang S, Liu X, Fisher K, Smith JG, Chen F, Tobery TW,

Ulmer JB, Evans RK, Caulfield MJ: Enhanced type I immune

response to a hepatitis B DNA vaccine by formulation with

calcium- or aluminum phosphate [In Process Citation]. Vaccine

18:1227-1235 (2000).

[39] Wang TT, Cheng WC, Lee BH: A simple program to

calculate codon bias index. Mol Biotechnol 1998

Oct;10(2):103-6.

[40] Wolff JA, Malone RW, Williams P, Chong W, Acsadi G,

Jani A, Felgner PL: Direct gene transfer into mouse muscle in

vivo. Science 247:1465-1468 (1990).

[41] zur MJ, Chen MC, Doe B, Schaefer M, Greer CE, Selby

M, Otten GR, Barnett SW: Increased expression and

immunogenicity of sequence-modified human

immunodeficiency virus type 1 gag gene. J Virol 74:2628-2635

(2000).

Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE

M 1

E 2

P 3

V 4

D 5

P 6

N 7

L 8

E 9

P 10

W 11

N 12

wt ATG GAG CCA GTA GAT CCT AAC CTA GAG CCC TGG CAT

co - - - - - - - - - - - - - - - - - - - - - - - G - - - - - - - - - - - -

H 13

P 14

G 15

S 16

Q 17

P 18

K 719

T 10

A 21

C 22

N 23

N 24

wt CAT CCA GGA AGT CAG CCT AAA ACT GCT TGC AAC AAC

co - - C - - C - - -C - - -C - - - - - C - - G - - C - - C - - - - - - - - -

C 25

Y 26

C 27

K 28

H 29

C 30

S 31

Y 32

H 33

C 34

L 35

V 36

wt TGT TAT TGT AAA CAC TGT AGC TAC CAT TGT CTA GTT

co - - C - - C - - C - - G - - - - - C - - - - - - - - C - - C - - G - - G

C 37

F 38

Q 39

T 40

K 41

G 42

L 43

G 44

I 45

S 46

Y 47

G 48

wt TGC TTT CAG ACA AAA GGC TTA GGC ATT TCC TAT GGC

co - - - - - C - - - - - C - - G - - - C-G - - - - - C AG - - - C - - -

R 49

K 50

K 51

R 52

R 53

Q 54

R 55

R 56

S 57

A 58

P 59

P 60

wt AGG AAG AAG CGG AGA CAG CGA CGA AGC GCT CCT CCA

co - - A - - - - - - A-A - - - - - - A - - A - - - - - - - - - - - - - -

S 61

S 62

E 63

D 64

H 65

Q 66

N 67

L 68

I 69

S 70

K 71

Q 72

wt AGT AGT GAG GAT CAT CAA AAT CTT ATA TCG AAG CAA

co - - C - - C - - - - - C - - C - - - - - - - - - - - - - - S - - - - - G

Figure 1. The wild type Tat amino acid sequence

A

B

C

Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE

A B C

Codon H 13

P 14

G 15

S 16

L 35

V 36

S 57

A 58

Wt CAT

(0.70)

CCA

(0.83)

GGA

(0.72)

AGT

(0.62)

CTA

(0.17)

GTT

(0.38)

AGC

(1.0)

GCT

(0.67)

Best CAC

(1.0)

CCC

(1.0)

GGC

(1.0)

AGC

(1.0)

CTG

(1.0)

GTG

(1.0)

AGC

(1.0)

GCC

(1.0)

CpG ----- ----- ----- ----- CTC

(0.48)

GTA

(0.24)

AGC

(1.0)

GCT

(0.67)

Figure 2. Codon optimization at three different places in the Tat amino acid sequence

Maximizing the desirable Motifs:

AT content: 50% Avg Fitness: 0.967748%

CAC TTG ATT GTT ACC CCA TCG GGC TGC GGC GAA CAA AAT ATG ATT GGC ATG ACC CCA ACC

GTT ATT GCA GTT CAC TAT CTC GAC GAA ACC GAA CAA TGG GAA AAA TTC GGC TTG GAA AAA

CGC CAA GGG GCG TTG GAA TTG ATT AAA AAA GGC TAT ACC CAA CAA TTG GCG TTC CGC CAA

CCA TCG TCG GCG TTC GCA GCG TTC

Minimizing the Undesirable Motifs:

AT content: 55% Avg Fitness: 0.96046%

CAC TTG ATT GTT ACC CCA TCA GGC TGC GGC GAA CAA AAT ATG ATT GGC ATG ACC CCA ACT

GTT ATT GCA GTT CAC TAT TTG GAC GAA ACT GAA CAA TGG GAA AAA TTT GGC TTG GAA AAA

CGC CAA GGC GCA TTG GAA TTG ATT AAA AAA GGC TAT ACC CAA CAA TTG GCA TTC CGT CAA

CCA TCG AGC GCA TTC GCA GCA TTC

Maximizing Desirable and Minimizing Undesirable Motifs:

AT content: 53% Avg Fitness: 0.94195%

CAC TTG ATT GTT ACC CCA TCA GGC TGC GGC GAA CAA AAT ATG ATT GGC ATG ACC CCA ACT

GTT ATT GCA GTT CAC TAT CTC GAC GAA ACT GAA CAA TGG GAA AAA TTT GGC TTG GAA AAA

CGC CAA GGC GCA TTG GAA TTG ATT AAA AAA GGC TAT ACC CAA CAA TTG GCG TTC CGT CAA

CCA TCG TCG GCG TTC GCA GCG TTC

Figure 3. Sample output of the program for a part of C3d sequence (desirable occurrences are shown in bold and

neutral occurrences are shown in italics)

Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE

An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern.pdf

An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern

Mahmoud Moh'd Mhashi

Computers and Information Technology

Tabuk University

Email: mmhashi@ut.edu.sa

Abstract

The objective of string matching is to search and find all the occurrences of the text pattern Pat with size m

characters in the text Text with size n characters, where n >> m. The string matching algorithm is so

fundamental that most computer programs use it, and it continues to receive great attention due to various

applications in text manipulation and other applications, especially in biological sequence analysis and

comparison. The main goal of this research is to develop a new enhanced skipping algorithm with intelligent

issues that outperforms the algorithms introduced in the literature. The string matching algorithms consist of

two main operations: checking and skipping. Traditionally, authors try to reduce the number of comparisons

through the checking operation. The new algorithm in this paper decreases the number of comparisons by

decreasing the number of checking operations through an increase in the shift distance. This enhances the

searching response time. The proposed algorithm is an efficient algorithm that can be used to search for exact

occurrences of long patterns in DNA patterns.

1. Introduction

Exact string searching is one of the most important problems that has been investigated; studies range

from finding the shortest string in DNA sequencing [1-4] to searching for occurrences of a pattern. In general,

string searching algorithms deal with searching the occurrences of a string (the pattern) of size m in a string (the

text) of size n (where n � m) [5-7]. In the literature, many exact string searching and pattern matching

algorithms were introduced and their performances were investigated against classical exact string searching

algorithms such as the Naïve (brute force) algorithm and Boyer-Moore-Horsepool(BMH) algorithms. Some of

these algorithms preprocess both the text and the pattern [8] while others preprocess the pattern [9-12].

The exact string searching problem consists of two major steps: checking and skipping, the checking step itself

consists of two phases:

1) A search along the text for a reasonable candidate string

2) A detailed comparison of the candidate against the pattern to verify the potential match.

Some characters of the candidate string must be selected carefully in order to avoid the problem of

repeated examination of each character of text when patterns are partially matched; the fewer the number of

character comparisons in the checking step the better the algorithm is. There are different algorithms that can be

used to check if the characters in the text match with the corresponding characters in the pattern [13-21]. After

the checking step, the skipping step shifts the pattern to the right to determine the next position in the text where

the substring text can possibly match with the pattern.

IMACST: VOLUME 3 NUMBER 1 FEBRUARY 2012 13

1857-7202/1202002

This research will focus on searching for exact occurrences of long patterns in the DNA alphabet. Searching

for DNA sequence is studied by many authors in the literature [22-26]. Molecular biologists often search for the

important information from the databases in different directions of different uses [27]. A DNA sequence can be

abstracted as a string over an alphabet of 4 basic characters, (A, C, G, and T). DNA sequences can be very long.

For example, the total size of a human genome (DNA) is around 3Gbp. Researchers tend to search patterns in

DNA data frequently [28]. So, an efficient searching tool which is enough to allow users to locate patterns in

large DNA data is desired. Developing a fast and efficient algorithm to find all occurrences of DNA patterns is

important to the different applications using exact matching. The main goal is to develop a new enhanced

checking and skipping algorithm where intelligent issues outperform the algorithms that are introduced in the

literature.

2. The Naïve or (Brute Force) Algorithm

Let us assume that the target sequence is an array Text[n] of n characters and the pattern sequence is the array

Pat[m] of m characters, then a Naïve approach to the problem would be:

������� � ��� ����������������� ��� ��� ����� ��������� ��� ��� ���

���

������������������ ����������� ��������� ��� ��� � ����� ��� ��� �����!!���

�����������

��������������������������������

����������������" �# ��� ��$� �����!������%�������$�����%���

�������������������

��������������������������������������� ��� � &���

�������������������������

�����������������������'�����()��*��'� �� ����(����� ��������(����(����� �����!���������

���������������������+� �,���

����������������������-��

���������������������� #. �������!!���

����������������-��

������������-��

������ �'�����

���-��

In the outer loop, Text is searched for occurrences of the first character in Pat. In the inner loop, a

detailed comparison of the candidate string is made against Pat to verify the potential match. The algorithm has

a worst case time of O(nm), where O(nm) is the number of comparisons performed by the algorithm to find all

the occurrences of Pat with size m characters in the Text with size n characters. The worst case in this algorithm

14 M.M. Mhashi: An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern

occurs when we get a match on each of the n Text characters and at each position we may need to perform m

comparisons.

3. Multiple Reference Character Algorithms (MRCA4m)

A string matching algorithm is a succession of checking and skipping, where the aim of a good

algorithm is to minimize the work done during each checking and to maximize the length distance during the

skipping. An algorithm can skip to the next position in the text without missing any pattern occurrence. Most of

the string matching algorithms preprocess the pattern before the search phase to help the algorithm to maximize

the length of the skips. The preprocessing phase in this new MRCA4m algorithm helps in increasing the

performance of maximizing the length of the skips.

In order to test the role of skipping distance on the number of checking operations, the MRCA

algorithm in [29] is enhanced. MRCA algorithm has five reference characters, including three static and two

dynamic references. Two new versions of MRCA algorithms are developed. The first version of MRCA (let us

call it MRCA2m) has three reference characters, including two static characters and one dynamic character.

The second version of MRCA (let us call it MRCA4m) has seven reference characters, including four static

reference characters and three dynamic characters. So, the number of reference characters in MRCA2m,

MRCA, and MRCA4m is three, five, and seven reference characters, respectively. The shift distance ranges

from 2m+1 characters in MRCA2m, 3m+1 characters in MRCA, and 4m+1 characters in MRCA4m. Of course,

increasing the shift distance in the skip operation decreases the number of checking operations. This is, in turn,

reduces the number of comparisons required to find all the occurrences of patterns in the text and decrease the

elapsed time. A case study will be conducted to test the above issues and will be explained in the next section.

A brief description of MRCA follows.

As mentioned earlier, CSA has five reference characters. The Text pointer TextIx always points to the

character, which is next to the character corresponding to the last character in Pat and the reference character ref

always points to the character that corresponds to the last character in Pat (i.e. ref = TextIx – 1). Now let ref1 =

TextIx, then the reference character ref2 can be calculated from ref or ref1, where ref2 can be found as “ref2 =

TextIx + m - 1” or “ref2 = TextIx + m” depending on the existence of ref or ref1 in Pat, where ref2 = TextIx + m

- 1 during the checking step if the character at ref doesn’t exist in Pat, or ref2 = TextIx + m after the checking

step if the character at ref1 doesn’t exist in Pat. The above formulas may be illustrated in the following

example:

Text: K K C D E F G K

Pat: E F G

In the above, Text and Pat assume that the pointers TextIx and ref1 point to the character ‘D’, the

pointer ref points to the character ‘C’, and m = 3, in this case since the character ‘C’ at ref doesn’t exist in Pat,

then ref2 = TextIx + m - 1 = 3 + 3 - 1 = 5. So, start counting from zero, ref2 points to the character ‘F’ in Text.

The occurrence of character ‘D’ at ref1 must be examined whether Pat occurs in Text or not, and since ‘D’

doesn’t occur in Pat then ref2 = TextIx + m = 3 + 3 = 6, hence, ref2 points to the character ‘G’ in Text.

IMACST: VOLUME 3 NUMBER 1 FEBRUARY 2012 15

1857-7202/1202002

In addition to that, MRCA pre-processes the pattern to produce two different arrays, namely skip and

pos, each array has a length equals to the alphabet size. The skip array is used when the reference character ref1

exists in Pat, it expresses how much the pattern is to be shifted forward after the checking step. While the pos

array defines where each one of the different reference characters ref1, ref2, ref_ref1, or ref_ref2 is located in

Pat, if any one of them exists in Pat, where the two dynamic pointers ref_ref1 and ref_ref2 can be calculated

from the two static pointers ref1 and ref2 respectively. In particular, the dynamic pointer ref_ref1 which is

calculated at the skipping step if ref1 occurs in Pat can be found as ref_ref1 = ref1 + m - pt, where m is the Pat

length and pt is the location of ref1 in Pat (i.e., pt = pos[Text[ref1]]), and the dynamic pointer ref_ref2 which is

calculated and used only during checking step if ref doesn’t occur in Pat or after the checking step if ref1

doesn’t occur in Pat can be found as ref_ref2 = ref2 + m - pt1, where pt1 determines where ref2 is located in

Pat. The above formulas can be illustrated in the following examples:

Example 2: Calculating ref_ref1

0 1 2 3 4 5 6 7

Text: K K G E K F G H

Pat: E F G

Assuming the checking step is performed on the above Text and Pat, then the reference character will be

character ‘E’ at location 3 in Text, and since ‘E’ at ref1 occurs at location 1 in Pat, then pt = pos[Text[ref1]] =

pos[‘E’] = 1 (note that for the case of counting position the counting starts from 1 not zero), and ref_ref1 may be

found as ref_ref1 = ref1 + m - pt = 3 + 3 - 1 = 5 (i.e. ref_ref1 points to the character ‘F’ in Text at location 5).

Example 3: Calculating ref_ref2 when the character at ref doesn’t occur in Pat

0 1 2 3 4 5 6 7 8 9 10

Text: A B C D E F G H F E G

Pat: F E G

Since ref doesn’t occur in Pat, then ref2=5 (calculated above in Example 1). Consequently, ref2 + m -

pt1, where m = 3 and pt1 = pos[Text[ref2]]) = pos[Text[‘F’]]) = 1. Hence, ref_ref2 may be found as ref_ref2 = 5

+ 3 - 1 = 7. In such a case, the alignment will be with ‘H’ at location 7 in Text, however, the letter ‘H” doesn’t

occur in Pat, so, TextIx will move forward 7 locations to point to the character ‘G’ at position 10.

Example 4: Calculating ref_ref2 after the checking step if ref1 doesn’t exist in Pat

0 1 2 3 4 5 6 7 8 9 10 11 12 13

Text: C D E F G H I J K L E D C M

Pat: E D C

Let Text and Pat occur as shown above. Since the character at ref occurs in Pat and there is a mismatch then

ref1 will be considered as a basis for calculating ref2, and since ref1 doesn’t occur in Pat then ref2 = TextIx + m

= 6, and pt1 = pos[Text[ref2]]) = pos[Text[‘I’]]) = 0. Hence, ref_ref2 may be found as ref_ref2 = ref2 + m - pt1 =

6 + 3 - 0 = 9. In addition, the TextIx pointer will be shifted forward 10 positions to align with the letter

16 M.M. Mhashi: An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern

Text[ref_ref2] = Text[‘L’] = 3m+1 positions, which is the maximum shift distance that this algorithm can skip

with only two–character-access. As a result, the pointer TextIx will point to the letter ‘M’ at position Text[13],

and the result will be as follows:

0 1 2 3 4 5 6 7 8 9 10 11 12 13

Text: C D E F G H I J K L E D C M

Pat: E D C

Based on that, The MRCA algorithm that reflects the above issues is as follows:

&��������� ���� ..����� ����������������� ��� �������/�.�������.,�/���

������

0���� �������

1���2��3�##���+# .�"�� �������#���#' .��2��

4�����������5������5�67�89��5!!�����

:�������/�.$5%����.,�/$5%���0����� ��� ���

������-��

;���2��<�=/'� �. ������.���� �����/�.���������� ����� �.��������2��

>��������5����5����� ��� ��5!!�����

?��������������$5%��/�.$�%��5�!&��.,�/$�%���0������� ��� � �5� &���

������-��

����-��

@���������9<76�� ����������������� ��� ��� ����� ��������� ��� ��� �������/�.�������.,�/���

���������

&��������� �������������#�.�A=�.=��� ��B���

&&�������/���/�&��� ���� �&��� �A� �&��� �0��� �A� �0���

&0������������$67�89%�����-���

&1���2��C/��� ���������+# ��������������� ����.��� ����� ����������2��

&4��������$���$�%%���&��#�.�A=�.=��� ������ ����������� ��� ���

&:���22�7�����7 ��� �����/ �������

&;���" �# �� ������� ��� ��� !&���

����������22�< �,����.� /D�< �,����.��� ����'�� �� ����� �� ����� ������ �/� ���'.�=�.=��� E��

&>����������� ��$� ����� ����� ��� �!�#�.�A=�.=��� %�������$#�.�A=�.=��� %���

IMACST: VOLUME 3 NUMBER 1 FEBRUARY 2012 17

1857-7202/1202002

�������������

&?������22�< �,���"�� �� ����� ������ ���� ������� ./���.�� ����.��� ���� �����������

&@��������������$� ��$� ����� ����� ��� %%���

0�����������22�< �,�� ����'�� �� ������������ ������=���� �����# ��� ��#'��������.��� ����� ���

0&�����������������B������������������ ��� � �&�������������� ����

00���������������� ��$� ����� �!!B%�F�����$�����%���

�������������������

01����������������#�.�A=�.=��� �����������

04���������������������� �����

������������������-��

0:���������������'���()�6�����'�� �� ����#��������(��� ���� ���� ��� ���(����(��� �����&�� ��#���

���������������-��

�����������-�

0;�����22�7������ �.,�//����/�����

0>������� ��D��

0?�������� �&���� ���������������/����/�.$� ��$� �&%%��

0@�����������F/����

�������������

1�������������� �0���� �&!����� ��� ��/�&���/�.$� ��$� �0%%���

1&�� �������F/�&����

10���������������������� �A� �0���� �0�!����� ��� � �/�&���

10���������������������� �����!��1������� ��� � �/�&��!�&� �/�.$� ��$� �A� �0%%���

������������������-��

11������������� #. ��

���������������������

14������������������ �����!��0������� ��� � �/�&�!�&��

� ������-�

� -�

�1:������ #. �

� ��

��1;������������� �A� �&���� �&�!����� ��� � �/���

��1>������������ �����!��.,�/$� ��$� �A� �&%%� �/��!�&��

18 M.M. Mhashi: An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern

�������������-�

���1?������� �'�����

���������-�

The first new version of MRCA is MRCA2m. As mentioned earlier, MRCA2m has skipping distance 2m+1.

The algorithm MRCA2m is as the same as the algorithm MRCA except the segment of code starting at line 26.

The new version from line 26 follows:

0;����������22�7������ �.,�//����/����

0>�������������� ��D�

0?������������� �&���� �������

0@������������������F/�.$� ��$� �&%%���

������������������

1������������������ �0���� ��!����� ��� ���

1&����������������/�&���/�.$� ��$� �0%%���

10���������������� �����!��0������� ��� � �/�&��

������������������-�

11��������������� #. �

���������������������

14�������������������� �A� �&���� �&�!����� ��� � �/���

1:������������������� �����!��.,�/$� ��$� �A� �&%%� �/��!�&��

�������������������-��

��������������-����

���1;������ �'����

����������-�

The second new version of MRCA is MRCA4m. As mentioned earlier, MRCA4m has skipping

distance ranges from one character to 4m+1 characters. The segment of code of MRCA4m starting from line 26

in MRCA follows:

���0;������22�7������ �.,�//����/����

���0>������ ��D�

���0?������� �&���� ������

���0@������/����/�.$� ��$� �&%%���

���1������������F/���

��������������

IMACST: VOLUME 3 NUMBER 1 FEBRUARY 2012 19

1857-7202/1202002

���1&�������������� �0���� �&�!����� ��� ��

���10�������������/�&���/�.$� ��$� �0%%��

���11�����������������F/�&����

���14������������������� �1���� �0�!����� ��� ��

���1:������������������/�0���/�.$� ��$� �1%%��

���1;�������������������� �A� �1���� �1�!����� ��� � �/�0��

���1>������������������� �����!��4������� ��� �!�&� �/�0� �/�.$� ��$� �A� �1%%��

������������������������-

���1?����������������� #. ��

��������������������������

���1@����������������������� �����!��0������� ��� � �/�&�!�&��22� /�.$� ��$� �A� �0%%�

���������������������������-

�����������������-

���4���������� #. �

������������������

���4&�������������� �A� �&���� �&�!����� ��� � �/���

���40�������������� �����!��.,�/$� ��$� �A� �&%%� �/��!�&��

������������������-�

��������������-���������

����41������ �'����

�����������-�

4. Illustration and Discussion

The three algorithms MRCA2m, ECSA, and MRCA4m were implemented and compared on DNA sequence

of 4 basic characters, (A, C, G, T), with a size of more than 1.8 mega characters. The algorithms executed using

Intel(R) Pentium(R) 4 PC with CPU speed 2.40GHz, 246MB RAM, and Windows XP professional operating

system, and a program was designed in C++ to select randomly 6 groups. Each group of patterns consists of

3000 patterns. The number of characters in patterns ranges from 6 to 50 characters. The average number of

occurrences ranges from 0 to 2346. The cost of the searching process to find all the occurrences of the different

patterns in each group in Text is measured by finding 1) the search clock time, where the total clock time

includes the preprocessing clock time of patterns, and 2) the average number of checking operations (i.e., the

first main operation in string searching algorithms) required to find each group of patterns. The second

20 M.M. Mhashi: An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern

measurement is important to give us an indication about the percentage of reduction in the number of

comparisons. Table (1) shows the clock time required to find each group of patterns.

Table 1: The clock time (seconds) required by MRCA2m, MRCA, and MRCA4m to find all the occurrences of

each group of patterns

Grou

p #

3000

Patterns

with

Length

In

character

Number of

Occurr-

ences

Clock time in seconds

Improvements of latest

version MRCA4m

vs.

MRCA2 ECSA Naive

MRCA2m

2m+1

ECSA

3m+1

MRCA4m

4m+1

1 6 0 22.1 7.75 6.61 5.39 15% 31%

2 6 0 22.4 8.44 7.25 6.19 14% 27%

3 6 2346 22.5 9.94 8.72 7.84 12% 21%

4 25 0 22.3 2.78 2.13 1.44 24% 48%

5 25 0 22.1 3.06 2.88 2.56 6% 16%

6 25 928 24.3 15.33 13.47 13.19 12% 14%

7 50 0 22.3 1.52 1.03 0.81 32% 47%

8 50 0 22.4 2.31 1.89 1.70 18% 26%

9 50 127 31.8 13.52 12.50 12.20 8% 10%

One can notice from Table (1) that the number of occurrences is zero. This is because the performance

of these versions of the algorithm ECSA needs to be tested. The characters in patterns in groups 1, 4, and 7 are

completely different from the characters in Text. About one third of the characters in the patterns in groups 2,

5, and 8 are exists in Text. Finally, the characters in groups 3, 6, and 9 are 100% exist in Text. Each group

consists of 3000 patterns. The time required to find all the occurrences of each group ranges from 1.515

seconds (group 7) to 15.38 seconds (group 6), 1.031 seconds to 13.468 seconds, and 0.812 seconds to 13.187

seconds by the algorithms MRCA2m, ECSA, and MRCA4m, respectively. Furthermore, one can see from

Table (1) that the algorithm MRCA4m (4m+1) reduces the clock time required by MRCA2m and ECSA

algorithms by 6.11% to 31.95% and 9.71 (group 9) to 48.33% (group 4). In order to show the significant of the

time required by these versions of MRCA, the naive algorithm is included in Table (1). The time required to

find all the occurrences of each group of patterns by the naïve algorithm ranges from 22.1 seconds to 31.8

seconds. Figure (1) show the time required to find each group of patterns by the different algorithms.

IMACST: VOLUME 3 NUMBER 1 FEBRUARY 2012 21

1857-7202/1202002

0

5

10

15

20

1 2 3 4 5 6 7 8 9

ECSA_1 2m +1 ECSA 3m +1 ECSA_2 4m +1

Fig.1: The clock time required by MRCA2m, ECSA, and MRCA4m algorithms to find all the occurrences of patterns in each group.

Table 2: The number of checking operation required by MRCA2m, ECSA, and MRCA4m to find all the occurrences of each group of patterns

Grou

p #

3000

Patterns

with

Length

In

character

Number of

Occurr-

ences

Number of Checking operations

Improvements of latest

version MRCA4m

vs.

MRCA2m ECSA

MRCA2m

2m+1

ECSA

3m+1

MRCA4m

4m+1

1 6 0 138708 86283 65577 53% 24%

2 6 0 152076 99498 78764 48% 21 %

3 6 6517 163794 118503 100550 39% 15%

4 25 0 32028 20755 15493 52% 25%

5 25 0 40160 34588 30629 24% 12%

6 25 4354 224542 210939 207656 8% 2%

7 50 0 16367 10673 7936 52% 26%

8 50 0 23683 22343 21938 7% 2%

9 50 3126 214086 201129 193386 10% 4%

Table (2) presents the three algorithms with the number of checking operations required to find all the

occurrences of Pat in Text for all the different groups. The information of number of groups, pattern length, and

22 M.M. Mhashi: An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern

the number of occurrences in this table are the same as in Table (1). The number of checking operations ranges

from 16367 (group 7) to 224542 (group 6), 10673 to 210939, and 7936 to 207656 by the algorithms MRCA2m,

ECSA, and MRCA4m, respectively. Furthermore, from Table (2) the algorithm MRCA4m (4m+1) reduces the

percentage of number of checking operations required by MRCA2m and ECSA algorithms by the ranges from

7% (group 8) to 53% (group 1) and from 2% (group 8) to 26% (group 7).

From the analysis of the discussed algorithms, it is clear that the new enhanced algorithm MRCA4m

outperforms the other algorithms, especially when the Pattern does not occur in Text (53%). This is due to the

long shift distance (4m+1). This decreases the number of checking operation, and in turn reduces the number of

basic comparison operations. However, when the characters in Pat exist in Text, the performance of the new

algorithm is not high as that in the previous case, especially when the pattern is long a little bit (group 6 and

group 9). Most of the words, sentences, and patterns have, prefix, suffix, infix, or may be a combination of two

of them. In such a case a lot of not necessary comparisons performed till the algorithm decides that there is no

mismatch. This decreases the performance of the new algorithm.

6. Conclusions

A new exact string searching algorithm MRCA4m was developed, and its performance was compared with

two algorithms, namely, MRCA2m, and MRCA. The new algorithm increases the performance of the skipping

phases needed for the string searching process. In the skipping phase MRCA4m focuses on increasing the shift

distance. The search clock time and counting the number of checking phase operation criteria were used in a

case study to compare the performance of MRCA4m against MRCA2m and MRCA. The results showed that:

1) Using MRCA4m improved the clock time required by MRCA2m and MRCA by 6% to 32% and 10% to 48%

respectively.

2) Using MRCA4m improved the number of checking operation required by MRCA2m and MRCA by 7% to

53% and 2% to 26% respectively.

3) Decreasing the pattern length decreases the system performance.

4) The performance of the new algorithm is positively distinguished in case that the percentage of existence of

the characters of patterns in Text is low. In such a case, if there is some applications required that situation

(i.e., rare existence characters of Pat in Text), it is possible to increase the shift distance more and more, and

in turn, increasing the performance.

5) Increasing the shift distance has a major effect on reducing the number of checking operations, and in turn

decreases the number of comparisons and increasing the system performance.

References

1] Ukkonen E., "A linear–time algorithm for finding approximate shortest common superstring", Algorithmica,

1990; 5:313–323.

2] Stephen G., "String Searching Algorithms", World Scientific, Singapore, 1994.

3] Apostolico, "A, Galil Z. Pattern Matching Algorithms", Oxford University Press, 1997.

IMACST: VOLUME 3 NUMBER 1 FEBRUARY 2012 23

1857-7202/1202002

4] C. Gibas. And P. Jambeck., Developing Bioinformatics Computer Skills. O’Reilly, Sebastopol First

Edition April 2001 pp. 1, 446.

5] Gusfield D., "Algorithms on strings, trees and sequences", Cambridge University Press, Cambridge, 1997.

6] Fenwick P., "Fast string matching for multiple searches", Software–Practice and Experience, 2001,

31(9):815–833.

7] Liu Z, Du X,and Ishii N., "An improved adaptive string searching algorithm", Software Practice and

Experience, 1988, 28(2):191–198.

8] Sunday D., "A very fast substring search algorithm", Communications of the ACM, 1990, 33(8):132–142.

9] Raita T., "Tuning the Boyer-Moore-Horspool String Searching Algorithm", Software Practice and

Experience, 1992, 22 (10):879-844.

10] Bruce W., Watson, E., "A Boyer-Moore-style Algorithm for Regular Expression Pattern Matching", Science

of Computer Programming, 2003, 48: 99-117.

11] Ager M. S., Danvy O., and Rohde H. K., "Fast partial evaluation of pattern matching in strings",

ACM/SIGPLAN Workshop Partial Evaluation and Semantic-Based Program Manipulation, San Diego,

California, USA, pp. 3 – 9, 2003.W.-K. Chen, Linear Networks and Systems (Book style)., Belmont, CA:

Wadsworth, 1993, pp. 123–135.

12] Fredriksson and Grabowski S., “Practical and Optimal String Matching”, Proceedings of SPIRE'2005,

Lecture Notes in Computer Science 3772, 2005, pp. 374-385, Springer Verlag.

13] Hernandez, and Rosenblueth D., “Disjunctive partial deduction of a right-to-left string-matching algorithm”,

Information Processing Letters, 2003, 87: 235–241.

14] Apostolico A., and R.Giancarlo R., “The Boyer-Moore-Galil string searching strategies revisited”, SIAM J.

Comput., 1986, 15(1): 98-105.

15] Crochemore, “Transducers and repetitions”, Theoret. Comput. Sci., 1986; 45: 63-86.

16] Crochemore M., and Perrin D., “Two-way string-matching”, J. ACM, 1991, 38: 651-675.

17] Galil Z., and Giancarlo R., “On the exact complexity of string matching: upper bounds”, SIAM J. Comput.,

1992, 21: 407-437.

18] Smith P. D., “Experiments with a very fast substring search algorithm”, SP&E, 1991, 21(10): 1065-1074.

19] Smith P., "On Tuning the Boyer-Moore-Horspool String Searching Algorithm", Short Communication,

Software Practice and Experience, 1994, 24(4):435-436.

20] Mhashi M., "A Fast String Matching Algorithm using Double-Length Skip Distances", Dirasat Journal,

University of Jordan, Jordan, 2003, 30(1):84-92.

21] Mhashi, M., "The Performance of the Character-Access On the Checking Phase in String Searching

Algorithms", Transactions on Enformatica, Systems Sciences and Engineering, 2005, 9: pp. 38 –43.

24 M.M. Mhashi: An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern

22] W. Kim, S. Park, J. Won. S. Kim, and J. Yoon., "An efficient DNA sequence searching method using

position specific weighting scheme", Journal of Information Science, 2006, 32(2).

23] A. F. Klaib and H. Osborne. “Searching Protein Sequence Database Using BRBMH Matching Algorithm,”

International Journal of Computer Science and Network Security (IJCSNS), 8(12), pp. 410-414, 2008.

24] Y. Huang, X. Pan, Y. Gao, and G. Cai, "A Fast Pattern Matching Algorithm for Biological Sequences,"

IEEE, pp. 608 – 611, 2008.

25]R. Thathoo, A. Virmani, S. S. Lakshmi, N. Balakrishnan and K. Sekar, “TVSBS: A fast exact pattern

matching algorithm for biological sequences,” Current Science, 2006, 91(1): 47–53.

26] A. L. Delcher, A. Phillippy, J. Carlton, and S. L. Salzberg, "Fast algorithms for large-scale genome

alignment and comparison," Nucleic Acids Res, 29002, 30, pp. 2478-2483.

27] Raju Bhukya., and DVLN Somayajulu., " Exact Multiple Pattern Matching Algorithm using DNA Sequence

and Pattern Pair", International Journal of Computer Applications 17(8):32-38.

28] L. Cheng, D. Cheung, and S. Yiu., Approximate String Matching in DNA Sequences, Proceedings of the

Eighth International Conference on Database Systems for Advanced Applications, 2003

29] Mhashi M., "The Effectof Multiple Reference Characters on Detecting Matches in String Searching

Algorithms", Software Practice and Experience, 2005, 35(13): 1299 - 1315.

IMACST: VOLUME 3 NUMBER 1 FEBRUARY 2012 25

1857-7202/1202002

Analyzing large-scale DNA Sequences on multi-core Architecture.pdf

Analyzing large-scale DNA Sequences on Multi-core Architectures

(CSE-2015, ©IEEE)

Suejb Memeti and Sabri Pllana Department of Computer Science, Linnaeus University

351 95 Växjö, Sweden {suejb.memeti, sabri.pllana}@lnu.se

Abstract—Rapid analysis of DNA sequences is important in preventing the evolution of different viruses and bacteria during an early phase, early diagnosis of genetic predispositions to certain diseases (cancer, cardiovascular diseases), and in DNA forensics. However, real-world DNA sequences may comprise several Gigabytes and the process of DNA analysis demands adequate computational resources to be completed within a reasonable time. In this paper we present a scalable approach for parallel DNA analysis that is based on Finite Automata, and which is suitable for analyzing very large DNA segments. We evaluate our approach for real-world DNA segments of mouse (2.7GB), cat (2.4GB), dog (2.4GB), chicken (1GB), human (3.2GB) and turkey (0.2GB). Experimental results on a dual-socket shared-memory system with 24 physical cores show speedups of up to 17.6×. Our approach is up to 3× faster than a pattern- based parallel approach that uses the RE2 library.

Index Terms—parallel DNA analysis, multi-core architectures, finite automata

I. INTRODUCTION The need for high performance computational biology has

emerged as a result of fast growth in biological information, the complexity of interactions that underlie many processes in biology, as well as the diversity and the interconnectedness of organisms at the molecular level [5]. These biological information are accumulated via different techniques, however they require adequate analysis and processing to extract useful information that make the results evident.

According to Benson et al. [9] the number of Deoxyribonu- cleic Acid (DNA) sequences and nucleotide bases in these sequences is growing exponentially, doubling every 18 months. As these data are collected, motif search and DNA sequencing are just some examples among many for analytics of Next Gen Sequencing Analysis.

A DNA sequence contains specific genetic instructions, which make the living organisms function properly. In a DNA strand there are four bases of nucleotides: A-adenine, C- cytosine, G-guanine and T-thymine. DNA analysis is important for discovery of differences and similarities of organisms and exploration of the evolutionary relationship between them. This process often requires comparisons of the corresponding DNA sequences, for example, checking whether one sequence is a subsequence of another, or comparing the occurrences of specific k-mers in the corresponding DNA sequences. In computational biology k-mers refer to all the possible sub- strings (sub-sequences) of length k of a DNA sequence. They

have an important role during sequence assembly and can be used in sequence alignment as well.

Analyzing DNA sequences within a reasonable time is important for domain scientists to study various phenomenons, such as the evolution of viruses and bacteria during an early phase [20], or diagnosis of genetic predispositions to certain diseases.

Modern parallel computing systems promise to provide the capabilities to cope with the DNA analysis processing requirements. Existing approaches use both hardware and software to accelerate regular expression matching. The hard- ware based approaches (such as [7], [27]) are faster, but less flexible and more expensive, whereas software based acceleration techniques are flexible in terms of updating or adding new patterns [30]. Recently different software based DNA analysis techniques designed for multi-core systems have been proposed [6], [11], [14], [16], [19], [23].

In this paper, we will first explore and discuss the paral- lelization opportunities of DNA analysis, and thereafter we introduce a parallel algorithm for DNA analysis that is based on Finite Automata. We use a domain decomposition approach for parallelization; in our approach the DNA sequence is split into several chunks, and each chunk is assigned to a thread to perform pattern matching. Our algorithm is optimized to do efficient speculations of the possible initial states for each chunk. Only one regular expression matching (REM) for a chunk is required to be completely performed; the remaining REMs stop when the converging point is reached. A converging point is a state where two or more REM starting from different states meet after the same number of symbols is read. Furthermore, we use a memory efficient data structure that saves the necessary information to count and highlights the k-mers. Experiments with real-world DNA segments (for human and various animals) on a dual socket shared-memory system with 48 threads show significant speedups compared to the sequential version (up to 17.6x). The implementation of our algorithm is up to 3x faster than a pattern-based algorithm implemented using the RE2 library [3]. Major contributions of this paper include: • a parallel algorithm for DNA analysis that is based on

Finite Automata; • empirical evaluation of our algorithm with real-world

DNA segments of mouse (2.7GB), cat (2.4GB), dog

ar X

iv :1

50 9.

01 50

6v 1

[ cs

.D C

] 4

S ep

2 01

5

(2.4GB), chicken (1GB), human (3.2GB) and turkey (0.2GB);

• a comparison of our algorithm with a pattern-based algorithm implementation that uses RE2 library.

The rest of the paper is organized as follows. Section II provides background information on pattern matching, whereas Section III presents our algorithm for counting and extracting k-mers in a DNA sequence. Section IV presents the experimental setup and discusses the experimental results. The work described in this paper is compared and contrasted to the related work in Section V. Section VI provides a summary of our work.

II. REGULAR EXPRESSION MATCHING (REM) WITH FINITE AUTOMATA (FA)

Regular expression matching verifies whether a pattern is present in a string. REM is commonly used for determining the locations of a pattern within a sequence of tokens, in search and replace functions, or to highlight important information out of a huge data set. In the context of computational biology, pattern matching is used for analyzing and processing biological information in order to extract the useful parts of the data and make them evident. The formal definition of the REM is as follows: the input text is an array T [1..n] where n is the length of the input, and pattern P [1..m] where the length of the pattern m ≤ n. The alphabet

∑ defines the possible

characters of the input string. A Finite Automaton (FA) is a machine for processing

information by scanning the input text T in order to find the occurrences of the pattern P . A formal definition of the FA is as follows: FA is a quintuple of (Q,

∑ ,δ,q0,F ), where Q

is the finite set of states, ∑

is the finite alphabet, δ is the transition function Q×

∑ −→ Q, q0 is the start state and F

is the distinguished set of final states. A well known algorithm for multiple pattern matching is the

Aho-Corasick algorithm. It is able to match any occurrences (including the overlapped ones) of multiple patterns linearly to the size of the input string. It examines each character of the input string only once. It builds an automaton by creating states and transitions corresponding to these states. It adds failure transitions when there is no regular transition leaving from the current state on a particular character, which makes it possible to match multiple and overlapping occurrences of the patterns. Furthermore, this algorithm is capable of delivering input-independent performance if implemented efficiently in parallel systems, which is a reason why we use this algorithm as basis of our work.

III. DESIGN AND IMPLEMENTATION OF A DNA ANALYSIS ALGORITHM

In this section we first provide the details about the outline of our algorithm. Thereafter we discuss the most important implementation aspects to achieve a scalable algorithm for counting and extracting specific k-mers from a large DNA sequence.

A. Our algorithm for counting and extracting the location of k-mers (k-mers CoEx)

Figure 1 depicts two possible ways for parallel execution of regular expression matching for bio-computing applications: (a) input-based approach that splits the input string into smaller chunks and processes them in separate threads and (b) pattern-based approach that splits the patterns in sub- patterns, creating separate state machines for each of them and processing the same input string with each different machine [16].

(a) Input-based approach

(b) Pattern-based approach

Fig. 1. Load balancing using Input and Pattern partitioning approach.

Our algorithm uses the input-based approach. The challenge of this approach is determining the initial state for each chunk. Finding the correct starting state for each chunk is important for finding the occurrences of the patterns that appear in the crossing border. Other researchers use different ways of finding the initial states, for instance Luchaup et al. [18] use speculation to find the initial state based on the most visited states, Devi and Rajagopalan [12] use an index based technique, Chacon et al. [11] use Suffix-Arrays, Villa et al. [31], uses the pattern length overlapping approach.

Fig. 2. K-mers CoEx finite state automaton for matching the patterns: ”acg”, ”cat”, ”cta” and ”tta”

Our way of determining the possible initial states is as follows: (1) find the set of source states (L) for the first element of the sub-input mapped to the running thread (Tn); (2) find the set of destination states (S) for the last character of the sub-input mapped to the previous thread (Tn−1); (3) find the intersection of S and L (S ∩ L), which is the set of possible initial states [21]. The first thread (T0) always starts from the initial state q0. Each thread is responsible for finding the set of possible initial states, and for each state of this set a regular expression matching is performed. When all threads have finished their job, the results are joined by a binary reduction, which connects the last visited state of Tn to the first visited state of Tn+1.

This method provides very good results for sparse transition tables and good performance for dense matrices for DFAs with relatively small number of states. However, in a DFA with large number of states this method seems to be less efficient. This happens because one thread may be responsible to perform multiple REM for the same input, due to multiple possible initial states. To reduce the operations required for each thread to perform the REM starting from different states, further optimizations are needed.

While investigating the REM using the modified Aho- Corasick DFA representation, we noticed that the result con- verges after several symbols are read (in our experiments, 10 is the max number of steps required to find the converging point). A converging point is a state where two or more REM starting from different states meet after the same number of symbols are examined. This insight allows us to significantly minimize the execution cost required to perform the REM starting from each possible initial state. The details about the process of the convergence are given in the next Section.

B. Implementation Aspects

In this section we will explain the implementation details of our algorithm, including the process of building the DFA,

splitting the input among the available threads, finding the set of possible initial states for each chunk assigned to a thread, running the REM for the first state in this set and the process of finding the converging point. A reference of each process to the corresponding lines of codes in Algorithm 1 will be provided.

Furthermore, we define qi as the state of the automaton where i is the state id, Ei as a sub-pattern of the selected patterns, Ri as the process of performing an REM starting from the item at index i of the set of possible initial states. The values used in the switch-case (Line 25 - 30) (ex. 122, 127, 128...) determine that a specific sub-pattern (Ei) has been matched.

1) Building the Deterministic Finite Automaton: The AC algorithm with failure transitions has a drawback due to the non-deterministic transitions for a single input character. Fig- ure 2 illustrates our solution to eliminate the failure transitions by adding the right transition (indicated by dashed lines) for each state. Having a valid transition for each possible character to another state in the automaton, guarantees that for each symbol the same amount of operations will be performed. The example automaton shown on Figure 2 is able to match the following patterns: ”acg”, ”cat”, ”cta” and ”tta”. For example, if we read string ”ac” we reach state q2, and when ”a”, ”c” or ”t” is read we know exactly that state q6, q5 or q10 is next, respectively.

2) Splitting the input and finding the possible starting states (PSS): The process of splitting the input among the available threads is depicted in Table I.a. This is a straight forward step, where the input length is divided by the number of available threads (Line 5-6). The chunks are assigned to the threads consecutively based on the thread IDs.

The pseudo-code of the process of finding the PSS (See Section III-A) is shown on Algorithm 1 Line 10. When the PSS are determined the thread performs an REM starting from each item of PSS (Line 14). Table I.b,c depicts the REM process starting from each item of PSS; each table corresponds to a thread. For example, the first thread (see Table I.b) initiates the REM starting from the following PSS: q3, q10, q13, and q139.

For every Ri (REM starting from the PSS at index i) a CR structure is created and stored in the results. The CR structure stores the initial state, last state and the total number of occurrences for each of the sub-patterns (Line 61 - 65).

3) Determining the Converging Point: We investigated that while performing an REM of the same input string starting from different states, the REM converges after a certain num- ber of steps. An example of how the convergence happens is depicted in Table I.b,c. For example in Table I.c, the matching of CCCTATACGA... (see Table I.a) starting from q3 leads to the following transitions: q2 → q2 → q2 → q10 → q11 → q130 → q40 → q60 → q15 → q1.... Matching the same input starting from q10 leads to the following transitions: q19 → q2; no further transitions are needed, because the state at position two for both REMs (R0 and R1) is q2, which indicates the point of convergence.

Algorithm 1 k-mers CoEx Input: transition table dfa; set of final states F ; input string I Output: list of CR results

1: procedure KCOEX(dfa, F, I) 2: steps = 10 . The max number of steps required to converge 3: result = list < list < CR >> . Stores the final states for each thread 4: for T0...Tn do in parallel . T - Thread, n - total number of threads 5: start position = t i ∗ (I.length/n) . t i = thread id 6: SI = substring(start position, I.length/n) . SI - sub input 7: if ti ! = 0 then 8: P SS = get possible starting states(I[start position],

I[start position − 1]) 9: else

10: P SS = [0] 11: end if 12: fr list = list < F R > 13: psi i = 0 14: for int cs in P SS do 15: CR cr . stores the init state, last state, and total number of

occurrences for each subexpression 16: char i = 0 17: for char c in SI do 18: if char i == 0 then 19: cr.init state = cs 20: else if char i == SI.length − 1 then 21: cr.last state = dfa[cs][c] 22: end if 23: cs = dfa[cs][c] 24: if cs in F then 25: switch cs do 26: case 117 27: cr.final states[0] + + . agggtaaa | tttaccct is

found 28: case 122 or 128 29: cr.final states[1] + + .

(c|g|t)gggtaaa | tttaccc(a|c|g) is found 30: ... 31: end if 32: if psi i == 0 and char i ≤ steps then 33: F R fr 34: fr.current state = cs 35: fr.final states[0] = cr.final states[0] 36: ... 37: fr.final states[8] = cr.final states[8] 38: fr list.add(fr) 39: else if psi i > 0 and fr list[char i].current state == cs

then . check for convergence 40: cr.final states[0] + = results[t i][0].final states[0]−

fr list[char i].final states[0] 41: ... 42: break 43: end if 44: end for 45: results[t i].add(cr) 46: end for 47: end for 48: end procedure

Input: transition table dfa; the first character of the input mapped to Tn (current thread) first char; the last character of the input mapped to Tn−1 (previous thread) last char Output: list of states

49: procedure GET POSSIBLE STARTING STATES(dfa, first char, last char) 50: S = L = list < q > . q - state of the DFA 51: for q0...qn do 52: if dfa[qi][first char] ∈ Q then . Q-list of states 53: Si = qi 54: end if 55: if dfa[qi][last char] ∈ Q then 56: Li = dfa[qi][last char] 57: end if 58: end for 59: return S ∩ L 60: end procedure

61: struct CR{ 62: int init state 63: int last state 64: int final states[9] . Stores the number of occurrences for each

sub-expression (E1...E9) 65: }

66: struct F R{ 67: int current state 68: int final states[9] 69: }

TABLE I THE PROCESS OF SPLITTING THE INPUT, PERFORMING THE REM

STARTING FROM EACH POSSIBLE INITIAL STATE, AND THE PROCESS OF CONVERGENCE

a) Splitting the input string into chunks

0 1 2 3 4 5 6 7 8 9

T1 → G T G A G C C G A G . . . chunk 1 T2 → C C C T A T A C G A . . . chunk 2

b) REM for chunk 1

In it

ia l

st at

e 1 → 7 21 31 1 7 19 2 9 1 7 . . . 6 → 16 21 . . . 18 → 32 48 13 1 . . . 43 → 63 21 . . .

c) REM for chunk 2

In it

ia l

st at

e 3 → 2 2 2 10 11 130 40 60 15 1 . . . 10 → 19 2 . . . 44 → 67 90 112 129 1 8 11 5 15 . . . 63 → 84 105 2 . . .

TABLE II A TABULAR REPRESENTATION OF THE fr list (LINE 12)

Step CS E1 E2 E3 E4 E5 E6 E7 E8 E9

1 67 0 0 0 0 0 0 0 0 0 ...

4 129 0 0 0 0 0 1 0 0 0

In order to properly count the number of occurrences of k-mers when a converging point is met, we need to store the number of occurrences of k-mers for the first n steps while performing R0. For each of the first n examined characters a FR structure is created and stored in the fr list (Line 32- 38). The FR structure is shown on Algorithm 1 Line 66 - 69, which stores the current state of the automaton, and the current number of occurrences for each of the sub-patterns. An example of this process is depicted in Table II, where each row represents a FR structure. In this example, we assume that q44 is the first state in the PSS, which means that the R0 starts from q44. After four characters (CCCT ) are examined, a final

TABLE III THE TABULAR REPRESENTATION OF THE results (LINE 3)

T Q0 Qn E1 E2 E3 E4 E5 E6 E7 E8 E9

1 1 130 0 0 0 1 0 0 0 0 0 96 130 0 0 0 1 0 0 0 0 0

2 3 130 0 0 1 0 1 0 2 0 1 44 130 0 0 1 0 1 1 2 0 1

T - thread index Q0 - start state, Qn - end state E1 − E9 - sub-expressions CS - current state

state (q129 - representing E6, see Table I.e row 3) is reached, therefore the number of current occurrences of k-mers for E6 becomes 1 (see Table II).

The REM starting from the remaining states of PSS will be performed until the converging point is reached. For instance (see Table I.b), R1,R2 and R3 need only 2, 4, 2 characters to be examined, respectively. When the converging point is reached, the total number of final states (CR.final states) is calculated by adding the total number of states found for R0 (results[t i][0].final states) to the Ri and subtracting the final states (fr list[char i].final states) found of R0 at the converging point (Line 41).

IV. EXPERIMENTAL EVALUATION In this section we describe the experimentation environment

used for the evaluation of our proposed algorithm and we discuss the obtained performance results.

A. Experimentation Environment We have performed experiments on a shared-memory sys-

tem with two 12-core Intel Xeon processors of the type E5- 2695 v2 and 16GB Memory. In total the system has 24 physical cores and each physical core supports two threads (also known as logical cores). We have implemented our algorithm using C++11 programming language and OpenMP. For compilation we used the Intel Compiler icc 15.0.0. In order to address the variability in performance measurements we have repeated each experiment 20 times.

For our experimental evaluation we have selected data-sets of genomes from the GenBank National Center for Biotech- nology Information sequence database [2]: mouse (2.7GB), cat (2.4GB), dog (2.4GB), chicken (1GB), human (3.2GB) and turkey (0.2GB). The information about the data-sets is provided in Table IV.

TABLE IV DNA DATA-SETS

Genome Reference Size (MB)

Mouse GRCm38.p2 2830 Cat Felis catus-6.2 2490 Dog CanFam3.1 2440 Chicken Gallus gullus-4.0 1060 Human GRCh38 3250 Turkey Meleagris gallopavo 193

In our experiments we used the same patterns as in the regex-dna benchmark, listed in Table V [1], which are used to extract and match DNA 8-mers and substitute nucleotides according to standards of International Union of Biochemistry (IUB). In Section IV-B we compare the performance of regex- dna benchmark with our k-mers CoEx algorithm.

The DFA for the given regular expression (Table V) was generated using our PaREM tool [21]. Figure 3 depicts the DFA of 137 states, which is able to find the occurrences (including the overlapping ones) of the selected patterns. For simplicity the failure links are omitted from the DFA graph.

TABLE V PATTERNS OF THE regex-dna BENCHMARK. THE SYMBOL ”|” DETERMINES

THE ”OR” REGEX OPERATOR

E1 agggtaaa | tttaccct E2 (c|g|t)gggtaaa | tttaccc(a|c|g) E3 a(a|c|t)ggtaaa | tttacc(a|g|t)t E4 ag(a|c|t)gtaaa | tttac(a|g|t)ct E5 agg(a|c|t)taaa | ttta(a|g|t)cct E6 aggg(a|c|g)aaa | ttt(c|g|t)ccct E7 agggt(c|g|t)aa | tt(a|c|g)accct E8 agggta(c|g|t)a | t(a|c|g)taccct E9 agggtaa(c|g|t) | (a|c|g)ttaccct

Fig. 5. Speedup of our k-mers CoEx algorithm implementation.

B. Results

We first present the performance results of our k-mers CoEx algorithm for various problem and machine sizes, and thereafter we compare our algorithm with a pattern-based algorithm implementation that uses RE2 library (known as regex-dna benchmark). Figure 4 depicts the execution time in logarithmic scale for each of our selected data-sets and for various numbers of threads {1,2,6,12,24,48}. We observe a good scalability of our algorithm as we increase the number of threads or the input size. For example, the analysis of the human’s DNA sequence using one thread takes 27 seconds, and by increasing the number of threads to 2, 6, 12, 24 and 48 the execution time reduces to 14.3s, 5.3s, 3s, 2s and 1.5s respectively.

Figure 5 depicts the obtained speedup of our algorithm com- pared to a sequential version of the Aho-Corasick algorithm for DNA analysis. We may observe that the k-mers CoEx algorithm scales gracefully with respect to the size of data-sets and the number of threads. The maximal speedup of 17.65× is achieved for the largest data-set (that is the human DNA segment) using 48 threads.

Figure 6 compares the performance of our k-mers CoEx algorithm with the regex-dna benchmark [1], which is im- plemented in C++ using the RE2 library [3] and OpenMP. The RE2 implementation is based on splitting the pattern in smaller patterns, and matching the input string in parallel for each sub-pattern. Since the regex-dna benchmark does not

Fig. 3. The DFA automaton that matches (counts and extracts the location of k-mers) 8-mers from a given DNA sequence. The corresponding regular expression is shown on Table V. For simplicity the failure links are omitted from the figure.

Fig. 4. Performance results of our k-mers CoEx algorithm for various numbers of threads and data-sets. As input are used six DNA sequences of various lengths: mouse (2.7GB), cat (2.4GB), dog (2.4GB), chicken (1GB), human (3.2GB), and turkey (0.2GB). The experiments are performed by varying the number of threads (2, 6, 12, 24, and 48). The performance measurements for each experiment have been repeated 20 times.

support larger data-sets, we have compared the two algorithms for the two smallest data-sets: chicken (1060MB) and turkey (193MB). Our k-mers CoEx algorithm outperforms the regex- dna benchmark for both data-sets. We may observe that the k-mers CoEx algorithm running on one thread takes the same amount of time as the regex-dna benchmark running on the total amount of threads. This happens because one thread has to perform at least one sequential REM for a specific sub- pattern. In this class of algorithms where the execution time is mainly dependent on the length of the input, balancing the work among the available threads should be done by splitting the input string, instead of the pattern length. One could benefit from partitioning a long pattern into smaller one, in cases when the input string is either relatively short or can not be split

(Real-time Network Intrusion Detection).

V. RELATED WORK

In this section we discuss the state-of-the-art in pattern matching and DNA sequence analysis techniques for multi- core architectures.

Existing approaches use both hardware and software to accelerate the process of regular expression matching. In com- parison to hardware based state machines, which are faster, less flexible and more expensive, software based acceleration techniques are flexible in terms of updating or adding new patterns [30].

Herath et al. presented in [16] an implementation of the Aho-Corasick string matching algorithm using POSIX threads,

Fig. 6. Performance comparison between k-mers CoEx and the regex-dna benchmark (RE2)

which is based on the pattern partitioning approach. A repli- cation of the Herath’s study with the intention to improve the software implementation of the Aho-Corasick algorithm was conducted by Arudchutha et al. [6].

Marçais and Kingsford [19] present the Jellyfish tool, which is based on the lock-free hash table that is optimized for counting k-mers of length up to 31 bases. Rizk et al. [28] present a similar approach to Jellyfish [19], so called DSK, which is designed for small-memory servers. The k-mers are counted by traversing the hash tables. Using hash tables for the internal representation resulted to be memory inefficient [14]. As described by Drews et al. [14] a sequence corresponding to a human chromosome with 24-230MB of input data would require gigabytes of memory to store the k-mers information.

Drews et al. [14] achieved significant speedup by partition- ing the input string among the threads in such a way that each thread processes only sequences starting with a specified prefix used to divide the radix tree among the threads. They achieved up to 6.9× speedup on a shared memory system with 8 cores.

The n-step FM-index approach presented by Chacón et al. [11] achieved speedups from 1.4× to 2.4× with respect to their original FM-index search algorithm.

An approach based on the Aho-Corasick string matching algorithm designed for the Cray XMT architecture is proposed by Villa et al. [31]. They split the input among the available threads, and overlap the input by the pattern length. Their approach is applicable for multiple patterns as long as they are of the same length, otherwise, the occurrences of the shortest patterns occurring on the crossing border may be counted by both threads.

A method for searching arbitrary regular expressions using speculation is proposed by Luchaup et al. [18]. The drawback is that if an REM performed by a thread does not converge on its sub-input, then the next thread has to start from a new state that breaks the serialization and limits the scalability.

Our k-mers CoEx algorithm is tailored for large-scale DNA

analysis. In our approach, the DNA segment is split into several chunks, and efficient speculations of the possible initial states for each chunk are performed. Furthermore, our algorithm optimizes the REM using a converging point.

VI. SUMMARY

We have described a parallel algorithm based on Finite Automata for counting and extracting k-mers in a DNA segment. In a series of experiments with real world data- sets we have observed that the algorithm scales well with respect to various problem and machine sizes. We achieved the maximal speedup of 17.65× for the largest data-set (that is the human DNA segment) using 48 threads on a dual-socket shared-memory system with 24 physical cores. In comparison to the regex-dna benchmark our algorithm was up to three times faster.

In this paper we have studied the performance of our approach for DNA sequence analysis on a shared-memory system with two 12-core Intel Xeon processors. It may be useful to compare the performance that we achieved using all available cores of the host Intel Xeon processors with the performance achieved when all available cores of the Intel Xeon Phi coprocessor are used [22]. Furthermore, software technologies, such as [13], enable the use of all cores of homogeneous processors of the host and all available cores of the coprocessor.

Future work could address generalization of our approach for DNA sequence analysis for various types of accelerated systems using techniques that ensure performance portability [8], [17], [24], [29]. The use of modeling and simulation techniques [10], [15], [25], [26] could help to reason about the performance on extreme-scale computing architectures [4].

REFERENCES

[1] The computer language benchmarks game. http://benchmarksgame. alioth.debian.org/. Accessed: Jan. 2015.

[2] National center for biotechnology information u.s. national library of medicine. http://www.ncbi.nlm.nih.gov/genbank. Accessed: Jan. 2015.

[3] Re2: An efficient, principled regular expression library. https://code. google.com/p/re2/. Accessed: Jan. 2015.

[4] E. Abraham, C. Bekas, I. Brandic, S. Genaim, E. B. Johnsen, I. Kondov, S. Pllana, and A. Streit. Preparing hpc applications for exascale: Challenges and recommendations. In 2015 International Conference on Network-Based Information Systems (NBiS). IEEE, 2015.

[5] S. Aluru, N. M. Amato, and D. A. Bader. Editorial: Special section on high-performance computational biology. IEEE Transactions on Parallel and Distributed Systems, 17(8):737–739, 2006.

[6] S. Arudchutha, T. Nishanthy, and R. G. Ragel. String matching with multicore cpus: Performing better with the aho-corasick algorithm. arXiv preprint arXiv:1403.1305, 2014.

[7] Y. Benenson, T. Paz-Elizur, R. Adar, E. Keinan, Z. Livneh, and E. Shapiro. Programmable and autonomous computing machine made of biomolecules. Nature, 414(6862):430–434, 11 2001.

[8] S. Benkner, S. Pllana, J. Traff, P. Tsigas, U. Dolinsky, C. Augonnet, B. Bachmayer, C. Kessler, D. Moloney, and V. Osipov. PEPPHER: Efficient and Productive Usage of Hybrid Computing Systems. Micro, IEEE, 31(5):28–41, Sept 2011.

[9] D. A. Benson, M. Cavanaugh, K. Clark, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and E. W. Sayers. Genbank. Nucleic acids research, 41(D1):D36–D42, 2013.

[10] I. Brandic, S. Pllana, and S. Benkner. An approach for the high-level specification of qos-aware grid workflows considering location affinity. Scientific Programming, 14(3–4):231–250, 2006.

[11] A. Chacn, J. C. Moure, A. Espinosa, and P. Hernndez. n-step fm- index for faster pattern matching. In V. N. Alexandrov, M. Lees, V. V. Krzhizhanovskaya, J. Dongarra, and P. M. A. Sloot, editors, ICCS, volume 18 of Procedia Computer Science, pages 70–79. Elsevier, 2013.

[12] S. N. Devi and S. P. Rajagopalan. Article: An index based pattern matching using multithreading. International Journal of Computer Applications, 50(6):13–17, July 2012.

[13] J. Dokulil, E. Bajrovic, S. Benkner, S. Pllana, M. Sandrieser, and B. Bachmayer. High-level support for hybrid parallel execution of c++ applications targeting intel xeon phi coprocessors. Procedia Computer Science, 18(0):2508 – 2511, 2013. 2013 International Conference on Computational Science.

[14] F. Drews, J. Lichtenberg, and L. R. Welch. Scalable parallel word search in multicore/multiprocessor systems. The Journal of Supercomputing, 51(1):58–75, 2010.

[15] T. Fahringer, S. Pllana, and J. Testori. Teuta: Tool support for performance modeling of distributed and parallel applications. In Computational Science - ICCS 2004, volume 3038 of Lecture Notes in Computer Science, pages 456–463. Springer Berlin Heidelberg, 2004.

[16] D. Herath, C. Lakmali, and R. Ragel. Accelerating string matching for bio-computing applications on multi-core cpus. In Industrial and Information Systems (ICIIS), 2012 7th IEEE International Conference on, pages 1–6, Aug 2012.

[17] C. Kessler, U. Dastgeer, S. Thibault, R. Namyst, A. Richards, U. Dolin- sky, S. Benkner, J. Traff, and S. Pllana. Programmability and perfor- mance portability aspects of heterogeneous multi-/manycore systems. In Design, Automation Test in Europe Conference Exhibition (DATE), 2012, pages 1403–1408, March 2012.

[18] D. Luchaup, R. Smith, C. Estan, and S. Jha. Speculative parallel pattern matching. IEEE Transactions on Information Forensics and Security, 6(2):438–451, 2011.

[19] G. Marais and C. Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764– 770, 2011.

[20] A. Mellmann and et al. Prospective genomic characterization of the german enterohemorrhagic escherichia coli o104:h4 outbreak by rapid next generation sequencing technology. PLoS ONE, 6(7):e22751, 07 2011.

[21] S. Memeti and S. Pllana. Parem: A novel approach for parallel regular expression matching. In Computational Science and Engineering (CSE),

2014 IEEE 17th International Conference on, pages 690–697, Dec 2014. [22] S. Memeti and S. Pllana. Accelerating dna sequence analysis using

intel xeon phi. In 2015 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA). IEEE, 2015.

[23] K. J. Nordström, M. C. Albani, G. V. James, C. Gutjahr, B. Hartwig, F. Turck, U. Paszkowski, G. Coupland, and K. Schneeberger. Mutation identification by direct comparison of whole-genome sequencing data from mutant and wild-type individuals using k-mers. Nature biotech- nology, 31(4):325–330, 2013.

[24] S. Pllana, S. Benkner, E. Mehofer, L. Natvig, and F. Xhafa. Towards an intelligent environment for programming multi-core computing systems. In Euro-Par 2008 Workshops - Parallel Processing, volume 5415 of Lecture Notes in Computer Science, pages 141–151. Springer Berlin Heidelberg, 2009.

[25] S. Pllana, S. Benkner, F. Xhafa, and L. Barolli. Hybrid performance modeling and prediction of large-scale computing systems. In Complex, Intelligent and Software Intensive Systems, 2008. CISIS 2008. Interna- tional Conference on, pages 132–138, March 2008.

[26] S. Pllana, I. Brandic, and S. Benkner. A Survey of the State of the Art in Performance Modeling and Prediction of Parallel and Distributed Com- puting Systems. International Journal of Computational Intelligence Research (IJCIR), 4(1):17–26, January 2008.

[27] J. Reif and S. Sahu. Autonomous programmable nanorobotic devices using dnazymes. In M. Garzon and H. Yan, editors, DNA Computing, volume 4848 of Lecture Notes in Computer Science, pages 66–78. Springer Berlin Heidelberg, 2008.

[28] G. Rizk, D. Lavenier, and R. Chikhi. Dsk: k-mer counting with very low memory usage. Bioinformatics, 29(5):652–653, 2013.

[29] M. Sandrieser, S. Benkner, and S. Pllana. Using Explicit Platform Descriptions to Support Programming of Heterogeneous Many-Core Systems. Parallel Computing, 38(1-2):52–56, January 2012.

[30] B. Soewito and N. Weng. Methodology for evaluating DNA pattern searching algorithms on multiprocessor. In Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE

2007, October 14-17, 2007, Harvard Medical School, Boston, MA, USA, pages 570–577, 2007.

[31] O. Villa, D. G. Chavarra-Miranda, and K. J. Maschhoff. Input- independent, scalable and fast string matching on the cray xmt. In IPDPS. IEEE, 2009.

  • I Introduction
  • II Regular Expression Matching (REM) with Finite Automata (FA)
  • III Design and Implementation of a DNA Analysis algorithm
    • III-A Our algorithm for counting and extracting the location of k-mers (k-mers CoEx)
    • III-B Implementation Aspects
      • III-B1 Building the Deterministic Finite Automaton
      • III-B2 Splitting the input and finding the possible starting states (PSS)
      • III-B3 Determining the Converging Point
  • IV Experimental Evaluation
    • IV-A Experimentation Environment
    • IV-B Results
  • V Related Work
  • VI Summary
  • References