assignment as attached

profilehelpmeout11
20171015193247f_17_bme_6762_assignment_b2.docx

TUTORIAL SET I

MUTATION

Mutational implications on the residues in biosequences: Mutations can be classified several different ways. This tutorial will focus on sorting such mutations by their effect on the structure of DNA or a chromosome. For this categorization, mutations can be separated into two main groups, each with multiple specific types. The two general categories

are large-scale and small-scale mutations.

Small-scale mutations : These are those that effect the DNA at the molecular level by changing the

normal sequence of nucleotide base pairs. These types of mutations may occur during the process of DNA replication during either meiosis or mitosis. There are three possible small-scale mutations that may occur: Substitution, deletion and insertion as described below. The occurrence of substitutions, deletions and insertions is in general due to mutations . A mutation refers to an epoch wherein a DNA gene is damaged or changed (in such a way as to alter the genetic message carried by that gene). Relevant permanent alteration to the physical composition of a DNA gene (such that the genetic message being changed) is caused by an agent of substance called mutagen.

Large-scale mutations : These mutations effect entire portions of the chromosome. Some large-scale mutations effect only single chromosomes, others occur across nonhomologous pairs. Some large-scale mutations in the chromosome are analogous to the small-scale mutations in DNA; the difference is that for large-scale mutations entire genes or sets of genes are altered rather that only a single nucleotide of the DNA. Single chromosome mutations are most likely to occur by some error in the DNA replication stage of cell growth, and therefore could occur during meiosis or mitosis. Mutation involving multiple chromosomes is more likely to occur in meiosis during the crossing-over that occurs during the prophase I. Large scale mutations are deletion, duplication, inversion, insertion, translocation and non-disjunction types as will be explained later.

Mutation and derivatives: Mutation results in a change in DNA, usually in its sequence, the number of copies of a sequence that are present, how the DNA is arranged, or its location, (namely, at which chromosome). Use one or more of the following methods for mutating the design and build both - the resulting single strand and "duplex" relevant to a query sequence.

Small-scale mutation - definition:

(A) Point mutation: Substitute an individual base with another. It is a type of mutation that causes a single nucleotide base substitution, insertion, or deletion of the genetic material, DNA or RNA. Some common substitutions: A for C; A for G; C for T; G for T, A for T; G for C. A point mutant is an individual that is affected by a point mutation.

Illustration of three types of point mutations to a codon.

Schematic of a single-stranded RNA molecule illustrating a series of three-base codons.Each three-nucleotide codon corresponds to an amino acid when translated to protein.When one of these codons is changed by a point mutation, the corresponding amino acid of the protein is changed.

A point mutation, or single base modification, is a type of mutation that causes a change in a single nucleotide base via substitution, insertion, or deletion of the genetic material, DNA or RNA.

Substitution: A substitution is a mutation that exchanges one base for another (i.e., a change in a single "chemical letter" such as switching an A to a G). Such a substitution could, (i) change a codon to one that encodes a different amino acid and cause a small change in the protein produced. For example, sickle cell anemia is caused by a substitution in the beta-hemoglobin gene, which alters a single amino acid in the protein produced; (ii) change a codon to one that encodes the same amino acid and causes no change in the protein produced. These are called silent mutations and (iii) change an amino-acid-coding codon to a single "stop" codon and cause an incomplete protein. This can have serious effects since the incomplete protein probably may not be functionally useful.

Insertion: These are mutations in which extra base pairs are inserted into a new place in the DNA.

Deletion: These mutations are those in which a section of DNA is lost, or deleted - that is, deleting a segment of a sequence.

Frameshift: The term frameshift mutation indicates the addition or deletion of a base pair. Since protein-coding DNA is divided into trinucleotides, insertions and deletions can alter a gene so that its message is no longer correctly parsed. Such changes are called frameshifts.

(For example, consider the sentence, "The fat cat sat." Each word represents a codon. If we delete the first letter and parse the sentence in the same way, it doesn't make sense).

With frameshifts, a similar error occurs at the DNA level, causing the codons to be parsed incorrectly. This usually generates truncated proteins like "hef atc ats at", which are uninformative.

Transposition: Move a segment of the sequence from one place to the other in the overall order.

Duplication: Repeat a section of the sequence one or more times

Repeat induced point (RIP) mutations: These are recurring point mutations. RIP is a genome defense in fungi that hypermutates repetitive DNA. It is suggested that RIP limits the accumulation of transposable elements [M. E. Hood, M. Katawczik and T. Giraud: Repeat-induced point mutation and the population structure of transposable elements in Microbotryum violaceum. Genetics, 2005, vol. 170(3), 1081–1089].

Large-scale mutations – definitions

Deletion: Large-scale deletion is a single chromosome mutation. This involves the loss of one or more genes from the parent chromosome.

Duplication: Duplication is the addition of one or more genes that are already present in the chromosome. This is a single chromosome mutation.

Inversion: It involves inverting a segment of the sequence, say, a complete reversal of one or more genes within a chromosome. The genes are retained post-inversion, but its order is backwards from the parent chromosome. This is also a single chromosome mutation. That is, inversions refer to one type of genetic mutation that creates changes in a chromosome.

Insertion: Large-scale insertion involves multiple chromosomes. For this type of insertion, one or more genes are removed from one chromosome and inserted into another nonhomologous chromosome. This can occur by an error during the prophase-I of meiosis when the chromosomes are swapping genes to increase diversity.

Translocation: Translocation also involves multiple nonhomologous chromosomes. Here the chromosomes swap one or more genes with another chromosome.

Non-disjunction: A non-disjunction mutation does not involve any errors in DNA replication or crossing-over. Instead these mutations occur during the anaphase and telophase when the chromosomes are not separated properly into the new cells. Common non-disjunctions are missing or extra chromosomes. When gametes with non-disjunctions are produced during meiosis, it can result in an offspring with a monosomy or trisomy (referring to a missing or extra homologous chromosome).

Effects of mutations

The effects of mutations may range from nothing all the way to unviability of a cell. All mutations will affect the proteins created during protein synthesis; but not all mutations will have a significant impact on the final product. Such effects can also be distinct between the small-scale and large-scale mutations.

Conservative substitution: This refers to a nucleotide mutation, which alters the amino acid sequence of the protein, causing substitution of one amino acid with another, which has a side chain with similar charge/polarity characteristics. The size of the side chain may also be an important consideration. Conservative mutations are generally considered unlikely to profoundly alter the structure or function of a protein, but there are many exceptions

Non-conservative substitution: This corresponds to a mutation, which results in the substitution of one amino acid within a polypeptide chain with an amino acid belonging to a different physico-chemical property such as, polarity/charge group.

Convergent and parallel substitutions: In comparisons among orthologous proteins from a given set of species, convergent substitutions at a particular site refer to independent changes from different ancestral amino acids to the same derived amino acid. In the illustration (a) below, there is a change from G (the ancestral state) to T (the derived state) in one species, and a change from A to T in another species. The convergent substitutions are denoted by red bold lines.

Parallel substitutions at a site refer to independent changes from the same ancestral amino acid to the same derived amino acid. In the case of illustration (b), changes from A to T occurred in two different species. The parallel substitutions are denoted by red bold lines.

In sets of closely related species, parallelism is generally more common than convergence simply because - at any given site - close relatives will be more likely to share the same ancestral state prior to the occurrence of independent substitutions [J. F. Storz: Causes of molecular convergence and parallelism in protein evolution. Nature Reviews Genetics, 2016, vol.17, 239-250]

Coincidental substitutions: The occurrence of two substitutions at the same nucleotide site in two homologous sequences.

---------------------------------------------------------------------------------------------------------------------

Example A: Assume a hypothetical initial strand … AAAAGGGGTTTTGACC … and perform an insertion version of mutation with a sub-sequence inserted at an arbitrary location.

Solution

For the assumed strand, the insertion version of mutation say, for example with a sub-sequence ‘CCCC’ at an arbitrary location, will result in the following:

... AAAAGG CCCC GGTTTTGACC ...

-----------------------------------------------------------------------------------Example B: Suppose one or more ancestral sequences are given. Assuming different types of mutational changes occur as indicated, evaluate the outcomes

Presumed mutational

change type

Result on the sequence(s)

No change:

Retained as it is

… A C C C T A C G …

A C C C T A C G …

Single substitution

C A

… A C C C T A C G …

… A A A A T A A G…

Multiple sequential

Substitutions

G A T

… A A A A T A A G

… A A A A T A A C

… A A A A T A A T

Back substitution

C T C

… A A A A T A A C

… A A A A T A A T

… A A A A T A A C

Coincidental

substitutions: With reference to two homologous sequences,

two substitutions at the same nucleotide site

T G

Homolog sequence: Y1

… A A A T A A T…

Homolog sequence: Y2

… C A A T A A T…

Coincidental substitutions are shown bold: 

Y1* … A A A G A A T…

Y2* … C A A G A A T…

Parallel

substitutions

at a site: This refer to independent changes from the same ancestral amino acid to the same derived amino acid.

T C or G

Given homolog species Z:

Z: … G A A A C A A T

Parallel mutations are shown bold :

Z1*: … G A A A C A A C

Z2*: … G A A A C A A G

Convergent

substitutions:

Independent changes from different ancestral amino acids to the same derived amino acid.

Say two different ancestral AAs, Z1 and Z2 are considered:

Z1: … A A T G A T

Z2 : ... A A T

Independent changes from different ancestral amino acids to the same derived residue, T

-----------------------------------------------------------------------------------

Problems on mutational changes

Problem B.1

Construct a matrix of the set {A, C, T, G} to illustrate the characteristic of the transition and transversion mutations.

(Hint: You may use a score of 100 % to depict the element of the matrix pertinent to no mutation – for example for 100% for A-to-A as shown; and, use prorated percentages to represent other elements illustrating the characteristic as above. The spontaneous base substitutions ratio of transitions to transversions is approximately 2:1. Therefore each transition should have a probability of 2/3 and each transversion 1/3).

Answer:

 

A

C

T

G

A

100%

C

T

G

-----------------------------------------------------------------------------------------------------------------------

Problem # B.2 (a)

A strand is presumably mutated at a location underlined in the sequences shown below. In each case, (i) write down the eventual resulting strand for the following additional mutations happening in succession:

1) Inversion of some subsequence part

2) Deletion of some subsequence part

3) Transpose of some subsequence part

4) Duplication of a base-pair in the sequence

5) Point mutation of one base into another

(Hint: The answers may depend on subjective selection as required)

a) …..TTAAGG GGGG CCTTTTGAAA….

Example answer: (5) GGGG → GGCG

(b) ….,, AAAAG GGGGGCCTTGGGACC….

(c) ……CCAAGG GTGT CCTTTTGAGG….

(ii) In each case, also write down the corresponding final resulting duplex strands at preRNA level

1. GGCG

Example answer: CCGC

2. AAATAA

3. TCCT

4. CGTTA

--------------------------------------------------------------------------------------------------------------------

Problem B. 2(b)

Consider an ancestral sequence:

…ACCCTAC …

Suppose a sequence of changes occur on this segment as follows: Single substitution, no change, multiple substitutions, back substitution, parallel substitutions, coincidental substitutions and convergent substitutions.

Write down two possible resulting sequences.

1) Initial - ACCCTAC

Single Sub - (A→G)CCCTAC (Example answer)

No Change -

Multiple Sub -

Back Sub -

Parallel Sub -

Coincidental Sub-

Convergent Sub -

2) Initial - ACCCTAC

Single Sub -

No Change -

Multiple Sub - A(C→T→A)CCTTC (Example answer)

Back Sub -

Parallel Sub -

Coincidental Sub-

Convergent Sub -

----------------------------------------------------------------------------------------------------------------------

Problem B.3

Given a sequence:

X: 5’ - TAC GGA TCG AAT GCT CCC GTA ATC – 3’

Suppose the following mutations have occurred in succession: A single point mutation, deletion of a triplet and duplication of a triplet twice in succession; and, the resulting complementary strand is found to be:

Y: 3’ – ATG CCT AAC TTA CGG CAT CAT CAT TAG – 5’

Find

Single Point Mutation – (Example answer: Highlight in YELLOW. In triplet 3, the second base C was changed to T. .... GGA TCG .... to .... CCT AAC ....)

Deletion of a Triplet – Highlight in RED.

Duplication of a Triplet – Highlight in GREEN.

Trace/identify all the mutational changes occurred from X to Y.

--------------------------------------------------------------------------------------------------------------------

SEQUENCE ALIGNMENT & SCORING: TUTORIAL

EXERCISES: Problems on sequence alignment and scoring

The following notations are used for the alignment status for a given pair:

Notations used to denote the alignment status between a pair of sequences

(i) Vertical bar Identical residues

(ii) One dot Somewhat similar residues

(iii) Two dots Very similar residues

EXAMPLE

x: AAGCTTACGCAAACCG

| · | : | | · | · · · :

y: GCTCACGGTTGCCACT

Problem B.4

(i) Apply the above notations for the alignment status for the given pair:

….L F D E L N R V V..........

| | | : : | . | .

….L F D D I N Q V L ……..

(ii) Denoting “s” for transition mutation and “v” for transversion mutation, determine

the sites at which s or v have occurred in the following test pair (a, b):

Example: Black brackets and V denote transversions, while Green S denotes transition

x: Q [Q D] [I L] F ....

S S

y: Q [D Q] [L V] V ....

V V

Test pair:

a: F Q D I L F R R D D I I I F Q L

b: F D Q L V V R E N D D D N Q F I

V V V V

Find transitions in y after all transversions have occurred.

V→I, V→F, E→R, D→I, N→I, N→L.

---------------------------------------------------------------------------------------------------------------------

Problems on basic pairwise alignment procedure

Example: Consider the following two short nucleotide sequences, each of seven residues only. Construct two possible alignments allowing two gaps. (A gap is defined as any maximal consecutive run of spaces in a single string of a given alignment. They facilitate creating alignments that better conform to underlying biological models and more closely and appropriately fit patterns vis-à-vis a meaningful alignment expected).

X: T A C C A G T

Y: C C C G T A A

Solution(s) :

(i)

X: T A C C A G T

Y: C C C G T A A

(ii)

X: T A C C A G T

Y: C C C G T A A

Problem B.5

Consider the following two short nucleotide sequences, each of 10 residues only. Construct two possible alignments allowing three gaps. (A gap is defined as any maximal consecutive run of spaces in a single string of a given alignment. They facilitate creating alignments that better conform to underlying biological models and more closely and appropriately fit patterns vis-à-vis a meaningful alignment expected).

X: A A C C A G T A AT

Y: T C C C T A A G T T

---------------------------------------------------------------------------------------------------------------------

Problems on scoring the alignments

Example Consider the following alignment:

S :x A T C G G A T G G A C

T : A C G G A A T C C

This alignment has four gaps containing a total of six spaces (). Further, it can be described as having five matches and two mismatches

Problem B.6

Consider the following two pairs of aligned nucleotide sequences (X, Y) and (U, V).

(i) Describe the alignments in each pair in terms of the counts on existing matches,

mismatches, gaps and spaces

(ii) Apply the following award/penalty scoring scheme and compare the alignment (X, Y) versus (U, V) in terms of the overall scores obtained in each case

Scoring scheme: Match: say, 100; Mismatch: (Purine Purine or Pyrimidine Pyrimidine) – Transition, say:75; (Purine Pyrimidine or Purine Pyrimidine)- Transversion: say 10; Space: − 50.

X: A G C C A T A T A

Y: A G G AC A A T T A

U: A G C C A T A T A

V: A G C A A T T A

Example answer for X, Y: This alignment has two gaps with three total spaces. It also has 5 matches and three mismatches. The score for this alignment would be (5 ×100) + (75 ×1) + (10 × 2) − (50 × 3) = 445

Deduce the score for: U : V (550)

---------------------------------------------------------------------------------------------------------------------

Problems on: Sequence similarity and notion of “distance”

Given two character strings, the measures of “distance” between them are: (i) Statistical distances (in Euclidian sense) such as, Mahalanobis distance and its variations. (ii) Hamming distance and (iii) Levenshtein distance (edit distance)

Hamming distance

The Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. That is, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other.

Examples

A G T C

C G T A Hamming distance (HD) = 2

KENTUCKY

TENTURKI

Hamming Distance = 3 (K/T, C/R, Y/I)

Edit distance

This refers to the edit distance (also known as Levenshtein distance) between two sequences expressed in terms of minimal number of operations (indels and substitutions) exercised to transform one sequence to another. This edit distance approximately specifies the number of DNA replications taken place across two sequences. That is, the Levenshtein distance (LD) is a string metric for measuring the difference between two sequences. Simply, the LD between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

Examples

A G – T C C

C G C T C A Levenshtein distance (LD) = 3

For the two sequences indicated below, determine the minimum number of edit operations required to transform one into the other.

X: A C C U G A

Y: A G C U A

Solution:

Step (i): Substitution of C by G in X: A G C U G A

Step (ii): Indel operation (Insertion of G) in Y: A G C U G A

Thus, there are (a minimum) of two edit operations required for the transformation. (Note that edit distance implies that all operations are exercised on only one sequence). Therefore, the Levenshtein score is: 2

---------------------------------------------------------------------------------------------------------------------

Problem # B.7

(i): Consider the following two sequences, namely:

X: A G T G G G CAT T C C T T T

Y: T C T A G A AT TT C T GT T

The following alignment is done with minimal editing permitting identity matches and transitions, [purines (A G) or pyrimidines (C T)] to stay. Determine the associated Levenshtein score.

X*: A G T G G G C A T T C C T T

Y*: T C T A G A T T C T G T T

Is there a possibility of better scoring feasible in aligning X and Y? If so indicate it..

(ii): Perform a visual comparison of each of the following putatively related pair of nucleic acid sequences x and y. Indicate in your answer the identities by 1s on match and 0s on mismatches. Also indicate mutations by the scoring indices (notations) S for transition and V for transversion:

Pair 1:

x: T CG [C T] G G C G C A A A C C G

1 0 0 0 0 1 0 1 0 0 0 0 0 0 0

y C C [T C] A G G G T T G C A A C A

(Answer Hint

Bases bracketed in red in sequence Y show transversions from red bracketed sequences in X.)

Pair 2

x: A A [G C A G T C] T C A A [A C G G]

1 0 1 0 0 0 0 1 0 0 0 0 0 0 0

y A [C T C A C G] T T T G [G G [A C]] C

(iii): Determine the edit distance in transforming one into the other of the following sequence pair:

X: T A C C A G T

Y: C C C G U A A

Problem # B.8

(a) Determine the Hamming distance between CUMBERLAND and TIMBERLAND

(b) Determine the Levenshtein distance between: (a) BIOINFORMATICS and BIOINFORMATION; (b) TELE-INFORMATICS and HAIL-FORMATION; (c) TELEINFORMATICS and C-O-NFORMATION

---------------------------------------------------------------------------------------------------------------------

Problem # B.9

The Hamming distance (HD) measurements between locally-aligned pair of nucleotide sequences (s and t) are as shown.

(a) I II III

s GGU AGCAA AGCACACA

t UGG ACAUA ACACACUA

HD(s,t) 2 3 6 (Answers?)

New HD is 2.

Considering the III group of the sequences s and t (that is, AGCACACA and ACACACUA), align the associated eight nucleotides by introducing two gaps in each sequence randomly so that the HD score (or cost) is optimally improved. (The gap denotes a deletion in a sequence or an insertion in the sequence being compared).

(b) I II III

s CCU UCGUU UCGUGUGU

t UCC UGUAU UGUGUGAU

HD(s,t) 2 3 6

New HD is 2.

Considering the III group of the sequences s and t (i.e., UCGUGUGU and UGUGUGAU), align the associated eight nucleotides by introducing two gaps in each sequence randomly so that the HD score (or cost) is optimally improved. (The gap denotes a deletion in a sequence or an insertion in the sequence being compared).

---------------------------------------------------------------------------------------------------------------------

Example

Assuming 0 Insertion{s) and 2 Deletion(s), determine the Hamming distance between:

s: AGACCA

t: CACACA

Positions: 3 5

7 1

Insertions: * *

* *

Deletions: C A

T A

Answer: HD(s,t) = 4

Problem # B.10

AGCAACCA

ACACACAT

Reference to the above example, obtain solutions for the following cases:

Assume: 1 Insertion{s) 1 Deletion(s)

==================================

Assume 0 Insertion{s) 2 Deletion(s)

==================================

Assume 1 Insertion{s) 1 Deletion(s)

==================================

Assume 1 Insertion{s) 1 Deletion(s)

==================================

Assume 0 Insertion{s) 2 Deletion(s)

==================================

Assume 0 Insertion{s) 2 Deletion(s)

---------------------------------------------------------------------------------------------------------------------

Problem # B.11

Find the Hamming distance between 2 sequences after applying 2 random insertions or deletions on a pair of original sequences (s, t) given as below:

s: TGCACACC

t: TCACACTC

s’:TGCAACTC

t’:TGCAACTC

(HD is 0?)

==================================

Problem # B. 12

Given the following two binary sequences X and Y, plot the HD between as a function of binary digit locations in the 0 to (about) 100 binary residues listed below. Hence indicate the most common substring locations between them.

X: 0110111001011010011 0110 0011111110100101011111011110011010101110

0011011100010100110100011111001111

Y: 1010001010011011100101100011111110100101011111011110011010100001

011010001111110 00101110011110101111

(Hint: Select a window of size 4. For each window, calculate HD. Plot window # versus HD

Note: In the case of binary strings, the HD is decided by XOR operation across the two residues one below the other in the strings; and the number of 1’s are counted in each window in the resulting XOR output string, depicting the HD )

Problem # B.13

For the two binary sequences X and Y, indicated above in Problem B. 12, plot the Kullback-Leibler (KL) measure between the strings. Hence confirm the most common substring locations between them as decided via HD measure in the previous problem.

(Hint: Again, select a window of size 4. For a given sequence in each window, calculate KL measure. Plot window # versus KL = KL1 + KL2 for each string

KL1 = (p(0)loge[(p(0)/q(1)])window#1 + ….

KL2 = (q(1)loge[(q(1)/p(0)])window#1 + ….

p(0): Probability of 0 in that window; q(1): Probability of 1 in that window)

-----------------------------------------------------------------------------------------------------------------------------

SEQUENCE ALIGNMENT: NW & SW ALGORITHMS

Implementing NW- and SW-algorithms

Outlined earlier are details on sequence alignments of interest in bioinformatics; and, pertinent global- and local-alignment algorithms conforming to (i) a global optimization strategy that enforces the alignment across the entire span of all the query sequences and (ii) local alignment strategy that identifies alignments only in the locally-significant segments of similarity (within the long sequences) are explained. Relevant computational schemes based on dynamic programming are indicated as the NW-algorithm for global alignment and SW-algorithm for local alignment. Presented in this tutorial are exemplars of such algorithms using necessary hypothetical sequences; also, pursuing the exemplar set of solved exercises, a set of problems (with solution hints/answers if needed)are presented.

Sequence alignment - global, local and glocal versions: Examples

To illustrate the differences in the post-alignment results of global, local and glocal alignments enforced on a pairs of sequences, the following examples are furnished

EXAMPLE: This example is indicated to illustrate the differences in the post-alignment results pertinent to global versus local alignments performed on a pairs of sequences. Suppose a pair of hypothetical amino acid sequences X and Y subjected to alignment procedure is as follows:

X: AGPSSKQNGKPSSRIWDN

Y: ANITKSAGKPAIMRLGDD

The results of global and local alignments of X and Y are shown below so as to understand the underlying differences. (The results shown are obtained using NW- and SW- algorithms as explained in later examples; and, performing those algorithms on X and Y given above will be indicated as problem exercises to solve).

1. Global alignment: Result of aligning X and Y over their entire lengths via NW algorithm. When the alignment is completed, both sequences are same length.

X: A G P S S K Q N G K P S − S R I W D N

| | | | | | |

Y: A N − I T K S A G K P A I M R LG D D

2. Local alignment: Result of aligning X and Y via SW algorithm showing the longest or best subsequence pair that has maximum similarity

X: − − − − − − −N G K P − − − − − − − −

| | |

Y: − − − − − − −A G K P − − − − − − − −

EXAMPLE: The results of global and glocal alignments of two nucleotide sequences u and v are presented below to understand the underlying differences.

u: TGTCTGTGGGTGG

v: TGCTTG

The results of global and local alignments of U and V are shown below to understand the underlying differences.

1. Global alignment

u: T G T C TG T G G G T G G

| | | | |

v: T GC − − T TG

T

G

T

C

T

G

T

G

G

G

T

G

G

T

G

C

T

T

G

2. Glocal (semiglobal) alignement

u: T G T C T GT G G G T G G

| | | |

v: T G C T T G

T

G

T

C

T

G

T

G

G

G

T

G

G

T

G

C

T

T

G

The result of (1) shown is obtained using NW-algorithm; and, the semi-global algorithm is also based on NW-algorithm modified as follows: Once the NW-algorithm based updating the values of the underlying matrix is completed, the trace-back is started at the greatest element of the last row of the alignment matrix (scores matrix); or last column, if there are more rows than columns. This is in contrast with NW-algorithm where the starting of the trace-back commences from the absolute last cell (leading cell) of the matrix.

Global sequence alignment: Implementation of NW algorithm

As explained before, the global alignment of sequences is based on dynamic programming. Relevant NW algorithm can be adopted to align protein or nucleotide sequences. The underlying optimal global alignment can be understood with the examples and exercises furnished below.

EXAMPLE: Given a set of sequence pairs, X: C T C G T and Y: C T A A G T, the problem is to determine the best global alignment between them via trace-back procedure using NW algorithm.

Solution

Global alignment refers to aligning sequences over their entire length; resulting in sequences of the same length. The global alignment of the sequences can be determined using the Needleman-Wunsch (NW) algorithm.

Step 1: Construct a matrix for the two sequences as shown.

X

C

T

C

G

T

Y

C

T

A

A

G

T

Step 2: Initialization: Cells representing identities are scored 1; and, cells representing mismatches are scored 0.

X

C

T

C

G

T

Y

C

1

0

1

0

0

T

0

1

0

0

1

A

0

0

0

0

0

A

0

0

0

0

0

G

0

0

0

1

0

T

0

1

0

0

1

Step 3: Add “dummy” columns (I2, I1, ...) and rows (J2, J1, ...) at the end of the matrix as illustrated; and, fill these columns and rows with zeros (shown with bold high-lighting). Mark the leading cell, LC. (The last corner cell on the right-side of the matrix designated as (i = I3, j = J3)th cell, as shown)

I7

I6

I5

I4

I3

Dummy

C

T

C

G

T

J8

C

1

0

1

0

0

I2

I1

J7

T

0

1

0

0

1

0

0

J6

A

0

0

0

0

0

0

0

J5

A

0

0

0

0

0

0

0

J4

G

0

0

0

1

0

0

0

J3

T

0

1

0

0

1

LC

0

0

Dummy

J2

0

0

0

0

0

0

J1

0

0

0

0

0

0

Step 4: Starting from the leading cell, LC, (designated as (i = I3, j = J3)th cell), perform the following procedures:

· Move to (I2, J2)th cell (diagonally going downward) and apply the following algorithm to update the value in LC:

Algorithm: Update the (leading) cell entry by adding the maximum value encountered at A: (i – 1, j – 1)th cell or along the three tracks, B or C as shown:

( Leading cell (i = I 3 , j = J 3 ) )

( C ) ( The maximum value seen at I 2 , I 1 , ..., etc. along the j th row )

1

( A )

0

0

0

0

0

0

0

0

0

( B ) ( The maximum value seen at: (I 2 = i – 1, J 2 = j – 1) th cell )

( The maximum value seen at J 2 , J 1 , ..., etc. along the j th column )

· Observed maximum value at A or along B and C: 0

· Existing value in the LC: 1

· Hence, the updated value for the leading cell is, therefore: (1 + 0) = 1.

Step 5: Use the same procedure and algorithm as above to update the value for each cell in the entire matrix pursuing the track-route indicated below:

(a) After updating the value in the LC, consider the other cells one-by-one along the row- J3 following the track moving leftward and encountering, the cells: I4, I5, ..., and I7. Corresponding updated values are shown bold in the matrix indicated below:

I7

I6

I5

I4

I3

Dummy

C

T

C

G

T

J8

C

1

0

1

0

0

I2

I1

J7

T

0

1

0

0

1

0

0

J6

A

0

0

0

0

0

0

0

J5

A

0

0

0

0

0

0

0

J4

G

0

0

0

1

0

0

0

J3

T

0

1

0

0

1

LC

0

0

Dummy

J2

0

0

0

0

0

0

J1

0

0

0

0

0

0

(b) Next considering the cells one-by-one, I3, I4, ..., and I7.. encountered along the upper row- J4 (by following again the track moving leftward), use the same procedure and algorithm as above to update the existing value in each cell. Corresponding updated values are shown bold in the matrix indicated below:

I7

I6

I5

I4

I3

Dummy

C

T

C

G

T

J8

C

1

0

1

0

0

I2

I1

J7

T

0

1

0

0

1

0

0

J6

A

0

0

0

0

0

0

0

J5

A

0

0

0

0

0

0

0

J4

G

1

1

1

2

0

0

0

J3

T

0

1

0

0

1

LC

0

0

Dummy

J2

0

0

0

0

0

0

J1

0

0

0

0

0

0

(c) Likewise, considering the cells one-by-one, I3, I4, ..., and I7, encountered along the next upper row- J5 (by following the track, moving leftward) and using the same procedure and algorithm indicated above the existing value in each cell is updated. Corresponding updated values are shown bold in the matrix indicated below:

I7

I6

I5

I4

I3

Dummy

C

T

C

G

T

J8

C

1

0

1

0

0

I2

I1

J7

T

0

1

0

0

1

0

0

J6

A

0

0

0

0

0

0

0

J5

A

2

2

2

1

0

0

0

J4

G

1

1

1

2

0

0

0

J3

T

0

1

0

0

1

LC

0

0

Dummy

J2

0

0

0

0

0

0

J1

0

0

0

0

0

0

(d) The aforesaid procedure is repeated for the cells one-by-one, I3, I4, ..., and I7. encountered along the next upper rows- J6 through J7 (by following the track, moving leftward); and, by using the procedure and algorithm indicated above, the existing value in each cell is updated. Corresponding updated values are shown bold in the following matrices:

I7

I6

I5

I4

I3

Dummy

C

T

C

G

T

J8

C

1

0

1

0

0

I2

I1

J7

T

0

1

0

0

1

0

0

J6

A

0

2

2

1

0

0

0

J5

A

2

2

2

1

0

0

0

J4

G

1

1

1

2

0

0

0

J3

T

0

1

0

0

1

LC

0

0

Dummy

J2

0

0

0

0

0

0

J1

0

0

0

0

0

0

I7

I6

I5

I4

I3

Dummy

C

T

C

G

T

J8

C

1

0

1

0

0

I2

I1

J7

T

0

3

2

1

1

0

0

J6

A

0

2

2

1

0

0

0

J5

A

2

2

2

1

0

0

0

J4

G

1

1

1

2

0

0

0

J3

T

0

1

0

0

1

LC

0

0

Dummy

J2

0

0

0

0

0

0

J1

0

0

0

0

0

0

(e) Lastly the procedure repeated for the cells one-by-one, I3, I4, ..., and I7 encountered along the next upper row- J8 (by following the track, moving leftward) completes the updating with the values shown bold in the following matrix

I7

I6

I5

I4

I3

Dummy

C

T

C

G

T

J8

C

4

2

3

1

0

I2

I1

J7

T

0

3

2

1

1

0

0

J6

A

0

2

2

1

0

0

0

J5

A

2

2

2

1

0

0

0

J4

G

1

1

1

2

0

0

0

J3

T

0

1

0

0

1

LC

0

0

Dummy

J2

0

0

0

0

0

0

J1

0

0

0

0

0

0

(f) The best global alignment is then determined using the back-tracing method as follows:

(i) Starting from the lead cell (LC), trace an upward diagonal path. This is done regardless if the cell corresponds to a match or a mismatch.

C

T

C

G

T

C

4

2

3

1

0

T

0

3

2

1

1

A

0

2

2

1

0

A

2

2

2

1

0

G

1

1

1

2

0

T

0

1

0

0

1

LC

(ii) This trace-path is continued until an island is met. An island refers to a set of 4 cells with 3 or more entries that are identical as shown with bold entries below.

C

T

C

G

T

C

4

2

3

1

0

T

2

3

2

1

1

A

2

2

2

1

0

A

2

2

2

1

0

G

1

1

1

2

0

T

0

1

0

0

1

(iii) From the island, there are two possible paths that can be taken:

1. First vertical, then diagonal

C

T

C

G

T

C

4

2

3

1

0

T

2

3

2

1

1

A

2

2

2

1

0

A

2

2

2

1

0

G

1

1

1

2

0

T

0

1

0

0

1

2. First horizontal, then diagonal:

C

T

C

G

T

C

4

2

3

1

0

T

2

3

2

1

1

A

2

2

2

1

0

A

2

2

2

1

0

G

1

1

1

2

0

T

0

1

0

0

1

Back-tracing: Summary:

1. Start from lead cell

2. Go diagonal from the cell to the next regardless of the entry in the new cell corresponds to a match or a mismatch

3. Continue on diagonal trace until an “island” shown with bold score entries is reached. (The island can be identified as the set of 4 cells with 3 or more entries are identical in the four cells)

4. Across the island, go “vertical” and continue on diagonal track; or, go “horizontal” and, then continue on the diagonal trace, as feasible.

The choice of picking one of the two paths depends on the number of cells ahead of the island - whichever path that has the more cells traced ahead should be the selected path. In the present example, taking the vertical path leads to four cells, whereas the horizontal path has only one ahead. Therefore, the vertical path is selected.

(g) Now, the given sequences can be aligned as per the following rules:

I. A vertical track implies introducing a gap in the X (upper) sequence at its site

II. A horizontal track implies introducing a gap in the Y (bottom) sequence at its site

Relevant to present example of sequences X: CTCGT and Y: CTAAGT, the trace-back is illustrated below

( C T C G T C T A A G T Gap )

Hence, the aligned sequences are written as follows, with the gap introduced. on sequence X:

X : C T – C G T

| | | |

Y : C T A A G T

(i) In summary, the NW-algorithm involves: (i) Setting up the matrix, (ii) updating the scores in the matrix cells and (iii) identifying the optimal alignments via a trace-back suite. (The four aspects of trace-back procedure are as follows: (a) Encountering identity of residues at a site between X and Y: Stay tracking along the diagonal; (b) mismatch of residues encountered at a site between X and Y: Stay tracking along the diagonal; (c) gap in top (X) sequence corresponds to tracking vertically in the island; and (d) gap in the bottom (Y) sequence corresponds to track horizontally in the island). Furnished below is an illustrative tutorial on trace-back procedure relevant to NW algorithm.

Thus, based on vertical or horizontal trace-pursuit at the island indicated above, the following can be specified as alignment rules:

· Outside the island where the diagonal pursuit is done, it implies a region where “identity” or ‘mismatch” of residues across the compared sequences exist without any gaps to exist at those sites

· Within the island, if a vertical pursuit is done, it implies a “gap” to be introduced at the site of the upper sequence, for example in U of a hypothetical pair of sequences, U and V

U: A A G – C T G

| | | |

V : A A T C G T G

· Within the island, if a horizontal pursuit is done, it implies a “gap” to be introduced at the site of the bottom sequence, for example in V of a hypothetical pair of sequences, U and V

U: G A G C C T A

| | | |

V : G A T – G T A

______________________________________________

EXAMPLE

Given the sequence pairs:

x: W F G Q E T S A I S

y: S F T Q F S E D A I

Perform NW-algorithm based comparison between the two sequences and elucidate the optimal pathway in aligning them globally.

Solution

Step 1: Develop a score matrix for the two sequences and do initialization of cells representing the identities with the score-value1 and the cells representing mismatches with score-value 0. The constructed initial matrix is shown below:

x

W

F

G

Q

E

T

S

A

I

S

y

S

0

0

0

0

0

0

1

0

0

1

F

0

1

0

0

0

0

0

0

0

0

T

0

0

0

0

0

1

0

0

0

0

Q

0

0

0

1

0

0

0

0

0

0

F

0

1

0

0

0

0

0

0

0

0

S

0

0

0

0

0

0

1

0

0

1

E

0

0

0

0

1

0

0

0

0

0

D

0

0

0

0

0

0

0

0

0

0

A

0

0

0

0

0

0

0

1

0

0

I

0

0

0

0

0

0

0

0

1

0

Step 2: Add “dummy” rows and columns at the left-end of the matrix and fill these columns and rows with zeros. Name the columns (i: I12, I13, ..., I3) and rows (j: J12, J11, ..., J3) as shown. Identify the corner-most cell (last cell of the matrix; not including dummy rows/columns) and designate it as the leading cell, LC specified with the coordinate (i, j).

I12

I11

I10

I9

I8

I7

I6

I5

I4

I3

I2

I1

W

F

G

Q

E

T

S

A

I

S

Dummy

J12

S

0

0

0

0

0

0

1

0

0

1

0

0

J11

F

0

1

0

0

0

0

0

0

0

0

0

0

J10

T

0

0

0

0

0

1

0

0

0

0

0

0

J9

Q

0

0

0

1

0

0

0

0

0

0

0

0

J8

F

0

1

0

0

0

0

0

0

0

0

0

0

J7

S

0

0

0

0

0

0

1

0

0

1

0

0

J6

E

0

0

0

0

1

0

0

0

0

0

0

0

J5

D

0

0

0

0

0

0

0

0

0

0

0

0

J4

A

0

0

0

0

0

0

0

1

0

0

0

0

J3

I

0

0

0

0

0

0

0

0

1

0

0

0

J2

Dummy

0

0

0

0

0

0

0

0

0

0

0

0

J1

0

0

0

0

0

0

0

0

0

0

0

0

Step 3: As described in the previous example, starting from the leading cell (LC: i, j), move to the S (i −1, j −1) cell (downward and diagonal). Then update the LC score value, S(i j) by adding the maximum of one of the three following observed value as per NW algorithm:

· S (i −1, j −1)

· Maximum of S (i − 1, j − 2), S(i −1, j − 3), ... (along I2, I1, ...), S (i −1, j − n)

· Maximum of S (i − 2, j −1), S(i − 3, j −1), ... ... (along J2, J1, ...), S (i − n, j −1)

This procedure is repeated for the entire cells one-by-one, encountered along the columns I3, I4, ..., and I12 and rows J3, J4, ..., and J12 and the updating of the scores completed leading to the values shown bold in the following matrix

X

Y

W

F

G

Q

E

T

S

A

I

S

S

5

4

4

4

4

3

3

1

1

1

F

4

5

4

4

4

3

2

1

1

0

T

4

4

4

3

3

4

2

1

1

0

Q

4

3

3

4

3

3

2

1

1

0

F

3

4

3

3

3

3

2

1

1

0

S

3

3

3

3

2

2

3

1

0

1

E

2

2

2

2

3

2

2

1

0

0

D

2

2

2

2

2

2

2

1

0

0

A

1

1

1

1

1

1

1

2

0

0

I

0

0

0

0

0

0

0

0

1

0

Step 4: Using the final score matrix, a trace-back from the LC is done via diagonal pursuit and resorting to vertical or horizontal path whenever an island is encountered. Shown below is the track prescribed thereof to the present problem:

X

Y

W

F

G

Q

E

T

S

A

I

S

S

5

4

4

4

4

3

3

1

1

1

F

4

5

4

4

4

3

2

1

1

0

T

4

4

4

3

3

4

2

1

1

0

Q

4

3

3

4

3

3

2

1

1

0

F

3

4

3

3

3

3

2

1

1

0

S

3

3

3

3

2

2

3

1

0

1

E

2

2

2

2

3

2

2

1

0

0

D

2

2

2

2

2

2

2

1

0

0

A

1

1

1

1

1

1

1

2

0

0

I

0

0

0

0

0

0

0

0

1

0: LC

The aligned sequence is therefore:

W F G Q E T S − − A I S

| | | | |

S F T Q F − S E D A I

---------------------------------------------------------------------------------------------------------------------

EXAMPLE

The following pair of sequences is indicated in [T. K. Attwood and D. J. Parry-Smith: Introduction to Bioinformatics. Pearson Education Ltd., Essex UK: 1999] and obtaining relevant global alignment via NW algorithm is hence described. The present exercise is to obtain the final score matrix given in [ ] for the test pair and verify the result on alignment as posted:

u: A D L G A V F A L C D R Y F Q

v: A D L G R T Q N C D R Y Y Q

Final gapped alignment shown in [ ] is:

u: A D L G A V F A L C D R Y F Q

| | | | | | | | |

v: A D L G R T Q N − C D R Y Y Q

--------------------------------------------------------------------------------------------------------------------

Local sequence alignment: Implementation of SW algorithm

The local alignment of sequences is based on dynamic programming using SW algorithm, which can be adopted to align protein and/or nucleotide sequences. The underlying optimal local alignment can be understood with the examples and exercises furnished below.

EXAMPLE

Given a set of sequence pairs, U and V as shown below, establish their best local alignment using SW algorithm.

(i) u: ...W R N D C Q E G S A... v: ...W G Q E G S I E A...

Solution

Application of Smith-Waterman (SW) algorithm towards aligning u and v and elucidating the local alignment conforms to the procedure with following steps:

(1) Construct an initial matrix framed with u and v residues as shown below.

(2) Add a set of edge elements x : “0” along the right-most column and the top-most row of

the matrix as shown. Inasmuch as the first row and first column cannot be an end-point of

any alignment, (x: 0’s) are introduced as indicated so as to serve as a dummy

placeholder

(3) Next, populate the cells corresponding to identical matches (of residues of u and v at the

cell site) scored with entries of “1”s; likewise, and cells corresponding to mismatches of residues scored with entries of “0”s

u

W

R

N

D

C

Q

E

G

S

A

v

x

0

0

0

0

0

0

0

0

0

0

W

0

1

0

0

0

0

0

0

0

0

0

G

0

0

0

0

0

0

0

0

1

0

0

Q

0

0

0

0

0

0

1

0

0

0

0

E

0

0

0

0

0

0

0

1

0

0

0

G

0

0

0

0

0

0

0

0

1

0

0

S

0

0

0

0

0

0

0

0

0

1

0

I

0

0

0

0

0

0

0

0

0

0

0

E

0

0

0

0

0

0

0

1

0

0

0

A

0

0

0

0

0

0

0

0

0

0

1

(4) Highlight all the match-score entries (1’s) bold. Relevant to each cell having match score entry (1) and marked bold, pursue the following three possible tracks to populate the rest of the cells with updated entries:

(a) Diagonal track: Suppose the cell having a match score entry (1) and marked bold. It is designated with a coordinate (i − 1, j − 1) and the diagonal tracking is done downward towards the cell: (i, j) as illustrated below. Suppose the score value at (i − 1, j − 1) is S(i − 1, j − 1) then the score on cell S(i, j) is decided as follows:

S(i, j) = S(i − 1, j − 1) + 1.0,

if a similarity (match) of u and v exists at the site, (i − 1, j − 1); otherwise,

S(i, j) = S(i − 1, j − 1) – 0.3,

if a dissimilarity (mismatch) of u and v exists at the site, (i − 1, j − 1).

In the above updating of scores, addition of 1.0 implies an “award” given to similarity-match observed; and, subtracting 0.3 refers to “penalty” given to dissimilarity-(mismatch) observed. (The value 0.3 is an approximation of 1/3 depicting the degree-of-freedom).

( (i − 1, j − 1) ) ( Award: + 1.0 Penalty: − 0.3 (i , j ) )

The above-said diagonal track score update procedure is illustrated below with an example: An nth cell with an existing score Sn gets an award of (+ 1) due to the identity of residues (A A) of u and v and its updated score, therefore becomes (Sn + 1). On the other hand, considering the mth cell as shown, with a score Sm takes a penalty of (− 0.3) due to the mismatch of residues (C T) of u and v, and, as such, its updated score, becomes (Sm − 0.3).

( v: T ≠ C ) ( S n − 1 S n + 1.0 S m − 0.3 u: A v: A u: C Award Penalty S m − 1 )

(5) The diagonal pursuit as per the above step is exercised at all those cells that shoe the score entry of “1s” as confirmed in the initialization; and, all the relevant cells in the diagonal pursuits are updated with the new scores. This diagonal path of updating the score is terminated when the computed score value becomes negative. At that cell and subsequently, the score entries along the diagonal path are rendered as “0s”. Further, this diagonal cell score-filling is discontinued when a cell having a positive score value (possibly, 1 as registered in the initialization) is encountered en passé

(6) The next step involves performing the following two algorithms with a horizontal pursuit along a row towards right and a vertical pursuit along a column downwards as illustrated below so as to populate the cells with updated entries.

Suppose (i, j) is any cell considered. Then, the horizontal track along the row from this cell as shown, leads to a set of sequential set of cells whose entries are updated with scores using the following algorithm:

S(i, j + k) = [S(i, j) − (1.0 + 0.3 × k)], k = 1, 2, ...

The k-value denoting the kth cell as shown is ended, when a negative value of the score results in; and thereupon, the subsequent cells are filled with 0 scores. This horizontal-track based filling is continued and eventually stopped when a match (identity)-value of 1 of the initial matrix is encountered on the row.

( S(i , j + 1 ) S(i , j + 2 ) S(i , j + k ) (i , j ) S(i , j ) )

Next, again considering a cell (i, j), the vertical track along the column from this cell as shown, leads to a set of sequential cells downward whose entries are updated with scores using the following algorithm:

S(i + , j) = [S(i, j) − (1.0 + 0.3 × )], = 1, 2, ...

The -value denotes the th cell as shown is ended, when a negative value of the score results in; and thereupon, fill the subsequent cells with 0 scores. This vertical-track based filling is continued and eventually stopped when a match-value of 1 of the initial matrix is encountered on the column.

( S( i + , j ) ) ( (i , j ) S(i + 1, j ) S(i + 2, j ) S(i , j ) )

In the above procedures, the values of k and denote penalty-lengths that specify the extent of penalties being imposed on the scores of the cells consistent with the possible deletions.

While proceeding along k or , if a negative value of the computed score results in, then the corresponding cell and the rest seen subsequently are filled with 0 scores. (It implies that there is no alignment similarity up to the current cell-position).

Further, score-filling horizontally (along the row) or vertically (along the column) is terminated when a cell with identity (similarity) score of 1 (registered in the initialization) is seen ahead en passé.

Thus, commencing from each of the cell (i − 1, j − 1) with initialized score entry S(i − 1, j − 1) = 1, the updating of score values is done via: (i) Diagonal passage from cell (i − 1, j − 1) to (i, j), (ii) horizontal path-lengths, k (along the row) or (iii) vertical path-lengths,  (along the column) as illustrated below:

( (i − 1, j − 1) ) ( (i , j ) k-path  -path Diagonal path )

(7) Now, considering the alignment exercise in hand, the diagonal, horizontal and vertical scoring procedures indicated above are performed in order to update the cell scores using the aforesaid algorithms pertinent to SW scheme of local alignment. The updated scores as computed are shown bold in the following matrix. The scored out values are the existing scores; and, all the pertinent diagonal pursuits are marked with arrows.

u

W

R

N

D

C

Q

E

G

S

A

x

v

0

0

0

0

0

0

0

0

0

0

W

0

1

0

0

0

0

0

0

0

0

0

G

0

0

0.66 0

0

0

0

0

0

1

0

0

Q

0

0

0

0.33 0

0

0

1

0

0

0.67 0

0

E

0

0

0

0

0

0

0

2 1

0.67 0

0.33 0

0.33 0

G

0

0

0

0

0

0

0

0.67 0

3 1

1.67 0

1.33 0

S

0

0

0

0

0

0

0

0.33 0

1.67 0

4 1

2.67 0

I

0

0

0

0

0

0

0

0

1.33 0

2.67 0

3.67 0

E

0

0

0

0

0

0

0

1

1

0

2.33 0

2.33 0

A

0

0

0

0

0

0

0

0

0.67 0

2.0 0

2.0 1

(8) Alignment via trace-back: The trace-back refers to commencement of a trace bottom-up from the largest score value observed on the final score matrix and proceeding diagonally upward as illustrated below relevant to scores high-lighted bold.

u

W

R

N

D

C

Q

E

G

S

A

v

x

0

0

0

0

0

0

0

0

0

0

W

0

1

0

0

0

0

0

0

0

0

0

G

0

0

0.66

0

0

0

0

0

1

0

0

Q

0

0

0

0.33

0

0

1

0

0

0.66

0

E

0

0

0

0

0

0

0

2

0.67

0.33

0.33

G

0

0

0

0

0

0

0

0.66

3

1.66

1.33

S

0

0

0

0

0

0

0

0.33

1.66

4

2.66

I

0

0

0

0

0

0

0

0

1.33

2.66

3.66

E

0

0

0

0

0

0

0

1

1.00

2.33

2.33

A

0

0

0

0

0

0

0

0

0.66

2

2

Hence, the final result on local alignment is as follows:

Q E G S

| | | |

Q E G S

---------------------------------------------------------------------------------------------------------------------

EXAMPLE

Given a set of sequence pairs, u and v as shown below, determine the best local alignment via trace-back method using SW-algorithm.

Given pair of sequences:

u: AASTHECWCTWH

v: AASRNPSCWTTWHT

Solution

u

A

A

S

T

H

E

C

W

C

T

W

H

v

x

0

0

0

0

0

0

0

0

0

0

0

0

A

0

1

1

A

0

1

1

S

0

1

R

0

N

0

P

0

S

0

1

C

0

1

1

W

0

1

1

T

0

1

1

T

0

1

1

W

0

1

1

H

0

1

1

T

0

1

1

Step 1: As indicated in the previous example, the starting point for the Smith-Waterman algorithm is to construct the matrix with given residue sequences and initialize it with edge-elements to x: 0 denoting the placeholder that accommodates the condition specified as follows: The first row and the first column of the matrix cannot form the endpoint of any specified alignment.

Step 2: Next, the cells in the matrix representing with identities of residues (between u and v) are scored 1; and, rest of the cells representing mismatches are scored 0. Shown below is the resulting matrix after Steps 1 and 2. (The mismatch values “0”s are omitted in the matrix illustration for clarity).

Step 3: Reference to the initialized matrix as above, the cells are filled with updated scores following the algorithm indicated in the last example.

25

Page 25 of 38

The final updated matrix is:

u

A

A

S

T

H

E

C

W

C

T

W

H

v

x

0

0

0

0

0

0

0

0

0

0

0

0

A

0

1

1

0

0

0

0

0

0

0

0

0

0

A

0

1

2

0.7

0.3

0

0

0

0

0

0

0

0

S

0

0

0.7

3

1.7

1.3

1

0.7

0.3

0

0

0

0

R

0

0

0.3

1.7

2.7

1.3

1

0.7

0.3

0

0

0

0

N

0

0

0

1.3

1.3

2.3

1

0.7

0.3

0

0

0

0

P

0

0

0

1

1

1

2

0.7

0.3

0

0

0

0

S

0

0

0

1

0.7

0.7

0.7

1.7

0.3

0

0

0

0

C

0

0

0

0

0.7

0.3

0.3

1.7

1.3

1.3

0

0

0

W

0

0

0

0

0

0.3

0

0

2.7

1.3

1

1

0

T

0

0

0

0

1

0

0

0

0

1.3

2.3

1

0.7

T

0

0

0

0

1

0.7

0

0

0

0

2.3

2

0.7

W

0

0

0

0

0

0.7

0.3

0

1

0

0

3.3

2

H

0

0

0

0

0

1

0.3

0

0

0.7

0

0

4.3

T

0

0

0

0

1

0

0.7

0

0

0

1.7

0

0

Step 4: Using the final score matrix, the trace-back is performed starting from the highest value (4.3), as shown. Relevantly, the diagonal pursuit is continued along the path having the highest values until an “island” is met. (The description of an island is given earlier in the exercise pertinent to NW algorithm). Here the trace is directed horizontal and then vertical direction as shown and then pursued diagonal to 2.3 and further. It implies introducing a gap between H and E residues of u and v respectively. Hence, the aligned sequences are written as follows:

A A S T H − E C W C T W H

| | | : | | | | |

A A S R N P S C W T T W H

And, the local-alignment segment is shown bold.

------------------------------------------------------------------------------------------------------------

EXAMPLE

With reference to the following pair of sequences perform SW-algorithm based comparison and elucidate locally significant, common regions of similarity.

u: W Y G Q E Q S Y I Q

v: W Y T Q E T S D I Q

Solution

Step 1: Pertinent to implementing Smith-Waterman algorithm, construct the initial score-matrix with edge-elements x: 0 being a placeholder as was done in the prior examples. Next, the cells representing identities are scored 1 and those representing mismatches are scored 0. The resulting matrix is shown below with the omission of 0 scores on mismatches for clarity.

u

W

Y

G

Q

E

Q

S

Y

I

Q

v

x

0

0

0

0

0

0

0

0

0

0

W

0

1

Y

0

1

1

T

0

Q

0

1

1

1

E

0

1

T

0

S

0

1

D

0

I

0

1

Q

0

1

1

1

Step 2: The cells are populated with updated scores via application of SW algorithm applied to, diagonal followed by row- and column-wise pursuits exercised at each cell showing the entry 1 (match scores). Hence, the resulting final score matrix is as follows:

I0

I1

I2

I3

I4

I5

I6

I7

I8

I9

I10

u

W

Y

G

Q

E

Q

S

Y

I

Q

J0

X

x

0

0

0

0

0

0

0

0

0

0

J1

W

0

1

0

0

0

0

0

0

0

0

0

J2

Y

0

0

2

0.7

0.3

0

0

0

1

0

0

J3

T

0

0

0.7

1.7

0.3

0

0

0

0

0.7

0

J4

Q

0

0

0.3

0.3

2.7

1.3

1

0.7

0.3

0

1.7

J5

E

0

0

0

0

1.3

3.7

2.3

2

1.7

1.3

1

J6

T

0

0

0

0

1

2.3

3.3

2

1.7

1.3

1

J7

S

0

0

0

0

0.7

1

2

4.3

3

2.7

2.3

J8

D

0

0

0

0

0.3

0.7

1.7

3

4

2.7

2.3

J9

I

0

0

0

0

0

0.3

1.3

2.7

2.7

5

3.7

J10

Q

0

0

0

0

1

0

1.3

2.3

2.3

3.7

6

Step 3: The trace-back pathway conforms to the passage that accumulates most matches. The common regions of similarity can be determined by referencing these matches as illustrated below:

I0

I1

I2

I3

I4

I5

I6

I7

I8

I9

I10

X

W

Y

G

Q

E

Q

S

Y

I

Q

J0

X

x

0

0

0

0

0

0

0

0

0

0

J1

W

0

1

0

0

0

0

0

0

0

0

0

J2

Y

0

0

2

0.7

0.3

0

0

0

1

0

0

J3

T

0

0

0.7

1.7

0.3

0

0

0

0

0.7

0

J4

Q

0

0

0.3

0.3

2.7

1.3

1

0.7

0.3

0

1.7

J5

E

0

0

0

0

1.3

3.7

2.3

2

1.7

1.3

1

J6

T

0

0

0

0

1

2.3

3.3

2

1.7

1.3

1

J7

S

0

0

0

0

0.7

1

2

4.3

3

2.7

2.3

J8

D

0

0

0

0

0.3

0.7

1.7

3

4

2.7

2.3

J9

I

0

0

0

0

0

0.3

1.3

2.7

2.7

5

3.7

J10

Q

0

0

0

0

1

0

1.3

2.3

2.3

3.7

6

The aligned sequence is as follows with the residues shown bold could be the locally-significant aligned pairs of interest.

W Y G Q E Q S Y I Q

| | | | | | |

W Y T Q E Q S D I Q

PROBLEMS/EXERCISES ON NW and SW ALIGNMENTS

EXAMPLE – NW Algorithm

Given a set of sequence pairs, U and V as indicated below. Determine the best global alignment via trace-back using NW algorithm.

U: C A C T H E T W V: C A C S C A T TW

Solution: Hand Calculation

C

A

C

T

H

E

T

W

C

4

2

2

0

0

0

0

0

A

2

3

1

0

0

0

0

0

C

3

2

2

0

0

0

0

0

S

2

2

1

0

0

0

0

0

C

2

1

2

0

0

0

0

0

A

1

2

1

0

0

0

0

0

T

1

1

1

1

0

0

1

0

T

0

0

0

1

0

0

1

0

W

0

0

0

0

0

0

0

1

Alignment =

CACT-HETW

|||: ||

CACSCATTW

-----------------------------------------------------------------------------------------------------------------------

Problem B.14

NW Algorithm

Given a set of sequence pairs, X and Y as indicated below. Determine in each case the best global alignment via trace-back using NW algorithm

(i) X: G A G C A Y: G A T T C A

Solution- Hint

The following is the solution on alignment:

GA-GCA

| | | |

GATTCA

---------------------------------------------------------------------------------------------------------------------

Problem B.15

NW Algorithm

Given the sequence pairs:

x: W F G Q F T S A I W

y: S S T Q F S E D A I

Perform NW-algorithm based comparison between the two sequences and elucidate the optimal pathway in aligning them globally.

-----------------------------------------------------------------------------------------------------------------------

Problem B. 16

Assigned is a pair of amino acid sequences (S and T). Determine the best global alignment

S: C U U A C G C A

T: A U G A G A A C U U

Solution Hint: Final alignment

S: C U U - A C G - - C - A

T: A U - G A - G A A C U U

-----------------------------------------------------------------------------------------------------------------------

Problem B.17

Given a sequence pairs, U and V as indicated below, determine the best global alignment via trace-back using NW algorithm

U: C T C G T V: C T A A G T

Answer:

(ii) Alignment =

CT-CGT

|| ||

CTAAGT

Problem B.18

Via hand calculations, perform NW-algorithm based comparison between the two given sequences indicated belowand elucidate the maximum path-way:

MA V R K L S L E G

M S T A L P G L G S

Problem B.19: Bonus Problem for extra credits

Via hand calculations, perform NW-algorithm based comparison between the two given sequences and elucidate the maximum path-way:

Sequence Pair

W F G Q E T S A I S

S F T Q F S E D A I

-----------------------------------------------------------------------------------------------------------------------

LOCAL ALIGNMENT: SW ALGORITHM

EXAMPLE

Determine the best local alignment via trace-back method of Smith-Waterman method

Step 1: Construct an n × m matrix and add the row of dummy '0's towards initialization

x

C

U

U

A

C

G

C

A

x

0

0

0

0

0

0

0

0

0

A

0

U

0

G

0

A

0

G

0

A

0

A

0

C

0

U

0

U

0

Step 2: Use the Smith-Waterman algorithm to fill table to find the best scoring path (blank spaces represent '0's).

x

C

U

U

A

C

G

C

A

x

0

0

0

0

0

0

0

0

0

A

0

1

1.0

U

0

1.0

1.0

0.7

G

0

0.7

0.7

1.3

A

0

1.7

0.3

1.0

1.0

G

0

0.3

1.3

1.0

0.7

A

0

1.0

1.0

0.7

1.0

A

0

1.0

0.7

0.7

1.0

C

0

1.0

0.7

0.3

1.0

0.3

U

0

2.0

1.3

0.3

0.7

U

0

1.3

3.0

Step 3: The row of dummy '0's is removed.

C

U

U

A

C

G

C

A

A

1.0

1.0

U

1.0

1.0

0.7

G

0.7

0.7

1.3

A

1.7

0.3

1.0

1.0

G

0.3

1.3

0.7

A

1.0

1.0

1.0

A

1.0

0.7

0.7

1.0

C

1.0

0.7

0.3

1.0

0.3

U

2.0

1.3

0.3

0.7

U

1.3

3.0

Results:

Best Local Alignment

C

U

U

C

U

U

EXAMPLE

Given the sequence pair, X and Y as indicated below, determine the best local alignment via trace-back in each case using SW algorithm.

X: P A W H E A E Y: H E A G E W G H E A

Hand Calculation

 

x

P

A

W

H

E

A

E

x

0

0

0

0

0

0

0

0

H

0

0

0

0

1

0

0

0

E

0

0

0

0

0

2

0.6667

1

A

0

0

1

0

0

0.6667

3

1.6667

G

0

0

0

0.6667

0

0.3334

1.6667

2.6667

E

0

0

0

0

0.3334

1

1.3334

1.3334

W

0

0

0

1

0

0

1.0001

1.0001

G

0

0

0

0

0.6667

0

0.6668

0.6668

H

0

0

0

0

1

0.3334

0.3335

0.3335

E

0

0

0

0

0

2

0

1

A

0

0

1

0

0

0.6667

3

1.6667

Answer: Alignment

W-HEA

| |||

WGHEA

-----------------------------------------------------------------------------------------------------------------------

EXAMPLE

Determine the common regions of similarity (that is, optimal local alignment) via SW algorithm

V S T V V L E N P G L G R A L S

M S T V V T P N P G L G K A S

x

V

S

T

V

V

L

E

N

P

G

L

G

R

A

L

S

x

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

M

0

S

0

1.0

T

0

2.0

V

0

1.0

0.7

3.0

V

0

1.0

0.7

0.3

1.7

4.0

T

0

0.7

1.7

0.3

2.7

3.7

P

0

0.3

1.3

2.3

2.3

3.3

N

0

1.0

2.0

4.3

P

0

0.7

5.3

G

0

0.3

6.3

L

0

7.3

G

0

8.3

K

0

8.0

A

0

9.0

S

0

8.7

Problem B.20

SW Algorithm

For the following pair of sequences u and v obtain the relevant local alignment via SW algorithm by obtaining the final score matrix outlined earlier:

u: A C A G C C U C G C U U A G

v: A A U G C C A U U G A C G G

Solution hint: The local alignment is:

... G C C − U C G ...

... G C C A U U G ...

-----------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------

Problem B.21

Given a sequence pairs, X and Y as shown below, determine the best local alignment via trace-back using SW algorithm.

X: W R N D C Q E G S A Y: W G Q E G S I E A

Answers

Alignment =

QEGH

||||

QEGH

Problem B.22 : Bonus problem for extra credits

Given a sequence pairs, U and V as shown below, determine the best local alignment via trace-back using SW algorithm.

U: AASTHECWCTWH V: AASRNPSCWTTWHT

Answers

Alignment =

AASTH-ECWCTWH

||| : || |||

AASRNPSCWTTWH

-----------------------------------------------------------------------------------------------------------------------

SUBMISSION (HARD COPY AND SOFT COPY) DUE DATE: BY 3rd November, 2017

BONUS CREDIT will be given for:

(a) Neat step-by-step demonstration of the problems

(b) Developing your own MatLab/C/C++ codes as necessary

(c) At least 2 hand-calculations of NW and SW problems indicated

________________________________________________________________________

9

38