Help in Bioinformatic
Chapter 3 St. Clair & Visick, continued
A global alignment program in Ruby. The program uses essentially the same steps to fill in the dynamic programming table that we used manually above. Tracing backwards from the bottom right to the top left is done the same way we did manually, but we use a stack to allow keeping multiple optimal paths separated. Each time a backlink is followed an entry is placed on top of the todo-list stack. If there are 2 or 3 backlinks leaving a particular cell then 2 or 3 entries are placed on the todo-list stack. In this manner the alignment path is forked. All the path threads are followed until they reach a cell with no backlinks -- the cell in the top left corner.
The code of global Alignment
# global.rb
# Global Alignment.
# Scoring constants:
Match = 1
Mu = 0
Sigma = -1
# Help:
Help = <<HELP_MESSAGE
global.rb
To align 2 nucleotide sequences using the Needleman-Wunsch algorithm.
The scoring parameters are simplified to:
Match = #{Match}, Mismatch = #{Mu}, Gap = #{Sigma}.
Usage: global.rb [-options] sequence_1 sequence_2
Options:
-d Print out the scoring matrix for debugging.
-h Print out this help message.
HELP_MESSAGE
# See if help is needed.
if ARGV.grep(/^-h/).size > 0 or
ARGV.size - ARGV.grep(/^-[hd]/).size != 2
print Help
exit
end
# Decide if the scoring matrix should be printed. and get the input sequences.
debug = ARGV.delete('-d')
last_col = ARGV[0].size # 1st argument across the top.
last_row = ARGV[1].size # 2nd argument down the left side.
# Pad the sequences to make labels for row and column 0.
seq = ARGV.map {|s| (" " + s).split("")}
# Prepare the dynamic programming table as 2 matrices, one for the score and
# one for the backlinks. Initialize each matrix with each row with just
# the 1st column filled in. Backlink codes t: top, l: left, d: diagonal.
score = (0..last_row).inject([]) {|s,e| s << [e * Sigma]}
backlink = (0..last_row).inject([]) {|b,e| b << [e == 0 ? "":'t']}
(1..last_col).each do |col| # Rest of top row.
score[0][col] = col * Sigma
backlink[0][col] = 'l'
end
(1..last_row).each do |row|
(1..last_col).each do |col|
up = score[row - 1][col] + Sigma
left = score[row][col - 1] + Sigma
delta = seq[1][row] == seq[0][col] ? Match : Mu
diag = score[row - 1][col - 1] + delta
score[row][col] = [up,left,diag].max
# Mark all scores equal to the maximum.
backlink[row][col] = ""
backlink[row][col] += 't' if score[row][col] == up
backlink[row][col] += 'l' if score[row][col] == left
backlink[row][col] += 'd' if score[row][col] == diag
end
end
# Display the dynamic programming table if the debug option was requested.
# Drawing table grid with characters
if debug
puts "Dynamic programming table:"
puts ' ' * 4 + '|' + seq[0].map{|l| " #{l} |"}.join
grid_line = "----+" + ('-' * 6 + '+') * seq[0].size
puts grid_line
(0..last_row).each do |row|
puts " |" + (0..last_col).map{|col|
"#{backlink[row][col].split("").include?('d') ? "\\" : " "} " +
"#{backlink[row][col].split("").include?('t') ? "^" : " "} |"}.join
puts " #{seq[1][row]} |" + (0..last_col).map{|col|
"#{backlink[row][col].split("").include?('l') ? "<" : " "} " +
"%3d |" % score[row][col]}.join
puts grid_line
end
end
# Find the optimal alignment paths starting from the lower right corner.
todo_list = [[last_row,last_col,"",""]] # Entry (row,col,string0,string1)
done_list = [] # Entry (str0,str1)
while !todo_list.empty?
row,col,str0,str1 = todo_list.pop
if !backlink[row][col].empty? # If some back-link exists.
todo_list.push([row - 1,col - 1,
seq[0][col] + str0,seq[1][row] + str1]
) if backlink[row][col].split("").include?('d')
todo_list.push([row - 1,col,
'_' + str0,seq[1][row] + str1]
) if backlink[row][col].split("").include?('t')
todo_list.push([row,col - 1,
seq[0][col] + str0,'_' + str1]
) if backlink[row][col].split("").include?('l')
else
done_list.push([str0,str1])
end
end
# Print each of the alignments in done_list.
i = 1
while !done_list.empty?
str0,str1 = done_list.pop
puts "Alignment #{i}\n#{str0}\n#{str1}"
i += 1
end
Yields:
$ ruby global.rb
global.rb
To align 2 nucleotide sequences using the Needleman-Wunsch algorithm.
The scoring parameters are simplified to:
Match = 1, Mismatch = 0, Gap = -1.
Usage: global.rb [-options] sequence_1 sequence_2
Options:
-d Print out the scoring matrix for debugging.
-h Print out this help message.
$ ruby global.rb -d cacgtat cgca
Here is the output
Dynamic programming table:
| | c | a | c | g | t | a | t |
----+------+------+------+------+------+------+------+------+
| | | | | | | | |
| 0 |< -1 |< -2 |< -3 |< -4 |< -5 |< -6 |< -7 |
----+------+------+------+------+------+------+------+------+
| ^ |\ | |\ | | | | |
c | -1 | 1 |< 0 |< -1 |< -2 |< -3 |< -4 |< -5 |
----+------+------+------+------+------+------+------+------+
| ^ | ^ |\ |\ |\ | | | |
g | -2 | 0 | 1 |< 0 | 0 |< -1 |< -2 |< -3 |
----+------+------+------+------+------+------+------+------+
| ^ |\ ^ |\ ^ |\ | |\ |\ |\ |
c | -3 | -1 | 0 | 2 |< 1 |< 0 |< -1 |< -2 |
----+------+------+------+------+------+------+------+------+
| ^ | ^ |\ | ^ |\ |\ |\ | |
a | -4 | -2 | 0 | 1 | 2 |< 1 | 1 |< 0 |
----+------+------+------+------+------+------+------+------+
Alignment 1
cacgtat
__cgca_
Alignment 2
cacgtat
c__gca_
Alignment 3
cacgtat
cgc__a_