Reinforcement Learning machine learning
Final Exam Part 3
Fall 2021
<enter your name and W# here>
Learning Traveling in a Maze
| 1 | 2 | 3 | 4 | 5 |
| G | 12 | 6 | ||
| 11 | 10 | 9 | 8 | 7 |
Assume that an agent is required to learn the optimal policy given the maze below. Assume that a deterministic Markov decision process exists.
except , =0.9
Use the Q-learning algorithm given to find the best policy.
Update Iteration-Table and Q-table as you are running the algorithm.
Enter “”, “”, and Q-value next to each state-node on the state-flow graph as the algorithm progress.
Show your calculation steps for updating Q-value.
State-flow Graph
6
G
7
5
4
8
9
10
11
12
2
1
3
Start state: 2, action: E. ⟨2,3,…,10,11,𝐺⟩
6
G
7
5
4
8
9
10
11
12
2
1
3
| 1 | 2 | 3 | 4 | 5 |
| G | 12 | 6 | ||
| 11 | 10 | 9 | 8 | 7 |
| a1=E | a2=W | a3=N | a4=S | ||
| STATES | 1 | ||||
| 2 | |||||
| 3 | |||||
| 4 | |||||
| 5 | |||||
| 6 | |||||
| 7 | |||||
| 8 | |||||
| 9 | |||||
| 10 | |||||
| 11 | |||||
| 12 | |||||
| G |
| t | s | a | r | s' | Q(s’,a1=E) | Q(s’,a2=W) | Q(s’,a3=N) | Q(s’,a4=S) | maxQ(s',a') | Q(s,a) |
| 1 | 2 | E | ||||||||
Iteration table
Q-table
Start state: 2, action: E. ⟨2,3,…,10,11,𝐺⟩
Show your calculation steps for updating Q-value.
t=1: s = 2, a= E
t=2: s = …, a= …
t=.. : s = …, a= …
…
6
G
7
5
4
8
9
10
11
12
2
1
3
Start state: 7, action: W.
| 1 | 2 | 3 | 4 | 5 |
| G | 12 | 6 | ||
| 11 | 10 | 9 | 8 | 7 |
6
G
7
5
4
8
9
10
11
12
2
1
3
| a1=E | a2=W | a3=N | a4=S | ||
| STATES | 1 | ||||
| 2 | |||||
| 3 | |||||
| 4 | |||||
| 5 | |||||
| 6 | |||||
| 7 | |||||
| 8 | |||||
| 9 | |||||
| 10 | |||||
| 11 | |||||
| 12 | |||||
| G |
| t | s | a | r | s' | Q(s’,a1=E) | Q(s’,a2=W) | Q(s’,a3=N) | Q(s’,a4=S) | maxQ(s',a') | Q(s,a) |
| 1 | 7 | W | ||||||||
Iteration table
Q-table
Start state: 7, action: W.
Show your calculation steps for updating Q-value.
…
t=1: s = …, a= …
t=2: s = …, a= …
t=…: s = …, a= …
6
G
7
5
4
8
9
10
11
12
2
1
3
Start state: 5, action: W.
| a1=E | a2=W | a3=N | a4=S | ||
| STATES | 1 | ||||
| 2 | |||||
| 3 | |||||
| 4 | |||||
| 5 | |||||
| 6 | |||||
| 7 | |||||
| 8 | |||||
| 9 | |||||
| 10 | |||||
| 11 | |||||
| 12 | |||||
| G |
| t | s | a | r | s' | Q(s’,a1=E) | Q(s’,a2=W) | Q(s’,a3=N) | Q(s’,a4=S) | maxQ(s',a') | Q(s,a) |
| 1 | 5 | W | ||||||||
Iteration table
Q-table
| 1 | 2 | 3 | 4 | 5 |
| G | 12 | 6 | ||
| 11 | 10 | 9 | 8 | 7 |
6
G
7
5
4
8
9
10
11
12
2
1
3
Start state: 5, action: W.
Show your calculation steps for updating Q-value.
…
t=1: s = …, a= …
t=2: s = …, a= …
t=…: s = …, a= …
6
G
7
5
4
8
9
10
11
12
2
1
3
Start state: 9, action: W.
| 1 | 2 | 3 | 4 | 5 |
| G | 12 | 6 | ||
| 11 | 10 | 9 | 8 | 7 |
6
G
7
5
4
8
9
10
11
12
2
1
3
| a1=E | a2=W | a3=N | a4=S | ||
| STATES | 1 | ||||
| 2 | |||||
| 3 | |||||
| 4 | |||||
| 5 | |||||
| 6 | |||||
| 7 | |||||
| 8 | |||||
| 9 | |||||
| 10 | |||||
| 11 | |||||
| 12 | |||||
| G |
| t | s | a | r | s' | Q(s’,a1=E) | Q(s’,a2=W) | Q(s’,a3=N) | Q(s’,a4=S) | maxQ(s',a') | Q(s,a) |
| 1 | 9 | W | ||||||||
Iteration table
Q-table
Start state: 9, action: W.
Show your calculation steps for updating Q-value.
…
t=1: s = …, a= …
t=2: s = …, a= …
t=…: s = …, a= …
6
G
7
5
4
8
9
10
11
12
2
1
3
Optimal Policy
| 1 | 2 | 3 | 4 | 5 |
| G | 12 | 6 | ||
| 11 | 10 | 9 | 8 | 7 |
Complete the Q-table that can give an optimal policy. (see note-1)
Obtain a optimal policy for this environment:
Draw the optimal path on the maze using proper action-arrow.
Draw the optimal path on the state-flow graph. Color the optimal path edges in red with appropriate arrow-direction.
Enter the action-value Q(s,a) into the box next to each node.
6
G
7
5
4
8
9
10
11
12
2
1
3
action-arrows
| a1=E | a2=W | a3=N | a4=S | ||
| STATES | 1 | ||||
| 2 | |||||
| 3 | |||||
| 4 | |||||
| 5 | |||||
| 6 | |||||
| 7 | |||||
| 8 | |||||
| 9 | |||||
| 10 | |||||
| 11 | |||||
| 12 | |||||
| G |
Q-table
Note-1: You do not have to show your calculations for all other episodes to complete the Q-table.
11
Learning Traveling in a Maze (Non-Deterministic)
| 1 | 2 | 3 | 4 | 5 |
| G | 12 | 6 | ||
| 11 | 10 | 9 | 8 | 7 |
Assume that an agent is required to learn the optimal policy given the maze below. Assume that a non-deterministic Markov decision process exists. , , , and for all other transitions.
except , =0.9
6
G
7
5
4
8
9
10
11
12
2
1
3
State-flow Graph
Use the Q-learning algorithm given to find the best policy.
Update Iteration-Table and Q-table as you are running the algorithm.
Enter “”, “”, and Q-value next to each state-node on the state-flow graph as the algorithm progress.
Show your calculation steps for updating Q-value.
Start state: 9, action: W.
| 1 | 2 | 3 | 4 | 5 |
| G | 12 | 6 | ||
| 11 | 10 | 9 | 8 | 7 |
6
G
7
5
4
8
9
10
11
12
2
1
3
| a1=E | a2=W | a3=N | a4=S | ||
| STATES | 1 | ||||
| 2 | |||||
| 3 | |||||
| 4 | |||||
| 5 | |||||
| 6 | |||||
| 7 | |||||
| 8 | |||||
| 9 | |||||
| 10 | |||||
| 11 | |||||
| 12 | |||||
| G |
| t | s | a | r | s' | Q(s’,a1=E) | Q(s’,a2=W) | Q(s’,a3=N) | Q(s’,a4=S) | maxQ(s',a') | Q(s,a) | ||||
| 1 | 9 | W | ||||||||||||
Iteration table
Q-table
Start state: 2, action: E.
Show your calculation steps for updating Q-value.
…
t=1: s = …, a= …
t=2: s = …, a= …
t=…: s = …, a= …
Start state: 9, action: W.
| 1 | 2 | 3 | 4 | 5 |
| G | 12 | 6 | ||
| 11 | 10 | 9 | 8 | 7 |
6
G
7
5
4
8
9
10
11
12
2
1
3
| a1=E | a2=W | a3=N | a4=S | ||
| STATES | 1 | ||||
| 2 | |||||
| 3 | |||||
| 4 | |||||
| 5 | |||||
| 6 | |||||
| 7 | |||||
| 8 | |||||
| 9 | |||||
| 10 | |||||
| 11 | |||||
| 12 | |||||
| G |
Iteration table
Q-table
| t | s | a | r | s' | Q(s’,a1=E) | Q(s’,a2=W) | Q(s’,a3=N) | Q(s’,a4=S) | maxQ(s',a') | Q(s,a) | ||||
| 1 | 9 | W | ||||||||||||
Start state: 2, action: E.
Show your calculation steps for updating Q-value.
…
t=1: s = …, a= …
t=2: s = …, a= …
t=…: s = …, a= …
Optimal Policy (Non-Deterministic)
| 1 | 2 | 3 | 4 | 5 |
| G | 12 | 6 | ||
| 11 | 10 | 9 | 8 | 7 |
Complete the Q-table that can give an optimal policy. (See note-1)
Obtain a optimal policy for this environment:
Draw the optimal path on the maze using proper action-arrow.
Draw the optimal path on the state-flow graph. Color the optimal path edges in red with appropriate arrow-direction.
Enter the action-value Q(s,a) into the box next to each node.
6
G
7
5
4
8
9
10
11
12
2
1
3
action-arrows
| a1=E | a2=W | a3=N | a4=S | ||
| STATES | 1 | ||||
| 2 | |||||
| 3 | |||||
| 4 | |||||
| 5 | |||||
| 6 | |||||
| 7 | |||||
| 8 | |||||
| 9 | |||||
| 10 | |||||
| 11 | |||||
| 12 | |||||
| G |
Q-table
Note-1: You do not have to show your calculations for all other episodes to complete the Q-table.
17