Reinforcement Learning machine learning

profilealva6506
quiz.pptx

Final Exam Part 3

Fall 2021

<enter your name and W# here>

Learning Traveling in a Maze

1 2 3 4 5
G 12     6
11 10 9 8 7

Assume that an agent is required to learn the optimal policy given the maze below. Assume that a deterministic Markov decision process exists.

except , =0.9

Use the Q-learning algorithm given to find the best policy.

Update Iteration-Table and Q-table as you are running the algorithm.

Enter “”, “”, and Q-value next to each state-node on the state-flow graph as the algorithm progress.

Show your calculation steps for updating Q-value.

State-flow Graph

6

G

7

5

4

8

9

10

11

12

2

1

3

Start state: 2, action: E. ⟨2,3,…,10,11,𝐺⟩

6

G

7

5

4

8

9

10

11

12

2

1

3

1 2 3 4 5
G 12     6
11 10 9 8 7

    a1=E a2=W a3=N a4=S
STATES 1
2
3
4
5
6
7
8
9
10
11
12
G        
t s a r s' Q(s’,a1=E) Q(s’,a2=W) Q(s’,a3=N) Q(s’,a4=S) maxQ(s',a') Q(s,a)
1 2 E

Iteration table

Q-table

Start state: 2, action: E. ⟨2,3,…,10,11,𝐺⟩

Show your calculation steps for updating Q-value.

t=1: s = 2, a= E

t=2: s = …, a= …

t=.. : s = …, a= …

6

G

7

5

4

8

9

10

11

12

2

1

3

Start state: 7, action: W.

1 2 3 4 5
G 12     6
11 10 9 8 7

6

G

7

5

4

8

9

10

11

12

2

1

3

    a1=E a2=W a3=N a4=S
STATES 1
2
3
4
5
6
7
8
9
10
11
12
G        
t s a r s' Q(s’,a1=E) Q(s’,a2=W) Q(s’,a3=N) Q(s’,a4=S) maxQ(s',a') Q(s,a)
1 7 W

Iteration table

Q-table

Start state: 7, action: W.

Show your calculation steps for updating Q-value.

t=1: s = …, a= …

t=2: s = …, a= …

t=…: s = …, a= …

6

G

7

5

4

8

9

10

11

12

2

1

3

Start state: 5, action: W.

    a1=E a2=W a3=N a4=S
STATES 1
2
3
4
5
6
7
8
9
10
11
12
G        
t s a r s' Q(s’,a1=E) Q(s’,a2=W) Q(s’,a3=N) Q(s’,a4=S) maxQ(s',a') Q(s,a)
1 5 W

Iteration table

Q-table

1 2 3 4 5
G 12     6
11 10 9 8 7

6

G

7

5

4

8

9

10

11

12

2

1

3

Start state: 5, action: W.

Show your calculation steps for updating Q-value.

t=1: s = …, a= …

t=2: s = …, a= …

t=…: s = …, a= …

6

G

7

5

4

8

9

10

11

12

2

1

3

Start state: 9, action: W.

1 2 3 4 5
G 12     6
11 10 9 8 7

6

G

7

5

4

8

9

10

11

12

2

1

3

    a1=E a2=W a3=N a4=S
STATES 1
2
3
4
5
6
7
8
9
10
11
12
G        
t s a r s' Q(s’,a1=E) Q(s’,a2=W) Q(s’,a3=N) Q(s’,a4=S) maxQ(s',a') Q(s,a)
1 9 W

Iteration table

Q-table

Start state: 9, action: W.

Show your calculation steps for updating Q-value.

t=1: s = …, a= …

t=2: s = …, a= …

t=…: s = …, a= …

6

G

7

5

4

8

9

10

11

12

2

1

3

Optimal Policy

1 2 3 4 5
G 12     6
11 10 9 8 7

Complete the Q-table that can give an optimal policy. (see note-1)

Obtain a optimal policy for this environment:

Draw the optimal path on the maze using proper action-arrow.

Draw the optimal path on the state-flow graph. Color the optimal path edges in red with appropriate arrow-direction.

Enter the action-value Q(s,a) into the box next to each node.

6

G

7

5

4

8

9

10

11

12

2

1

3

action-arrows

    a1=E a2=W a3=N a4=S
STATES 1
2
3
4
5
6
7
8
9
10
11
12
G        

Q-table

Note-1: You do not have to show your calculations for all other episodes to complete the Q-table.

11

Learning Traveling in a Maze (Non-Deterministic)

1 2 3 4 5
G 12     6
11 10 9 8 7

Assume that an agent is required to learn the optimal policy given the maze below. Assume that a non-deterministic Markov decision process exists. , , , and for all other transitions.

except , =0.9

6

G

7

5

4

8

9

10

11

12

2

1

3

State-flow Graph

Use the Q-learning algorithm given to find the best policy.

Update Iteration-Table and Q-table as you are running the algorithm.

Enter “”, “”, and Q-value next to each state-node on the state-flow graph as the algorithm progress.

Show your calculation steps for updating Q-value.

Start state: 9, action: W.

1 2 3 4 5
G 12     6
11 10 9 8 7

6

G

7

5

4

8

9

10

11

12

2

1

3

    a1=E a2=W a3=N a4=S
STATES 1
2
3
4
5
6
7
8
9
10
11
12
G        
t s a r s' Q(s’,a1=E) Q(s’,a2=W) Q(s’,a3=N) Q(s’,a4=S) maxQ(s',a') Q(s,a)
1 9 W

Iteration table

Q-table

Start state: 2, action: E.

Show your calculation steps for updating Q-value.

t=1: s = …, a= …

t=2: s = …, a= …

t=…: s = …, a= …

Start state: 9, action: W.

1 2 3 4 5
G 12     6
11 10 9 8 7

6

G

7

5

4

8

9

10

11

12

2

1

3

    a1=E a2=W a3=N a4=S
STATES 1
2
3
4
5
6
7
8
9
10
11
12
G        

Iteration table

Q-table

t s a r s' Q(s’,a1=E) Q(s’,a2=W) Q(s’,a3=N) Q(s’,a4=S) maxQ(s',a') Q(s,a)
1 9 W

Start state: 2, action: E.

Show your calculation steps for updating Q-value.

t=1: s = …, a= …

t=2: s = …, a= …

t=…: s = …, a= …

Optimal Policy (Non-Deterministic)

1 2 3 4 5
G 12     6
11 10 9 8 7

Complete the Q-table that can give an optimal policy. (See note-1)

Obtain a optimal policy for this environment:

Draw the optimal path on the maze using proper action-arrow.

Draw the optimal path on the state-flow graph. Color the optimal path edges in red with appropriate arrow-direction.

Enter the action-value Q(s,a) into the box next to each node.

6

G

7

5

4

8

9

10

11

12

2

1

3

action-arrows

    a1=E a2=W a3=N a4=S
STATES 1
2
3
4
5
6
7
8
9
10
11
12
G        

Q-table

Note-1: You do not have to show your calculations for all other episodes to complete the Q-table.

17