Reinforcement Learning machine learning

alva6506

quiz.pptx

Home >Computer Science homework help >Reinforcement Learning machine learning

Final Exam Part 3

Fall 2021

Learning Traveling in a Maze

1	2	3	4	5
G	12			6
11	10	9	8	7

Assume that an agent is required to learn the optimal policy given the maze below. Assume that a deterministic Markov decision process exists.

except , =0.9

Use the Q-learning algorithm given to find the best policy.

Update Iteration-Table and Q-table as you are running the algorithm.

Enter “”, “”, and Q-value next to each state-node on the state-flow graph as the algorithm progress.

Show your calculation steps for updating Q-value.

State-flow Graph

Start state: 2, action: E. ⟨2,3,…,10,11,𝐺⟩

1	2	3	4	5
G	12			6
11	10	9	8	7

		a1=E	a2=W	a3=N	a4=S
STATES	1
	2
	3
	4
	5
	6
	7
	8
	9
	10
	11
	12
	G

t	s	a	r	s'	Q(s’,a1=E)	Q(s’,a2=W)	Q(s’,a3=N)	Q(s’,a4=S)	maxQ(s',a')	Q(s,a)
1	2	E

Iteration table

Q-table

Start state: 2, action: E. ⟨2,3,…,10,11,𝐺⟩

Show your calculation steps for updating Q-value.

t=1: s = 2, a= E

t=2: s = …, a= …

t=.. : s = …, a= …

…

Start state: 7, action: W.

1	2	3	4	5
G	12			6
11	10	9	8	7

		a1=E	a2=W	a3=N	a4=S
STATES	1
	2
	3
	4
	5
	6
	7
	8
	9
	10
	11
	12
	G

t	s	a	r	s'	Q(s’,a1=E)	Q(s’,a2=W)	Q(s’,a3=N)	Q(s’,a4=S)	maxQ(s',a')	Q(s,a)
1	7	W

Iteration table

Q-table

Start state: 7, action: W.

Show your calculation steps for updating Q-value.

…

t=1: s = …, a= …

t=2: s = …, a= …

t=…: s = …, a= …

Start state: 5, action: W.

		a1=E	a2=W	a3=N	a4=S
STATES	1
	2
	3
	4
	5
	6
	7
	8
	9
	10
	11
	12
	G

t	s	a	r	s'	Q(s’,a1=E)	Q(s’,a2=W)	Q(s’,a3=N)	Q(s’,a4=S)	maxQ(s',a')	Q(s,a)
1	5	W

Iteration table

Q-table

1	2	3	4	5
G	12			6
11	10	9	8	7

Start state: 5, action: W.

Show your calculation steps for updating Q-value.

…

t=1: s = …, a= …

t=2: s = …, a= …

t=…: s = …, a= …

Start state: 9, action: W.

1	2	3	4	5
G	12			6
11	10	9	8	7

		a1=E	a2=W	a3=N	a4=S
STATES	1
	2
	3
	4
	5
	6
	7
	8
	9
	10
	11
	12
	G

t	s	a	r	s'	Q(s’,a1=E)	Q(s’,a2=W)	Q(s’,a3=N)	Q(s’,a4=S)	maxQ(s',a')	Q(s,a)
1	9	W

Iteration table

Q-table

Start state: 9, action: W.

Show your calculation steps for updating Q-value.

…

t=1: s = …, a= …

t=2: s = …, a= …

t=…: s = …, a= …

Optimal Policy

1	2	3	4	5
G	12			6
11	10	9	8	7

Complete the Q-table that can give an optimal policy. (see note-1)

Obtain a optimal policy for this environment:

Draw the optimal path on the maze using proper action-arrow.

Draw the optimal path on the state-flow graph. Color the optimal path edges in red with appropriate arrow-direction.

Enter the action-value Q(s,a) into the box next to each node.

action-arrows

		a1=E	a2=W	a3=N	a4=S
STATES	1
	2
	3
	4
	5
	6
	7
	8
	9
	10
	11
	12
	G

Q-table

Note-1: You do not have to show your calculations for all other episodes to complete the Q-table.

Learning Traveling in a Maze (Non-Deterministic)

1	2	3	4	5
G	12			6
11	10	9	8	7

Assume that an agent is required to learn the optimal policy given the maze below. Assume that a non-deterministic Markov decision process exists. , , , and for all other transitions.

except , =0.9

State-flow Graph

Use the Q-learning algorithm given to find the best policy.

Update Iteration-Table and Q-table as you are running the algorithm.

Enter “”, “”, and Q-value next to each state-node on the state-flow graph as the algorithm progress.

Show your calculation steps for updating Q-value.

Start state: 9, action: W.

1	2	3	4	5
G	12			6
11	10	9	8	7

		a1=E	a2=W	a3=N	a4=S
STATES	1
	2
	3
	4
	5
	6
	7
	8
	9
	10
	11
	12
	G

Q(s’,a1=E)

Q(s’,a2=W)

Q(s’,a3=N)

Q(s’,a4=S)

maxQ(s',a')

Q(s,a)

Iteration table

Q-table

Start state: 2, action: E.

Show your calculation steps for updating Q-value.

…

t=1: s = …, a= …

t=2: s = …, a= …

t=…: s = …, a= …

Start state: 9, action: W.

1	2	3	4	5
G	12			6
11	10	9	8	7

		a1=E	a2=W	a3=N	a4=S
STATES	1
	2
	3
	4
	5
	6
	7
	8
	9
	10
	11
	12
	G

Iteration table

Q-table

Q(s’,a1=E)

Q(s’,a2=W)

Q(s’,a3=N)

Q(s’,a4=S)

maxQ(s',a')

Q(s,a)

Start state: 2, action: E.

Show your calculation steps for updating Q-value.

…

t=1: s = …, a= …

t=2: s = …, a= …

t=…: s = …, a= …

Optimal Policy (Non-Deterministic)

1	2	3	4	5
G	12			6
11	10	9	8	7

Complete the Q-table that can give an optimal policy. (See note-1)

Obtain a optimal policy for this environment:

Draw the optimal path on the maze using proper action-arrow.

Draw the optimal path on the state-flow graph. Color the optimal path edges in red with appropriate arrow-direction.

Enter the action-value Q(s,a) into the box next to each node.

action-arrows

		a1=E	a2=W	a3=N	a4=S
STATES	1
	2
	3
	4
	5
	6
	7
	8
	9
	10
	11
	12
	G

Q-table

Note-1: You do not have to show your calculations for all other episodes to complete the Q-table.