Stanford CS234 Lecture 4

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 4

→We evaluated policy in model-free situation last time

How can an agent start making good decisions when it doen’t know how the world works: How do we make a “good decision”?

Learning to Control Invovles...

Optimization : we want maximal expected rewards
Delayed Consequences : may take time to realize wheter previous action aws good or bad
Exploration : requires some exploration to learn possible higher solution

We will considier situation today as either of below

→ MDP model is unknown but can be sampled

→MDP model is known but impossible to use as is unless through sampling

On-Policy and Off-Policy Learning

| On-Policy | - Learn from direct experience

Estimate and evaluate policy from expereince obtained from that policy |
| --- | --- |
| Off-Policy | - Learn to estimate and evaluate policy from expereince obtained from a different policy |

Generalized Policy Iteraton

let us recall policy iteration in model-present case. tou would

Initialize policy $π$
Loop :

→compute $V^{π}$ (evaluation)

→update $V^{π}$ (improve)

$π^{'} (s) = a r g m a x R (s, a) + γ \sum_{s^{'} \in S} P (s^{'} | s, a) V^{π} (s^{'}) = a r g m a x Q^{π} (s, a)$

we iterate this system $| A |^{| s |}$ times for all policies.

Monte-Carlo for On Policy Q Evaluation and Model-Free Policy Iteration

We will first evaluate and make estimates of $Q$ values

Initialize $N (s, a) = 0, G (s, a) = 0, Q^{π} (s, a) = 0$

Loop

Now that we have an estimate

Q^{π_{i}} (s, a)

we can update new policy

$π_{i + 1} (s) = \arg max_{a} Q^{π_{i}} (s, a)$

Now we iterate our policy $π$

Initialize policy $π$

Loop :

→compute $Q^{π}$ (evaluation)

→update $Q^{π}$ (improve)

Exploration

We want our estimate $Q$ to be super precise.However if $π$ is deterministic, an agent will only take specific action for a specific state. So we call in $ϵ$ -greedy policy.

policy goes either $π = (\begin{matrix} \arg max Q^{π} (s, a) o n e o f t h e o t h e r \end{matrix})$

for wither probability of $ϵ$ or $ϵ / | A |$ .

Example - MC for On Policy Q Evaluation and $ϵ$ -greedy policy

Initialize $N (s, a) = 0, G (s, a) = 0, Q^{π} (s, a) = 0$

Ans. $Q^{ϵ - π} (-, a_{1}) = [1, 0, 1, 0, 0, 0, 0], Q^{ϵ - π} (-, a_{2}) = [0, 1, 0, 0, 0, 0, 0]$

Define GLIE → Greedy in the Limit of Infinite Exploration

All state-action pairs are visited an infinite number of times.

Behavior policy, what policy you’re using vs. what policy is greedy with current $Q$ converges in limit condition to greedy policy.

Monte Carlo Control

first-visit and every-visit are both fine here.

Temporal Difference Methods for Control

*SARSA Algorithm

SARSA for finite-state and finite-action MDPs converges to the optimal action-value under conditions...

the policy sequence $π_{t} (a | s)$ satisfies GLIE
the step size $a_{t}$ satisfiy the Robbins-Munro sequence

but empirically you don’t use the second term

Off-policy Control with Q-Learning

we can estimate $\pi^ $w i t h o u t k n o w i n g w h a t$ \pi^$ is.

The key idea is to maintain state-action, estimates and use to bootstrap

Q-Learning with $ϵ - g r e e d y$

Q learning only updates Q function for the state you were in. Even if later you realize there’s much higher reward Q learning won’t backpropagate. Thus Q-learning is basically slower than Monte-Carlo.

Que : Conditions sufficient to ensure that Q-learning with $ϵ - g r e e d y$ converges to optimal $Q^{*}$ ?

Ans. the algoritm is opt to visit all $(s, a)$ pairs infinite times and the step size $a_{t}$ satisfiy the Robbins-Munro sequence

Que : Conditions sufficient to ensure that Q-learning with $ϵ - g r e e d y$ converges to optimal $π^{*}$ ?

Ans. must satisfy GLIE along with all conditions above

Maximization Bias

Double Learning→Double Q-Learning

As seen above, greedy policy with respect ro estimated Q values can yield maximization bias.

Instead of choosing max of estimates, we will split samples and create two independant unbiased estimates of $Q_{1} (s_{1}, a_{i})$ and $Q_{2} (s_{1}, a_{i})$

→Use one estimate to select max action : $a^{*} = a r g m a x_{a} Q_{1} (s_{1}, a)$

→Use other estimate to estimate value of $a^ $:$ Q_2(s,a^)$

→ Yield unbiased estimate : $E(Q_2(s,a^))=Q(s,a^)$

we can say this process yields unbiased estimates cuz it uses independent samples to estimate

Double Q-Learning Algorithm

this process requires double the memory of original Q-Learning

As from figure above, Q-Learning algorithm tends to choose actions that might be suboptimal(wrong) but stochastic over deterministic but much better action. This happens because Q-Learning favors stochastic actions.

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 6 (0)	2022.08.09
Stanford CS234 Lecture 5 (0)	2022.08.08
Stanford CS234 Lecture 3 (0)	2022.08.05
Stanford CS234 Lecture2 (0)	2022.08.05
Stanford CS234 Lecture 1 (2)	2022.08.04

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

말티즈가 물어온 잡동사니

Stanford CS234 Lecture 4

Learning to Control Invovles...

On-Policy and Off-Policy Learning

Generalized Policy Iteraton

Monte-Carlo for On Policy Q Evaluation and Model-Free Policy Iteration

Exploration

Define GLIE → Greedy in the Limit of Infinite Exploration

Monte Carlo Control

Temporal Difference Methods for Control

Off-policy Control with Q-Learning

Maximization Bias

Double Learning→Double Q-Learning

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

Stanford CS234 Lecture 4

Learning to Control Invovles...

On-Policy and Off-Policy Learning

Generalized Policy Iteraton

Monte-Carlo for On Policy Q Evaluation and Model-Free Policy Iteration

Exploration

Define GLIE → Greedy in the Limit of Infinite Exploration

Monte Carlo Control

Temporal Difference Methods for Control

Off-policy Control with Q-Learning

Maximization Bias

Double Learning→Double Q-Learning

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역