Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 4
→We evaluated policy in model-free situation last time
How can an agent start making good decisions when it doen’t know how the world works: How do we make a “good decision”?
Learning to Control Invovles...
- Optimization : we want maximal expected rewards
- Delayed Consequences : may take time to realize wheter previous action aws good or bad
- Exploration : requires some exploration to learn possible higher solution
We will considier situation today as either of below
→ MDP model is unknown but can be sampled
→MDP model is known but impossible to use as is unless through sampling
On-Policy and Off-Policy Learning
| On-Policy | - Learn from direct experience
- Estimate and evaluate policy from expereince obtained from that policy |
| --- | --- |
| Off-Policy | - Learn to estimate and evaluate policy from expereince obtained from a different policy |
Generalized Policy Iteraton
let us recall policy iteration in model-present case. tou would
Initialize policy
Loop :
→compute
(evaluation)→update
(improve)
we iterate this system
Monte-Carlo for On Policy Q Evaluation and Model-Free Policy Iteration
We will first evaluate and make estimates of
Initialize
Loop

Now that we have an estimate
Now we iterate our policy
Initialize policy
Loop :
→compute
→update
Exploration
We want our estimate
policy goes either
for wither probability of

Example - MC for On Policy Q Evaluation and
-greedy policyInitialize
Ans.
Define GLIE → Greedy in the Limit of Infinite Exploration
All state-action pairs are visited an infinite number of times.
Behavior policy, what policy you’re using vs. what policy is greedy with current
Monte Carlo Control

first-visit and every-visit are both fine here.
Temporal Difference Methods for Control
*SARSA Algorithm

SARSA for finite-state and finite-action MDPs converges to the optimal action-value under conditions...
- the policy sequence
satisfies GLIE - the step size
satisfiy the Robbins-Munro sequence
but empirically you don’t use the second term
Off-policy Control with Q-Learning
we can estimate $\pi^
The key idea is to maintain state-action, estimates and use to bootstrap

Q-Learning with

Q learning only updates Q function for the state you were in. Even if later you realize there’s much higher reward Q learning won’t backpropagate. Thus Q-learning is basically slower than Monte-Carlo.
Que : Conditions sufficient to ensure that Q-learning with
Ans. the algoritm is opt to visit all
Que : Conditions sufficient to ensure that Q-learning with
Ans. must satisfy GLIE along with all conditions above
Maximization Bias

Double Learning→Double Q-Learning
As seen above, greedy policy with respect ro estimated Q values can yield maximization bias.
Instead of choosing max of estimates, we will split samples and create two independant unbiased estimates of
→Use one estimate to select max action :
→Use other estimate to estimate value of $a^
→ Yield unbiased estimate : $E(Q_2(s,a^))=Q(s,a^)$
we can say this process yields unbiased estimates cuz it uses independent samples to estimate
Double Q-Learning Algorithm

this process requires double the memory of original Q-Learning

As from figure above, Q-Learning algorithm tends to choose actions that might be suboptimal(wrong) but stochastic over deterministic but much better action. This happens because Q-Learning favors stochastic actions.
'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글
Stanford CS234 Lecture 6 (0) | 2022.08.09 |
---|---|
Stanford CS234 Lecture 5 (0) | 2022.08.08 |
Stanford CS234 Lecture 3 (0) | 2022.08.05 |
Stanford CS234 Lecture2 (0) | 2022.08.05 |
Stanford CS234 Lecture 1 (2) | 2022.08.04 |
댓글