Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 4
→We evaluated policy in model-free situation last time
How can an agent start making good decisions when it doen’t know how the world works: How do we make a “good decision”?
Learning to Control Invovles...
- Optimization : we want maximal expected rewards
- Delayed Consequences : may take time to realize wheter previous action aws good or bad
- Exploration : requires some exploration to learn possible higher solution
We will considier situation today as either of below
→ MDP model is unknown but can be sampled
→MDP model is known but impossible to use as is unless through sampling
On-Policy and Off-Policy Learning
| On-Policy | - Learn from direct experience
- Estimate and evaluate policy from expereince obtained from that policy |
| --- | --- |
| Off-Policy | - Learn to estimate and evaluate policy from expereince obtained from a different policy |
Generalized Policy Iteraton
let us recall policy iteration in model-present case. tou would
Initialize policy $\pi$
Loop :
→compute $V^\pi$(evaluation)
→update $V^\pi$(improve)
$\pi'(s)=arg maxR(s,a)+\gamma \sum_{s' \in S} P(s'|s,a)V^\pi(s')=argmaxQ^\pi(s,a)$
we iterate this system $|A|^{|s|}$ times for all policies.
Monte-Carlo for On Policy Q Evaluation and Model-Free Policy Iteration
We will first evaluate and make estimates of $Q$ values
Initialize $N(s,a)=0, G(s,a)=0, Q^\pi(s,a)=0$
Loop
Now that we have an estimate $Q^{\pi_i}(s,a)$ we can update new policy
$$
\pi_{i+1}(s)=\arg\max_a Q^{\pi_i}(s,a)
$$
Now we iterate our policy $\pi$
Initialize policy $\pi$
Loop :
→compute $Q^\pi$(evaluation)
→update $Q^\pi$(improve)
Exploration
We want our estimate $Q$ to be super precise.However if $\pi$ is deterministic, an agent will only take specific action for a specific state. So we call in $\epsilon$-greedy policy.
policy goes either $\pi=\begin{pmatrix}
\arg\max Q^\pi(s,a)\
one of the other
\end{pmatrix}$
for wither probability of $\epsilon$ or $\epsilon/|A|$.
Example - MC for On Policy Q Evaluation and $\epsilon$-greedy policy
Initialize $N(s,a)=0, G(s,a)=0, Q^\pi(s,a)=0$
Ans. $Q^{\epsilon -\pi}(-, a_1)=[1 ,0,1,0,0,0,0], Q^{\epsilon-\pi}(-,a_2)=[0,1,0,0,0,0,0]$
Define GLIE → Greedy in the Limit of Infinite Exploration
All state-action pairs are visited an infinite number of times.
Behavior policy, what policy you’re using vs. what policy is greedy with current $Q$ converges in limit condition to greedy policy.
Monte Carlo Control
first-visit and every-visit are both fine here.
Temporal Difference Methods for Control
*SARSA Algorithm
SARSA for finite-state and finite-action MDPs converges to the optimal action-value under conditions...
- the policy sequence $\pi_t(a|s)$ satisfies GLIE
- the step size $a_t$ satisfiy the Robbins-Munro sequence
but empirically you don’t use the second term
Off-policy Control with Q-Learning
we can estimate $\pi^$ without knowing what $\pi^$ is.
The key idea is to maintain state-action, estimates and use to bootstrap
Q-Learning with $\epsilon-greedy$
Q learning only updates Q function for the state you were in. Even if later you realize there’s much higher reward Q learning won’t backpropagate. Thus Q-learning is basically slower than Monte-Carlo.
Que : Conditions sufficient to ensure that Q-learning with $\epsilon-greedy$ converges to optimal $Q^*$?
Ans. the algoritm is opt to visit all $(s,a)$ pairs infinite times and the step size $a_t$ satisfiy the Robbins-Munro sequence
Que : Conditions sufficient to ensure that Q-learning with $\epsilon-greedy$ converges to optimal $\pi^*$?
Ans. must satisfy GLIE along with all conditions above
Maximization Bias
Double Learning→Double Q-Learning
As seen above, greedy policy with respect ro estimated Q values can yield maximization bias.
Instead of choosing max of estimates, we will split samples and create two independant unbiased estimates of $Q_1(s_1,a_i)$ and $Q_2(s_1,a_i)$
→Use one estimate to select max action : $a^*=argmax_aQ_1(s_1,a)$
→Use other estimate to estimate value of $a^$ : $Q_2(s,a^)$
→ Yield unbiased estimate : $E(Q_2(s,a^))=Q(s,a^)$
we can say this process yields unbiased estimates cuz it uses independent samples to estimate
Double Q-Learning Algorithm
this process requires double the memory of original Q-Learning
As from figure above, Q-Learning algorithm tends to choose actions that might be suboptimal(wrong) but stochastic over deterministic but much better action. This happens because Q-Learning favors stochastic actions.
'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글
Stanford CS234 Lecture 6 (0) | 2022.08.09 |
---|---|
Stanford CS234 Lecture 5 (0) | 2022.08.08 |
Stanford CS234 Lecture 3 (0) | 2022.08.05 |
Stanford CS234 Lecture2 (0) | 2022.08.05 |
Stanford CS234 Lecture 1 (2) | 2022.08.04 |
댓글