본문 바로가기
ML Study/Stanford CS234: Reinforcement Learning

Stanford CS234 Lecture 4

by 누워있는말티즈 2022. 8. 5.

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 4

→We evaluated policy in model-free situation last time

How can an agent start making good decisions when it doen’t know how the world works: How do we make a “good decision”?


Learning to Control Invovles...

  • Optimization : we want maximal expected rewards
  • Delayed Consequences : may take time to realize wheter previous action aws good or bad
  • Exploration : requires some exploration to learn possible higher solution

We will considier situation today as either of below

→ MDP model is unknown but can be sampled

→MDP model is known but impossible to use as is unless through sampling

On-Policy and Off-Policy Learning

| On-Policy | - Learn from direct experience

  • Estimate and evaluate policy from expereince obtained from that policy |
    | --- | --- |
    | Off-Policy | - Learn to estimate and evaluate policy from expereince obtained from a different policy |

Generalized Policy Iteraton

let us recall policy iteration in model-present case. tou would

  • Initialize policy $\pi$

  • Loop :

    →compute $V^\pi$(evaluation)

    →update $V^\pi$(improve)

    $\pi'(s)=arg maxR(s,a)+\gamma \sum_{s' \in S} P(s'|s,a)V^\pi(s')=argmaxQ^\pi(s,a)$

we iterate this system $|A|^{|s|}$ times for all policies.

Monte-Carlo for On Policy Q Evaluation and Model-Free Policy Iteration

We will first evaluate and make estimates of $Q$ values

Initialize $N(s,a)=0, G(s,a)=0, Q^\pi(s,a)=0$

Loop


Now that we have an estimate $Q^{\pi_i}(s,a)$ we can update new policy

$$
\pi_{i+1}(s)=\arg\max_a Q^{\pi_i}(s,a)
$$

Now we iterate our policy $\pi$

Initialize policy $\pi$

Loop :

→compute $Q^\pi$(evaluation)

→update $Q^\pi$(improve)


Exploration

We want our estimate $Q$ to be super precise.However if $\pi$ is deterministic, an agent will only take specific action for a specific state. So we call in $\epsilon$-greedy policy.

policy goes either $\pi=\begin{pmatrix}
\arg\max Q^\pi(s,a)\
one of the other
\end{pmatrix}$

for wither probability of $\epsilon$ or $\epsilon/|A|$.

  • Example - MC for On Policy Q Evaluation and $\epsilon$-greedy policy

    Initialize $N(s,a)=0, G(s,a)=0, Q^\pi(s,a)=0$


    Ans. $Q^{\epsilon -\pi}(-, a_1)=[1 ,0,1,0,0,0,0], Q^{\epsilon-\pi}(-,a_2)=[0,1,0,0,0,0,0]$

Define GLIE → Greedy in the Limit of Infinite Exploration

All state-action pairs are visited an infinite number of times.

Behavior policy, what policy you’re using vs. what policy is greedy with current $Q$ converges in limit condition to greedy policy.


Monte Carlo Control


first-visit and every-visit are both fine here.


Temporal Difference Methods for Control

*SARSA Algorithm


SARSA for finite-state and finite-action MDPs converges to the optimal action-value under conditions...

  • the policy sequence $\pi_t(a|s)$ satisfies GLIE
  • the step size $a_t$ satisfiy the Robbins-Munro sequence

but empirically you don’t use the second term

Off-policy Control with Q-Learning

we can estimate $\pi^$ without knowing what $\pi^$ is.

The key idea is to maintain state-action, estimates and use to bootstrap


Q-Learning with $\epsilon-greedy$


Q learning only updates Q function for the state you were in. Even if later you realize there’s much higher reward Q learning won’t backpropagate. Thus Q-learning is basically slower than Monte-Carlo.

Que : Conditions sufficient to ensure that Q-learning with $\epsilon-greedy$ converges to optimal $Q^*$?

Ans. the algoritm is opt to visit all $(s,a)$ pairs infinite times and the step size $a_t$ satisfiy the Robbins-Munro sequence

Que : Conditions sufficient to ensure that Q-learning with $\epsilon-greedy$ converges to optimal $\pi^*$?

Ans. must satisfy GLIE along with all conditions above


Maximization Bias

Double Learning→Double Q-Learning

As seen above, greedy policy with respect ro estimated Q values can yield maximization bias.

Instead of choosing max of estimates, we will split samples and create two independant unbiased estimates of $Q_1(s_1,a_i)$ and $Q_2(s_1,a_i)$

→Use one estimate to select max action : $a^*=argmax_aQ_1(s_1,a)$

→Use other estimate to estimate value of $a^$ : $Q_2(s,a^)$

→ Yield unbiased estimate : $E(Q_2(s,a^))=Q(s,a^)$

we can say this process yields unbiased estimates cuz it uses independent samples to estimate

Double Q-Learning Algorithm


this process requires double the memory of original Q-Learning


As from figure above, Q-Learning algorithm tends to choose actions that might be suboptimal(wrong) but stochastic over deterministic but much better action. This happens because Q-Learning favors stochastic actions.

반응형

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 6  (0) 2022.08.09
Stanford CS234 Lecture 5  (0) 2022.08.08
Stanford CS234 Lecture 3  (0) 2022.08.05
Stanford CS234 Lecture2  (0) 2022.08.05
Stanford CS234 Lecture 1  (2) 2022.08.04

댓글