본문 바로가기
ML Study/Stanford CS234: Reinforcement Learning

Stanford CS234 Lecture 4

by 누워있는말티즈 2022. 8. 5.

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 4

→We evaluated policy in model-free situation last time

How can an agent start making good decisions when it doen’t know how the world works: How do we make a “good decision”?


Learning to Control Invovles...

  • Optimization : we want maximal expected rewards
  • Delayed Consequences : may take time to realize wheter previous action aws good or bad
  • Exploration : requires some exploration to learn possible higher solution

We will considier situation today as either of below

→ MDP model is unknown but can be sampled

→MDP model is known but impossible to use as is unless through sampling

On-Policy and Off-Policy Learning

| On-Policy | - Learn from direct experience

  • Estimate and evaluate policy from expereince obtained from that policy |
    | --- | --- |
    | Off-Policy | - Learn to estimate and evaluate policy from expereince obtained from a different policy |

Generalized Policy Iteraton

let us recall policy iteration in model-present case. tou would

  • Initialize policy π

  • Loop :

    →compute Vπ(evaluation)

    →update Vπ(improve)

    π(s)=argmaxR(s,a)+γsSP(s|s,a)Vπ(s)=argmaxQπ(s,a)

we iterate this system |A||s| times for all policies.

Monte-Carlo for On Policy Q Evaluation and Model-Free Policy Iteration

We will first evaluate and make estimates of Q values

Initialize N(s,a)=0,G(s,a)=0,Qπ(s,a)=0

Loop


Now that we have an estimate Qπi(s,a) we can update new policy

πi+1(s)=argmaxaQπi(s,a)

Now we iterate our policy π

Initialize policy π

Loop :

→compute Qπ(evaluation)

→update Qπ(improve)


Exploration

We want our estimate Q to be super precise.However if π is deterministic, an agent will only take specific action for a specific state. So we call in ϵ-greedy policy.

policy goes either π=(argmaxQπ(s,a) oneoftheother)

for wither probability of ϵ or ϵ/|A|.

  • Example - MC for On Policy Q Evaluation and ϵ-greedy policy

    Initialize N(s,a)=0,G(s,a)=0,Qπ(s,a)=0


    Ans. Qϵπ(,a1)=[1,0,1,0,0,0,0],Qϵπ(,a2)=[0,1,0,0,0,0,0]

Define GLIE → Greedy in the Limit of Infinite Exploration

All state-action pairs are visited an infinite number of times.

Behavior policy, what policy you’re using vs. what policy is greedy with current Q converges in limit condition to greedy policy.


Monte Carlo Control


first-visit and every-visit are both fine here.


Temporal Difference Methods for Control

*SARSA Algorithm


SARSA for finite-state and finite-action MDPs converges to the optimal action-value under conditions...

  • the policy sequence πt(a|s) satisfies GLIE
  • the step size at satisfiy the Robbins-Munro sequence

but empirically you don’t use the second term

Off-policy Control with Q-Learning

we can estimate $\pi^withoutknowingwhat\pi^$ is.

The key idea is to maintain state-action, estimates and use to bootstrap


Q-Learning with ϵgreedy


Q learning only updates Q function for the state you were in. Even if later you realize there’s much higher reward Q learning won’t backpropagate. Thus Q-learning is basically slower than Monte-Carlo.

Que : Conditions sufficient to ensure that Q-learning with ϵgreedy converges to optimal Q?

Ans. the algoritm is opt to visit all (s,a) pairs infinite times and the step size at satisfiy the Robbins-Munro sequence

Que : Conditions sufficient to ensure that Q-learning with ϵgreedy converges to optimal π?

Ans. must satisfy GLIE along with all conditions above


Maximization Bias

Double Learning→Double Q-Learning

As seen above, greedy policy with respect ro estimated Q values can yield maximization bias.

Instead of choosing max of estimates, we will split samples and create two independant unbiased estimates of Q1(s1,ai) and Q2(s1,ai)

→Use one estimate to select max action : a=argmaxaQ1(s1,a)

→Use other estimate to estimate value of $a^:Q_2(s,a^)$

→ Yield unbiased estimate : $E(Q_2(s,a^))=Q(s,a^)$

we can say this process yields unbiased estimates cuz it uses independent samples to estimate

Double Q-Learning Algorithm


this process requires double the memory of original Q-Learning


As from figure above, Q-Learning algorithm tends to choose actions that might be suboptimal(wrong) but stochastic over deterministic but much better action. This happens because Q-Learning favors stochastic actions.

반응형

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 6  (0) 2022.08.09
Stanford CS234 Lecture 5  (0) 2022.08.08
Stanford CS234 Lecture 3  (0) 2022.08.05
Stanford CS234 Lecture2  (0) 2022.08.05
Stanford CS234 Lecture 1  (2) 2022.08.04