본문 바로가기
ML Study/Stanford CS234: Reinforcement Learning

Stanford CS234 Lecture 5

by 누워있는말티즈 2022. 8. 8.

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 5

We need to be able to generalize from our experience to make “good decisions”

Value Function Approximation(VFA)

from now on, we will represent (s,a) value function with parameterized function


input would be state or state-action pair, output would be value in any kinds.

parameter w here would a vector in simple terms such as DNN parameters.

Motivations

  • we don’t want to store and go through every single state’s properties
  • we want more compact and precise, generalized representation

Benefits of Generalization

  • Reduce required memory
  • Reduce computation
  • Reduce experience(numbers)

Function Approximators

Out of so many possible approximators(NN, Linear combination, Fourier, etc ...) we will focus on differentiable kinds.

Today we will focus on Linear feature representations

Review Gradient Descent

Consider a function J(w) that is differentiable function of w

→ we want to fing w that minimize J

wJ(w)=[J(w)w1,J(w)w2,...,J(w)wN]

then we update our parameter with this gradient

wwαwJ(w)

where α is “learning rate” which is a constant. With the w update method above, it is guaranteed tha w will move toward local optima


VFA for Prediction

we will assume that there is an oracle🔮 that returns exact value for given state, so we know Vπ(s)

Stochastic Gradient Descent(SGD)

we use SGD to specify the very parameter w that minimized loss between true value Vπ(s) and estimated value V^(s;w). (← w here denotes that this value is parameterized by w)

we use feature vectors

Linear VFA for Prediction with Oracle

*Convergence Gurantees for Linear VFA for Policy Evaluation

Define MSE of a linear VFA for particulat policy π

MSVE(w)=d(s)(Vπ(s)Vπ^(st;w))2

where d(s) is a stationary distribution of π in decision process →”probability distribution over states” (sd(s)=1)

Monte-Carlo Linear VFA

We no longer have an oracle that tells us the true Vπ. In Monte-Carlo we use reward Gt instead.

Δw=α(GtV^(st;w))wV^(st;w) =α(GtV^(st;w))x(st) =α(Gtx(st)Tw)x(st)

can be identically applied to both first-visit and every-visit MC

Algorithm

  • Baird Example - Monte Carlo Policy Evaluation

    calculate x(s1) for s1 with parateter properties.

    assume w=[1,1,1,1,1,1,1,1] and α=0.5,

    trajectory : (s1,a1,0,s7,a1,0,s7,a1,0,term)

    Ans. for Gs1= 0, V(s1)=xTw=3

    Δw=0.5(03)x(s)=1.5[2,0,0,0,0,0,0,1]

    w=w+Δw=[2,1,1,1,1,1,1,0.5]

*Convergence Gurantees for Linear VFA for “Monte Carlo” Policy Evaluation

Define MSE of a linear VFA for particulat policy π

MSVE(wmc)=minwd(s)(Vπ(s)Vπ^(st;w))2

where d(s) is a stationary distribution of π in decision process →”probability distribution over states” (sd(s)=1)

Batch Monte-Carlo VFA

we take sets of episodes(a.k.a. Batches) for policy π. since these are finite it’s possible to solve analytically an approximation that minimizes MSE.

argminwi=1N(G(si)x(si)Tw)2

Take derivative and set to 0

w=(XTX)1XTG

where X stands for matrix of features of each of N states.

Temporal Difference(TD) Learning VFA

instead of rewards(G) as in Monte-Carlo, we use TD target(rj+γV^π(sj+1,w)) here.

Δw=α(r+γV^π(s,w)V^(st;w))wV^(st;w) =α(r+γV^π(s,w)V^(st;w))x(st) =α(r+γx(s)Twx(st)Tw)x(st)

Algorithm

Basically identical to Monte-Carlo method.

  • Baird Example - Temporal Difference Policy Evaluation


    calculate x(s1) for s1 with parateter properties.

    assume w=[1,1,1,1,1,1,1,1] and α=0.5, γ=0.9

    trajectory : (s1,a1,0,s7,a1,0,s7,a1,0,term), sample (s1,a1,0,s7)

    Ans. for Gs1= 0, V(s1)=xTw=3

    Δw=0.5(0+0.9x(s7)Twx(s1)Tw)x(s) =0.5(0+0.93=3)[2,0,0,0,0,0,0,1] =[0.3,0,0,0,0,0,0,0.15]

*Convergence Gurantees for Linear VFA for “Temporal Difference” Policy Evaluation

Define MSE of a linear VFA for particulat policy π

MSVE(wTD)<11γminwd(s)(Vπ(s)Vπ^(st;w))2

where d(s) is a stationary distribution of π in decision process →”probability distribution over states” (sd(s)=1)

Control with Value Function Approximation

Now we practice VFA above with state-action value Q^π(s,a;w)Qπ

This process generally involves bootstrapping, off-policy learning with function approximation

Exact same approach will be taken as of above where we talked about state values → SGD

features for both state and action will be as below

state-action value representation with features and weight value

Q^(s,a;w)=x(s,a)Tw=j=1nxj(s,a)wj

Stochastic Gradient Descent

Objective function is...

wJ(w)=wEπ[(Qπ(s,a)Q^π(s,a;w))2]

we use above to derive Δw terms

Gradient Descent term for MC, TD, and Q-Learning

  • Monte-Carlo

Δw=α(GtQ^(st,at;w))wQ^(st,at;w)

  • SARSA

Δw=α(r+γQ^(s,a;w)Q^(s,a;w))wQ^(s,a;w)

  • Q-Learning

Δw=α(r+γmaxaQ^(s,a;w)Q^(s,a;w))wQ^(s,a;w)

Convergence of Control Method with VFA

반응형

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 7  (0) 2022.08.09
Stanford CS234 Lecture 6  (0) 2022.08.09
Stanford CS234 Lecture 4  (0) 2022.08.05
Stanford CS234 Lecture 3  (0) 2022.08.05
Stanford CS234 Lecture2  (0) 2022.08.05

댓글