본문 바로가기
ML Study/Stanford CS234: Reinforcement Learning

Stanford CS234 Lecture 5

by 누워있는말티즈 2022. 8. 8.

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 5

We need to be able to generalize from our experience to make “good decisions”

Value Function Approximation(VFA)

from now on, we will represent $(s,a)$ value function with parameterized function


input would be state or state-action pair, output would be value in any kinds.

parameter $w$ here would a vector in simple terms such as DNN parameters.

Motivations

  • we don’t want to store and go through every single state’s properties
  • we want more compact and precise, generalized representation

Benefits of Generalization

  • Reduce required memory
  • Reduce computation
  • Reduce experience(numbers)

Function Approximators

Out of so many possible approximators(NN, Linear combination, Fourier, etc ...) we will focus on differentiable kinds.

Today we will focus on Linear feature representations

Review Gradient Descent

Consider a function $J(w)$ that is differentiable function of $w$

→ we want to fing $w$ that minimize $J$

$$
\nabla_wJ(w)=[{\partial J(w)\over \partial w_1},{\partial J(w)\over \partial w_2},...,{\partial J(w)\over \partial w_N}]
$$

then we update our parameter with this gradient

$$
w \leftarrow w - \alpha \nabla_wJ(w)
$$

where $\alpha$ is “learning rate” which is a constant. With the $w$ update method above, it is guaranteed tha $w$ will move toward local optima


VFA for Prediction

we will assume that there is an oracle🔮 that returns exact value for given state, so we know $V^\pi(s)$

Stochastic Gradient Descent(SGD)

we use SGD to specify the very parameter $w$ that minimized loss between true value $V^\pi(s)$ and estimated value $\hat{V}(s; w)$. (← $w$ here denotes that this value is parameterized by $w$)

we use feature vectors

Linear VFA for Prediction with Oracle

*Convergence Gurantees for Linear VFA for Policy Evaluation

Define MSE of a linear VFA for particulat policy $\pi$

$$
MSVE(w)=\sum d(s)(V^\pi(s)-\hat{V^\pi}(s_t; w))^2
$$

where $d(s)$ is a stationary distribution of $\pi$ in decision process →”probability distribution over states” ($\sum_sd(s)=1$)

Monte-Carlo Linear VFA

We no longer have an oracle that tells us the true $V^\pi$. In Monte-Carlo we use reward $G_t$ instead.

$$
\begin{matrix}
\Delta w &=& \alpha(G_t-\hat{V}(s_t; w))\nabla_w\hat{V}(s_t;w)\
&=& \alpha(G_t-\hat{V}(s_t; w))x(s_t) \
&=& \alpha(G_t-x(s_t)^T w)x(s_t)
\end{matrix}
$$

can be identically applied to both first-visit and every-visit MC

Algorithm

  • Baird Example - Monte Carlo Policy Evaluation

    calculate $x(s_1)$ for $s_1$ with parateter properties.

    assume $w_*=[1 ,1, 1, 1,1,1,1,1]$ and $\alpha=0.5$,

    trajectory : $(s_1,a_1,0,s_7,a_1,0,s_7,a_1,0,term)$

    Ans. for $G_{s_1}$= 0, $V(s_1)=x^T w=3$

    →$\Delta w=0.5(0-3)x(s)=-1.5[2,0,0,0,0,0,0,1]$

    → $w = w+\Delta w=[-2,1,1,1,1,1,1,-0.5]$

*Convergence Gurantees for Linear VFA for “Monte Carlo” Policy Evaluation

Define MSE of a linear VFA for particulat policy $\pi$

$$
MSVE(w_mc)=\min_w\sum d(s)(V^\pi(s)-\hat{V^\pi}(s_t; w))^2
$$

where $d(s)$ is a stationary distribution of $\pi$ in decision process →”probability distribution over states” ($\sum_sd(s)=1$)

Batch Monte-Carlo VFA

we take sets of episodes(a.k.a. Batches) for policy $\pi$. since these are finite it’s possible to solve analytically an approximation that minimizes MSE.

$$
\arg \min_w \sum^N_{i=1}(G(s_i)-x(s_i)^Tw)^2
$$

Take derivative and set to 0

$$
w=(X^TX)^{-1}X^TG
$$

where X stands for matrix of features of each of N states.

Temporal Difference(TD) Learning VFA

instead of rewards($G$) as in Monte-Carlo, we use TD target($r_j+\gamma \hat{V}^\pi(s_{j+1},w)$) here.

$$
\begin{matrix}
\Delta w &=& \alpha(r+\gamma \hat{V}^\pi(s',w)-\hat{V}(s_t; w))\nabla_w\hat{V}(s_t;w)\
&=& \alpha(r+\gamma \hat{V}^\pi(s',w)-\hat{V}(s_t; w))x(s_t) \
&=& \alpha(r+\gamma x(s')^Tw-x(s_t)^T w)x(s_t)
\end{matrix}
$$

Algorithm

Basically identical to Monte-Carlo method.

  • Baird Example - Temporal Difference Policy Evaluation


    calculate $x(s_1)$ for $s_1$ with parateter properties.

    assume $w_*=[1 ,1, 1, 1,1,1,1,1]$ and $\alpha=0.5$, $\gamma=0.9$

    trajectory : $(s_1,a_1,0,s_7,a_1,0,s_7,a_1,0,term)$, sample $(s_1,a_1,0,s_7)$

    Ans. for $G_{s_1}$= 0, $V(s_1)=x^T w=3$

    $$
    \begin{matrix}
    \Delta w&=&0.5(0+0.9x(s_7)^Tw-x(s_1)^Tw)x(s) \
    &=&0.5(0+0.9*3=3)[2,0,0,0,0,0,0,1] \
    &=&[-0.3,0,0,0,0,0,0,-0.15]
    \end{matrix}
    $$

*Convergence Gurantees for Linear VFA for “Temporal Difference” Policy Evaluation

Define MSE of a linear VFA for particulat policy $\pi$

$$
MSVE(w_{TD})<{1\over1-\gamma}\min_w\sum d(s)(V^\pi(s)-\hat{V^\pi}(s_t; w))^2
$$

where $d(s)$ is a stationary distribution of $\pi$ in decision process →”probability distribution over states” ($\sum_sd(s)=1$)

Control with Value Function Approximation

Now we practice VFA above with state-action value $\hat{Q}^\pi(s,a;w) \approx Q^\pi$

This process generally involves bootstrapping, off-policy learning with function approximation

Exact same approach will be taken as of above where we talked about state values → SGD

features for both state and action will be as below

state-action value representation with features and weight value

$$
\hat{Q}(s,a;w)=x(s,a)^Tw=\sum^n_{j=1}x_j(s,a)w_j
$$

Stochastic Gradient Descent

Objective function is...

$$
\nabla_wJ(w)=\nabla_wE_\pi[(Q^\pi(s,a)-\hat{Q}^\pi(s,a;w))^2]
$$

we use above to derive $\Delta w$ terms

Gradient Descent term for MC, TD, and Q-Learning

  • Monte-Carlo

$$
\Delta w = \alpha(G_t-\hat{Q}(s_t,a_t; w))\nabla_w\hat{Q}(s_t,a_t;w)
$$

  • SARSA

$$
\Delta w = \alpha(r+\gamma\hat{Q}(s',a'; w)-\hat{Q}(s,a; w))\nabla_w\hat{Q}(s,a;w)
$$

  • Q-Learning

$$
\Delta w = \alpha(r+\gamma \max_{a'}\hat{Q}(s',a'; w)-\hat{Q}(s,a; w))\nabla_w\hat{Q}(s,a;w)
$$

Convergence of Control Method with VFA

반응형

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 7  (0) 2022.08.09
Stanford CS234 Lecture 6  (0) 2022.08.09
Stanford CS234 Lecture 4  (0) 2022.08.05
Stanford CS234 Lecture 3  (0) 2022.08.05
Stanford CS234 Lecture2  (0) 2022.08.05

댓글