Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 5
We need to be able to generalize from our experience to make “good decisions”
Value Function Approximation(VFA)
from now on, we will represent $(s,a)$ value function with parameterized function
input would be state or state-action pair, output would be value in any kinds.
parameter $w$ here would a vector in simple terms such as DNN parameters.
Motivations
- we don’t want to store and go through every single state’s properties
- we want more compact and precise, generalized representation
Benefits of Generalization
- Reduce required memory
- Reduce computation
- Reduce experience(numbers)
Function Approximators
Out of so many possible approximators(NN, Linear combination, Fourier, etc ...) we will focus on differentiable kinds.
Today we will focus on Linear feature representations
Review Gradient Descent
Consider a function $J(w)$ that is differentiable function of $w$
→ we want to fing $w$ that minimize $J$
$$
\nabla_wJ(w)=[{\partial J(w)\over \partial w_1},{\partial J(w)\over \partial w_2},...,{\partial J(w)\over \partial w_N}]
$$
then we update our parameter with this gradient
$$
w \leftarrow w - \alpha \nabla_wJ(w)
$$
where $\alpha$ is “learning rate” which is a constant. With the $w$ update method above, it is guaranteed tha $w$ will move toward local optima
VFA for Prediction
we will assume that there is an oracle🔮 that returns exact value for given state, so we know $V^\pi(s)$
Stochastic Gradient Descent(SGD)
we use SGD to specify the very parameter $w$ that minimized loss between true value $V^\pi(s)$ and estimated value $\hat{V}(s; w)$. (← $w$ here denotes that this value is parameterized by $w$)
we use feature vectors
Linear VFA for Prediction with Oracle
*Convergence Gurantees for Linear VFA for Policy Evaluation
Define MSE of a linear VFA for particulat policy $\pi$
$$
MSVE(w)=\sum d(s)(V^\pi(s)-\hat{V^\pi}(s_t; w))^2
$$
where $d(s)$ is a stationary distribution of $\pi$ in decision process →”probability distribution over states” ($\sum_sd(s)=1$)
Monte-Carlo Linear VFA
We no longer have an oracle that tells us the true $V^\pi$. In Monte-Carlo we use reward $G_t$ instead.
$$
\begin{matrix}
\Delta w &=& \alpha(G_t-\hat{V}(s_t; w))\nabla_w\hat{V}(s_t;w)\
&=& \alpha(G_t-\hat{V}(s_t; w))x(s_t) \
&=& \alpha(G_t-x(s_t)^T w)x(s_t)
\end{matrix}
$$
can be identically applied to both first-visit and every-visit MC
Algorithm
Baird Example - Monte Carlo Policy Evaluation
calculate $x(s_1)$ for $s_1$ with parateter properties.
assume $w_*=[1 ,1, 1, 1,1,1,1,1]$ and $\alpha=0.5$,
trajectory : $(s_1,a_1,0,s_7,a_1,0,s_7,a_1,0,term)$
Ans. for $G_{s_1}$= 0, $V(s_1)=x^T w=3$
→$\Delta w=0.5(0-3)x(s)=-1.5[2,0,0,0,0,0,0,1]$
→ $w = w+\Delta w=[-2,1,1,1,1,1,1,-0.5]$
*Convergence Gurantees for Linear VFA for “Monte Carlo” Policy Evaluation
Define MSE of a linear VFA for particulat policy $\pi$
$$
MSVE(w_mc)=\min_w\sum d(s)(V^\pi(s)-\hat{V^\pi}(s_t; w))^2
$$
where $d(s)$ is a stationary distribution of $\pi$ in decision process →”probability distribution over states” ($\sum_sd(s)=1$)
Batch Monte-Carlo VFA
we take sets of episodes(a.k.a. Batches) for policy $\pi$. since these are finite it’s possible to solve analytically an approximation that minimizes MSE.
$$
\arg \min_w \sum^N_{i=1}(G(s_i)-x(s_i)^Tw)^2
$$
Take derivative and set to 0
$$
w=(X^TX)^{-1}X^TG
$$
where X stands for matrix of features of each of N states.
Temporal Difference(TD) Learning VFA
instead of rewards($G$) as in Monte-Carlo, we use TD target($r_j+\gamma \hat{V}^\pi(s_{j+1},w)$) here.
$$
\begin{matrix}
\Delta w &=& \alpha(r+\gamma \hat{V}^\pi(s',w)-\hat{V}(s_t; w))\nabla_w\hat{V}(s_t;w)\
&=& \alpha(r+\gamma \hat{V}^\pi(s',w)-\hat{V}(s_t; w))x(s_t) \
&=& \alpha(r+\gamma x(s')^Tw-x(s_t)^T w)x(s_t)
\end{matrix}
$$
Algorithm
Basically identical to Monte-Carlo method.
Baird Example - Temporal Difference Policy Evaluation
calculate $x(s_1)$ for $s_1$ with parateter properties.assume $w_*=[1 ,1, 1, 1,1,1,1,1]$ and $\alpha=0.5$, $\gamma=0.9$
trajectory : $(s_1,a_1,0,s_7,a_1,0,s_7,a_1,0,term)$, sample $(s_1,a_1,0,s_7)$
Ans. for $G_{s_1}$= 0, $V(s_1)=x^T w=3$
$$
\begin{matrix}
\Delta w&=&0.5(0+0.9x(s_7)^Tw-x(s_1)^Tw)x(s) \
&=&0.5(0+0.9*3=3)[2,0,0,0,0,0,0,1] \
&=&[-0.3,0,0,0,0,0,0,-0.15]
\end{matrix}
$$
*Convergence Gurantees for Linear VFA for “Temporal Difference” Policy Evaluation
Define MSE of a linear VFA for particulat policy $\pi$
$$
MSVE(w_{TD})<{1\over1-\gamma}\min_w\sum d(s)(V^\pi(s)-\hat{V^\pi}(s_t; w))^2
$$
where $d(s)$ is a stationary distribution of $\pi$ in decision process →”probability distribution over states” ($\sum_sd(s)=1$)
Control with Value Function Approximation
Now we practice VFA above with state-action value $\hat{Q}^\pi(s,a;w) \approx Q^\pi$
This process generally involves bootstrapping, off-policy learning with function approximation
Exact same approach will be taken as of above where we talked about state values → SGD
features for both state and action will be as below
state-action value representation with features and weight value
$$
\hat{Q}(s,a;w)=x(s,a)^Tw=\sum^n_{j=1}x_j(s,a)w_j
$$
Stochastic Gradient Descent
Objective function is...
$$
\nabla_wJ(w)=\nabla_wE_\pi[(Q^\pi(s,a)-\hat{Q}^\pi(s,a;w))^2]
$$
we use above to derive $\Delta w$ terms
Gradient Descent term for MC, TD, and Q-Learning
- Monte-Carlo
$$
\Delta w = \alpha(G_t-\hat{Q}(s_t,a_t; w))\nabla_w\hat{Q}(s_t,a_t;w)
$$
- SARSA
$$
\Delta w = \alpha(r+\gamma\hat{Q}(s',a'; w)-\hat{Q}(s,a; w))\nabla_w\hat{Q}(s,a;w)
$$
- Q-Learning
$$
\Delta w = \alpha(r+\gamma \max_{a'}\hat{Q}(s',a'; w)-\hat{Q}(s,a; w))\nabla_w\hat{Q}(s,a;w)
$$
Convergence of Control Method with VFA
'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글
Stanford CS234 Lecture 7 (0) | 2022.08.09 |
---|---|
Stanford CS234 Lecture 6 (0) | 2022.08.09 |
Stanford CS234 Lecture 4 (0) | 2022.08.05 |
Stanford CS234 Lecture 3 (0) | 2022.08.05 |
Stanford CS234 Lecture2 (0) | 2022.08.05 |
댓글