Stanford CS234 Lecture 5

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 5

We need to be able to generalize from our experience to make “good decisions”

Value Function Approximation(VFA)

from now on, we will represent $(s, a)$ value function with parameterized function

input would be state or state-action pair, output would be value in any kinds.

parameter $w$ here would a vector in simple terms such as DNN parameters.

Motivations

we don’t want to store and go through every single state’s properties
we want more compact and precise, generalized representation

Benefits of Generalization

Reduce required memory
Reduce computation
Reduce experience(numbers)

Function Approximators

Out of so many possible approximators(NN, Linear combination, Fourier, etc ...) we will focus on differentiable kinds.

Today we will focus on Linear feature representations

Review Gradient Descent

Consider a function $J (w)$ that is differentiable function of $w$

→ we want to fing $w$ that minimize $J$

$\nabla_{w} J (w) = [\frac{\partial J (w)}{\partial w_{1}}, \frac{\partial J (w)}{\partial w_{2}}, . . ., \frac{\partial J (w)}{\partial w_{N}}]$

then we update our parameter with this gradient

$w \leftarrow w - α \nabla_{w} J (w)$

where $α$ is “learning rate” which is a constant. With the $w$ update method above, it is guaranteed tha $w$ will move toward local optima

VFA for Prediction

we will assume that there is an oracle🔮 that returns exact value for given state, so we know $V^{π} (s)$

Stochastic Gradient Descent(SGD)

we use SGD to specify the very parameter $w$ that minimized loss between true value $V^{π} (s)$ and estimated value $\hat{V} (s; w)$ . (← $w$ here denotes that this value is parameterized by $w$ )

we use feature vectors

Linear VFA for Prediction with Oracle

*Convergence Gurantees for Linear VFA for Policy Evaluation

Define MSE of a linear VFA for particulat policy $π$

$M S V E (w) = \sum d (s) (V^{π} (s) - \hat{V^{π}} (s_{t}; w))^{2}$

where $d (s)$ is a stationary distribution of $π$ in decision process →”probability distribution over states” ( $\sum_{s} d (s) = 1$ )

Monte-Carlo Linear VFA

We no longer have an oracle that tells us the true $V^{π}$ . In Monte-Carlo we use reward $G_{t}$ instead.

$\begin{matrix} Δ w & = & α (G_{t} - \hat{V} (s_{t}; w)) \nabla_{w} \hat{V} (s_{t}; w) & = & α (G_{t} - \hat{V} (s_{t}; w)) x (s_{t}) & = & α (G_{t} - x (s_{t})^{T} w) x (s_{t}) \end{matrix}$

can be identically applied to both first-visit and every-visit MC

Algorithm

Baird Example - Monte Carlo Policy Evaluation

calculate $x (s_{1})$ for $s_{1}$ with parateter properties.

assume $w_{*} = [1, 1, 1, 1, 1, 1, 1, 1]$ and $α = 0.5$ ,

trajectory : $(s_{1}, a_{1}, 0, s_{7}, a_{1}, 0, s_{7}, a_{1}, 0, t e r m)$

Ans. for $G_{s_{1}}$ = 0, $V (s_{1}) = x^{T} w = 3$

→ $Δ w = 0.5 (0 - 3) x (s) = - 1.5 [2, 0, 0, 0, 0, 0, 0, 1]$

→ $w = w + Δ w = [- 2, 1, 1, 1, 1, 1, 1, - 0.5]$

*Convergence Gurantees for Linear VFA for “Monte Carlo” Policy Evaluation

Define MSE of a linear VFA for particulat policy $π$

$M S V E (w_{m} c) = min_{w} \sum d (s) (V^{π} (s) - \hat{V^{π}} (s_{t}; w))^{2}$

where $d (s)$ is a stationary distribution of $π$ in decision process →”probability distribution over states” ( $\sum_{s} d (s) = 1$ )

Batch Monte-Carlo VFA

we take sets of episodes(a.k.a. Batches) for policy $π$ . since these are finite it’s possible to solve analytically an approximation that minimizes MSE.

$\arg min_{w} \sum_{i = 1}^{N} (G (s_{i}) - x (s_{i})^{T} w)^{2}$

Take derivative and set to 0

$w = (X^{T} X)^{- 1} X^{T} G$

where X stands for matrix of features of each of N states.

Temporal Difference(TD) Learning VFA

instead of rewards( $G$ ) as in Monte-Carlo, we use TD target( $r_{j} + γ {\hat{V}}^{π} (s_{j + 1}, w)$ ) here.

$\begin{matrix} Δ w & = & α (r + γ {\hat{V}}^{π} (s^{'}, w) - \hat{V} (s_{t}; w)) \nabla_{w} \hat{V} (s_{t}; w) & = & α (r + γ {\hat{V}}^{π} (s^{'}, w) - \hat{V} (s_{t}; w)) x (s_{t}) & = & α (r + γ x (s^{'})^{T} w - x (s_{t})^{T} w) x (s_{t}) \end{matrix}$

Algorithm

Basically identical to Monte-Carlo method.

Baird Example - Temporal Difference Policy Evaluation

calculate $x (s_{1})$ for $s_{1}$ with parateter properties.

assume $w_{*} = [1, 1, 1, 1, 1, 1, 1, 1]$ and $α = 0.5$ , $γ = 0.9$

trajectory : $(s_{1}, a_{1}, 0, s_{7}, a_{1}, 0, s_{7}, a_{1}, 0, t e r m)$ , sample $(s_{1}, a_{1}, 0, s_{7})$

Ans. for $G_{s_{1}}$ = 0, $V (s_{1}) = x^{T} w = 3$

$\begin{matrix} Δ w & = & 0.5 (0 + 0.9 x (s_{7})^{T} w - x (s_{1})^{T} w) x (s) & = & 0.5 (0 + 0.9 * 3 = 3) [2, 0, 0, 0, 0, 0, 0, 1] & = & [- 0.3, 0, 0, 0, 0, 0, 0, - 0.15] \end{matrix}$

*Convergence Gurantees for Linear VFA for “Temporal Difference” Policy Evaluation

Define MSE of a linear VFA for particulat policy $π$

$M S V E (w_{T D}) < \frac{1}{1 - γ} min_{w} \sum d (s) (V^{π} (s) - \hat{V^{π}} (s_{t}; w))^{2}$

where $d (s)$ is a stationary distribution of $π$ in decision process →”probability distribution over states” ( $\sum_{s} d (s) = 1$ )

Control with Value Function Approximation

Now we practice VFA above with state-action value ${\hat{Q}}^{π} (s, a; w) \approx Q^{π}$

This process generally involves bootstrapping, off-policy learning with function approximation

Exact same approach will be taken as of above where we talked about state values → SGD

features for both state and action will be as below

state-action value representation with features and weight value

$\hat{Q} (s, a; w) = x (s, a)^{T} w = \sum_{j = 1}^{n} x_{j} (s, a) w_{j}$

Stochastic Gradient Descent

Objective function is...

$\nabla_{w} J (w) = \nabla_{w} E_{π} [(Q^{π} (s, a) - {\hat{Q}}^{π} (s, a; w))^{2}]$

we use above to derive $Δ w$ terms

Gradient Descent term for MC, TD, and Q-Learning

Monte-Carlo

$Δ w = α (G_{t} - \hat{Q} (s_{t}, a_{t}; w)) \nabla_{w} \hat{Q} (s_{t}, a_{t}; w)$

SARSA

$Δ w = α (r + γ \hat{Q} (s^{'}, a^{'}; w) - \hat{Q} (s, a; w)) \nabla_{w} \hat{Q} (s, a; w)$

Q-Learning

$Δ w = α (r + γ max_{a^{'}} \hat{Q} (s^{'}, a^{'}; w) - \hat{Q} (s, a; w)) \nabla_{w} \hat{Q} (s, a; w)$

Convergence of Control Method with VFA

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 7 (0)	2022.08.09
Stanford CS234 Lecture 6 (0)	2022.08.09
Stanford CS234 Lecture 4 (0)	2022.08.05
Stanford CS234 Lecture 3 (0)	2022.08.05
Stanford CS234 Lecture2 (0)	2022.08.05

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

말티즈가 물어온 잡동사니

Stanford CS234 Lecture 5