Stanford CS234 Lecture 10

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 10

continuing our discussion over Updating Parameters Given the Gradient

Local Approximation

we couldn’t calculate equation above because we had no clue of what $\tilde{\pi}$ is. So for approximation, we replace the term with previous policy.

we take policy $\pi^i$ run it out, get $D$ trajectories and use them to obtain distribution $\mu$ → use to compute for $\pi^{i+1}$.

오로지 계산을 위해 $\mu_{\tilde{\pi}}(s)$ 자리에 $\mu_\pi(s)$를 집어넣은 것.

so we “just say” it’s an objective function and something that can be optimized.

If you evaluate the function under same policy, you just get the same value.

Conservative Policy Iteration

we assume a new policy with blend of current policy and some other policy.

again, if $\alpha = 0$, $\pi_{new}=\pi_{old}$ and so $V^{\pi_{new}}=L_{\pi_{old}}(\pi_{new})=L_{\pi_{old}}(\pi_{old})=V^{\pi_{old}}$

For any stochastic policies, you can get a bound on the performance with

$$
L_{\pi}(\tilde{\pi})=V(\theta)+\sum_s \mu_\pi(s)\sum_a\tilde{\pi}(a|s)A_\pi(s,a)
$$

$D^{max}_{TV}$ denotes the distance between probability that each of the two policies put on that action.

This theorem implies that

with objective function $L$, the new value of your policy is at least the objective function computed minus the quantitiy of max distance of total variation.

TV divergence might be difficult to fondle with, so we may use KL divergence in this form.

How does this guarantee improvement?

Trusted Region Policy Optimization(TRPO) Algorithm

step size mentioned above

our goal is to optimize

Practicing this requires the step size to be very small so we impose constraint on step size.

→ Introduce a trust region(신뢰구간) as contraint

Now our objective is as below

where

$$
L_{\theta_{old}}(\theta)=V(\theta)+\sum_s \mu_{\theta_{old}}(s)\sum_a\pi(a|s,\theta)A_{\theta_{old}}(s,a)
$$

Here we aquaint a problem : we do not know the actual state distribution($\mu$) nor true $A_\theta$ → we only know the samples!

Therefore we do substitutions :

we only look at states that were actually sampled by our current old policy and re-weight them

use sampled distribution and its’ probability

restore $Q$ instead of $A$

Now with all substitutions, we have a something new to optimize.

Algorithm

TRPO algorithm automatically constrains the weight update to a trusted region, to approximate where the first order approximation is valid.

Common Template of Policy Gradient Algorithms

for each iteration, gather trajectories of data running the policy
compute the target($Q$ or $R$ reward) giving tradeoff between bias and variance
use them to estimate policy gradient
take step along gradient for monotonic improvement

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 9 (0)	2022.08.11
Stanford CS234 Lecture 8 (0)	2022.08.11
Stanford CS234 Lecture 7 (0)	2022.08.09
Stanford CS234 Lecture 6 (0)	2022.08.09
Stanford CS234 Lecture 5 (0)	2022.08.08

말티즈가 물어온 잡동사니

Stanford CS234 Lecture 10

Local Approximation

Trusted Region Policy Optimization(TRPO) Algorithm

Common Template of Policy Gradient Algorithms

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

댓글

티스토리툴바

Stanford CS234 Lecture 10

Local Approximation

Trusted Region Policy Optimization(TRPO) Algorithm

Common Template of Policy Gradient Algorithms

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

관련글

댓글

티스토리툴바