본문 바로가기
ML Study/Stanford CS234: Reinforcement Learning

Stanford CS234 Lecture 10

by 누워있는말티즈 2022. 8. 12.

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 10

continuing our discussion over Updating Parameters Given the Gradient

Local Approximation

we couldn’t calculate equation above because we had no clue of what $\tilde{\pi}$ is. So for approximation, we replace the term with previous policy.

we take policy $\pi^i$ run it out, get $D$ trajectories and use them to obtain distribution $\mu$ → use to compute for $\pi^{i+1}$.

오로지 계산을 위해 $\mu_{\tilde{\pi}}(s)$ 자리에 $\mu_\pi(s)$를 집어넣은 것.

so we “just say” it’s an objective function and something that can be optimized.

If you evaluate the function under same policy, you just get the same value.

Conservative Policy Iteration

we assume a new policy with blend of current policy and some other policy.

again, if $\alpha = 0$, $\pi_{new}=\pi_{old}$ and so $V^{\pi_{new}}=L_{\pi_{old}}(\pi_{new})=L_{\pi_{old}}(\pi_{old})=V^{\pi_{old}}$

For any stochastic policies, you can get a bound on the performance with

$$
L_{\pi}(\tilde{\pi})=V(\theta)+\sum_s \mu_\pi(s)\sum_a\tilde{\pi}(a|s)A_\pi(s,a)
$$


$D^{max}_{TV}$ denotes the distance between probability that each of the two policies put on that action.

This theorem implies that

with objective function $L$, the new value of your policy is at least the objective function computed minus the quantitiy of max distance of total variation.

TV divergence might be difficult to fondle with, so we may use KL divergence in this form.

How does this guarantee improvement?

Trusted Region Policy Optimization(TRPO) Algorithm

step size mentioned above

our goal is to optimize


Practicing this requires the step size to be very small so we impose constraint on step size.

→ Introduce a trust region(신뢰구간) as contraint

Now our objective is as below


where

$$
L_{\theta_{old}}(\theta)=V(\theta)+\sum_s \mu_{\theta_{old}}(s)\sum_a\pi(a|s,\theta)A_{\theta_{old}}(s,a)
$$

Here we aquaint a problem : we do not know the actual state distribution($\mu$) nor true $A_\theta$ → we only know the samples!

Therefore we do substitutions :

  1. we only look at states that were actually sampled by our current old policy and re-weight them

  1. use sampled distribution and its’ probability

  1. restore $Q$ instead of $A$

Now with all substitutions, we have a something new to optimize.

Algorithm

TRPO algorithm automatically constrains the weight update to a trusted region, to approximate where the first order approximation is valid.

Common Template of Policy Gradient Algorithms

  • for each iteration, gather trajectories of data running the policy
  • compute the target($Q$ or $R$ reward) giving tradeoff between bias and variance
  • use them to estimate policy gradient
  • take step along gradient for monotonic improvement
반응형

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 9  (0) 2022.08.11
Stanford CS234 Lecture 8  (0) 2022.08.11
Stanford CS234 Lecture 7  (0) 2022.08.09
Stanford CS234 Lecture 6  (0) 2022.08.09
Stanford CS234 Lecture 5  (0) 2022.08.08

댓글