Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 10
continuing our discussion over Updating Parameters Given the Gradient
Local Approximation
we couldn’t calculate equation above because we had no clue of what

we take policy
오로지 계산을 위해
so we “just say” it’s an objective function and something that can be optimized.
If you evaluate the function under same policy, you just get the same value.
Conservative Policy Iteration
we assume a new policy with blend of current policy and some other policy.

again, if
For any stochastic policies, you can get a bound on the performance with

This theorem implies that
with objective function
TV divergence might be difficult to fondle with, so we may use KL divergence in this form.

How does this guarantee improvement?

Trusted Region Policy Optimization(TRPO) Algorithm
step size mentioned above
our goal is to optimize

Practicing this requires the step size to be very small so we impose constraint on step size.
→ Introduce a trust region(신뢰구간) as contraint
Now our objective is as below

where
Here we aquaint a problem : we do not know the actual state distribution(
Therefore we do substitutions :
- we only look at states that were actually sampled by our current old policy and re-weight them

- use sampled distribution and its’ probability

- restore
instead of

Now with all substitutions, we have a something new to optimize.

Algorithm

TRPO algorithm automatically constrains the weight update to a trusted region, to approximate where the first order approximation is valid.
Common Template of Policy Gradient Algorithms

- for each iteration, gather trajectories of data running the policy
- compute the target(
or reward) giving tradeoff between bias and variance - use them to estimate policy gradient
- take step along gradient for monotonic improvement
'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글
Stanford CS234 Lecture 9 (0) | 2022.08.11 |
---|---|
Stanford CS234 Lecture 8 (0) | 2022.08.11 |
Stanford CS234 Lecture 7 (0) | 2022.08.09 |
Stanford CS234 Lecture 6 (0) | 2022.08.09 |
Stanford CS234 Lecture 5 (0) | 2022.08.08 |
댓글