Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 9
Continuing discussion about gradient descent, recall that our goal is to converge ASAP to a local optima.
We want our policy update to be a monotonic improvement.

→ guarantees to converge (emphirical)
→ we simply don’t want to get fired...
Recall last time, we expressed gradient of value function as below

this term is unbiased but very noisy so we fix by
Temporal structure$\leftarrow$ did last time- Baseline
- Alternatives to using Monte Carlo returns
Baseline
for original expectation
$$
\nabla_\theta V(\theta)=\nabla_\theta E_\tau[R]=E_\tau[\sum^{T-1}{t=0}\nabla_\theta log\pi(a_t|s_t, \theta)(\sum^{T-1}{t'=t}r_{t'})]
$$
we introduce baseline to reduce variance
$$
\nabla_\theta E_\tau[R]=E_\tau[\sum^{T-1}{t=0}\nabla_\theta log\pi(a_t|s_t, \theta)(\sum^{T-1}{t'=t}r_{t'}-b(s_t))]
$$
for any choice of
Let us take only

‘Vanilla’ Policy Gradient Algorithm

basic sturcture is very similar to that of Monte-Carlo policy evaluation
we use state-value function
Alternatives to using Monte-Carlo returns as targets
Policy Gradient Formulas with Value Functions

equation above computes retuen as Monte-Carlo return, below considers it as Q-function. It’s biased but low in variance.
so it leads to,

Choosing the Target
reward $R_t=\sum^{T-1}{t'=t} r{t'}$ under a trajectory unbiased but high variance estimation
- we want to reduce variance by bootstrapping and function approximation!
- “critic” estimates V or Q,
- actor-critic methods maintain an explicit representation of both policy and the value fucntion, and update both
we blend TD and MC estimators for target

Updating Parameters Given the Gradient
Step Size
Step size is especially important in RL since it determines
there is a simple step-sizing method : Line search in direction of gradient
→ simple but expensive and naive
Objective Function
our goal is to find policy parameter that maximizes value function
we want a new policy with greater
we can express updated return

Local Approximation
we couldn’t calculate equation above beacause we had no clue of what

we take policy
오로지 계산을 위해
so we “just say” it’s an objective function and something that can be optimized.
If you evaluate the function under same policy, you just get the same value.
Conservative Policy Iteration

again, if
.
'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글
Stanford CS234 Lecture 10 (0) | 2022.08.12 |
---|---|
Stanford CS234 Lecture 8 (0) | 2022.08.11 |
Stanford CS234 Lecture 7 (0) | 2022.08.09 |
Stanford CS234 Lecture 6 (0) | 2022.08.09 |
Stanford CS234 Lecture 5 (0) | 2022.08.08 |
댓글