Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 9
Continuing discussion about gradient descent, recall that our goal is to converge ASAP to a local optima.
We want our policy update to be a monotonic improvement.
→ guarantees to converge (emphirical)
→ we simply don’t want to get fired...
Recall last time, we expressed gradient of value function as below
this term is unbiased but very noisy so we fix by
Temporal structure$\leftarrow$ did last time- Baseline
- Alternatives to using Monte Carlo returns
Baseline
for original expectation
$$
\nabla_\theta V(\theta)=\nabla_\theta E_\tau[R]=E_\tau[\sum^{T-1}{t=0}\nabla_\theta log\pi(a_t|s_t, \theta)(\sum^{T-1}{t'=t}r_{t'})]
$$
we introduce baseline to reduce variance
$$
\nabla_\theta E_\tau[R]=E_\tau[\sum^{T-1}{t=0}\nabla_\theta log\pi(a_t|s_t, \theta)(\sum^{T-1}{t'=t}r_{t'}-b(s_t))]
$$
for any choice of $b$ gradient estimator is unbiased and lower variance.
Let us take only $b(s_t)$ terms to vverify baseline does not introduce bias-derivation
‘Vanilla’ Policy Gradient Algorithm
basic sturcture is very similar to that of Monte-Carlo policy evaluation
we use state-value function $V^{\pi_i}$ for our baseline cuz if we do so we can make advantage function form with state-action-value function $Q^{\pi,\gamma}$ $\rightarrow$ $A^{\pi,\gamma}(s,a)=Q^{\pi,\gamma}(s,a)-V^{\pi,\gamma}(s)$
Alternatives to using Monte-Carlo returns as targets
Policy Gradient Formulas with Value Functions
equation above computes retuen as Monte-Carlo return, below considers it as Q-function. It’s biased but low in variance.
so it leads to,
Choosing the Target
reward $R_t=\sum^{T-1}{t'=t} r{t'}$ under a trajectory unbiased but high variance estimation
- we want to reduce variance by bootstrapping and function approximation!
- “critic” estimates V or Q,
- actor-critic methods maintain an explicit representation of both policy and the value fucntion, and update both
we blend TD and MC estimators for target $\rightarrow$ N-step estimator
Updating Parameters Given the Gradient
Step Size
Step size is especially important in RL since it determines $\pi$ and therefore data we collect to learn.
there is a simple step-sizing method : Line search in direction of gradient
→ simple but expensive and naive
Objective Function
our goal is to find policy parameter that maximizes value function
$$
V(\theta)=E_{\pi_\theta}[\sum^\infin_{t=0}\gamma^tR(s_t,a_t);\pi_\theta]
$$
we want a new policy with greater $V$ while we only have data from pevious policies
we can express updated return $V(\tilde{\theta})$ from new policy parameterized by $\tilde{\theta}$
Local Approximation
we couldn’t calculate equation above beacause we had no clue of what $\tilde{\pi}$ is. So for approximation, we replace the term with previous policy.
we take policy $\pi^i$ run it out, get $D$ trajectories and use them to obtain distribution $\mu$ → use to compute for $\pi^{i+1}$.
오로지 계산을 위해 $\mu_{\tilde{\pi}}(s)$ 자리에 $\mu_\pi(s)$를 집어넣은 것.
so we “just say” it’s an objective function and something that can be optimized.
If you evaluate the function under same policy, you just get the same value.
Conservative Policy Iteration
again, if $\alpha = 0$, $\pi_{new}=\pi_{old}$ and so $V^{\pi_{new}}=L_{\pi_{old}}(\pi_{new})=L_{\pi_{old}}(\pi_{old})=V^{\pi_{old}}$
.
'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글
Stanford CS234 Lecture 10 (0) | 2022.08.12 |
---|---|
Stanford CS234 Lecture 8 (0) | 2022.08.11 |
Stanford CS234 Lecture 7 (0) | 2022.08.09 |
Stanford CS234 Lecture 6 (0) | 2022.08.09 |
Stanford CS234 Lecture 5 (0) | 2022.08.08 |
댓글