본문 바로가기
ML Study/Stanford CS234: Reinforcement Learning

Stanford CS234 Lecture 9

by 누워있는말티즈 2022. 8. 11.

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 9

Continuing discussion about gradient descent, recall that our goal is to converge ASAP to a local optima.

We want our policy update to be a monotonic improvement.

→ guarantees to converge (emphirical)

→ we simply don’t want to get fired...

Recall last time, we expressed gradient of value function as below


this term is unbiased but very noisy so we fix by

  • Temporal structure $\leftarrow$ did last time
  • Baseline
  • Alternatives to using Monte Carlo returns

Baseline

for original expectation

$$
\nabla_\theta V(\theta)=\nabla_\theta E_\tau[R]=E_\tau[\sum^{T-1}{t=0}\nabla_\theta log\pi(a_t|s_t, \theta)(\sum^{T-1}{t'=t}r_{t'})]
$$

we introduce baseline to reduce variance

$$
\nabla_\theta E_\tau[R]=E_\tau[\sum^{T-1}{t=0}\nabla_\theta log\pi(a_t|s_t, \theta)(\sum^{T-1}{t'=t}r_{t'}-b(s_t))]
$$

for any choice of $b$ gradient estimator is unbiased and lower variance.

Let us take only $b(s_t)$ terms to vverify baseline does not introduce bias-derivation


‘Vanilla’ Policy Gradient Algorithm


basic sturcture is very similar to that of Monte-Carlo policy evaluation

we use state-value function $V^{\pi_i}$ for our baseline cuz if we do so we can make advantage function form with state-action-value function $Q^{\pi,\gamma}$ $\rightarrow$ $A^{\pi,\gamma}(s,a)=Q^{\pi,\gamma}(s,a)-V^{\pi,\gamma}(s)$

Alternatives to using Monte-Carlo returns as targets

Policy Gradient Formulas with Value Functions


equation above computes retuen as Monte-Carlo return, below considers it as Q-function. It’s biased but low in variance.

so it leads to,


Choosing the Target

reward $R_t=\sum^{T-1}{t'=t} r{t'}$ under a trajectory unbiased but high variance estimation

  • we want to reduce variance by bootstrapping and function approximation!
  • “critic” estimates V or Q,
  • actor-critic methods maintain an explicit representation of both policy and the value fucntion, and update both

we blend TD and MC estimators for target $\rightarrow$ N-step estimator

Updating Parameters Given the Gradient

Step Size

Step size is especially important in RL since it determines $\pi$ and therefore data we collect to learn.

there is a simple step-sizing method : Line search in direction of gradient

→ simple but expensive and naive

Objective Function

our goal is to find policy parameter that maximizes value function

$$
V(\theta)=E_{\pi_\theta}[\sum^\infin_{t=0}\gamma^tR(s_t,a_t);\pi_\theta]
$$

we want a new policy with greater $V$ while we only have data from pevious policies

we can express updated return $V(\tilde{\theta})$ from new policy parameterized by $\tilde{\theta}$

Local Approximation

we couldn’t calculate equation above beacause we had no clue of what $\tilde{\pi}$ is. So for approximation, we replace the term with previous policy.


we take policy $\pi^i$ run it out, get $D$ trajectories and use them to obtain distribution $\mu$ → use to compute for $\pi^{i+1}$.

오로지 계산을 위해 $\mu_{\tilde{\pi}}(s)$ 자리에 $\mu_\pi(s)$를 집어넣은 것.

so we “just say” it’s an objective function and something that can be optimized.

If you evaluate the function under same policy, you just get the same value.

Conservative Policy Iteration


again, if $\alpha = 0$, $\pi_{new}=\pi_{old}$ and so $V^{\pi_{new}}=L_{\pi_{old}}(\pi_{new})=L_{\pi_{old}}(\pi_{old})=V^{\pi_{old}}$

.

반응형

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 10  (0) 2022.08.12
Stanford CS234 Lecture 8  (0) 2022.08.11
Stanford CS234 Lecture 7  (0) 2022.08.09
Stanford CS234 Lecture 6  (0) 2022.08.09
Stanford CS234 Lecture 5  (0) 2022.08.08

댓글