Stanford CS234 Lecture 9

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 9

Continuing discussion about gradient descent, recall that our goal is to converge ASAP to a local optima.

We want our policy update to be a monotonic improvement.

→ guarantees to converge (emphirical)

→ we simply don’t want to get fired...

Recall last time, we expressed gradient of value function as below

this term is unbiased but very noisy so we fix by

~~Temporal structure~~ $~~\leftarrow~~$ did last time
Baseline
Alternatives to using Monte Carlo returns

Baseline

for original expectation

$$
\nabla_\theta V(\theta)=\nabla_\theta E_\tau[R]=E_\tau[\sum^{T-1}{t=0}\nabla_\theta log\pi(a_t|s_t, \theta)(\sum^{T-1}{t'=t}r_{t'})]
$$

we introduce baseline to reduce variance

$$
\nabla_\theta E_\tau[R]=E_\tau[\sum^{T-1}{t=0}\nabla_\theta log\pi(a_t|s_t, \theta)(\sum^{T-1}{t'=t}r_{t'}-b(s_t))]
$$

for any choice of $b$ gradient estimator is unbiased and lower variance.

Let us take only $b (s_{t})$ terms to vverify baseline does not introduce bias-derivation

‘Vanilla’ Policy Gradient Algorithm

basic sturcture is very similar to that of Monte-Carlo policy evaluation

we use state-value function $V^{π_{i}}$ for our baseline cuz if we do so we can make advantage function form with state-action-value function $Q^{π, γ}$ $\to$ $A^{π, γ} (s, a) = Q^{π, γ} (s, a) - V^{π, γ} (s)$

Alternatives to using Monte-Carlo returns as targets

Policy Gradient Formulas with Value Functions

equation above computes retuen as Monte-Carlo return, below considers it as Q-function. It’s biased but low in variance.

so it leads to,

Choosing the Target

reward $R_t=\sum^{T-1}{t'=t} r{t'}$ under a trajectory unbiased but high variance estimation

we want to reduce variance by bootstrapping and function approximation!
“critic” estimates V or Q,
actor-critic methods maintain an explicit representation of both policy and the value fucntion, and update both

we blend TD and MC estimators for target $\to$ N-step estimator

Updating Parameters Given the Gradient

Step Size

Step size is especially important in RL since it determines $π$ and therefore data we collect to learn.

there is a simple step-sizing method : Line search in direction of gradient

→ simple but expensive and naive

Objective Function

our goal is to find policy parameter that maximizes value function

$V (θ) = E_{π_{θ}} [\sum_{t = 0}^{\infin} γ^{t} R (s_{t}, a_{t}); π_{θ}]$

we want a new policy with greater $V$ while we only have data from pevious policies

we can express updated return $V (\tilde{θ})$ from new policy parameterized by $\tilde{θ}$

Local Approximation

we couldn’t calculate equation above beacause we had no clue of what $\tilde{π}$ is. So for approximation, we replace the term with previous policy.

we take policy

π^{i}

run it out, get

D

trajectories and use them to obtain distribution

μ

→ use to compute for

π^{i + 1}

오로지 계산을 위해 $μ_{\tilde{π}} (s)$ 자리에 $μ_{π} (s)$ 를 집어넣은 것.

so we “just say” it’s an objective function and something that can be optimized.

If you evaluate the function under same policy, you just get the same value.

Conservative Policy Iteration

again, if

α = 0

π_{n e w} = π_{o l d}

and so

V^{π_{n e w}} = L_{π_{o l d}} (π_{n e w}) = L_{π_{o l d}} (π_{o l d}) = V^{π_{o l d}}

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 10 (0)	2022.08.12
Stanford CS234 Lecture 8 (0)	2022.08.11
Stanford CS234 Lecture 7 (0)	2022.08.09
Stanford CS234 Lecture 6 (0)	2022.08.09
Stanford CS234 Lecture 5 (0)	2022.08.08

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

말티즈가 물어온 잡동사니

Stanford CS234 Lecture 9

Baseline

Alternatives to using Monte-Carlo returns as targets

Updating Parameters Given the Gradient

Local Approximation

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

Stanford CS234 Lecture 9

Baseline

Alternatives to using Monte-Carlo returns as targets

Updating Parameters Given the Gradient

Local Approximation

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역