Stanford CS234 Lecture 8

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 8

Policy-Based Reinforcement Learning

Recall last lecture → where we learned to find state-value( $V$ ) or state-action value( $Q$ ) for parameter $w$ (or $θ$ ) and use such to build (hopefully)optimal policy( $π$ )

Today we’ll take that policy $π^{θ}$ as parameter

$π^{θ} (s, a) = P [a | s; θ]$

our goal is to find a policy with maximal value function $V^{π}$

Benefits and Demerits

Advantages	Disadvantages
- better converges properties

works efficiently with high-dimension or continous action spaces
can learn stochastic policies | - converges to local optima rather than global
inefficient & high varience |

Stochastic properties needed at aliased environment

Value-based Rl would take features(combination of actions & is ther a wall?)

$Q_{θ} (s, a) = f (ϕ (s, a), θ)$

Policy-based RL would take those features and directly make decisions of action

$π_{θ} (s, a) = g (ϕ (s, a), θ)$

it the policy were deterministic agent would get stuck anyway at either A or B

whereas stochastic policy leaves probability of moving both directions.

deterministic policy

stochastic policy

Policy Objective Function

our goal is to find best parameter $θ$ that returns maximal value for given policy $π_{θ} (s, a)$

episodic env(H steps till terminate)

$J_{1} (θ) = V^{π_{θ}} (s_{1})$

continuing env( $\infin$ )

$J_{a v V} (θ) = \sum_{s} d^{π_{θ}} (s) V^{π_{θ}} (s)$

* $d^{π}$ : stationary distribution of Markov chain

Policy-based RL is an optimization problem!

there sure are gradient-free approaches... simple and precise, even works for undifferentiable, and easy to parallelize. However, it’s extremely sample-inefficient!!!

Policy Gradient

policy gradient is very much like CNN. set $V (θ) = V^{π_{θ}}$ , assuming episodic MDPs

we search for local maximum of $V$ with gradient of $θ$

$\nabla θ = α \nabla_{θ} V (θ)$

where $α$ is learning rate and $\nabla_{θ} V$ is policy gradient

we compute policy gradient to estimate by Finite Differences method

Score Function and Policy Gradient Theorem

the Policy vlaue →(probability that we observe a particular trajectory)*(reward of trajectory)

$V (θ) = E [\sum_{t = 0}^{T} R (s_{t}, a_{t}) | π_{θ}] = \sum_{τ} P (τ; θ) R (τ)$

the very $θ$ that maximizes $V (θ)$ here woul be the optimal $θ$ we were looking for

$\arg max_{θ} V (θ) = \arg max_{θ} \sum_{τ} P (τ; θ) R (τ)$

take gradient w.r.t $θ$

$\begin{matrix} \nabla_{θ} V (θ) & = & \nabla_{θ} \sum_{τ} P (τ; θ) R (τ) & = & \sum_{τ} \frac{P (τ; θ)}{P (τ; θ)} \nabla_{θ} P (τ; θ) R (τ) & = & \sum_{τ} \frac{\nabla_{θ} P (τ; θ)}{P (τ; θ)} P (τ; θ) R (τ) & = & \sum_{τ} P (τ; θ) R (τ) \nabla_{θ} P (τ; θ) \end{matrix}$

$\nabla_{θ} V (θ) = \sum_{τ} P (τ; θ) R (τ) \nabla_{θ} P (τ; θ)$

Approximate with empirical estimate for m sample paths

$$
\begin{matrix}
\nabla_\theta V(\theta) \approx \hat{g} = (1/m)\sum_{i=1}^m R(\tau^i) \nabla_\theta P(\tau^i ; \theta)\
=(1/m)\sum^m_{i=1} R(\tau^i) \sum^{T-1}{i=1}\nabla_\theta log{\pi_\theta}(a^i_t|s_t^i)
\end{matrix}
$$

“moving up in trajectory of log probaility of the sample based on its quality→score”

increase weight of things that lead to high reward

Decompose the log term

we call that

\nabla_{θ} l o g_{π_{θ}} (s, a)

the score function

Policy Gradient Algorithms and Reducing Variance

$$
\nabla_\theta V(\theta) \approx (1/m)\sum^m_{i=1} R(\tau^i) \sum^{T-1}{i=1}\nabla_\theta log{\pi_\theta}(a^i_t|s_t^i)
$$

This approximation us unbiased but noisy → high variance! So we approach with two fixes that make it practical

Temporal structure
Baseline
Alternatives to using Monte Carlo returns

Temporal structure

REINFORCE(Monte-Carlo policy gradient)

→ we do all the update and get another episode!

How to compute this differential with respect to policy parameters?

Classes of policies considered

Softmax → discrete action space

features observed - average feature over the action
Gaussian → coutinuous action space

이해 잘 안됨...
Neural network

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 10 (0)	2022.08.12
Stanford CS234 Lecture 9 (0)	2022.08.11
Stanford CS234 Lecture 7 (0)	2022.08.09
Stanford CS234 Lecture 6 (0)	2022.08.09
Stanford CS234 Lecture 5 (0)	2022.08.08

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

말티즈가 물어온 잡동사니

Stanford CS234 Lecture 8

Policy-Based Reinforcement Learning

Policy Gradient

Score Function and Policy Gradient Theorem

Policy Gradient Algorithms and Reducing Variance

Temporal structure

How to compute this differential with respect to policy parameters?

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

Stanford CS234 Lecture 8

Policy-Based Reinforcement Learning

Policy Gradient

Score Function and Policy Gradient Theorem

Policy Gradient Algorithms and Reducing Variance

Temporal structure

How to compute this differential with respect to policy parameters?

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역