본문 바로가기
ML Study/Stanford CS234: Reinforcement Learning

Stanford CS234 Lecture 8

by 누워있는말티즈 2022. 8. 11.

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 8

Policy-Based Reinforcement Learning

Recall last lecture → where we learned to find state-value($V$) or state-action value($Q$) for parameter $w$(or $\theta$) and use such to build (hopefully)optimal policy($\pi$)

Today we’ll take that policy $\pi^\theta$ as parameter

$$
\pi^\theta(s,a)=P[a|s;\theta]
$$

our goal is to find a policy with maximal value function $V^\pi$

Benefits and Demerits

Advantages Disadvantages
- better converges properties
  • works efficiently with high-dimension or continous action spaces
  • can learn stochastic policies | - converges to local optima rather than global
  • inefficient & high varience |

Stochastic properties needed at aliased environment

  • Value-based Rl would take features(combination of actions & is ther a wall?)

$$
Q_\theta(s,a)=f(\phi(s,a),\theta)
$$

  • Policy-based RL would take those features and directly make decisions of action

$$
\pi_\theta(s,a)=g(\phi(s,a),\theta)
$$

it the policy were deterministic agent would get stuck anyway at either A or B

whereas stochastic policy leaves probability of moving both directions.

deterministic policy


stochastic policy

Policy Objective Function

our goal is to find best parameter $\theta$ that returns maximal value for given policy $\pi_\theta(s,a)$

  • episodic env(H steps till terminate)

$$
J_1(\theta)=V^{\pi_\theta}(s_1)
$$

  • continuing env($\infin$)

$$
J_{avV}(\theta)=\sum_s d^{\pi_\theta}(s)V^{\pi_\theta}(s)
$$

*$d^\pi$ : stationary distribution of Markov chain

Policy-based RL is an optimization problem!

there sure are gradient-free approaches... simple and precise, even works for undifferentiable, and easy to parallelize. However, it’s extremely sample-inefficient!!!

Policy Gradient

policy gradient is very much like CNN. set $V(\theta)=V^{\pi_\theta}$, assuming episodic MDPs

we search for local maximum of $V$ with gradient of $\theta$

$$
\nabla\theta=\alpha\nabla_\theta V(\theta)
$$

where $\alpha$ is learning rate and $\nabla_\theta V$ is policy gradient


we compute policy gradient to estimate by Finite Differences method

Score Function and Policy Gradient Theorem


the Policy vlaue →(probability that we observe a particular trajectory)*(reward of trajectory)

$$
V(\theta)=E[\sum^T_{t=0}R(s_t,a_t)|\pi_\theta]=\sum_\tau P(\tau;\theta)R(\tau)
$$

the very $\theta$ that maximizes $V(\theta)$ here woul be the optimal $\theta$ we were looking for

$$
\arg\max_\theta V(\theta)=\arg\max_\theta \sum_\tau P(\tau;\theta)R(\tau)
$$

take gradient w.r.t $\theta$

$$
\begin{matrix}
\nabla_\theta V(\theta) &=& \nabla_\theta \sum_\tau P(\tau;\theta)R(\tau) \
&=& \sum_\tau {P(\tau;\theta)\over P(\tau;\theta)}\nabla_\theta P(\tau;\theta)R(\tau) \
&=& \sum_\tau {\nabla_\theta P(\tau;\theta)\over P(\tau;\theta)} P(\tau;\theta)R(\tau)\
&=& \sum_\tau P(\tau;\theta)R(\tau) \nabla_\theta P(\tau ; \theta)
\end{matrix}
$$

$\nabla_\theta V(\theta)=\sum_\tau P(\tau;\theta)R(\tau) \nabla_\theta P(\tau ; \theta)$

Approximate with empirical estimate for m sample paths

$$
\begin{matrix}
\nabla_\theta V(\theta) \approx \hat{g} = (1/m)\sum_{i=1}^m R(\tau^i) \nabla_\theta P(\tau^i ; \theta)\
=(1/m)\sum^m_{i=1} R(\tau^i) \sum^{T-1}{i=1}\nabla_\theta log{\pi_\theta}(a^i_t|s_t^i)
\end{matrix}
$$

“moving up in trajectory of log probaility of the sample based on its quality→score”


increase weight of things that lead to high reward

Decompose the log term


we call that $\nabla_\theta log_{\pi_\theta}(s,a)$ the score function

Policy Gradient Algorithms and Reducing Variance

$$
\nabla_\theta V(\theta) \approx (1/m)\sum^m_{i=1} R(\tau^i) \sum^{T-1}{i=1}\nabla_\theta log{\pi_\theta}(a^i_t|s_t^i)
$$

This approximation us unbiased but noisy → high variance! So we approach with two fixes that make it practical

  • Temporal structure
  • Baseline
  • Alternatives to using Monte Carlo returns

Temporal structure


REINFORCE(Monte-Carlo policy gradient)


→ we do all the update and get another episode!

How to compute this differential with respect to policy parameters?

Classes of policies considered

  • Softmax → discrete action space

    features observed - average feature over the action

  • Gaussian → coutinuous action space

    이해 잘 안됨...

  • Neural network

반응형

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 10  (0) 2022.08.12
Stanford CS234 Lecture 9  (0) 2022.08.11
Stanford CS234 Lecture 7  (0) 2022.08.09
Stanford CS234 Lecture 6  (0) 2022.08.09
Stanford CS234 Lecture 5  (0) 2022.08.08

댓글