Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 8
Policy-Based Reinforcement Learning
Recall last lecture → where we learned to find state-value($V$) or state-action value($Q$) for parameter $w$(or $\theta$) and use such to build (hopefully)optimal policy($\pi$)
Today we’ll take that policy $\pi^\theta$ as parameter
$$
\pi^\theta(s,a)=P[a|s;\theta]
$$
our goal is to find a policy with maximal value function $V^\pi$
Benefits and Demerits
Advantages | Disadvantages |
---|---|
- better converges properties |
- works efficiently with high-dimension or continous action spaces
- can learn stochastic policies | - converges to local optima rather than global
- inefficient & high varience |
Stochastic properties needed at aliased environment
- Value-based Rl would take features(combination of actions & is ther a wall?)
$$
Q_\theta(s,a)=f(\phi(s,a),\theta)
$$
- Policy-based RL would take those features and directly make decisions of action
$$
\pi_\theta(s,a)=g(\phi(s,a),\theta)
$$
it the policy were deterministic agent would get stuck anyway at either A or B
whereas stochastic policy leaves probability of moving both directions.
deterministic policy
stochastic policy
Policy Objective Function
our goal is to find best parameter $\theta$ that returns maximal value for given policy $\pi_\theta(s,a)$
- episodic env(H steps till terminate)
$$
J_1(\theta)=V^{\pi_\theta}(s_1)
$$
- continuing env($\infin$)
$$
J_{avV}(\theta)=\sum_s d^{\pi_\theta}(s)V^{\pi_\theta}(s)
$$
*$d^\pi$ : stationary distribution of Markov chain
Policy-based RL is an optimization problem!
there sure are gradient-free approaches... simple and precise, even works for undifferentiable, and easy to parallelize. However, it’s extremely sample-inefficient!!!
Policy Gradient
policy gradient is very much like CNN. set $V(\theta)=V^{\pi_\theta}$, assuming episodic MDPs
we search for local maximum of $V$ with gradient of $\theta$
$$
\nabla\theta=\alpha\nabla_\theta V(\theta)
$$
where $\alpha$ is learning rate and $\nabla_\theta V$ is policy gradient
we compute policy gradient to estimate by Finite Differences method
Score Function and Policy Gradient Theorem
the Policy vlaue →(probability that we observe a particular trajectory)*(reward of trajectory)
$$
V(\theta)=E[\sum^T_{t=0}R(s_t,a_t)|\pi_\theta]=\sum_\tau P(\tau;\theta)R(\tau)
$$
the very $\theta$ that maximizes $V(\theta)$ here woul be the optimal $\theta$ we were looking for
$$
\arg\max_\theta V(\theta)=\arg\max_\theta \sum_\tau P(\tau;\theta)R(\tau)
$$
take gradient w.r.t $\theta$
$$
\begin{matrix}
\nabla_\theta V(\theta) &=& \nabla_\theta \sum_\tau P(\tau;\theta)R(\tau) \
&=& \sum_\tau {P(\tau;\theta)\over P(\tau;\theta)}\nabla_\theta P(\tau;\theta)R(\tau) \
&=& \sum_\tau {\nabla_\theta P(\tau;\theta)\over P(\tau;\theta)} P(\tau;\theta)R(\tau)\
&=& \sum_\tau P(\tau;\theta)R(\tau) \nabla_\theta P(\tau ; \theta)
\end{matrix}
$$
$\nabla_\theta V(\theta)=\sum_\tau P(\tau;\theta)R(\tau) \nabla_\theta P(\tau ; \theta)$
Approximate with empirical estimate for m sample paths
$$
\begin{matrix}
\nabla_\theta V(\theta) \approx \hat{g} = (1/m)\sum_{i=1}^m R(\tau^i) \nabla_\theta P(\tau^i ; \theta)\
=(1/m)\sum^m_{i=1} R(\tau^i) \sum^{T-1}{i=1}\nabla_\theta log{\pi_\theta}(a^i_t|s_t^i)
\end{matrix}
$$
“moving up in trajectory of log probaility of the sample based on its quality→score”
increase weight of things that lead to high reward
Decompose the log term
we call that $\nabla_\theta log_{\pi_\theta}(s,a)$ the score function
Policy Gradient Algorithms and Reducing Variance
$$
\nabla_\theta V(\theta) \approx (1/m)\sum^m_{i=1} R(\tau^i) \sum^{T-1}{i=1}\nabla_\theta log{\pi_\theta}(a^i_t|s_t^i)
$$
This approximation us unbiased but noisy → high variance! So we approach with two fixes that make it practical
- Temporal structure
- Baseline
- Alternatives to using Monte Carlo returns
Temporal structure
REINFORCE(Monte-Carlo policy gradient)
→ we do all the update and get another episode!
How to compute this differential with respect to policy parameters?
Classes of policies considered
Softmax → discrete action space
features observed - average feature over the action
Gaussian → coutinuous action space
이해 잘 안됨...
Neural network
'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글
Stanford CS234 Lecture 10 (0) | 2022.08.12 |
---|---|
Stanford CS234 Lecture 9 (0) | 2022.08.11 |
Stanford CS234 Lecture 7 (0) | 2022.08.09 |
Stanford CS234 Lecture 6 (0) | 2022.08.09 |
Stanford CS234 Lecture 5 (0) | 2022.08.08 |
댓글