Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 8
Policy-Based Reinforcement Learning
Recall last lecture → where we learned to find state-value(
Today we’ll take that policy
our goal is to find a policy with maximal value function
Benefits and Demerits
Advantages | Disadvantages |
---|---|
- better converges properties |
- works efficiently with high-dimension or continous action spaces
- can learn stochastic policies | - converges to local optima rather than global
- inefficient & high varience |
Stochastic properties needed at aliased environment

- Value-based Rl would take features(combination of actions & is ther a wall?)
- Policy-based RL would take those features and directly make decisions of action
it the policy were deterministic agent would get stuck anyway at either A or B
whereas stochastic policy leaves probability of moving both directions.

deterministic policy

stochastic policy
Policy Objective Function
our goal is to find best parameter
- episodic env(H steps till terminate)
- continuing env(
)
*
Policy-based RL is an optimization problem!
there sure are gradient-free approaches... simple and precise, even works for undifferentiable, and easy to parallelize. However, it’s extremely sample-inefficient!!!
Policy Gradient
policy gradient is very much like CNN. set
we search for local maximum of
where

we compute policy gradient to estimate by Finite Differences method

Score Function and Policy Gradient Theorem

the Policy vlaue →(probability that we observe a particular trajectory)*(reward of trajectory)
the very
take gradient w.r.t
Approximate with empirical estimate for m sample paths
$$
\begin{matrix}
\nabla_\theta V(\theta) \approx \hat{g} = (1/m)\sum_{i=1}^m R(\tau^i) \nabla_\theta P(\tau^i ; \theta)\
=(1/m)\sum^m_{i=1} R(\tau^i) \sum^{T-1}{i=1}\nabla_\theta log{\pi_\theta}(a^i_t|s_t^i)
\end{matrix}
$$
“moving up in trajectory of log probaility of the sample based on its quality→score”

increase weight of things that lead to high reward
Decompose the log term

we call that
Policy Gradient Algorithms and Reducing Variance
$$
\nabla_\theta V(\theta) \approx (1/m)\sum^m_{i=1} R(\tau^i) \sum^{T-1}{i=1}\nabla_\theta log{\pi_\theta}(a^i_t|s_t^i)
$$
This approximation us unbiased but noisy → high variance! So we approach with two fixes that make it practical
- Temporal structure
- Baseline
- Alternatives to using Monte Carlo returns
Temporal structure

REINFORCE(Monte-Carlo policy gradient)

→ we do all the update and get another episode!
How to compute this differential with respect to policy parameters?
Classes of policies considered
Softmax → discrete action space
features observed - average feature over the action
Gaussian → coutinuous action space
이해 잘 안됨...
Neural network
'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글
Stanford CS234 Lecture 10 (0) | 2022.08.12 |
---|---|
Stanford CS234 Lecture 9 (0) | 2022.08.11 |
Stanford CS234 Lecture 7 (0) | 2022.08.09 |
Stanford CS234 Lecture 6 (0) | 2022.08.09 |
Stanford CS234 Lecture 5 (0) | 2022.08.08 |
댓글