본문 바로가기
ML Study/Stanford CS234: Reinforcement Learning

Stanford CS234 Lecture 8

by 누워있는말티즈 2022. 8. 11.

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 8

Policy-Based Reinforcement Learning

Recall last lecture → where we learned to find state-value(V) or state-action value(Q) for parameter w(or θ) and use such to build (hopefully)optimal policy(π)

Today we’ll take that policy πθ as parameter

πθ(s,a)=P[a|s;θ]

our goal is to find a policy with maximal value function Vπ

Benefits and Demerits

Advantages Disadvantages
- better converges properties
  • works efficiently with high-dimension or continous action spaces
  • can learn stochastic policies | - converges to local optima rather than global
  • inefficient & high varience |

Stochastic properties needed at aliased environment

  • Value-based Rl would take features(combination of actions & is ther a wall?)

Qθ(s,a)=f(ϕ(s,a),θ)

  • Policy-based RL would take those features and directly make decisions of action

πθ(s,a)=g(ϕ(s,a),θ)

it the policy were deterministic agent would get stuck anyway at either A or B

whereas stochastic policy leaves probability of moving both directions.

deterministic policy


stochastic policy

Policy Objective Function

our goal is to find best parameter θ that returns maximal value for given policy πθ(s,a)

  • episodic env(H steps till terminate)

J1(θ)=Vπθ(s1)

  • continuing env(\infin)

JavV(θ)=sdπθ(s)Vπθ(s)

*dπ : stationary distribution of Markov chain

Policy-based RL is an optimization problem!

there sure are gradient-free approaches... simple and precise, even works for undifferentiable, and easy to parallelize. However, it’s extremely sample-inefficient!!!

Policy Gradient

policy gradient is very much like CNN. set V(θ)=Vπθ, assuming episodic MDPs

we search for local maximum of V with gradient of θ

θ=αθV(θ)

where α is learning rate and θV is policy gradient


we compute policy gradient to estimate by Finite Differences method

Score Function and Policy Gradient Theorem


the Policy vlaue →(probability that we observe a particular trajectory)*(reward of trajectory)

V(θ)=E[t=0TR(st,at)|πθ]=τP(τ;θ)R(τ)

the very θ that maximizes V(θ) here woul be the optimal θ we were looking for

argmaxθV(θ)=argmaxθτP(τ;θ)R(τ)

take gradient w.r.t θ

θV(θ)=θτP(τ;θ)R(τ) =τP(τ;θ)P(τ;θ)θP(τ;θ)R(τ) =τθP(τ;θ)P(τ;θ)P(τ;θ)R(τ) =τP(τ;θ)R(τ)θP(τ;θ)

θV(θ)=τP(τ;θ)R(τ)θP(τ;θ)

Approximate with empirical estimate for m sample paths

$$
\begin{matrix}
\nabla_\theta V(\theta) \approx \hat{g} = (1/m)\sum_{i=1}^m R(\tau^i) \nabla_\theta P(\tau^i ; \theta)\
=(1/m)\sum^m_{i=1} R(\tau^i) \sum^{T-1}{i=1}\nabla_\theta log{\pi_\theta}(a^i_t|s_t^i)
\end{matrix}
$$

“moving up in trajectory of log probaility of the sample based on its quality→score”


increase weight of things that lead to high reward

Decompose the log term


we call that θlogπθ(s,a) the score function

Policy Gradient Algorithms and Reducing Variance

$$
\nabla_\theta V(\theta) \approx (1/m)\sum^m_{i=1} R(\tau^i) \sum^{T-1}{i=1}\nabla_\theta log{\pi_\theta}(a^i_t|s_t^i)
$$

This approximation us unbiased but noisy → high variance! So we approach with two fixes that make it practical

  • Temporal structure
  • Baseline
  • Alternatives to using Monte Carlo returns

Temporal structure


REINFORCE(Monte-Carlo policy gradient)


→ we do all the update and get another episode!

How to compute this differential with respect to policy parameters?

Classes of policies considered

  • Softmax → discrete action space

    features observed - average feature over the action

  • Gaussian → coutinuous action space

    이해 잘 안됨...

  • Neural network

반응형

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 10  (0) 2022.08.12
Stanford CS234 Lecture 9  (0) 2022.08.11
Stanford CS234 Lecture 7  (0) 2022.08.09
Stanford CS234 Lecture 6  (0) 2022.08.09
Stanford CS234 Lecture 5  (0) 2022.08.08

댓글