본문 바로가기
ML Study/Stanford CS234: Reinforcement Learning

Stanford CS234 Lecture 7

by 누워있는말티즈 2022. 8. 9.

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 7

Imitation Learning

there are occasions where rewards are dense in time or each iteration is super expensive

→ autonomous driving kind of stuff

So we summon an expert to demonstrate trajectories

Our problem Setup

we will talk about three methods below and their goal are...

  • Behavoiral Cloning : learn directly from teacher’s policy
  • Inverse RL : can we extract reward $R$ from demonstration?
  • Apprenticeship Learning : can we use $R$ on making good policy?

Behavioral Cloning

seems familiar... a lot like simple supervised learning

we fix a policy class(NN, decision tree..) and estimate policy from “demonstration sets”

let’s go over two notable models

ALVINN

ALVINN encounters two major problem of compounding error

→ due to supervised learning’s basic assumption that all data are *iid(independent and identically distributed).* For our dataset of $(s_0,a_0,s_1,a_1,...)$ are sequential and correlated, error may be accumulated exponentially.

→ in addition, if an agent falls into state where Expert has never experienced, there is no way of recovering because we have no data of what to do. This leads to accumulating of errors.

DAGGER : Dataset Aggregation

we assume that the Expert gives direction for the action and agent doesn’t have access to $\pi^*$

this method is extremely inefficient!

Inverse RL

we look at Expert*’s policy and endeavor to find *reward function.

recall linear value function approximation form Lecture5

we will consider reward is linear $R(s)=w^Tx(s)$

value function for policy $\pi$ would be as below

$$
\begin{matrix}
V^\pi&=&E[\sum^\infin_{t=0}\gamma^tR(s_t)|\pi]\&=&E[\sum^\infin_{t=0}\gamma^tw^Tx(s_t)|\pi] \
&=& w^TE[\sum^\infin_{t=0}\gamma^tx(s_t)|\pi]&=&w^T\mu(\pi)
\end{matrix}
$$

*$\mu$ : discounted sum of state features

with method above we may resolve some problems since at least it won’t set all reward to zero.

Apprenticeship Learning

super similar to Inverse RL! All are same to $V^\pi$ representation.

$$
V^\pi=w^T\mu(\pi)
$$

before we go further recall that $V^* \ge V^\pi$ and thus $w^{T}\mu(\pi^) \ge w^{*T}\mu(\pi)$.

We wish to find reward function s.t the expert policy performs best. And if $V^\pi$ for a policy $\pi$ is sufficiently similar to $V^$ of $\pi^$, we can guarantee that we have acquired a policy that performs as well as expert policy.

  • Algorithm(we don’t really use this anymore)

    repeat the process until the difference $|w^T\mu(\pi^*)-w^T\mu(\pi)|$ is sufficiently small.


Unresolved...

  • there are infinite number of possible reward functions with same optimal policy
  • even if we have reward function, there may be multiple policies that optimally fit
  • so choose which???
반응형

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 9  (0) 2022.08.11
Stanford CS234 Lecture 8  (0) 2022.08.11
Stanford CS234 Lecture 6  (0) 2022.08.09
Stanford CS234 Lecture 5  (0) 2022.08.08
Stanford CS234 Lecture 4  (0) 2022.08.05

댓글