Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 7
Imitation Learning
there are occasions where rewards are dense in time or each iteration is super expensive
→ autonomous driving kind of stuff
So we summon an expert to demonstrate trajectories
Our problem Setup
we will talk about three methods below and their goal are...
- Behavoiral Cloning : learn directly from teacher’s policy
- Inverse RL : can we extract reward $R$ from demonstration?
- Apprenticeship Learning : can we use $R$ on making good policy?
Behavioral Cloning
seems familiar... a lot like simple supervised learning
we fix a policy class(NN, decision tree..) and estimate policy from “demonstration sets”
let’s go over two notable models
ALVINN
ALVINN encounters two major problem of compounding error
→ due to supervised learning’s basic assumption that all data are *iid(independent and identically distributed).* For our dataset of $(s_0,a_0,s_1,a_1,...)$ are sequential and correlated, error may be accumulated exponentially.
→ in addition, if an agent falls into state where Expert has never experienced, there is no way of recovering because we have no data of what to do. This leads to accumulating of errors.
DAGGER : Dataset Aggregation
we assume that the Expert gives direction for the action and agent doesn’t have access to $\pi^*$
this method is extremely inefficient!
Inverse RL
we look at Expert*’s policy and endeavor to find *reward function.
recall linear value function approximation form Lecture5
we will consider reward is linear $R(s)=w^Tx(s)$
value function for policy $\pi$ would be as below
$$
\begin{matrix}
V^\pi&=&E[\sum^\infin_{t=0}\gamma^tR(s_t)|\pi]\&=&E[\sum^\infin_{t=0}\gamma^tw^Tx(s_t)|\pi] \
&=& w^TE[\sum^\infin_{t=0}\gamma^tx(s_t)|\pi]&=&w^T\mu(\pi)
\end{matrix}
$$
*$\mu$ : discounted sum of state features
with method above we may resolve some problems since at least it won’t set all reward to zero.
Apprenticeship Learning
super similar to Inverse RL! All are same to $V^\pi$ representation.
$$
V^\pi=w^T\mu(\pi)
$$
before we go further recall that $V^* \ge V^\pi$ and thus $w^{T}\mu(\pi^) \ge w^{*T}\mu(\pi)$.
We wish to find reward function s.t the expert policy performs best. And if $V^\pi$ for a policy $\pi$ is sufficiently similar to $V^$ of $\pi^$, we can guarantee that we have acquired a policy that performs as well as expert policy.
Algorithm(we don’t really use this anymore)
repeat the process until the difference $|w^T\mu(\pi^*)-w^T\mu(\pi)|$ is sufficiently small.
Unresolved...
- there are infinite number of possible reward functions with same optimal policy
- even if we have reward function, there may be multiple policies that optimally fit
- so choose which???
'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글
Stanford CS234 Lecture 9 (0) | 2022.08.11 |
---|---|
Stanford CS234 Lecture 8 (0) | 2022.08.11 |
Stanford CS234 Lecture 6 (0) | 2022.08.09 |
Stanford CS234 Lecture 5 (0) | 2022.08.08 |
Stanford CS234 Lecture 4 (0) | 2022.08.05 |
댓글