Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 7

Imitation Learning

there are occasions where rewards are dense in time or each iteration is super expensive

→ autonomous driving kind of stuff

So we summon an expert to demonstrate trajectories

Our problem Setup

we will talk about three methods below and their goal are...

Behavoiral Cloning : learn directly from teacher’s policy
Inverse RL : can we extract reward $R$ from demonstration?
Apprenticeship Learning : can we use $R$ on making good policy?

Behavioral Cloning

seems familiar... a lot like simple supervised learning

we fix a policy class(NN, decision tree..) and estimate policy from “demonstration sets”

let’s go over two notable models

ALVINN

ALVINN encounters two major problem of compounding error

→ due to supervised learning’s basic assumption that all data are *iid(independent and identically distributed).* For our dataset of $(s_{0}, a_{0}, s_{1}, a_{1}, . . .)$ are sequential and correlated, error may be accumulated exponentially.

→ in addition, if an agent falls into state where Expert has never experienced, there is no way of recovering because we have no data of what to do. This leads to accumulating of errors.

DAGGER : Dataset Aggregation

we assume that the Expert gives direction for the action and agent doesn’t have access to $π^{*}$

this method is extremely inefficient!

Inverse RL

we look at Expert*’s policy and endeavor to find *reward function.

recall linear value function approximation form Lecture5

we will consider reward is linear $R (s) = w^{T} x (s)$

value function for policy $π$ would be as below

$\begin{matrix} V^{π} & = & E [\sum_{t = 0}^{\infin} γ^{t} R (s_{t}) | π] & = & E [\sum_{t = 0}^{\infin} γ^{t} w^{T} x (s_{t}) | π] & = & w^{T} E [\sum_{t = 0}^{\infin} γ^{t} x (s_{t}) | π] & = & w^{T} μ (π) \end{matrix}$

* $μ$ : discounted sum of state features

with method above we may resolve some problems since at least it won’t set all reward to zero.

Apprenticeship Learning

super similar to Inverse RL! All are same to $V^{π}$ representation.

$V^{π} = w^{T} μ (π)$

before we go further recall that $V^{*} \geq V^{π}$ and thus $w^{T}\mu(\pi^) \ge w^{*T}\mu(\pi)$.

We wish to find reward function s.t the expert policy performs best. And if $V^{π}$ for a policy $π$ is sufficiently similar to $V^ $o f$ \pi^$, we can guarantee that we have acquired a policy that performs as well as expert policy.

Algorithm(we don’t really use this anymore)

repeat the process until the difference $| w^{T} μ (π^{*}) - w^{T} μ (π) |$ is sufficiently small.

Unresolved...

there are infinite number of possible reward functions with same optimal policy
even if we have reward function, there may be multiple policies that optimally fit
so choose which???

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 9 (0)	2022.08.11
Stanford CS234 Lecture 8 (0)	2022.08.11
Stanford CS234 Lecture 6 (0)	2022.08.09
Stanford CS234 Lecture 5 (0)	2022.08.08
Stanford CS234 Lecture 4 (0)	2022.08.05

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

말티즈가 물어온 잡동사니

Stanford CS234 Lecture 7

Imitation Learning

Behavioral Cloning

Inverse RL

Apprenticeship Learning

Unresolved...

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

Stanford CS234 Lecture 7

Imitation Learning

Behavioral Cloning

Inverse RL

Apprenticeship Learning

Unresolved...

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역