Lecture 3

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 3

recap MDP evaluation of Dynamic Programming

Dynamic Programming

case where we know exact model (not model free)

Initialize $V_{0} (s) = 0$ for all s

for k = 1 until convergence

for all $s$ in $S$

$V_{k}^{π} (s) = r (s, π (s)) + γ \sum_{s^{'} \in S} P (s^{'} | s, π (s)) V_{k - 1}^{π} (s^{'})$

and we iterate until it converges → $| | V_{k}^{π} - V_{k - 1}^{π} | | < ϵ$

if k is finite

→ $V_{k}^{π} (s)$ is exact value of k-horizon value of state $s$ under policy $π$
if k is infinite

→ $V_{k}^{π} (s)$ is approximate value of state $s$ under policy $π$

→ $V_{k}^{π} (s)$ ← $E_{π} [r_{t} + γ V_{k - 1} | s_{t} = s]$

resulted states of an action would be “expectation” over the state-action.

“when we know the model, we can compute immediate reward and exact estimate sum of future states. then we can substitue, instead of expanding $V_{k - 1}$ as sum of rewards, we can boot strap and use current estimate $V_{k - 1}$ ”

*Bootstrapping : 같은 종류의 예측값을 업데이트 할 때 예측값을 이용하는 것(예측값을 사용해서 예측값을 업데이트)

Model Free Policy Evaluation

we do not know the reward & dynamics model

Monte Carlo(MC) Policy Evaluation

→ we think of all the possible trajectories we can get from the policy and average all the returns

no bootstrapping
can only be applied to episodic MDPs

→ averaging over returns from complete episodes

→ requires episode to terminate

Consider a statistic $\hat{θ}$ that provides an estimate of $θ$ and is a function of observed data $x$ → $\hat{θ} = f (x)$

	Definition
Bias(편차)	⁍
Variance(분산)	⁍
MSE	⁍

First-Visit MC algorithm

Initialize $N (s) = 0, G (s) = 0$ where $N$ is number of times a state is visited

Loop the following

$V^{π} (s)$ obtained is an ‘estimate’ thus might be wrong. So how do we evaluate??

$V^{π}$ estimator in first-visit MC is an unbiased estimator of $E_{π} [G_{t} | s_{t} = s]$
by law of large numbers, as $N (s)$ → $\infin$ , $V^{π} (s)$ converges to $E_{π} [G_{t} | s_{t} = s]$ “consistent”

Every-Visit MC algorithm

Initialize $N (s) = 0, G (s) = 0$ where $N$ is number of times a state is visited

Loop the following

Note there’s no “first” on this algorithm.

$V^{π}$ in every-visit MC estimator is a biased estimator

→ multiple experience on a state gets them correlated, thus data no longer ‘iid’
stiil a consistent estimator and often has better MSE

Incremental MC algorithm

we can slowly move running average in this algorithm

Non-stationary domains → dynamics model changes over time.

Advanced topic, but incredebly important in real world

MC estimate Example

Q1. Let $γ = 1$ . First visit MC estimate of V of each state?

Ans . $V_{s_{1}}$ ~ $V_{s_{3}} = 1$ , $V_{s_{4}}$ ~ $V_{s_{7}} = 0$ → we dont know states 4~7 without experience

Q2. Let $γ = 0.9$ . compare first-visit & evert-visit MC estimate of $s_{2}$

Ans . first-visit : 0.81 , every-visit : 0.855

→ fisrst-visit의 경우 $s_{2}$ → $s_{2}$ → $s_{1}$ 의 루트를 가며 첫 경험만 계산하므로

$G(s_2)=0+0.90+0.9^21=0.81$

따라서 $V^{π} (s_{2}) = 0.81 / 1 = 0.81$ 이

→ every-visit의 경우 $s_{2}$ → $s_{2}$ → $s_{1}$ 의 루트를 가며 두 encounter 모두 계산하므로

$G(s_2)=G_1(s_2)+G_2(S_2)=(0+0.90+0.9^21)+(0+0.9*1)=1.71$

따라서 $V^{π} (s_{2}) = 1.71 / 2 = 0.855$ 이다.

What Monte-Carlo Evaluation is doing

“approximate averaging over all possible futures by summing up one trajectory through the tree”

→ sample a return, add up all the rewards along the way

Key Limitations

Generally high variance estimator

→ Reducing variance can require a lot of data

→ Impractical when lacking data
Requires episodic settings

→ Episode must terminate to use to update $V$

Temporal Difference(TD) Policy Evaluation

Combination of Monte-Carlo & dynamic programming method

→ takes Bootstrapping aspect from dyamic programming

→ takes Sampling from Monte-Carlo

Immediately updates estimate of $V$ after each ( $s, a, r, s^{'}$ ) tuple → no need to wait episode to end

TD Learning estimation

→ we’re gonna get immediate reward plus discounted sum of future rewards

Now let us compare $V^{π} (s)$ of incremental every-visit MC with TD

Incremental every-visit MC	⁍
TD	⁍

instead of using $G_{i, t}$ we bootstrap with immediate reward plus expected future reward.

TD Error

$δ_{t} = r_{t} + γ V^{π} (s_{t + 1}) - V^{π} (s_{t})$

TD algorithm can immediately update estimated after each ( $s, a, r, s^{'}$ ) tuple

TD algorithm Example

Ans. [1 0 0 0 0 0 0], we can sample tuples from trajectory as below

What Temporal-Difference Evaluation is doing

$V^{π} (s) = V^{π} (s) + α ([r_{t} + γ V^{π} (s_{t + 1})] - V^{π} (s))$

for state $s_{t}$ , we update this by using a sample of $s_{t + 1}$ to approximate expected next state distribution or future distribution. And then, we bootstrap by plugginf in previous estimate of $V^{π}$ .

Evaluating Policy Evaluation Algorithms

Properties

	Dynamic Programming	Monte-Carlo	Temporal-Difference
Usable without model of correct domain	X	O	O
Handles non-episodic domain	O	X	O
Handle non-Markovian domain	X	O	X
Converge to true value in limit (in Markovian domain)	O	O	O
Unbiased estimate of value	X	O	X

Bias/Variance characterisics of algorithms

Batch MC and TD calculation

we might want to spend more calculation to get better estimate... for better “sample efficiency”

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 6 (0)	2022.08.09
Stanford CS234 Lecture 5 (0)	2022.08.08
Stanford CS234 Lecture 4 (0)	2022.08.05
Stanford CS234 Lecture2 (0)	2022.08.05
Stanford CS234 Lecture 1 (2)	2022.08.04

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

말티즈가 물어온 잡동사니

Stanford CS234 Lecture 3

Lecture 3

Dynamic Programming

Model Free Policy Evaluation

Monte Carlo(MC) Policy Evaluation

Temporal Difference(TD) Policy Evaluation

TD Learning estimation

Evaluating Policy Evaluation Algorithms

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

Stanford CS234 Lecture 3

Lecture 3

Dynamic Programming

Model Free Policy Evaluation

Monte Carlo(MC) Policy Evaluation

Temporal Difference(TD) Policy Evaluation

TD Learning estimation

Evaluating Policy Evaluation Algorithms

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역