Lecture 3
Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 3
recap MDP evaluation of Dynamic Programming
Dynamic Programming
- case where we know exact model (not model free)
Initialize
for k = 1 until convergence
for all
and we iterate until it converges →
if k is finite
→
is exact value of k-horizon value of state under policyif k is infinite
→
is approximate value of state under policy→
←resulted states of an action would be “expectation” over the state-action.
“when we know the model, we can compute immediate reward and exact estimate sum of future states. then we can substitue, instead of expanding
as sum of rewards, we can boot strap and use current estimate ”*Bootstrapping : 같은 종류의 예측값을 업데이트 할 때 예측값을 이용하는 것(예측값을 사용해서 예측값을 업데이트)
Model Free Policy Evaluation
we do not know the reward & dynamics model
Monte Carlo(MC) Policy Evaluation

→ we think of all the possible trajectories we can get from the policy and average all the returns
no bootstrapping
can only be applied to episodic MDPs
→ averaging over returns from complete episodes
→ requires episode to terminate
Consider a statistic
Definition | |
---|---|
Bias(편차) | ⁍ |
Variance(분산) | ⁍ |
MSE | ⁍ |
First-Visit MC algorithm
Initialize
Loop the following

estimator in first-visit MC is an unbiased estimator of- by law of large numbers, as
→ , converges to “consistent”
Every-Visit MC algorithm
Initialize
Loop the following

Note there’s no “first” on this algorithm.
in every-visit MC estimator is a biased estimator→ multiple experience on a state gets them correlated, thus data no longer ‘iid’
stiil a consistent estimator and often has better MSE
Incremental MC algorithm
we can slowly move running average in this algorithm

Non-stationary domains → dynamics model changes over time.
Advanced topic, but incredebly important in real world
MC estimate Example
Q1. Let
. First visit MC estimate of V of each state?Ans .
~ , ~ → we dont know states 4~7 without experienceQ2. Let
. compare first-visit & evert-visit MC estimate ofAns . first-visit : 0.81 , every-visit : 0.855
→ fisrst-visit의 경우
→ → 의 루트를 가며 첫 경험만 계산하므로$G(s_2)=0+0.90+0.9^21=0.81$
따라서
이→ every-visit의 경우
→ → 의 루트를 가며 두 encounter 모두 계산하므로$G(s_2)=G_1(s_2)+G_2(S_2)=(0+0.90+0.9^21)+(0+0.9*1)=1.71$
따라서
이다.
What Monte-Carlo Evaluation is doing

“approximate averaging over all possible futures by summing up one trajectory through the tree”
→ sample a return, add up all the rewards along the way
Key Limitations
Generally high variance estimator
→ Reducing variance can require a lot of data
→ Impractical when lacking data
Requires episodic settings
→ Episode must terminate to use to update
Temporal Difference(TD) Policy Evaluation
Combination of Monte-Carlo & dynamic programming method
→ takes Bootstrapping aspect from dyamic programming
→ takes Sampling from Monte-Carlo
Immediately updates estimate of
TD Learning estimation

→ we’re gonna get immediate reward plus discounted sum of future rewards
Now let us compare
Incremental every-visit MC | ⁍ |
---|---|
TD | ⁍ |
instead of using
TD Error
TD algorithm can immediately update estimated after each (
TD algorithm Example
Ans. [1 0 0 0 0 0 0], we can sample tuples from trajectory as below
What Temporal-Difference Evaluation is doing

for state
Evaluating Policy Evaluation Algorithms
Properties
Dynamic Programming | Monte-Carlo | Temporal-Difference | |
---|---|---|---|
Usable without model of correct domain | X | O | O |
Handles non-episodic domain | O | X | O |
Handle non-Markovian domain | X | O | X |
Converge to true value in limit (in Markovian domain) | O | O | O |
Unbiased estimate of value | X | O | X |
Bias/Variance characterisics of algorithms

Batch MC and TD calculation
we might want to spend more calculation to get better estimate... for better “sample efficiency”

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글
Stanford CS234 Lecture 6 (0) | 2022.08.09 |
---|---|
Stanford CS234 Lecture 5 (0) | 2022.08.08 |
Stanford CS234 Lecture 4 (0) | 2022.08.05 |
Stanford CS234 Lecture2 (0) | 2022.08.05 |
Stanford CS234 Lecture 1 (2) | 2022.08.04 |
댓글