본문 바로가기
ML Study/Stanford CS234: Reinforcement Learning

Stanford CS234 Lecture 3

by 누워있는말티즈 2022. 8. 5.

Lecture 3

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 3

recap MDP evaluation of Dynamic Programming

Dynamic Programming

  • case where we know exact model (not model free)

Initialize V0(s)=0 for all s

for k = 1 until convergence

for all s in S

Vkπ(s)=r(s,π(s))+γsSP(s|s,π(s))Vk1π(s)

and we iterate until it converges → ||VkπVk1π||<ϵ

  • if k is finite

    Vkπ(s) is exact value of k-horizon value of state s under policy π

  • if k is infinite

    Vkπ(s) is approximate value of state s under policy π

    Vkπ(s)Eπ[rt+γVk1|st=s]

    resulted states of an action would be “expectation” over the state-action.

    “when we know the model, we can compute immediate reward and exact estimate sum of future states. then we can substitue, instead of expanding Vk1 as sum of rewards, we can boot strap and use current estimate Vk1

    *Bootstrapping : 같은 종류의 예측값을 업데이트 할 때 예측값을 이용하는 것(예측값을 사용해서 예측값을 업데이트)


Model Free Policy Evaluation

we do not know the reward & dynamics model

Monte Carlo(MC) Policy Evaluation

→ we think of all the possible trajectories we can get from the policy and average all the returns

  • no bootstrapping

  • can only be applied to episodic MDPs

    → averaging over returns from complete episodes

    → requires episode to terminate

Consider a statistic θ^ that provides an estimate of θ and is a function of observed data xθ^=f(x)

Definition
Bias(편차)
Variance(분산)
MSE

First-Visit MC algorithm

Initialize N(s)=0,G(s)=0 where N is number of times a state is visited

Loop the following

Vπ(s) obtained is an ‘estimate’ thus might be wrong. So how do we evaluate??

  • Vπ estimator in first-visit MC is an unbiased estimator of Eπ[Gt|st=s]
  • by law of large numbers, as N(s)\infin, Vπ(s) converges to Eπ[Gt|st=s] “consistent”

Every-Visit MC algorithm

Initialize N(s)=0,G(s)=0 where N is number of times a state is visited

Loop the following

Note there’s no “first” on this algorithm.

  • Vπ in every-visit MC estimator is a biased estimator

    → multiple experience on a state gets them correlated, thus data no longer ‘iid’

  • stiil a consistent estimator and often has better MSE

Incremental MC algorithm

we can slowly move running average in this algorithm

Non-stationary domains → dynamics model changes over time.

Advanced topic, but incredebly important in real world

  • MC estimate Example

    Q1. Let γ=1. First visit MC estimate of V of each state?

    Ans . Vs1~ Vs3=1 , Vs4 ~ Vs7=0 → we dont know states 4~7 without experience

    Q2. Let γ=0.9. compare first-visit & evert-visit MC estimate of s2

    Ans . first-visit : 0.81 , every-visit : 0.855

    → fisrst-visit의 경우 s2s2s1의 루트를 가며 첫 경험만 계산하므로

    $G(s_2)=0+0.90+0.9^21=0.81$

    따라서 Vπ(s2)=0.81/1=0.81

    → every-visit의 경우 s2s2s1의 루트를 가며 두 encounter 모두 계산하므로

    $G(s_2)=G_1(s_2)+G_2(S_2)=(0+0.90+0.9^21)+(0+0.9*1)=1.71$

    따라서 Vπ(s2)=1.71/2=0.855 이다.

What Monte-Carlo Evaluation is doing

approximate averaging over all possible futures by summing up one trajectory through the tree”

→ sample a return, add up all the rewards along the way

Key Limitations

  • Generally high variance estimator

    → Reducing variance can require a lot of data

    → Impractical when lacking data

  • Requires episodic settings

    → Episode must terminate to use to update V


Temporal Difference(TD) Policy Evaluation

Combination of Monte-Carlo & dynamic programming method

→ takes Bootstrapping aspect from dyamic programming

→ takes Sampling from Monte-Carlo

Immediately updates estimate of V after each (s,a,r,s) tuple → no need to wait episode to end

TD Learning estimation

→ we’re gonna get immediate reward plus discounted sum of future rewards

Now let us compare Vπ(s) of incremental every-visit MC with TD

Incremental every-visit MC
TD

instead of using Gi,t we bootstrap with immediate reward plus expected future reward.

TD Error

δt=rt+γVπ(st+1)Vπ(st)

TD algorithm can immediately update estimated after each (s,a,r,s) tuple

  • TD algorithm Example

    Ans. [1 0 0 0 0 0 0], we can sample tuples from trajectory as below

What Temporal-Difference Evaluation is doing

Vπ(s)=Vπ(s)+α([rt+γVπ(st+1)]Vπ(s))

for state st, we update this by using a sample of st+1 to approximate expected next state distribution or future distribution. And then, we bootstrap by plugginf in previous estimate of Vπ.


Evaluating Policy Evaluation Algorithms

Properties

Dynamic Programming Monte-Carlo Temporal-Difference
Usable without model of correct domain X O O
Handles non-episodic domain O X O
Handle non-Markovian domain X O X
Converge to true value in limit (in Markovian domain) O O O
Unbiased estimate of value X O X

Bias/Variance characterisics of algorithms

Batch MC and TD calculation

we might want to spend more calculation to get better estimate... for better “sample efficiency

반응형

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 6  (0) 2022.08.09
Stanford CS234 Lecture 5  (0) 2022.08.08
Stanford CS234 Lecture 4  (0) 2022.08.05
Stanford CS234 Lecture2  (0) 2022.08.05
Stanford CS234 Lecture 1  (2) 2022.08.04