본문 바로가기
ML Study/Stanford CS234: Reinforcement Learning

Stanford CS234 Lecture2

by 누워있는말티즈 2022. 8. 5.

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 2

Given the model of the world

Markov Property → stochastic process evolving over time(whether or not I investi stocks, stock market changes)

Markov Chain

  • sequence of random states with Markov property
  • no rewards, no actions

Let S be set of states (sS) and P a transition model that specifies P(st+1=s|st=s)

for finite number(N) of states, we get transition matrix P


example discussed last section(we abort discussion of rewards and actions for easy understanding)

at state s1 we have 0.4 chance of transfering to s2 (P(s1|s2)) and 0.6 probability of staying s1 (P(s1|s1)). Such probability matrix is expressed as P above.

Let’s say we start at s1, we can calculate agent’s probablility of next state by calculating dot product of [1,0,0,0,0,0,0] and P above. As result we get [0.6,0.4,0,0,0,0,0]T


Markov Reward Process(MRP)

  • Markov Chain + rewards
  • no actions

for finite number(N) of states,

S : set of states (sS)

P : transition model that specifies P(st+1=s|st=s)

R : reward function R(st=s)=E[rt|st=s]

Horizon

  • number of time steps in each episode
  • can be either finite or infinite

Return(Gt)

  • Discounted sum of rewards from time step t to horizon

Gt=rt+γrt+1+γ2rt+2+γ3rt+3+...

State Value Function(V(s))

  • ‘Expected’ return from starting in state s → average of all returns at state s
  • V is defined for each state s thus may be displayed in array format.

V(st=s)=E[Gt|st=s]=Eπ[rt+γrt+1+γ2rt+2+γ3rt+3+...|st=s]

if the process is deterministic Return = State Value → since there would be single identical route .

Example

return value for sample episode, where γ=0.5

episode s4s5s6s7 , return : $0+0.50+0.5^20+0.5^3*10=1.25$

Computing the Value of Markov Reward Process

  • can estimate by mathematical simulation
  • MRP value function satisfies

V(s)=R(s)+γsSP(s|s)V(s)

  • Proof

Matrix form of Bellman Equation for MRP

  • analytic method of calculation

for finite MRP, we know R(s) and P(s|s), we express V(s) in matrix eqn

Iterative Algorithm for Computing Value of a MRP

Dynamic Programming

  • Initialize V0(s)=0 for all s

for k = 1 until convergence

for all s in S

Vk(s)=R(s)+γsSP(s|s)Vk1(s)

** computational complexity is simpler compared to analytical method


Markov Decision Process(MDP)

  • Markov Reward Process + actions

for finite number(N) of states,

S : set of states (sS)

A : set of actions(aA)

P : transition model for each action P(st+1=s|st=s,at=a)

R : reward function R(st=s,at=a)=E[rt|st=s,at=a]

MDP is a tuple: (S,A,P,R,γ)

Example

Untitled

action a1 : move left, action a2 : move left


MDP policies can be either *deterministic(I always do this in this state) or stochastic(there exists set of possible actions for this state)*

Markov Decision Process

  • MDP + π(a|s) ⇒ MRP (S,Rπ,Pπ,γ)

Rπ(s)=aAπ(a|s)R(s|a)

Pπ(s|s)=aAπ(a|s)P(s|s,a)

MDP Policy Evaluation - Iterative Algorithm

  • Initialize V0(s)=0 for all s

for k = 1 until convergence

for all s in S

Vkπ(s)=r(s,π(s))+γsSP(s|s,π(s))Vk1π(s)

this is a Bellman backup for particular policy

Example

!

only two actions a1,a2 and let’s say π(s)=a1,γ=0

compute iterations!

Vkπ(s)=r(s,π(s))+γsSP(s|s,π(s))Vk1π(s)

since γ=0, V(s4) would be zero with only its immediate reward

Practice

  • Dynamics : p(s6|s6,a1)=0.5,p(s7|s6,a1)=0.5
  • Let π(s)=a1,Vkπ=[1,0,0,0,0,0,10] and k=1,γ=0.5

Find V1π(s6).

— — — — — — — — — — — — — — — — — — — — — — — — — — —

$$
V^\pi_2(s_6)=0 + 0.5(0.50+0.510)=2.5
$$

MDP Control

  • compute optimal policy

π(s)=argmaxπVπ(s)

  • there exists a unique optimal value function
  • Optimal policy for an infinite horizon MDP→ Stationary (in same state, time stamp doesn’t matter)
  • → not necessarily unique since there may exist other state-action with same optimal value
  • → Deterministic

* optimal policy is not unique whereas an optimal value function is unique

Question


MDP Policy Iteration(PI)

for infinite horizon,

State-Action Value Q

Qπ(s,a)=R(s,a)+γsSP(s|s,a)Vπ(s)

immediate reward + expected reward starting from next state, following π

Policy Improvement(steps)

  1. compute all Q values for all S ans A

Qπi(s,a)=R(s,a)+γsSP(s|s,a)Vπi(s)

  1. compute new πi+1 for all S that maximizes Qπi

πi+1(s)=argmaxaQπi(s,a)

    $\pi_{i+1}$ is either $\pi_i$ or a different $a$ that maximized $Q$ value thus

maxaQπi(s,a)Qπi(s,πi(s))=R(s,πi(s))+γsSP(s|s,πi(s))Vπi(s)=Vπi(s)

if we took a new policy for an action and then follow πi forever, we’re guaranteed to be at least as good as we were before in terms of value function.”

Monotonic Improvement in Policy Value

the new policy is greater than equal to the old policy for all states

Proposition : Vπi+1Vπi with strict inequality if πi is suboptimal, where πi+1 is the new policy we get from policy improvement on πi.

  • Proof

MDP Value Iteration (VI)

  • Know optimal value and policy but only gets to act for k time steps

Bellman Equation and Bellman Backup Operators

Bellman Equation → Value fucntion of a policy must satisfy

Vπ(s)=R(s)+γsSPπ(s|s)V(s)

Bellman backup operator

  • applied to a value function and return a new value function
  • yields a value function over all states s

BV(s)=R(s,a)+γsSP(s|s)V(s)

Policy Iteration as Bellman Operations

  • Bellman backup operator Bπ for particular policy

BπV(s)=Rπ(s,a)+γsSPπ(s|s)V(s)

  • repeatedly apply operator until V stops changing for evaluation

Vπ=BπBπ...BπV

Value Iteration(VI)

Contraction Operator

  • Let O be an operator and |x| denote norm of x
  • if |OVOV||VV|, then O is a contractiion operator
  • → the distance between two vectors can shrink applying certain operator

Question : Will Value Iteration Converge?

A : Yes, because Bellman backup is a contraction operator

  • Proof

참고할 링크

[https://sumniya.tistory.com/5](

반응형

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 6  (0) 2022.08.09
Stanford CS234 Lecture 5  (0) 2022.08.08
Stanford CS234 Lecture 4  (0) 2022.08.05
Stanford CS234 Lecture 3  (0) 2022.08.05
Stanford CS234 Lecture 1  (2) 2022.08.04