Stanford CS234 Lecture2

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 2

Given the model of the world

Markov Property → stochastic process evolving over time(whether or not I investi stocks, stock market changes)

Markov Chain

sequence of random states with Markov property
no rewards, no actions

Let $S$ be set of states ( $s \in S$ ) and $P$ a transition model that specifies $P (s_{t + 1} = s^{'} | s_{t} = s)$

for finite number( $N$ ) of states, we get transition matrix $P$

example discussed last section(we abort discussion of rewards and actions for easy understanding)

at state $s_{1}$ we have 0.4 chance of transfering to $s_{2}$ ( $P (s_{1} | s_{2})$ ) and 0.6 probability of staying $s_{1}$ ( $P (s_{1} | s_{1})$ ). Such probability matrix is expressed as $P$ above.

Let’s say we start at $s_{1}$ , we can calculate agent’s probablility of next state by calculating dot product of $[1, 0, 0, 0, 0, 0, 0]$ and $P$ above. As result we get $[0.6, 0.4, 0, 0, 0, 0, 0]^{T}$

Markov Reward Process(MRP)

Markov Chain + rewards
no actions

for finite number( $N$ ) of states,

$S$ : set of states ( $s \in S$ )

$P$ : transition model that specifies $P (s_{t + 1} = s^{'} | s_{t} = s)$

$R$ : reward function $R (s_{t} = s) = E [r_{t} | s_{t} = s]$

Horizon

number of time steps in each episode
can be either finite or infinite

Return( $G_{t}$ )

Discounted sum of rewards from time step t to horizon

$G_{t} = r_{t} + γ r_{t + 1} + γ^{2} r_{t + 2} + γ^{3} r_{t + 3} + . . .$

State Value Function(V(s))

‘Expected’ return from starting in state $s$ → average of all returns at state $s$
$V$ is defined for each state $s$ thus may be displayed in array format.

$V (s_{t} = s) = E [G_{t} | s_{t} = s] = E_{π} [r_{t} + γ r_{t + 1} + γ^{2} r_{t + 2} + γ^{3} r_{t + 3} + . . . | s_{t} = s]$

if the process is deterministic Return = State Value → since there would be single identical route .

Example

return value for sample episode, where $γ = 0.5$

episode $s_{4}$ → $s_{5}$ → $s_{6}$ → $s_{7}$ , $r e t u r n$ : $0+0.50+0.5^20+0.5^3*10=1.25$

Computing the Value of Markov Reward Process

can estimate by mathematical simulation
MRP value function satisfies

$V (s) = R (s) + γ \sum_{s^{'} \in S} P (s^{'} | s) V (s^{'})$

Proof

Matrix form of Bellman Equation for MRP

analytic method of calculation

for finite MRP, we know $R (s)$ and $P (s | s)$ , we express $V (s)$ in matrix eqn

Iterative Algorithm for Computing Value of a MRP

Dynamic Programming

Initialize $V_{0} (s) = 0$ for all s

for k = 1 until convergence

for all $s$ in $S$

$V_{k} (s) = R (s) + γ \sum_{s^{'} \in S} P (s^{'} | s) V_{k - 1} (s^{'})$

** computational complexity is simpler compared to analytical method

Markov Decision Process(MDP)

Markov Reward Process + actions

for finite number( $N$ ) of states,

$S$ : set of states ( $s \in S$ )

$A$ : set of actions( $a \in A$ )

$P$ : transition model for each action $P (s_{t + 1} = s^{'} | s_{t} = s, a_{t} = a)$

$R$ : reward function $R (s_{t} = s, a_{t} = a) = E [r_{t} | s_{t} = s, a_{t} = a]$

MDP is a tuple: ( $S, A, P, R, γ$ )

Example

action $a_{1}$ : move left, action $a_{2}$ : move left

MDP policies can be either *deterministic(I always do this in this state) or stochastic(there exists set of possible actions for this state)*

Markov Decision Process

MDP + $π (a | s)$ ⇒ MRP ( $S, R^{π}, P^{π}, γ$ )

$R^{π} (s) = \sum_{a \in A} π (a | s) R (s | a)$

$P^{π} (s^{'} | s) = \sum_{a \in A} π (a | s) P (s^{'} | s, a)$

MDP Policy Evaluation - Iterative Algorithm

Initialize $V_{0} (s) = 0$ for all s

for k = 1 until convergence

for all $s$ in $S$

$V_{k}^{π} (s) = r (s, π (s)) + γ \sum_{s^{'} \in S} P (s^{'} | s, π (s)) V_{k - 1}^{π} (s^{'})$

this is a Bellman backup for particular policy

Example

only two actions $a_{1}, a_{2}$ and let’s say $π (s) = a_{1}, γ = 0$

compute iterations!

$V_{k}^{π} (s) = r (s, π (s)) + γ \sum_{s^{'} \in S} P (s^{'} | s, π (s)) V_{k - 1}^{π} (s^{'})$

since $γ = 0$ , $V (s_{4})$ would be zero with only its immediate reward

Practice

Dynamics : $p (s_{6} | s_{6}, a_{1}) = 0.5, p (s_{7} | s_{6}, a_{1}) = 0.5$
Let $π (s) = a_{1}, V_{k}^{π} = [1, 0, 0, 0, 0, 0, 10]$ and $k = 1, γ = 0.5$

Find $V_{1}^{π} (s_{6})$ .

— — — — — — — — — — — — — — — — — — — — — — — — — — —

$$
V^\pi_2(s_6)=0 + 0.5(0.50+0.510)=2.5
$$

MDP Control

compute optimal policy

$π^{*} (s) = \arg max_{π} V^{π} (s)$

there exists a unique optimal value function
Optimal policy for an infinite horizon MDP→ Stationary (in same state, time stamp doesn’t matter)
→ not necessarily unique since there may exist other state-action with same optimal value
→ Deterministic

* optimal policy is not unique whereas an optimal value function is unique

Question

MDP Policy Iteration(PI)

for infinite horizon,

State-Action Value $Q$

$Q^{π} (s, a) = R (s, a) + γ \sum_{s^{'} \in S} P (s^{'} | s, a) V^{π} (s^{'})$

immediate reward + expected reward starting from next state, following $π$

Policy Improvement(steps)

compute all $Q$ values for all $S$ ans $A$

$Q^{π_{i}} (s, a) = R (s, a) + γ \sum_{s^{'} \in S} P (s^{'} | s, a) V^{π_{i}} (s^{'})$

compute new $π_{i + 1}$ for all $S$ that maximizes $Q^{π_{i}}$

$π_{i + 1} (s) = a r g max_{a} Q^{π_{i}} (s, a)$

    $\pi_{i+1}$ is either $\pi_i$ or a different $a$ that maximized $Q$ value thus

$max_{a} Q^{π_{i}} (s, a) \geq Q^{π_{i}} (s, π_{i} (s)) = R (s, π_{i} (s)) + γ \sum_{s^{'} \in S} P (s^{'} | s, π_{i} (s)) V^{π_{i}} (s^{'}) = V^{π_{i}} (s)$

“if we took a new policy for an action and then follow $π_{i}$ forever, we’re guaranteed to be at least as good as we were before in terms of value function.”

Monotonic Improvement in Policy Value

the new policy is greater than equal to the old policy for all states

Proposition : $V^{π_{i + 1}} \geq V^{π_{i}}$ with strict inequality if $π_{i}$ is suboptimal, where $π_{i + 1}$ is the new policy we get from policy improvement on $π_{i}$ .

Proof

MDP Value Iteration (VI)

Know optimal value and policy but only gets to act for $k$ time steps

Bellman Equation and Bellman Backup Operators

Bellman Equation → Value fucntion of a policy must satisfy

$V^{π} (s) = R (s) + γ \sum_{s^{'} \in S} P^{π} (s^{'} | s) V (s^{'})$

Bellman backup operator

applied to a value function and return a new value function
yields a value function over all states $s$

$B V (s) = R (s, a) + γ \sum_{s^{'} \in S} P (s^{'} | s) V (s^{'})$

Policy Iteration as Bellman Operations

Bellman backup operator $B^{π}$ for particular policy

$B^{π} V (s) = R^{π} (s, a) + γ \sum_{s^{'} \in S} P^{π} (s^{'} | s) V (s)$

repeatedly apply operator until $V$ stops changing for evaluation

$V^{π} = B^{π} B^{π} . . . B^{π} V$

Value Iteration(VI)

Contraction Operator

Let $O$ be an operator and $| x |$ denote norm of $x$
if $| O V - O V^{'} | \leq | V - V^{'} |$ , then $O$ is a contractiion operator
→ the distance between two vectors can shrink applying certain operator

Question : Will Value Iteration Converge?

A : Yes, because Bellman backup is a contraction operator

Proof

참고할 링크

[https://sumniya.tistory.com/5](

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 6 (0)	2022.08.09
Stanford CS234 Lecture 5 (0)	2022.08.08
Stanford CS234 Lecture 4 (0)	2022.08.05
Stanford CS234 Lecture 3 (0)	2022.08.05
Stanford CS234 Lecture 1 (2)	2022.08.04

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

말티즈가 물어온 잡동사니

Stanford CS234 Lecture2

Given the model of the world

Markov Chain

Markov Reward Process(MRP)

Markov Decision Process(MDP)

MDP Policy Iteration(PI)

MDP Value Iteration (VI)

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

Stanford CS234 Lecture2

Given the model of the world

Markov Chain

Markov Reward Process(MRP)

Markov Decision Process(MDP)

MDP Policy Iteration(PI)

MDP Value Iteration (VI)

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역