Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 2
Given the model of the world
Markov Property → stochastic process evolving over time(whether or not I investi stocks, stock market changes)
Markov Chain
- sequence of random states with Markov property
- no rewards, no actions
Let $S$ be set of states ($s \in S$) and $P$ a transition model that specifies $P(s_{t+1}=s'|s_t=s)$
for finite number($N$) of states, we get transition matrix $P$
example discussed last section(we abort discussion of rewards and actions for easy understanding)
at state $s_1$ we have 0.4 chance of transfering to $s_2$ ($P(s_1|s_2)$) and 0.6 probability of staying $s_1$ ($P(s_1|s_1)$). Such probability matrix is expressed as $P$ above.
Let’s say we start at $s_1$, we can calculate agent’s probablility of next state by calculating dot product of $[1 ,0,0,0,0,0,0]$ and $P$ above. As result we get $[0.6,0.4,0,0,0,0,0]^T$
Markov Reward Process(MRP)
- Markov Chain + rewards
- no actions
for finite number($N$) of states,
$S$ : set of states ($s \in S$)
$P$ : transition model that specifies $P(s_{t+1}=s'|s_t=s)$
$R$ : reward function $R(s_t=s)=E[r_t|s_t=s]$
Horizon
- number of time steps in each episode
- can be either finite or infinite
Return($G_t$)
- Discounted sum of rewards from time step t to horizon
$$
G_t=r_t + \gamma r_{t+1}+\gamma ^2 r_{t+2}+\gamma ^3 r_{t+3}+...
$$
State Value Function(V(s))
- ‘Expected’ return from starting in state $s$ → average of all returns at state $s$
- $V$ is defined for each state $s$ thus may be displayed in array format.
$$
V(s_t=s)=E[G_t|s_t=s]=E_\pi[r_t + \gamma r_{t+1}+\gamma ^2 r_{t+2}+\gamma ^3 r_{t+3}+...|s_t=s]
$$
if the process is deterministic Return = State Value → since there would be single identical route .
Example
return value for sample episode, where $\gamma=0.5$
episode $s_4$→$s_5$→$s_6$→$s_7$ , $return$ : $0+0.50+0.5^20+0.5^3*10=1.25$
Computing the Value of Markov Reward Process
- can estimate by mathematical simulation
- MRP value function satisfies
$$
V(s)=R(s) + \gamma \sum_{s' \in S} P(s'|s)V(s')
$$
- Proof
Matrix form of Bellman Equation for MRP
- analytic method of calculation
for finite MRP, we know $R(s)$ and $P(s|s)$, we express $V(s)$ in matrix eqn
Iterative Algorithm for Computing Value of a MRP
Dynamic Programming
- Initialize $V_0 (s)=0$ for all s
for k = 1 until convergence
for all $s$ in $S$
$V_k(s)=R(s) + \gamma \sum_{s' \in S} P(s'|s)V_{k-1}(s')$
** computational complexity is simpler compared to analytical method
Markov Decision Process(MDP)
- Markov Reward Process + actions
for finite number($N$) of states,
$S$ : set of states ($s \in S$)
$A$ : set of actions($a \in A$)
$P$ : transition model for each action $P(s_{t+1}=s'|s_t=s,a_t=a)$
$R$ : reward function $R(s_t=s, a_t=a)=E[r_t|s_t=s,a_t=a]$
MDP is a tuple: ($S,A,P,R,\gamma$)
Example
action $a_1$ : move left, action $a_2$ : move left
MDP policies can be either *deterministic(I always do this in this state) or stochastic(there exists set of possible actions for this state)*
Markov Decision Process
- MDP + $\pi(a|s)$ ⇒ MRP ($S, R^\pi,P^\pi,\gamma$)
$$
R^\pi (s)=\sum_{a \in A}\pi(a|s)R(s|a)
$$
$$
P^\pi(s'|s)=\sum_{a \in A}\pi(a|s)P(s'|s,a)
$$
MDP Policy Evaluation - Iterative Algorithm
- Initialize $V_0 (s)=0$ for all s
for k = 1 until convergence
for all $s$ in $S$
$V^\pi_k(s)=r(s,\pi(s)) + \gamma \sum_{s' \in S} P(s'|s,\pi(s))V^\pi_{k-1}(s')$
this is a Bellman backup for particular policy
Example
!
only two actions $a_1,a_2$ and let’s say $\pi(s)=a_1, \gamma=0$
compute iterations!
$V^\pi_k(s)=r(s,\pi(s)) + \gamma \sum_{s' \in S} P(s'|s,\pi(s))V^\pi_{k-1}(s')$
since $\gamma=0$, $V(s_4)$ would be zero with only its immediate reward
Practice
- Dynamics : $p(s_6|s_6,a_1)=0.5, p(s_7|s_6, a_1)=0.5$
- Let $\pi(s)=a_1, V^\pi_k=[1,0,0,0,0,0,10]$ and $k=1, \gamma = 0.5$
Find $V^\pi_1(s_6)$.
— — — — — — — — — — — — — — — — — — — — — — — — — — —
$$
V^\pi_2(s_6)=0 + 0.5(0.50+0.510)=2.5
$$
MDP Control
- compute optimal policy
$$
\pi^*(s) = \arg \max_\pi V^\pi(s)
$$
- there exists a unique optimal value function
- Optimal policy for an infinite horizon MDP→ Stationary (in same state, time stamp doesn’t matter)
- → not necessarily unique since there may exist other state-action with same optimal value
- → Deterministic
* optimal policy is not unique whereas an optimal value function is unique
Question
MDP Policy Iteration(PI)
for infinite horizon,
State-Action Value $Q$
$$
Q^\pi(s,a)=R(s,a)+\gamma\sum_{s' \in S}P(s'|s,a)V^\pi(s')
$$
immediate reward + expected reward starting from next state, following $\pi$
Policy Improvement(steps)
- compute all $Q$ values for all $S$ ans $A$
$$
Q^{\pi_i}(s,a)=R(s,a)+\gamma\sum_{s' \in S}P(s'|s,a)V^{\pi_i}(s')
$$
- compute new $\pi_{i+1}$ for all $S$ that maximizes $Q^{\pi_i}$
$$
\pi_{i+1}(s)=arg \max_a Q^{\pi_i}(s,a)
$$
$\pi_{i+1}$ is either $\pi_i$ or a different $a$ that maximized $Q$ value thus
$$
\max_aQ^{\pi_i}(s,a) \ge Q^{\pi_i}(s, \pi_i(s)) = R(s, \pi_i(s)) + \gamma \sum_{s' \in S}P(s'|s, \pi_i(s))V^{\pi_i}(s')=V^{\pi_i}(s)
$$
“if we took a new policy for an action and then follow $\pi_i$ forever, we’re guaranteed to be at least as good as we were before in terms of value function.”
Monotonic Improvement in Policy Value
the new policy is greater than equal to the old policy for all states
Proposition : $V^{\pi_{i+1}} \ge V^{\pi_i}$ with strict inequality if $\pi_i$ is suboptimal, where $\pi_{i+1}$ is the new policy we get from policy improvement on $\pi_i$.
- Proof
MDP Value Iteration (VI)
- Know optimal value and policy but only gets to act for $k$ time steps
Bellman Equation and Bellman Backup Operators
Bellman Equation → Value fucntion of a policy must satisfy
$$
V^\pi(s)=R(s) + \gamma \sum_{s' \in S} P^\pi(s'|s)V(s')
$$
Bellman backup operator
- applied to a value function and return a new value function
- yields a value function over all states $s$
$$
BV(s)=R(s,a) + \gamma \sum_{s' \in S} P(s'|s)V(s')
$$
Policy Iteration as Bellman Operations
- Bellman backup operator $B^\pi$ for particular policy
$$
B^\pi V(s)=R^\pi(s,a) + \gamma \sum_{s' \in S} P^\pi(s'|s)V(s)
$$
- repeatedly apply operator until $V$ stops changing for evaluation
$$
V^\pi=B^\pi B^\pi ... B^\pi V
$$
Value Iteration(VI)
Contraction Operator
- Let $O$ be an operator and $|x|$ denote norm of $x$
- if $|OV-OV'| \le |V-V'|$, then $O$ is a contractiion operator
- → the distance between two vectors can shrink applying certain operator
Question : Will Value Iteration Converge?
A : Yes, because Bellman backup is a contraction operator
- Proof
참고할 링크
'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글
Stanford CS234 Lecture 6 (0) | 2022.08.09 |
---|---|
Stanford CS234 Lecture 5 (0) | 2022.08.08 |
Stanford CS234 Lecture 4 (0) | 2022.08.05 |
Stanford CS234 Lecture 3 (0) | 2022.08.05 |
Stanford CS234 Lecture 1 (2) | 2022.08.04 |
댓글