Stanford CS234 Lecture 6

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 6

We will combine Nueral Network(NN) features on RL

Basic Deep Neural Network

DNN is linear neural network structure with more than three hidden layers of functional operators which are differentiable. Benefits of using DNN are as below

DNN is universal function approximator
Requires less nodes/parameters to represent same function
Uses distributed representations instead of locqal representations
Can learn parameters using SGD

I will skip Convolutional Neural Network section

너무 많이 했어...

Deep Q Learning → Deep Q Network(DQN)

We will use DNN to represent Value function, Polcy, and Model and optimize loss finction(SGD)

state-action value function by Q-network with $w$ → $\hat{Q} (s, a; w) \approx Q (s, a)$

With statement above, let us recall State-Action Value Function Approximation and Model-Free Control from Lecture 5.

This is what we expect when we attept to play Atari game with RL

game screen → state( $s_{t}$ ) : we take 4 frames(pixel) to represent position and velocity
moves → action( $a_{t}$ ) : 18 joystick moves will be actions agent take
game score → reward( $r_{t}$ ) : we prefer higher score

Applying state images in CNN structure...

Input : 4 previous frames → Output : $Q (s, a)$ for 18 joystick moves

Why DQN?

Q-Learning with Value Function Approximation is indeed useful. However it has critical flaw of possiblility of diverging due to following reasons

Correlation between samples
Non-stationary targets

DQN drives in here and addresses by Experience Replay and Fixed Q-targets

Experience Replay

In TD or Q learning algorithms, we sample a tuple, use them to update and discard right away.

Now we will store all data as replay buffer that are to be sampled and replayed.

we repeat

sample an experience tuple from dataset → $(s, a, r, s^{'}) \in D$
compute target value → $r + γ max_{a^{'}} \hat{Q} (s^{'}, a^{'}; w)$
use SGD to update $w$

$Δ w = α (r + γ max_{a^{'}} \hat{Q} (s^{'}, a^{'}; w) - \hat{Q} (s, a; w)) \nabla_{w} \hat{Q} (s, a; w)$

Notice $Δ w$ equation is identical to normal Q-learning. However it is significant tu understant that we repeat this process!!! For every procedure Q function would change and therefore different value will return each time for the same tuple.

Fixed Q-Targets

We fix target weights used in target calcualtion( $r + γ V (s^{'}) \approx o r a c l e V^{*}$ )

we represent “fixed weight” as $w^{-}$ and $w$ shall be weights that are being updated.

sample an experience tuple from dataset → $(s, a, r, s^{'}) \in D$
compute target value → $r + γ max_{a^{'}} \hat{Q} (s^{'}, a^{'}; w^{-})$
use SGD to update $w$

$Δ w = α (r + γ max_{a^{'}} \hat{Q} (s^{'}, a^{'}; w^{-}) - \hat{Q} (s, a; w)) \nabla_{w} \hat{Q} (s, a; w)$

Example

We can see that sampling order yields significant difference on Value Function.

3 Big Ideas for Deep RL

Double DQN

Recall Double Q-Learning from maximization bias notion of Lecture4

Extend this idea to DQN

Current Q-network $w$ is used to select actions
Older Q-network $w^{-}$ is used to evaluate actions

$Δ w = α (r + γ \hat{Q} (\arg max_{a^{'}} \hat{Q} (s^{'}, a^{'}; w); w^{-}) - \hat{Q} (s, a; w)) \nabla_{w} \hat{Q} (s, a; w)$

we select action to evaluate with $\arg max_{a^{'}} \hat{Q} (s^{'}, a^{'}; w)$ inside and we take that action to evaluate with fixed weight $w^{-}$ .

Prioritized Replay

We use “priority function” when sampling tuples for update

if $α = 0$ , choosing upon tuples would be of uniform priorities.

Dueling DQN

We decouple $Q^{π} (s, a)$ into $V^{π} (s)$ and “advantage” $A^{π} (s, a)$ of an action.

$A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s)$

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 8 (0)	2022.08.11
Stanford CS234 Lecture 7 (0)	2022.08.09
Stanford CS234 Lecture 5 (0)	2022.08.08
Stanford CS234 Lecture 4 (0)	2022.08.05
Stanford CS234 Lecture 3 (0)	2022.08.05

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

말티즈가 물어온 잡동사니

Stanford CS234 Lecture 6

Basic Deep Neural Network

Deep Q Learning → Deep Q Network(DQN)

Why DQN?

3 Big Ideas for Deep RL

Double DQN

Prioritized Replay

Dueling DQN

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

Stanford CS234 Lecture 6

Basic Deep Neural Network

Deep Q Learning → Deep Q Network(DQN)

Why DQN?

3 Big Ideas for Deep RL

Double DQN

Prioritized Replay

Dueling DQN

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역