Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 6
We will combine Nueral Network(NN) features on RL
Basic Deep Neural Network


DNN is linear neural network structure with more than three hidden layers of functional operators which are differentiable. Benefits of using DNN are as below
- DNN is universal function approximator
- Requires less nodes/parameters to represent same function
- Uses distributed representations instead of locqal representations
- Can learn parameters using SGD
I will skip Convolutional Neural Network section
너무 많이 했어...
Deep Q Learning → Deep Q Network(DQN)
We will use DNN to represent Value function, Polcy, and Model and optimize loss finction(SGD)
state-action value function by Q-network with

With statement above, let us recall State-Action Value Function Approximation and Model-Free Control from Lecture 5.
This is what we expect when we attept to play Atari game with RL

- game screen → state(
) : we take 4 frames(pixel) to represent position and velocity - moves → action(
) : 18 joystick moves will be actions agent take - game score → reward(
) : we prefer higher score
Applying state images in CNN structure...

Input : 4 previous frames → Output :
Why DQN?
Q-Learning with Value Function Approximation is indeed useful. However it has critical flaw of possiblility of diverging due to following reasons
- Correlation between samples
- Non-stationary targets
DQN drives in here and addresses by Experience Replay and Fixed Q-targets
Experience Replay
In TD or Q learning algorithms, we sample a tuple, use them to update and discard right away.
Now we will store all data as replay buffer that are to be sampled and replayed.

we repeat
- sample an experience tuple from dataset →
- compute target value →
- use SGD to update
Notice
Fixed Q-Targets
We fix target weights used in target calcualtion(
we represent “fixed weight” as
- sample an experience tuple from dataset →
- compute target value →
- use SGD to update

Example
We can see that sampling order yields significant difference on Value Function.
3 Big Ideas for Deep RL
Double DQN
Recall Double Q-Learning from maximization bias notion of Lecture4

Extend this idea to DQN
- Current Q-network
is used to select actions - Older Q-network
is used to evaluate actions
we select action to evaluate with
Prioritized Replay
We use “priority function” when sampling tuples for update

if
Dueling DQN
We decouple


'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글
Stanford CS234 Lecture 8 (0) | 2022.08.11 |
---|---|
Stanford CS234 Lecture 7 (0) | 2022.08.09 |
Stanford CS234 Lecture 5 (0) | 2022.08.08 |
Stanford CS234 Lecture 4 (0) | 2022.08.05 |
Stanford CS234 Lecture 3 (0) | 2022.08.05 |
댓글