Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 6
We will combine Nueral Network(NN) features on RL
Basic Deep Neural Network
DNN is linear neural network structure with more than three hidden layers of functional operators which are differentiable. Benefits of using DNN are as below
- DNN is universal function approximator
- Requires less nodes/parameters to represent same function
- Uses distributed representations instead of locqal representations
- Can learn parameters using SGD
I will skip Convolutional Neural Network section
너무 많이 했어...
Deep Q Learning → Deep Q Network(DQN)
We will use DNN to represent Value function, Polcy, and Model and optimize loss finction(SGD)
state-action value function by Q-network with $w$ → $\hat{Q}(s,a;w) \approx Q(s,a)$
With statement above, let us recall State-Action Value Function Approximation and Model-Free Control from Lecture 5.
This is what we expect when we attept to play Atari game with RL
- game screen → state($s_t$) : we take 4 frames(pixel) to represent position and velocity
- moves → action($a_t$) : 18 joystick moves will be actions agent take
- game score → reward($r_t$) : we prefer higher score
Applying state images in CNN structure...
Input : 4 previous frames → Output : $Q(s,a)$ for 18 joystick moves
Why DQN?
Q-Learning with Value Function Approximation is indeed useful. However it has critical flaw of possiblility of diverging due to following reasons
- Correlation between samples
- Non-stationary targets
DQN drives in here and addresses by Experience Replay and Fixed Q-targets
Experience Replay
In TD or Q learning algorithms, we sample a tuple, use them to update and discard right away.
Now we will store all data as replay buffer that are to be sampled and replayed.
we repeat
- sample an experience tuple from dataset → $(s,a,r,s') \in D$
- compute target value → $r + \gamma \max_{a'}\hat{Q}(s',a';w)$
- use SGD to update $w$
$$
\Delta w = \alpha(r+\gamma \max_{a'}\hat{Q}(s',a'; w)-\hat{Q}(s,a; w))\nabla_w\hat{Q}(s,a;w)
$$
Notice $\Delta w$ equation is identical to normal Q-learning. However it is significant tu understant that we repeat this process!!! For every procedure Q function would change and therefore different value will return each time for the same tuple.
Fixed Q-Targets
We fix target weights used in target calcualtion($r+\gamma V(s') \approx oracleV^*$)
we represent “fixed weight” as $w^-$ and $w$ shall be weights that are being updated.
- sample an experience tuple from dataset → $(s,a,r,s') \in D$
- compute target value → $r + \gamma \max_{a'}\hat{Q}(s',a';w^-)$
- use SGD to update $w$
$$
\Delta w = \alpha(r+\gamma \max_{a'}\hat{Q}(s',a'; w^-)-\hat{Q}(s,a; w))\nabla_w\hat{Q}(s,a;w)
$$
Example
We can see that sampling order yields significant difference on Value Function.
3 Big Ideas for Deep RL
Double DQN
Recall Double Q-Learning from maximization bias notion of Lecture4
Extend this idea to DQN
- Current Q-network $w$ is used to select actions
- Older Q-network $w^-$ is used to evaluate actions
$$
\Delta w = \alpha(r+\gamma \hat{Q}(\arg\max_{a'}\hat{Q}(s',a';w); w^-)-\hat{Q}(s,a; w))\nabla_w\hat{Q}(s,a;w)
$$
we select action to evaluate with $\arg\max_{a'}\hat{Q}(s',a';w)$ inside and we take that action to evaluate with fixed weight $w^-$.
Prioritized Replay
We use “priority function” when sampling tuples for update
if $\alpha = 0$, choosing upon tuples would be of uniform priorities.
Dueling DQN
We decouple $Q^\pi(s,a)$ into $V^\pi(s)$ and “advantage” $A^\pi(s,a)$ of an action.
$$
A^\pi(s,a)=Q^\pi(s,a)-V^\pi(s)
$$
'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글
Stanford CS234 Lecture 8 (0) | 2022.08.11 |
---|---|
Stanford CS234 Lecture 7 (0) | 2022.08.09 |
Stanford CS234 Lecture 5 (0) | 2022.08.08 |
Stanford CS234 Lecture 4 (0) | 2022.08.05 |
Stanford CS234 Lecture 3 (0) | 2022.08.05 |
댓글