본문 바로가기
ML Study/Stanford CS234: Reinforcement Learning

Stanford CS234 Lecture 6

by 누워있는말티즈 2022. 8. 9.

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 6

We will combine Nueral Network(NN) features on RL

Basic Deep Neural Network

DNN is linear neural network structure with more than three hidden layers of functional operators which are differentiable. Benefits of using DNN are as below

  • DNN is universal function approximator
  • Requires less nodes/parameters to represent same function
  • Uses distributed representations instead of locqal representations
  • Can learn parameters using SGD

I will skip Convolutional Neural Network section

너무 많이 했어...


Deep Q Learning → Deep Q Network(DQN)

We will use DNN to represent Value function, Polcy, and Model and optimize loss finction(SGD)

state-action value function by Q-network with $w$ → $\hat{Q}(s,a;w) \approx Q(s,a)$

With statement above, let us recall State-Action Value Function Approximation and Model-Free Control from Lecture 5.

This is what we expect when we attept to play Atari game with RL

  • game screen → state($s_t$) : we take 4 frames(pixel) to represent position and velocity
  • moves → action($a_t$) : 18 joystick moves will be actions agent take
  • game score → reward($r_t$) : we prefer higher score

Applying state images in CNN structure...

Input : 4 previous frames → Output : $Q(s,a)$ for 18 joystick moves

Why DQN?

Q-Learning with Value Function Approximation is indeed useful. However it has critical flaw of possiblility of diverging due to following reasons

  • Correlation between samples
  • Non-stationary targets

DQN drives in here and addresses by Experience Replay and Fixed Q-targets

Experience Replay

In TD or Q learning algorithms, we sample a tuple, use them to update and discard right away.

Now we will store all data as replay buffer that are to be sampled and replayed.


we repeat

  • sample an experience tuple from dataset → $(s,a,r,s') \in D$
  • compute target value → $r + \gamma \max_{a'}\hat{Q}(s',a';w)$
  • use SGD to update $w$

$$
\Delta w = \alpha(r+\gamma \max_{a'}\hat{Q}(s',a'; w)-\hat{Q}(s,a; w))\nabla_w\hat{Q}(s,a;w)
$$

Notice $\Delta w$ equation is identical to normal Q-learning. However it is significant tu understant that we repeat this process!!! For every procedure Q function would change and therefore different value will return each time for the same tuple.

Fixed Q-Targets

We fix target weights used in target calcualtion($r+\gamma V(s') \approx oracleV^*$)

we represent “fixed weight” as $w^-$ and $w$ shall be weights that are being updated.

  • sample an experience tuple from dataset → $(s,a,r,s') \in D$
  • compute target value → $r + \gamma \max_{a'}\hat{Q}(s',a';w^-)$
  • use SGD to update $w$

$$
\Delta w = \alpha(r+\gamma \max_{a'}\hat{Q}(s',a'; w^-)-\hat{Q}(s,a; w))\nabla_w\hat{Q}(s,a;w)
$$

  • Example

We can see that sampling order yields significant difference on Value Function.


3 Big Ideas for Deep RL

Double DQN

Recall Double Q-Learning from maximization bias notion of Lecture4

Extend this idea to DQN

  • Current Q-network $w$ is used to select actions
  • Older Q-network $w^-$ is used to evaluate actions

$$
\Delta w = \alpha(r+\gamma \hat{Q}(\arg\max_{a'}\hat{Q}(s',a';w); w^-)-\hat{Q}(s,a; w))\nabla_w\hat{Q}(s,a;w)
$$

we select action to evaluate with $\arg\max_{a'}\hat{Q}(s',a';w)$ inside and we take that action to evaluate with fixed weight $w^-$.

Prioritized Replay

We use “priority function” when sampling tuples for update

if $\alpha = 0$, choosing upon tuples would be of uniform priorities.

Dueling DQN

We decouple $Q^\pi(s,a)$ into $V^\pi(s)$ and “advantage” $A^\pi(s,a)$ of an action.

$$
A^\pi(s,a)=Q^\pi(s,a)-V^\pi(s)
$$


반응형

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 8  (0) 2022.08.11
Stanford CS234 Lecture 7  (0) 2022.08.09
Stanford CS234 Lecture 5  (0) 2022.08.08
Stanford CS234 Lecture 4  (0) 2022.08.05
Stanford CS234 Lecture 3  (0) 2022.08.05

댓글