Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 5
We need to be able to generalize from our experience to make “good decisions”
Value Function Approximation(VFA)
from now on, we will represent

input would be state or state-action pair, output would be value in any kinds.
parameter
Motivations
- we don’t want to store and go through every single state’s properties
- we want more compact and precise, generalized representation
Benefits of Generalization
- Reduce required memory
- Reduce computation
- Reduce experience(numbers)
Function Approximators
Out of so many possible approximators(NN, Linear combination, Fourier, etc ...) we will focus on differentiable kinds.
Today we will focus on Linear feature representations
Review Gradient Descent
Consider a function
→ we want to fing
then we update our parameter with this gradient
where
VFA for Prediction
we will assume that there is an oracle🔮 that returns exact value for given state, so we know
Stochastic Gradient Descent(SGD)
we use SGD to specify the very parameter

we use feature vectors

Linear VFA for Prediction with Oracle

*Convergence Gurantees for Linear VFA for Policy Evaluation
Define MSE of a linear VFA for particulat policy
where
Monte-Carlo Linear VFA
We no longer have an oracle that tells us the true
can be identically applied to both first-visit and every-visit MC
Algorithm

Baird Example - Monte Carlo Policy Evaluation
calculate
for with parateter properties.assume
and ,trajectory :
Ans. for
= 0,→
→
*Convergence Gurantees for Linear VFA for “Monte Carlo” Policy Evaluation
Define MSE of a linear VFA for particulat policy
where
Batch Monte-Carlo VFA
we take sets of episodes(a.k.a. Batches) for policy
Take derivative and set to 0
where X stands for matrix of features of each of N states.
Temporal Difference(TD) Learning VFA
instead of rewards(
Algorithm

Basically identical to Monte-Carlo method.
Baird Example - Temporal Difference Policy Evaluation
calculate for with parateter properties.assume
and ,trajectory :
, sampleAns. for
= 0,
*Convergence Gurantees for Linear VFA for “Temporal Difference” Policy Evaluation
Define MSE of a linear VFA for particulat policy
where
Control with Value Function Approximation
Now we practice VFA above with state-action value
This process generally involves bootstrapping, off-policy learning with function approximation
Exact same approach will be taken as of above where we talked about state values → SGD

features for both state and action will be as below

state-action value representation with features and weight value
Stochastic Gradient Descent
Objective function is...
we use above to derive
Gradient Descent term for MC, TD, and Q-Learning
- Monte-Carlo
- SARSA
- Q-Learning
Convergence of Control Method with VFA

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글
Stanford CS234 Lecture 7 (0) | 2022.08.09 |
---|---|
Stanford CS234 Lecture 6 (0) | 2022.08.09 |
Stanford CS234 Lecture 4 (0) | 2022.08.05 |
Stanford CS234 Lecture 3 (0) | 2022.08.05 |
Stanford CS234 Lecture2 (0) | 2022.08.05 |
댓글