Stanford7 Stanford CS234 Lecture 10 Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 10 continuing our discussion over Updating Parameters Given the Gradient Local Approximation we couldn’t calculate equation above because we had no clue of what $\tilde{\pi}$ is. So for approximation, we replace the term with previous policy. we take policy $\pi^i$ run it out, get $D$ trajectories and use them to obtain distribution .. 2022. 8. 12. Stanford CS234 Lecture 9 Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 9 Continuing discussion about gradient descent, recall that our goal is to converge ASAP to a local optima. We want our policy update to be a monotonic improvement. → guarantees to converge (emphirical) → we simply don’t want to get fired... Recall last time, we expressed gradient of value function as below this term is unbiased but .. 2022. 8. 11. Stanford CS234 Lecture 8 Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 8 Policy-Based Reinforcement Learning Recall last lecture → where we learned to find state-value($V$) or state-action value($Q$) for parameter $w$(or $\theta$) and use such to build (hopefully)optimal policy($\pi$) Today we’ll take that policy $\pi^\theta$ as parameter $$ \pi^\theta(s,a)=P[a|s;\theta] $$ our goal is to find a policy .. 2022. 8. 11. Stanford CS234 Lecture 7 Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 7 Imitation Learning there are occasions where rewards are dense in time or each iteration is super expensive → autonomous driving kind of stuff So we summon an expert to demonstrate trajectories Our problem Setup we will talk about three methods below and their goal are... Behavoiral Cloning : learn directly from teacher’s policy In.. 2022. 8. 9. Stanford CS234 Lecture 5 Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 5 We need to be able to generalize from our experience to make “good decisions” Value Function Approximation(VFA) from now on, we will represent $(s,a)$ value function with parameterized function input would be state or state-action pair, output would be value in any kinds. parameter $w$ here would a vector in simple terms such as DN.. 2022. 8. 8. Stanford CS234 Lecture 4 Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 4 →We evaluated policy in model-free situation last time How can an agent start making good decisions when it doen’t know how the world works: How do we make a “good decision”? Learning to Control Invovles... Optimization : we want maximal expected rewards Delayed Consequences : may take time to realize wheter previous action aws goo.. 2022. 8. 5. 이전 1 2 다음 반응형