본문 바로가기

Stanford7

Stanford CS234 Lecture 10 Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 10 continuing our discussion over Updating Parameters Given the Gradient Local Approximation we couldn’t calculate equation above because we had no clue of what

\tilde{π}

is. So for approximation, we replace the term with previous policy. we take policy

π^{i}

run it out, get

D

trajectories and use them to obtain distribution .. 2022. 8. 12.

Stanford CS234 Lecture 9 Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 9 Continuing discussion about gradient descent, recall that our goal is to converge ASAP to a local optima. We want our policy update to be a monotonic improvement. → guarantees to converge (emphirical) → we simply don’t want to get fired... Recall last time, we expressed gradient of value function as below this term is unbiased but .. 2022. 8. 11.

Stanford CS234 Lecture 8 Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 8 Policy-Based Reinforcement Learning Recall last lecture → where we learned to find state-value(

V

) or state-action value(

Q

) for parameter

w

θ

) and use such to build (hopefully)optimal policy(

π

) Today we’ll take that policy

π^{θ}

π^{θ} (s, a) = P [a | s; θ]

our goal is to find a policy .. 2022. 8. 11.

Stanford CS234 Lecture 7 Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 7 Imitation Learning there are occasions where rewards are dense in time or each iteration is super expensive → autonomous driving kind of stuff So we summon an expert to demonstrate trajectories Our problem Setup we will talk about three methods below and their goal are... Behavoiral Cloning : learn directly from teacher’s policy In.. 2022. 8. 9.

Stanford CS234 Lecture 5 Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 5 We need to be able to generalize from our experience to make “good decisions” Value Function Approximation(VFA) from now on, we will represent

(s, a)

value function with parameterized function input would be state or state-action pair, output would be value in any kinds. parameter

w

here would a vector in simple terms such as DN.. 2022. 8. 8.

Stanford CS234 Lecture 4 Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 4 →We evaluated policy in model-free situation last time How can an agent start making good decisions when it doen’t know how the world works: How do we make a “good decision”? Learning to Control Invovles... Optimization : we want maximal expected rewards Delayed Consequences : may take time to realize wheter previous action aws goo.. 2022. 8. 5.

이전 1 2 다음

티스토리툴바