본문 바로가기
ML Study/Stanford CS234: Reinforcement Learning

Stanford CS234 Lecture 1

by 누워있는말티즈 2022. 8. 4.

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 1

What is Reinforcement Learning(RL)

How an intelligent agent learns to make good sequences of decisions according to repeated interactions with World

Key aspects of RL

  • Optimization
  • → goal is to find an optimal way to make decisions!
  • Delayed consequences
  • → decisions now can impact future situations...
  • Exploration→ only get censored data(reward for decision) : don’t know what happens if she made different choice
  • → decisions impact what the agent learns
  • → agent should learn about the world by acting out
  • Generalization
  • Policy is mapping from past experience to action

Comparing RL with similar AI procedures

  • AI Planning(바둑 등)→ does not require Exploration since model of the world is already given
  • → involves Optimization, Generalization, Delayed Consequences
  • Supervised Machine Learning→ does not involve Exploration, Delayed Consequences for dataset, label are given
  • → agent can immediately acknowledge results of her decisions(right or wrong for classification problems etc)
  • → involves Optimization, Generalization
  • Unsupervised Machine Learning→ does not involve Exploration, Delayed Consequences for dataset is given but label is not given
  • → involves Optimization, Generalization
  • Imitation Learning→ does not require Exploration
  • → observes and learns from other agent’s experiences
  • → involves Optimization, Generalization, Delayed Consequences

Sequential Decision Making

Goal : compose set of actions to maximize total expected future reward

→ may require strategic behavior to achieve max rewards(need to balance between immediate & long term rewards)

Agent & World Interaction(Discrete Time)

Each time step $t$ :

  • Agent takes action $a_t$
  • Actions $a_t$ affects the world, emits observation $o_t$ and reward $r_t$
  • Agent receives observation $o_t$ and reward $r_t$

Agent decides action based on history : $h_t = (a_1, o_1, r_1, ..., a_t, o_t, r_t)$

State → information ‘assumed’ to determine next situation, function of history $s_t = (h_t)$

World State

world state is the actual state of the world that decides how the world generates next observations and rewards

The agent usually knows none or only small part of the world state : agent’s state space


Markov Assumption

  • We assume that state used by the agent is sufficient statistic of history
  • Future is independent of past given present
  • State $s_t$ is Markov if and only if :“in order to predict the future, you only need current state of the environment”
  • $$
    p(s_{t+1}|s_t,a_t)=p(s_{t+1}|h_t,a_t)
    $$

we simplify our situation by taking the most recent or sequence of recent observations(maybe 4...?)

$$
s_t=o_t
$$

Types of Sequential Decision Processes

Whether actions affect next observation, Observability, How the world changes are three indicators that classify types od sequential decision processes

  • Bandits
  • → actions have no influence on next observations
  • Full Observability / Markov Decision Process(MDP)→ Agent’s assumption of state is same as world state : $s_t = o_t$
  • → Actions influence future observations
  • Partial Observability / Partially Observable Markov Decision Process(POMDP)→ Agent state is not the same as the world state, thus agent composes its own state with $s_t=h_t$, or beliefs of world state
  • (e.g.) Poker game where player only knows own cards while card distribution magnificently affects the world
  • → Actions influence future observations

How the world changes

  • Deterministic : Given history & action, single observation & reward
  • → common assumption in robotics and control
  • Stochastic : Given history & action, multiple observations & rewards
  • → common for customers, patients... difficult to model

Components of RL Algorithms

** we assume rewards are happening when the agent’s in the specific state **

we take reward as $r(s,a)$ where you arrive at a state, choose an action, get reward and then transitionto next state occurs

with an example below

Model

Representation of how the world changes in response to agent’s action

  • Transition or dynamics that model predicts next agent state
  • probability of the next state being $s'$ where current state is $s$ and it takes action $a$

$$
p(s_{t+1}=s'|s_t=s,a_t=a)
$$

  • Prediction of reward immediately after an action
  • reward at certain (state, action) = mean value of all possible rewards at the (state, action)

$$
r(s_t=s,a_t=a)=E[r_t|s_t=s, a_t=a]
$$

Model may be wrong!!

Policy($\pi$)

function mapping how agent chooese action

$\pi : S$ → $A$ , mapping from states to actions

  • Deterministic policy :

$$
\pi(s)=a
$$

  • Stochastic policy :
  • $$
    \pi(a|s)=P(a_t=a|s_t=s)
    $$

example of deterministic policy

$\pi(s_1)=\pi(s_2)=...=\pi(s_7)=move right$

Value Function(V)

Future rewards from being in a state, action following a particular policy

$V^{\pi}$ : expected discounted sum of future rewards under particular policy

$$
V^{\pi}(s_t=s)=E_\pi[r_t + \gamma r_{t+1}+\gamma ^2 r_{t+2}+\gamma ^3 r_{t+3}+...|s_t=s]
$$

where discount factor $\gamma$ weights between 0 and 1

example of value functions at each state with $\gamma = 0$ → $V^{\pi}(s_t=s)=E_t[r_t]$

$\pi(s_1)=\pi(s_2)=...=\pi(s_7)=move right$

if $\gamma$ is not zero not so many values would appear zero...

Types of RL Agents

Figure from David Silver’s Reinforcement Learnig course

Key Challenges in Learning to make Good Decisions

  • Planning→ Algorithm computes how to act for maximum reward
  • → given model of how the world works
  • Reinforcement Learning→ Agent improves policy
  • → Agent doesn’t know how the world works so interacts with the world to learn stuff

Solving Dilemma

Agent only learns from experiences from actions it has taken, it has no idea of what happens for other actions → dilemma!!

Agent needs to Balance between Exploration and Exploitation

  • Exploration : trying new things that might enable agent with better outcome in the future
  • Exploitation : choosing actions that are expected to yeild greatest rewards possible according to past experience

such balancing may sacrifice some rewards for possibility of greater reward in the future

반응형

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 6  (0) 2022.08.09
Stanford CS234 Lecture 5  (0) 2022.08.08
Stanford CS234 Lecture 4  (0) 2022.08.05
Stanford CS234 Lecture 3  (0) 2022.08.05
Stanford CS234 Lecture2  (0) 2022.08.05

댓글