Stanford CS234 Lecture 1

Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 1

What is Reinforcement Learning(RL)

How an intelligent agent learns to make good sequences of decisions according to repeated interactions with World

Key aspects of RL

Optimization
→ goal is to find an optimal way to make decisions!
Delayed consequences
→ decisions now can impact future situations...
Exploration→ only get censored data(reward for decision) : don’t know what happens if she made different choice
→ decisions impact what the agent learns
→ agent should learn about the world by acting out
Generalization
→ Policy is mapping from past experience to action

Comparing RL with similar AI procedures

AI Planning(바둑 등)→ does not require Exploration since model of the world is already given
→ involves Optimization, Generalization, Delayed Consequences
Supervised Machine Learning→ does not involve Exploration, Delayed Consequences for dataset, label are given
→ agent can immediately acknowledge results of her decisions(right or wrong for classification problems etc)
→ involves Optimization, Generalization
Unsupervised Machine Learning→ does not involve Exploration, Delayed Consequences for dataset is given but label is not given
→ involves Optimization, Generalization
Imitation Learning→ does not require Exploration
→ observes and learns from other agent’s experiences
→ involves Optimization, Generalization, Delayed Consequences

Sequential Decision Making

Goal : compose set of actions to maximize total expected future reward

→ may require strategic behavior to achieve max rewards(need to balance between immediate & long term rewards)

Agent & World Interaction(Discrete Time)

Each time step $t$ :

Agent takes action $a_{t}$
Actions $a_{t}$ affects the world, emits observation $o_{t}$ and reward $r_{t}$
Agent receives observation $o_{t}$ and reward $r_{t}$

Agent decides action based on history : $h_{t} = (a_{1}, o_{1}, r_{1}, . . ., a_{t}, o_{t}, r_{t})$

State → information ‘assumed’ to determine next situation, function of history $s_{t} = (h_{t})$

World State

world state is the actual state of the world that decides how the world generates next observations and rewards

The agent usually knows none or only small part of the world state : agent’s state space

Markov Assumption

We assume that state used by the agent is sufficient statistic of history
Future is independent of past given present
State $s_{t}$ is Markov if and only if :“in order to predict the future, you only need current state of the environment”
$p (s_{t + 1} | s_{t}, a_{t}) = p (s_{t + 1} | h_{t}, a_{t})$

we simplify our situation by taking the most recent or sequence of recent observations(maybe 4...?)

$s_{t} = o_{t}$

Types of Sequential Decision Processes

Whether actions affect next observation, Observability, How the world changes are three indicators that classify types od sequential decision processes

Bandits
→ actions have no influence on next observations
Full Observability / Markov Decision Process(MDP)→ Agent’s assumption of state is same as world state : $s_{t} = o_{t}$
→ Actions influence future observations
Partial Observability / Partially Observable Markov Decision Process(POMDP)→ Agent state is not the same as the world state, thus agent composes its own state with $s_{t} = h_{t}$ , or beliefs of world state
(e.g.) Poker game where player only knows own cards while card distribution magnificently affects the world
→ Actions influence future observations

How the world changes

Deterministic : Given history & action, single observation & reward
→ common assumption in robotics and control
Stochastic : Given history & action, multiple observations & rewards
→ common for customers, patients... difficult to model

Components of RL Algorithms

** we assume rewards are happening when the agent’s in the specific state **

we take reward as $r (s, a)$ where you arrive at a state, choose an action, get reward and then transitionto next state occurs

with an example below

Model

Representation of how the world changes in response to agent’s action

Transition or dynamics that model predicts next agent state
probability of the next state being $s^{'}$ where current state is $s$ and it takes action $a$

$p (s_{t + 1} = s^{'} | s_{t} = s, a_{t} = a)$

Prediction of reward immediately after an action
reward at certain (state, action) = mean value of all possible rewards at the (state, action)

$r (s_{t} = s, a_{t} = a) = E [r_{t} | s_{t} = s, a_{t} = a]$

Model may be wrong!!

Policy( $π$ )

function mapping how agent chooese action

$π : S$ → $A$ , mapping from states to actions

Deterministic policy :

$π (s) = a$

Stochastic policy :
$π (a | s) = P (a_{t} = a | s_{t} = s)$

example of deterministic policy

$π (s_{1}) = π (s_{2}) = . . . = π (s_{7}) = m o v e r i g h t$

Value Function(V)

Future rewards from being in a state, action following a particular policy

$V^{π}$ : expected discounted sum of future rewards under particular policy

$V^{π} (s_{t} = s) = E_{π} [r_{t} + γ r_{t + 1} + γ^{2} r_{t + 2} + γ^{3} r_{t + 3} + . . . | s_{t} = s]$

where discount factor $γ$ weights between 0 and 1

example of value functions at each state with $γ = 0$ → $V^{π} (s_{t} = s) = E_{t} [r_{t}]$

$π (s_{1}) = π (s_{2}) = . . . = π (s_{7}) = m o v e r i g h t$

if $γ$ is not zero not so many values would appear zero...

Types of RL Agents

Figure from David Silver’s Reinforcement Learnig course

Key Challenges in Learning to make Good Decisions

Planning→ Algorithm computes how to act for maximum reward
→ given model of how the world works
Reinforcement Learning→ Agent improves policy
→ Agent doesn’t know how the world works so interacts with the world to learn stuff

Solving Dilemma

Agent only learns from experiences from actions it has taken, it has no idea of what happens for other actions → dilemma!!

Agent needs to Balance between Exploration and Exploitation

Exploration : trying new things that might enable agent with better outcome in the future
Exploitation : choosing actions that are expected to yeild greatest rewards possible according to past experience

such balancing may sacrifice some rewards for possibility of greater reward in the future

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

Stanford CS234 Lecture 6 (0)	2022.08.09
Stanford CS234 Lecture 5 (0)	2022.08.08
Stanford CS234 Lecture 4 (0)	2022.08.05
Stanford CS234 Lecture 3 (0)	2022.08.05
Stanford CS234 Lecture2 (0)	2022.08.05

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

말티즈가 물어온 잡동사니

Stanford CS234 Lecture 1

What is Reinforcement Learning(RL)

Sequential Decision Making

Markov Assumption

Types of Sequential Decision Processes

Components of RL Algorithms

Key Challenges in Learning to make Good Decisions

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

Stanford CS234 Lecture 1

What is Reinforcement Learning(RL)

Sequential Decision Making

Markov Assumption

Types of Sequential Decision Processes

Components of RL Algorithms

Key Challenges in Learning to make Good Decisions

'ML Study > Stanford CS234: Reinforcement Learning' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역