Sutton & Barto - RL: An Introduction

Chapter 1: Introduction

Discover the fundamentals of Reinforcement Learning — a computational approach to learning from interaction to achieve goals.

Learning from Interaction

Agents learn through trial-and-error, discovering optimal actions by interacting with their environment.

Reward Maximization

The goal is to maximize cumulative reward over time, not just immediate gains.

Exploration vs Exploitation

Balance between trying new actions (exploration) and using known good actions (exploitation).

Start Learning

Section 1.1

What is Reinforcement Learning?

Learning what to do — how to map situations to actions — so as to maximize a numerical reward signal.

Reinforcement Learning

Reinforcement Learning (RL) is learning what to do — how to map situations to actions — so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.

Two characteristics distinguish reinforcement learning from other types of learning:

1. Trial-and-Error Search

The agent must try different actions and observe their consequences to learn which ones lead to good outcomes.

2. Delayed Reward

Actions may affect not just the immediate reward but also future situations and all subsequent rewards.

The Agent-Environment Interaction

At the core of RL is the interaction loop between an agent and its environment. The agent observes states, takes actions, and receives rewards:

Agent

Action (At)

State (St+1)

Reward (Rt+1)

Environment

Click on any element to learn more about it

At each time step t, the agent receives state St, selects action At, and receives reward Rt+1 along with new state St+1.

RL as a Third Paradigm

Reinforcement learning is often considered a third machine learning paradigm, alongside supervised learning and unsupervised learning. Here's how they compare:

Supervised Learning

Learn from labeled examples provided by an external supervisor.

Example:

Image classification: Given photos labeled "cat" or "dog", learn to classify new photos.

Unsupervised Learning

Find hidden structure in unlabeled data.

Example:

Customer segmentation: Group customers by purchasing behavior without predefined categories.

Reinforcement Learning

Learn through trial-and-error to maximize cumulative reward.

Example:

Game playing: Learn to play chess by winning/losing games, without being told the best moves.

This Chapter

Feature	Supervised	Unsupervised	Reinforcement
Requires Labeled Data
Type of Feedback	Correct answer provided	None	Reward signal (better/worse)
Primary Goal	Minimize prediction error	Find patterns/structure	Maximize cumulative reward
Needs Exploration
Handles Delayed Consequences

Key Insight: Reinforcement Learning is considered a third paradigm of machine learning, distinct from both supervised and unsupervised learning. It uniquely handles the challenges of learning from delayed consequences and balancing exploration with exploitation.

The Exploration-Exploitation Dilemma

One of the most unique challenges in RL is balancing exploration (trying new actions) with exploitation (using known good actions):

Exploration

Try new actions

Discover potentially better actions by trying things you haven't tried before or actions that seem suboptimal based on current knowledge.

+May discover better rewards
+Improves knowledge of environment
-Risks immediate lower rewards

Exploitation

Use best known action

Select the action that looks best based on your current knowledge. Maximize immediate expected reward using what you've learned.

+Guaranteed reasonable rewards
+Makes use of learned knowledge
-May miss optimal actions

Finding the Balance

More ExplorationMore Exploitation

Restaurant Example

Exploitation: You always go to your favorite restaurant because you know it's good.

Exploration: You try a new restaurant you've never been to — it might be worse, but it could also become your new favorite!

The Dilemma: If you always exploit, you'll never find better restaurants. If you always explore, you'll waste time on bad ones. The optimal strategy involves some of each.

Important: The exploration-exploitation dilemma is unique to reinforcement learning. It doesn't arise in supervised or unsupervised learning, and despite decades of study, remains mathematically unresolved in its full generality.

Section 1.2

Examples of RL in Action

From chess to robots, RL principles appear everywhere agents learn from interaction.

Chess Master

A chess player must plan several moves ahead while evaluating board positions intuitively.

Requires planning (anticipating opponent moves)
Intuitive position evaluation
Balances immediate tactics with long-term strategy

Refinery Controller

Optimizes parameters in real-time to balance yield, cost, and quality.

Real-time parameter optimization
Multiple competing objectives
Continuous state and action spaces

Gazelle Calf

Goes from stumbling to running 20 mph in just 30 minutes through rapid motor learning.

Rapid motor skill acquisition
Learning from physical feedback
Survival-critical learning

Mobile Robot

Must decide: explore for more trash or return to recharge station?

Battery management decisions
Explore vs. exploit tradeoff
Sequential decision making

Breakfast Preparation

Phil making breakfast involves hierarchical goals and complex sensorimotor coordination.

Hierarchical goal structure
Sensorimotor coordination
Multiple sub-tasks and dependencies

Common Features

All these examples share several important features:

Interaction between agent and environment
Goal the agent seeks to achieve despite uncertainty
Actions affect future states and opportunities
Requires accounting for delayed consequences
Can learn from experience to improve

Section 1.3

Elements of Reinforcement Learning

Four main subelements of any RL system: policy, reward signal, value function, and model.

Beyond the agent and environment, a reinforcement learning system has four main subelements. Click each to expand and learn more:

The policy uses values (estimated from rewards) to make decisions. An optional model enables planning ahead.

The Importance of Value Functions

Value estimation is the most important component of almost all RL algorithms.This is arguably the most important thing learned about RL over the past 60 years. While rewards define what is good in an immediate sense, values define what is good in the long run.

Section 1.4

Limitations and Scope

Understanding what the book covers and what approaches it doesn't focus on.

What This Book Covers

Methods that estimate value functions
Markov Decision Processes (MDPs)
Learning from interaction
Both model-free and model-based methods
Tabular and function approximation approaches

What's Not Covered

State representation and construction
Genetic algorithms and evolutionary methods
Simulated annealing
Methods that don't use value functions

Why Not Evolutionary Methods?

Evolutionary methods (genetic algorithms, etc.) are excluded because:

They don't learn while interacting — the focus here is online learning
They ignore the structure that policy is a function from states to actions
They don't use information about which states were visited or actions selected
Generally less efficient than methods using individual behavioral interactions

The Role of State

The book relies heavily on the concept of state as input to the policy, value function, and model. A state signal conveys the agent's sense of "how the environment is" at a particular time. The formal definition uses Markov Decision Processes, covered in Chapter 3.

Section 1.5

Tic-Tac-Toe: An Extended Example

See RL in action with an interactive demonstration of temporal-difference learning.

This example demonstrates the key ideas of RL using a simple game. The AI learns to play by updating state values using the temporal-difference (TD) method:

V(S_t) \leftarrow V(S_t) + \alpha[V(S_{t+1}) - V(S_t)]

V(St)

Current state value

Learning rate (step size)

V(St+1) - V(St)

Temporal difference error

Your turn (X)

Games played: 0

Value Table (Player X)

Current State Value

0.500

Key: ---------

Possible Next States

V = 0.50

+5 more states

1.0 = Guaranteed win

0.5 = Unknown/Draw likely

0.0 = Guaranteed loss

Key Insight: RL vs. Evolutionary Methods

Notice how the RL approach differs from evolutionary methods: it evaluates individual states, not just final outcomes. This allows proper credit assignment — moves that actually mattered get updated, even if the game hasn't ended yet. Evolutionary methods would only learn from wins/losses at the end, giving credit even to moves that never occurred!

Section 1.6

Summary

Key takeaways from Chapter 1.

RL is a computational approach to goal-directed learning from interaction

It's distinguished by learning without exemplary supervision

The first field to seriously address learning from interaction to achieve long-term goals

Uses Markov Decision Processes as its formal framework

Represents states, actions, and rewards

Captures essential AI features: cause-and-effect, uncertainty, explicit goals

Value and value functions are key to most RL methods

Next up: Chapter 2 - Multi-armed Bandits

Section 1.7

Early History of Reinforcement Learning

Three independent threads that converged into modern RL.

Modern reinforcement learning emerged from three distinct research threads spanning over a century. Filter by thread or scroll through the timeline to explore key milestones:

1850sTrial-and-Error

Origins of Trial-and-Error

Alexander Bain describes "groping and experiment" as a learning mechanism.

Alexander Bain

1911Trial-and-Error

Law of Effect

Edward Thorndike formulates the Law of Effect: responses followed by satisfaction become more likely.

Edward Thorndike

1927Trial-and-Error

"Reinforcement" Term Coined

The term "reinforcement" appears in English translation of Pavlov's conditioned reflexes work.

Ivan Pavlov

1948Trial-and-Error

Turing's Pleasure-Pain System

Alan Turing describes a "pleasure-pain system" for machine learning.

Alan Turing

1954Trial-and-Error

SNARCs - First RL Hardware

Marvin Minsky builds Stochastic Neural-Analog Reinforcement Calculators at Princeton.

Marvin Minsky

1957Optimal Control

Bellman Equation & MDPs

Richard Bellman develops dynamic programming and defines Markov Decision Processes.

Richard Bellman

1959TD Learning

Samuel's Checkers Player

Arthur Samuel creates a checkers program that learns evaluation functions — first TD implementation.

Arthur Samuel

1960Optimal Control

Policy Iteration

Ronald Howard develops policy iteration method for MDPs.

Ronald Howard

1961-63Trial-and-Error

MENACE & BOXES

Donald Michie creates MENACE (matchbox tic-tac-toe) and BOXES (pole-balancing) systems.

Donald Michie

1972Trial-and-Error

Klopf Revives Trial-and-Error

Harry Klopf argues for returning to hedonic (pleasure-seeking) aspects of learning.

Harry Klopf

1977TD Learning

First TD Rule Publication

Ian Witten publishes the first clear description of tabular TD(0) learning rule.

Ian Witten

1983TD Learning

Actor-Critic Architecture

Barto, Sutton & Anderson combine TD learning with trial-and-error in actor-critic framework.

Andrew BartoRichard SuttonCharles Anderson

1988TD Learning

TD(λ) Algorithm

Sutton separates TD from control, introduces TD(λ), proves convergence properties.

Richard Sutton

1989TD Learning

Q-Learning

Chris Watkins develops Q-learning, fully integrating all three threads of RL research.

Chris Watkins

1992TD Learning

TD-Gammon

Gerry Tesauro's backgammon player combines TD with neural networks, achieving superhuman play.

Gerry Tesauro

The Three Threads Converge

Modern reinforcement learning emerged from the convergence of three distinct research threads: trial-and-error learning from psychology, optimal control from control theory and dynamic programming, and temporal-difference learning from AI research. Chris Watkins' Q-learning (1989) was the first to fully integrate all three threads.

Chapter 1: Introduction

Learning from Interaction

Reward Maximization

Exploration vs Exploitation

What is Reinforcement Learning?

1. Trial-and-Error Search

2. Delayed Reward

The Agent-Environment Interaction

RL as a Third Paradigm

Supervised Learning

Unsupervised Learning

Reinforcement Learning

The Exploration-Exploitation Dilemma

Exploration

Exploitation

Restaurant Example

Examples of RL in Action

Chess Master

Refinery Controller

Gazelle Calf

Mobile Robot

Breakfast Preparation

Elements of Reinforcement Learning

Policy

Reward Signal

Value Function

Model (Optional)

Limitations and Scope

What This Book Covers

What's Not Covered

The Role of State

Tic-Tac-Toe: An Extended Example

Value Table (Player X)

Summary

Early History of Reinforcement Learning

Origins of Trial-and-Error

Law of Effect

"Reinforcement" Term Coined

Turing's Pleasure-Pain System

SNARCs - First RL Hardware

Bellman Equation & MDPs

Samuel's Checkers Player

Policy Iteration

MENACE & BOXES

Klopf Revives Trial-and-Error

First TD Rule Publication

Actor-Critic Architecture

TD(λ) Algorithm

Q-Learning

TD-Gammon

The Three Threads Converge