Sutton & Barto - RL: An Introduction

Chapter 1: Introduction

Discover the fundamentals of Reinforcement Learning — a computational approach to learning from interaction to achieve goals.

Learning from Interaction

Agents learn through trial-and-error, discovering optimal actions by interacting with their environment.

Reward Maximization

The goal is to maximize cumulative reward over time, not just immediate gains.

Exploration vs Exploitation

Balance between trying new actions (exploration) and using known good actions (exploitation).

Start Learning
Section 1.1

What is Reinforcement Learning?

Learning what to do — how to map situations to actions — so as to maximize a numerical reward signal.

Reinforcement Learning

Reinforcement Learning (RL) is learning what to do — how to map situations to actions — so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.

Two characteristics distinguish reinforcement learning from other types of learning:

1. Trial-and-Error Search

The agent must try different actions and observe their consequences to learn which ones lead to good outcomes.

2. Delayed Reward

Actions may affect not just the immediate reward but also future situations and all subsequent rewards.

The Agent-Environment Interaction

At the core of RL is the interaction loop between an agent and its environment. The agent observes states, takes actions, and receives rewards:

Agent
Action (At)
State (St+1)
Reward (Rt+1)
Environment

Click on any element to learn more about it

At each time step t, the agent receives state St, selects action At, and receives reward Rt+1 along with new state St+1.

RL as a Third Paradigm

Reinforcement learning is often considered a third machine learning paradigm, alongside supervised learning and unsupervised learning. Here's how they compare:

Supervised Learning

Learn from labeled examples provided by an external supervisor.

Example:

Image classification: Given photos labeled "cat" or "dog", learn to classify new photos.

Unsupervised Learning

Find hidden structure in unlabeled data.

Example:

Customer segmentation: Group customers by purchasing behavior without predefined categories.

Reinforcement Learning

Learn through trial-and-error to maximize cumulative reward.

Example:

Game playing: Learn to play chess by winning/losing games, without being told the best moves.

This Chapter
FeatureSupervisedUnsupervisedReinforcement
Requires Labeled Data
Type of Feedback
Correct answer provided
None
Reward signal (better/worse)
Primary Goal
Minimize prediction error
Find patterns/structure
Maximize cumulative reward
Needs Exploration
Handles Delayed Consequences

Key Insight: Reinforcement Learning is considered a third paradigm of machine learning, distinct from both supervised and unsupervised learning. It uniquely handles the challenges of learning from delayed consequences and balancing exploration with exploitation.

The Exploration-Exploitation Dilemma

One of the most unique challenges in RL is balancing exploration (trying new actions) with exploitation (using known good actions):

Exploration

Try new actions

Discover potentially better actions by trying things you haven't tried before or actions that seem suboptimal based on current knowledge.

  • +May discover better rewards
  • +Improves knowledge of environment
  • -Risks immediate lower rewards

Exploitation

Use best known action

Select the action that looks best based on your current knowledge. Maximize immediate expected reward using what you've learned.

  • +Guaranteed reasonable rewards
  • +Makes use of learned knowledge
  • -May miss optimal actions
Finding the Balance
More ExplorationMore Exploitation

Restaurant Example

Exploitation: You always go to your favorite restaurant because you know it's good.

Exploration: You try a new restaurant you've never been to — it might be worse, but it could also become your new favorite!

The Dilemma: If you always exploit, you'll never find better restaurants. If you always explore, you'll waste time on bad ones. The optimal strategy involves some of each.

Important: The exploration-exploitation dilemma is unique to reinforcement learning. It doesn't arise in supervised or unsupervised learning, and despite decades of study, remains mathematically unresolved in its full generality.

Section 1.2

Examples of RL in Action

From chess to robots, RL principles appear everywhere agents learn from interaction.

Chess Master

A chess player must plan several moves ahead while evaluating board positions intuitively.

  • Requires planning (anticipating opponent moves)
  • Intuitive position evaluation
  • Balances immediate tactics with long-term strategy

Refinery Controller

Optimizes parameters in real-time to balance yield, cost, and quality.

  • Real-time parameter optimization
  • Multiple competing objectives
  • Continuous state and action spaces

Gazelle Calf

Goes from stumbling to running 20 mph in just 30 minutes through rapid motor learning.

  • Rapid motor skill acquisition
  • Learning from physical feedback
  • Survival-critical learning

Mobile Robot

Must decide: explore for more trash or return to recharge station?

  • Battery management decisions
  • Explore vs. exploit tradeoff
  • Sequential decision making

Breakfast Preparation

Phil making breakfast involves hierarchical goals and complex sensorimotor coordination.

  • Hierarchical goal structure
  • Sensorimotor coordination
  • Multiple sub-tasks and dependencies
Common Features

All these examples share several important features:

  • Interaction between agent and environment
  • Goal the agent seeks to achieve despite uncertainty
  • Actions affect future states and opportunities
  • Requires accounting for delayed consequences
  • Can learn from experience to improve
Section 1.3

Elements of Reinforcement Learning

Four main subelements of any RL system: policy, reward signal, value function, and model.

Beyond the agent and environment, a reinforcement learning system has four main subelements. Click each to expand and learn more:

The policy uses values (estimated from rewards) to make decisions. An optional model enables planning ahead.

The Importance of Value Functions

Value estimation is the most important component of almost all RL algorithms.This is arguably the most important thing learned about RL over the past 60 years. While rewards define what is good in an immediate sense, values define what is good in the long run.

Section 1.4

Limitations and Scope

Understanding what the book covers and what approaches it doesn't focus on.

What This Book Covers

  • Methods that estimate value functions
  • Markov Decision Processes (MDPs)
  • Learning from interaction
  • Both model-free and model-based methods
  • Tabular and function approximation approaches

What's Not Covered

  • State representation and construction
  • Genetic algorithms and evolutionary methods
  • Simulated annealing
  • Methods that don't use value functions
Why Not Evolutionary Methods?

Evolutionary methods (genetic algorithms, etc.) are excluded because:

  • They don't learn while interacting — the focus here is online learning
  • They ignore the structure that policy is a function from states to actions
  • They don't use information about which states were visited or actions selected
  • Generally less efficient than methods using individual behavioral interactions

The Role of State

The book relies heavily on the concept of state as input to the policy, value function, and model. A state signal conveys the agent's sense of "how the environment is" at a particular time. The formal definition uses Markov Decision Processes, covered in Chapter 3.

Section 1.5

Tic-Tac-Toe: An Extended Example

See RL in action with an interactive demonstration of temporal-difference learning.

This example demonstrates the key ideas of RL using a simple game. The AI learns to play by updating state values using the temporal-difference (TD) method:

V(St)V(St)+α[V(St+1)V(St)]V(S_t) \leftarrow V(S_t) + \alpha[V(S_{t+1}) - V(S_t)]
V(St)
Current state value
α
Learning rate (step size)
V(St+1) - V(St)
Temporal difference error
Your turn (X)
Games played: 0

Value Table (Player X)

Current State Value
0.500
Key: ---------
Possible Next States
X
V = 0.50
X
V = 0.50
X
V = 0.50
X
V = 0.50
+5 more states

1.0 = Guaranteed win

0.5 = Unknown/Draw likely

0.0 = Guaranteed loss

Key Insight: RL vs. Evolutionary Methods

Notice how the RL approach differs from evolutionary methods: it evaluates individual states, not just final outcomes. This allows proper credit assignment — moves that actually mattered get updated, even if the game hasn't ended yet. Evolutionary methods would only learn from wins/losses at the end, giving credit even to moves that never occurred!

Section 1.6

Summary

Key takeaways from Chapter 1.

1

RL is a computational approach to goal-directed learning from interaction

2

It's distinguished by learning without exemplary supervision

3

The first field to seriously address learning from interaction to achieve long-term goals

4

Uses Markov Decision Processes as its formal framework

5

Represents states, actions, and rewards

6

Captures essential AI features: cause-and-effect, uncertainty, explicit goals

7

Value and value functions are key to most RL methods

Next up: Chapter 2 - Multi-armed Bandits
Section 1.7

Early History of Reinforcement Learning

Three independent threads that converged into modern RL.

Modern reinforcement learning emerged from three distinct research threads spanning over a century. Filter by thread or scroll through the timeline to explore key milestones:

1850sTrial-and-Error

Origins of Trial-and-Error

Alexander Bain describes "groping and experiment" as a learning mechanism.

Alexander Bain
1911Trial-and-Error

Law of Effect

Edward Thorndike formulates the Law of Effect: responses followed by satisfaction become more likely.

Edward Thorndike
1927Trial-and-Error

"Reinforcement" Term Coined

The term "reinforcement" appears in English translation of Pavlov's conditioned reflexes work.

Ivan Pavlov
1948Trial-and-Error

Turing's Pleasure-Pain System

Alan Turing describes a "pleasure-pain system" for machine learning.

Alan Turing
1954Trial-and-Error

SNARCs - First RL Hardware

Marvin Minsky builds Stochastic Neural-Analog Reinforcement Calculators at Princeton.

Marvin Minsky
1957Optimal Control

Bellman Equation & MDPs

Richard Bellman develops dynamic programming and defines Markov Decision Processes.

Richard Bellman
1959TD Learning

Samuel's Checkers Player

Arthur Samuel creates a checkers program that learns evaluation functions — first TD implementation.

Arthur Samuel
1960Optimal Control

Policy Iteration

Ronald Howard develops policy iteration method for MDPs.

Ronald Howard
1961-63Trial-and-Error

MENACE & BOXES

Donald Michie creates MENACE (matchbox tic-tac-toe) and BOXES (pole-balancing) systems.

Donald Michie
1972Trial-and-Error

Klopf Revives Trial-and-Error

Harry Klopf argues for returning to hedonic (pleasure-seeking) aspects of learning.

Harry Klopf
1977TD Learning

First TD Rule Publication

Ian Witten publishes the first clear description of tabular TD(0) learning rule.

Ian Witten
1983TD Learning

Actor-Critic Architecture

Barto, Sutton & Anderson combine TD learning with trial-and-error in actor-critic framework.

Andrew BartoRichard SuttonCharles Anderson
1988TD Learning

TD(λ) Algorithm

Sutton separates TD from control, introduces TD(λ), proves convergence properties.

Richard Sutton
1989TD Learning

Q-Learning

Chris Watkins develops Q-learning, fully integrating all three threads of RL research.

Chris Watkins
1992TD Learning

TD-Gammon

Gerry Tesauro's backgammon player combines TD with neural networks, achieving superhuman play.

Gerry Tesauro

The Three Threads Converge

Modern reinforcement learning emerged from the convergence of three distinct research threads: trial-and-error learning from psychology, optimal control from control theory and dynamic programming, and temporal-difference learning from AI research. Chris Watkins' Q-learning (1989) was the first to fully integrate all three threads.