Chapter 1: Introduction
Discover the fundamentals of Reinforcement Learning — a computational approach to learning from interaction to achieve goals.
Learning from Interaction
Agents learn through trial-and-error, discovering optimal actions by interacting with their environment.
Reward Maximization
The goal is to maximize cumulative reward over time, not just immediate gains.
Exploration vs Exploitation
Balance between trying new actions (exploration) and using known good actions (exploitation).
What is Reinforcement Learning?
Learning what to do — how to map situations to actions — so as to maximize a numerical reward signal.
Reinforcement Learning (RL) is learning what to do — how to map situations to actions — so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.
Two characteristics distinguish reinforcement learning from other types of learning:
1. Trial-and-Error Search
The agent must try different actions and observe their consequences to learn which ones lead to good outcomes.
2. Delayed Reward
Actions may affect not just the immediate reward but also future situations and all subsequent rewards.
The Agent-Environment Interaction
At the core of RL is the interaction loop between an agent and its environment. The agent observes states, takes actions, and receives rewards:
Click on any element to learn more about it
At each time step t, the agent receives state St, selects action At, and receives reward Rt+1 along with new state St+1.
RL as a Third Paradigm
Reinforcement learning is often considered a third machine learning paradigm, alongside supervised learning and unsupervised learning. Here's how they compare:
Supervised Learning
Learn from labeled examples provided by an external supervisor.
Image classification: Given photos labeled "cat" or "dog", learn to classify new photos.
Unsupervised Learning
Find hidden structure in unlabeled data.
Customer segmentation: Group customers by purchasing behavior without predefined categories.
Reinforcement Learning
Learn through trial-and-error to maximize cumulative reward.
Game playing: Learn to play chess by winning/losing games, without being told the best moves.
| Feature | Supervised | Unsupervised | Reinforcement |
|---|---|---|---|
| Requires Labeled Data | |||
| Type of Feedback | Correct answer provided | None | Reward signal (better/worse) |
| Primary Goal | Minimize prediction error | Find patterns/structure | Maximize cumulative reward |
| Needs Exploration | |||
| Handles Delayed Consequences |
Key Insight: Reinforcement Learning is considered a third paradigm of machine learning, distinct from both supervised and unsupervised learning. It uniquely handles the challenges of learning from delayed consequences and balancing exploration with exploitation.
The Exploration-Exploitation Dilemma
One of the most unique challenges in RL is balancing exploration (trying new actions) with exploitation (using known good actions):
Exploration
Try new actions
Discover potentially better actions by trying things you haven't tried before or actions that seem suboptimal based on current knowledge.
- +May discover better rewards
- +Improves knowledge of environment
- -Risks immediate lower rewards
Exploitation
Use best known action
Select the action that looks best based on your current knowledge. Maximize immediate expected reward using what you've learned.
- +Guaranteed reasonable rewards
- +Makes use of learned knowledge
- -May miss optimal actions
Restaurant Example
Exploitation: You always go to your favorite restaurant because you know it's good.
Exploration: You try a new restaurant you've never been to — it might be worse, but it could also become your new favorite!
The Dilemma: If you always exploit, you'll never find better restaurants. If you always explore, you'll waste time on bad ones. The optimal strategy involves some of each.
Important: The exploration-exploitation dilemma is unique to reinforcement learning. It doesn't arise in supervised or unsupervised learning, and despite decades of study, remains mathematically unresolved in its full generality.
Examples of RL in Action
From chess to robots, RL principles appear everywhere agents learn from interaction.
Chess Master
A chess player must plan several moves ahead while evaluating board positions intuitively.
- Requires planning (anticipating opponent moves)
- Intuitive position evaluation
- Balances immediate tactics with long-term strategy
Refinery Controller
Optimizes parameters in real-time to balance yield, cost, and quality.
- Real-time parameter optimization
- Multiple competing objectives
- Continuous state and action spaces
Gazelle Calf
Goes from stumbling to running 20 mph in just 30 minutes through rapid motor learning.
- Rapid motor skill acquisition
- Learning from physical feedback
- Survival-critical learning
Mobile Robot
Must decide: explore for more trash or return to recharge station?
- Battery management decisions
- Explore vs. exploit tradeoff
- Sequential decision making
Breakfast Preparation
Phil making breakfast involves hierarchical goals and complex sensorimotor coordination.
- Hierarchical goal structure
- Sensorimotor coordination
- Multiple sub-tasks and dependencies
All these examples share several important features:
- Interaction between agent and environment
- Goal the agent seeks to achieve despite uncertainty
- Actions affect future states and opportunities
- Requires accounting for delayed consequences
- Can learn from experience to improve
Elements of Reinforcement Learning
Four main subelements of any RL system: policy, reward signal, value function, and model.
Beyond the agent and environment, a reinforcement learning system has four main subelements. Click each to expand and learn more:
The policy uses values (estimated from rewards) to make decisions. An optional model enables planning ahead.
Value estimation is the most important component of almost all RL algorithms.This is arguably the most important thing learned about RL over the past 60 years. While rewards define what is good in an immediate sense, values define what is good in the long run.
Limitations and Scope
Understanding what the book covers and what approaches it doesn't focus on.
What This Book Covers
- Methods that estimate value functions
- Markov Decision Processes (MDPs)
- Learning from interaction
- Both model-free and model-based methods
- Tabular and function approximation approaches
What's Not Covered
- State representation and construction
- Genetic algorithms and evolutionary methods
- Simulated annealing
- Methods that don't use value functions
Evolutionary methods (genetic algorithms, etc.) are excluded because:
- They don't learn while interacting — the focus here is online learning
- They ignore the structure that policy is a function from states to actions
- They don't use information about which states were visited or actions selected
- Generally less efficient than methods using individual behavioral interactions
The Role of State
The book relies heavily on the concept of state as input to the policy, value function, and model. A state signal conveys the agent's sense of "how the environment is" at a particular time. The formal definition uses Markov Decision Processes, covered in Chapter 3.
Tic-Tac-Toe: An Extended Example
See RL in action with an interactive demonstration of temporal-difference learning.
This example demonstrates the key ideas of RL using a simple game. The AI learns to play by updating state values using the temporal-difference (TD) method:
Value Table (Player X)
1.0 = Guaranteed win
0.5 = Unknown/Draw likely
0.0 = Guaranteed loss
Notice how the RL approach differs from evolutionary methods: it evaluates individual states, not just final outcomes. This allows proper credit assignment — moves that actually mattered get updated, even if the game hasn't ended yet. Evolutionary methods would only learn from wins/losses at the end, giving credit even to moves that never occurred!
Summary
Key takeaways from Chapter 1.
RL is a computational approach to goal-directed learning from interaction
It's distinguished by learning without exemplary supervision
The first field to seriously address learning from interaction to achieve long-term goals
Uses Markov Decision Processes as its formal framework
Represents states, actions, and rewards
Captures essential AI features: cause-and-effect, uncertainty, explicit goals
Value and value functions are key to most RL methods
Early History of Reinforcement Learning
Three independent threads that converged into modern RL.
Modern reinforcement learning emerged from three distinct research threads spanning over a century. Filter by thread or scroll through the timeline to explore key milestones:
Origins of Trial-and-Error
Alexander Bain describes "groping and experiment" as a learning mechanism.
Law of Effect
Edward Thorndike formulates the Law of Effect: responses followed by satisfaction become more likely.
"Reinforcement" Term Coined
The term "reinforcement" appears in English translation of Pavlov's conditioned reflexes work.
Turing's Pleasure-Pain System
Alan Turing describes a "pleasure-pain system" for machine learning.
SNARCs - First RL Hardware
Marvin Minsky builds Stochastic Neural-Analog Reinforcement Calculators at Princeton.
Bellman Equation & MDPs
Richard Bellman develops dynamic programming and defines Markov Decision Processes.
Samuel's Checkers Player
Arthur Samuel creates a checkers program that learns evaluation functions — first TD implementation.
Policy Iteration
Ronald Howard develops policy iteration method for MDPs.
MENACE & BOXES
Donald Michie creates MENACE (matchbox tic-tac-toe) and BOXES (pole-balancing) systems.
Klopf Revives Trial-and-Error
Harry Klopf argues for returning to hedonic (pleasure-seeking) aspects of learning.
First TD Rule Publication
Ian Witten publishes the first clear description of tabular TD(0) learning rule.
Actor-Critic Architecture
Barto, Sutton & Anderson combine TD learning with trial-and-error in actor-critic framework.
TD(λ) Algorithm
Sutton separates TD from control, introduces TD(λ), proves convergence properties.
Q-Learning
Chris Watkins develops Q-learning, fully integrating all three threads of RL research.
TD-Gammon
Gerry Tesauro's backgammon player combines TD with neural networks, achieving superhuman play.
The Three Threads Converge
Modern reinforcement learning emerged from the convergence of three distinct research threads: trial-and-error learning from psychology, optimal control from control theory and dynamic programming, and temporal-difference learning from AI research. Chris Watkins' Q-learning (1989) was the first to fully integrate all three threads.