Oct 7, 2025

Teaching an AI to Think: A Beginner's Guide to Reinforcement Learning

Oct 7, 2025

5 min read

Have you ever wondered how an AI learns to play a game, master a strategy, or navigate a maze? The answer often lies in a fascinating field called Reinforcement Learning (RL). Unlike other types of machine learning where we provide labeled data, RL is all about learning through trial and error—just like training a pet.

In this guide, we’ll explore the core concepts of RL and walk through building a simple AI that learns to solve a classic problem using a powerful technique called Q-Learning.

The Language of Learning: Core Terminology 🗣️

Before we dive in, let’s get familiar with the key players in any reinforcement learning task. Imagine we’re teaching an AI to play a game of Super Mario.

Agent: This is our learner or decision-maker. In our example, Mario is the agent.
Environment: This is the world the agent interacts with. The game level is the environment.
State: This is the agent’s current situation. The agent’s location (x, y coordinates) on the screen is its state.
Action: These are the moves the agent can make. Jumping, moving left, or even doing nothing are all actions.
Reward: This is the feedback the agent gets after an action. Grabbing a coin is a positive reward, while getting hit by an enemy is a negative reward.

The ultimate goal of the agent is to take actions within the environment to maximize its total reward. The tricky part is defining the rewards correctly to guide the agent toward the desired outcome.

What is Q-Learning? The Agent’s “Cheat Sheet”

Q-Learning is a popular RL algorithm that helps an agent figure out the best action to take in any given state. It does this by creating and updating a “cheat sheet” called a Q-Table.

This table has a row for every possible state and a column for every possible action. The value inside each cell Q[state, action] represents the expected future reward of taking that action in that state.

After the table is successfully learned, the agent can instantly know the best move in any situation by simply finding the action with the highest Q-value for its current state.

How an Agent Learns: Exploration vs. Exploitation

So, how do we fill out this Q-Table? The agent learns by exploring the environment. But it faces a classic dilemma:

Exploration: Should it try a random action to discover new paths and potentially better rewards?
Exploitation: Should it use the Q-Table and choose the action it already knows gives the best reward?

A good agent needs to balance both. At the beginning of training, it will act mostly randomly to explore as much of the environment as possible. As it gains experience and its Q-Table becomes more accurate, it will start relying more on what it has learned (exploitation). This balance is often controlled by a variable called epsilon (ε), which represents the probability of taking a random action.

The Magic Update Formula ✨

After every action, the agent updates the corresponding cell in the Q-Table using the Bellman equation. It looks a bit scary, but the concept is simple!

Q[state,action]=Q[state,action]+α∗(reward+γ∗max(Q[newState,:])−Q[state,action])

Let’s break it down: The new Q-value is the old Q-value plus a small adjustment.

α (Alpha): The learning rate. This controls how much we update the Q-value. A small α means the agent learns slowly and cautiously.
γ (Gamma): The discount factor. This determines the importance of future rewards. A high γ makes the agent farsighted, caring more about long-term success.

In simple terms, the formula updates our expectation based on the immediate reward we got, plus the potential future reward from the best action in the next state.

Let’s Build It! The Frozen Lake Challenge 🧊

Time to put theory into practice! We’ll use OpenAI’s Gym to train an agent to navigate the FrozenLake-v1 environment.

The goal is to move from the start (S) to the goal (G) without falling into a hole (H). The frozen squares (F) are safe to walk on. There are 16 states (one for each square) and 4 possible actions (Up, Down, Left, Right). Setting Up the Q-Table and Training Loop

First, we’ll set up our environment, initialize our Q-Table with zeros, and define our training parameters.

import gymnasium as gym
import numpy as np
import time

env = gym.make('FrozenLake-v1')
STATES = env.observation_space.n
ACTIONS = env.action_space.n

# Initialize the Q-Table with all zeros
Q = np.zeros((STATES, ACTIONS))

# Hyperparameters
EPISODES = 2000 # How many times to run the environment from the beginning
MAX_STEPS = 100  # Max number of steps allowed for each run
LEARNING_RATE = 0.81
GAMMA = 0.96

The Training Logic

Now for the main loop. In each episode, the agent will navigate the lake. We’ll use our epsilon (ε) value to decide whether to take a random action (explore) or the best-known action (exploit). After each step, we update the Q-Table with our formula. Epsilon will slowly decrease over time, making the agent more confident in its Q-Table.

epsilon = 0.9 # Start with a 90% chance of picking a random action
rewards = []

for episode in range(EPISODES):
    # Reset the environment and get the initial state.
    # In newer gymnasium, reset() returns a tuple of (observation, info)
    state, info = env.reset()
    for _ in range(MAX_STEPS):
        # Choose an action: randomly or based on Q-Table
        if np.random.uniform(0, 1) < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state, :])

        # Take the action and get the new state and reward
        next_state, reward, terminated, truncated, _ = env.step(action)

        # Combine terminated and truncated to get the 'done' condition
        done = terminated or truncated

        # Update the Q-Table using the Bellman equation
        Q[state, action] = Q[state, action] + LEARNING_RATE * (reward + GAMMA * np.max(Q[next_state, :]) - Q[state, action])

        state = next_state

        if done:
            rewards.append(reward)
            epsilon -= 0.001 # Decrease epsilon
            break # Episode is over

print("--- Final Q-Table ---")
print(Q)
# Avoid division by zero if no rewards were collected
if rewards:
    print(f"\nAverage reward: {sum(rewards)/len(rewards)}")
else:
    print("\nNo rewards collected during training.")

After training, the Q-Table will be filled with values that represent the best path to the goal. Our agent has successfully learned to navigate the frozen lake!