Building Your First Image Classifier: A Deep Dive into CNNs

Have you ever wanted to build a program that can look at a picture and tell you what’s in it? That’s the power of Convolutional Neural Networks (CNNs), the workhorse of modern computer vision.
In this tutorial, we’ll build a CNN from scratch to classify 10 different types of everyday objects. We’ll cover the entire process, from loading data to training the model, and then explore advanced techniques like data augmentation and transfer learning to make our model even smarter. Let’s get started! 🚀
Step 1: The Dataset
Our project will use the famous CIFAR-10 dataset, which is conveniently built into TensorFlow. It’s a perfect dataset for learning because it’s complex enough to be interesting but small enough to train quickly.
The dataset includes:
- 60,000 color images, each 32x32 pixels.
- 10 distinct classes, with 6,000 images per class.
- The classes are: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.
First, we’ll load the data and normalize it. Normalizing means scaling the pixel values from their original range of 0-255 down to a range of 0-1. This simple step helps our network train more effectively.
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
# Load and split the dataset
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck']
Step 2: Building the CNN Architecture 🏗️
A common CNN architecture consists of two main parts:
- The Convolutional Base: A stack of Conv2D and MaxPooling2D layers. Its job is to extract features (like edges, textures, and shapes) from the images.
- The Classifier: A set of Dense layers at the end. Its job is to take the extracted features and decide which class the image belongs to.
The Convolutional Base
Let’s build the feature-extracting part of our model.
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
Let’s break that down:
- Layer 1 (Conv2D): This layer looks for 32 different patterns (filters) using a 3x3 window. It takes our 32x32 image with 3 color channels as input.
- Layer 2 (MaxPooling2D): This layer simplifies the information by taking the maximum value from every 2x2 grid, effectively shrinking the image dimensions.
- Other Layers: The subsequent layers repeat this process. Notice that we increase the number of filters from 32 to 64. As the image gets spatially smaller, we can afford to detect more complex and numerous patterns.
We can see how the output shape changes at each step by looking at the model summary:
model.summary()
Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ conv2d (Conv2D) │ (None, 30, 30, 32) │ 896 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d (MaxPooling2D) │ (None, 15, 15, 32) │ 0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_1 (Conv2D) │ (None, 13, 13, 64) │ 18,496 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d_1 (MaxPooling2D) │ (None, 6, 6, 64) │ 0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_2 (Conv2D) │ (None, 4, 4, 64) │ 36,928 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 56,320 (220.00 KB)
Trainable params: 56,320 (220.00 KB)
Non-trainable params: 0 (0.00 B)
As you can see, the image’s height and width shrink, but its depth (the number of feature maps) grows.
The Classifier
Now that we have our feature extractor, we need to add the classifier on top. This will take the final feature map, flatten it into a single vector, and make a final prediction.
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10)) # 10 neurons, one for each class
Step 3: Training the Model
With our model fully designed, we just need to compile it and start training.
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
history = model.fit(train_images, train_labels, epochs=10, # We'll train for 10 epochs
validation_data=(test_images, test_labels))
Step 4: Evaluating Our Model
After training, let’s see how well our model performs on the test data it has never seen before.
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
print(f"\nTest accuracy: {test_acc:.2f}")
313/313 - 4s - 12ms/step - accuracy: 0.6951 - loss: 0.8937
Test accuracy: 0.70
You should see an accuracy of around 70%. That’s not bad for a simple model built from scratch! But we can do better.
Level Up: Advanced Techniques 🌟
How do we push past 70% accuracy? Here are two powerful techniques used by professionals.
1. Data Augmentation
One of the biggest challenges in deep learning is overfitting, where a model gets too good at classifying the training data but fails on new data. A great way to fight this is with data augmentation.
This technique artificially creates more training data by applying random transformations—like rotations, zooms, and flips—to our existing images. This teaches the model to be more robust and generalize better.
from keras.preprocessing.image import ImageDataGenerator
# Create a data generator object that transforms images
datagen = ImageDataGenerator(
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest')
2. Pre-trained Models (Transfer Learning)
Why build a feature extractor from scratch when you can use one built by experts? Transfer learning involves taking a powerful, pre-trained CNN—one that has already learned from millions of images on a dataset like ImageNet—and using its convolutional base for your own project.
The early layers of these models are excellent at detecting universal features like edges, colors, and textures. We can simply chop off the original classifier and bolt on our own dense layers to classify our specific dataset. This approach allows you to achieve very high accuracy even with a small dataset.
Sometimes, you might want to fine-tune this model by unfreezing the last few layers of the pre-trained base and retraining them on your data. This helps the model specialize its feature detection for your unique problem, often leading to even better performance.