Back to blog
Oct 14, 2025
7 min read

14. Seeing the World Through Convolutional Neural Networks

In this episode, we’re diving into one of the most powerful and widely used architectures in deep learning — the Convolutional Neural Network (CNN).

CNNs are the foundation of modern computer vision — the technology behind facial recognition, autonomous driving, medical imaging, and even the camera filters on your phone. Let’s unpack how they work, why they’re so effective, and how you can build one from scratch.


🏔️ The Challenge: Classifying Images

Imagine you’re trying to classify whether an image contains mountains or no mountains. A natural first idea might be to use a fully connected neural network (FCNN). Each pixel becomes an input neuron, connected to every neuron in the next layer.

But here’s the problem:

  • A 200×200 image has 40,000 pixels, and each pixel has 3 color values (R, G, B).
  • That’s 120,000 inputs in total!
  • Even a small hidden layer (say 60,000 neurons) would lead to over 7 billion parameters to learn.

Not only is this computationally insane, but the model is also sensitive to where an object appears in the image. If you flip or shift the mountain, the model might suddenly think it’s not there anymore.

Clearly, we need a smarter approach — one that understands patterns like edges, shapes, and textures, regardless of their position.


🧩 The Key Idea: Convolution and Kernels

CNNs borrow a concept from image processing called filtering.

A filter (or kernel) is a small matrix — often 3×3 or 5×5 — that “slides” over the image and transforms it. For example:

  • A blur filter smooths the image.
  • An edge detection filter highlights boundaries.

Each pixel in the new image is computed by multiplying the kernel’s values with the underlying pixel values and summing the result — this is the convolution operation.

Now, instead of using a fixed filter (like an edge detector), CNNs learn their own kernels during training. These learned filters automatically pick up useful patterns — corners, lines, textures, and eventually objects.


🧮 From Math to Neural Network

Mathematically, the convolution operation looks like a dot product between a kernel and a local patch of the image. In neural network terms:

[ \text{Output} = \sigma(W^T X + b) ]

Here, each ( W ) is a weight in the kernel, shared across all regions of the image — a concept called weight sharing. This drastically reduces parameters and gives CNNs the ability to detect the same feature anywhere in the image.


🔍 Important CNN Concepts

1. Receptive Field

Each neuron in a CNN “sees” only a small patch of the image — its receptive field. Unlike fully connected layers, CNN neurons don’t look at the entire image at once, making them efficient and spatially aware.

2. Padding

Sometimes, to preserve image size after convolution, we pad the image with zeros.

  • Valid Padding: No padding → smaller output.
  • Same Padding: Zero padding → output size same as input.

3. Stride

Stride determines how far the kernel moves each step. A stride of 1 scans every pixel, while a stride of 2 skips one pixel each time, reducing the output size.

4. Feature Maps

Each kernel produces one feature map, highlighting a specific pattern (like edges or textures). Using multiple kernels in a layer gives multiple feature maps — each capturing a different feature.


📏 Example: Computing Output Size

If the input image is 185×185, with:

  • Kernel = 5×5
  • Stride = 2
  • No padding

Then the output size is:

[ \text{Output} = \frac{(185 - 5)}{2} + 1 = 91 ]

So the feature map becomes 91×91.


🌊 Pooling Layers: Summarizing Information

Pooling layers help reduce dimensions while preserving key information.

  • Max Pooling: Takes the maximum value from each region (most common).
  • Average Pooling: Takes the average of the region.

Pooling makes the network shift-invariant — meaning it recognizes objects even if they move slightly — and helps denoise the input.


🏗️ Building the Architecture

A simple CNN might look like this:

  1. Input Layer – Image (RGB, 335×335)
  2. Conv Layer 1 – 8 kernels, 3×3, stride 2
  3. Max Pooling – 3×3, stride 4
  4. Conv Layer 2 – 16 kernels, 3×3, stride 2
  5. Max Pooling – 3×3, stride 4
  6. Flatten Layer – Converts 3D feature maps into 1D vector
  7. Fully Connected Layer – 250 neurons
  8. Output Layer – Binary classification (mountain / no mountain)

🚛 Real-World Example: Tracking Trucks with CNNs

Let’s consider a hedge fund in early 2017 trying to predict company earnings. They hypothesize that the number of trucks leaving a distribution center could indicate product sales — and thus earnings.

Using traffic camera footage, they train a CNN to detect whether a truck is in each video frame. They label 100,000 images (half with trucks, half without) using Amazon Mechanical Turk, train the CNN, and achieve near 100% accuracy on the test set.

This “truck count” becomes a feature in their broader financial prediction model — ultimately helping them make profitable trades.

That’s the beauty of CNNs: from mountains to trucks to tumors, they can learn to see what humans can’t easily quantify.


⚙️ Improving Model Performance

During training, they faced two common issues:

  • Overfitting → solved with Dropout Layers.
  • Slow Convergence → solved with Batch Normalization.

They also used:

  • ReLU activation
  • Adam optimizer
  • Kaiming initialization
  • L1/L2 regularization
  • 80/10/10 data split (train/validation/test)

🌐 State-of-the-Art CNN Architectures

ModelKey IdeaNotable Features
AlexNetTwo CNNs in parallelIntroduced ReLU, Dropout
GoogLeNet (Inception)Multiple kernel sizes per layerInception blocks
ResNetResidual (skip) connectionsEnables >100 layers

ResNet’s innovation — residual learning — helps networks go deeper without vanishing gradients. It remains one of the most successful architectures in image recognition.


🧰 Tools for Building CNNs

If you’re ready to experiment, try Keras, a high-level deep learning API:

  • Conv2D() – For convolutional layers
  • MaxPooling2D() – For pooling
  • Flatten() – For flattening feature maps
  • Dense() – For fully connected layers

You can even use Keras Applications to load pre-trained models like ResNet, Inception, or VGGNet, and fine-tune them for your own tasks.


🏁 Wrapping Up

Convolutional Neural Networks revolutionized how machines see the world. They reduced the need for manual feature extraction and made computer vision scalable, flexible, and astonishingly powerful.

From counting trucks to classifying galaxies, CNNs form the foundation of most modern AI vision systems.

Next up in our crash course: we’ll continue exploring deep learning and uncover how large-scale architectures and GPUs make these models possible.