January 2026

Gradient Descent

The simple algorithm behind every neural network, explained from scratch.

I took a course on AI and Machine Learning in my final year at college. Like most college courses, it was a mix of theory, assignments, and exams. We covered neural networks, backpropagation, CNNs, RNNs - the usual suspects. I understood the math, passed the exams, and promptly forgot most of it. As you do.

But lately, AI has been impossible to ignore. Every product I use has some AI feature. Every newsletter I read mentions transformers or diffusion models. I find myself constantly reading about machine learning - not because I have to, but because it's genuinely fascinating. And one concept keeps coming up, again and again: gradient descent.

A few months back, I was chatting with my friends about all this AI stuff. We were discussing how these models actually "learn" - like, what's really happening under the hood. One of them asked, "But how does the model know which direction to improve?" That question stuck with me. It's such a simple question but the answer is the foundation of nearly all modern AI.

Then I decided to actually build something. I'm working on an app that listens to live streams in Gujarati - specifically religious discourses - and answers questions based on that content. Think of it as a real-time Q&A system for spiritual teachings. To build this, I needed to fine-tune models, understand embeddings, work with loss functions. And every time I hit a wall, the answer came back to gradient descent. It's everywhere.

So I went back to basics. Forgot everything I "learned" in college and started fresh. This article is my attempt to explain gradient descent the way I wish it was explained to me - with intuition first, math second, and lots of interactive demos to play with.

The Learning Problem

Before we dive into gradient descent, let's understand the problem it solves. What does it mean for a machine to "learn"?

Say you want to predict house prices based on their size. You have some data - houses of various sizes and their actual prices. You want to find a relationship that lets you predict the price of a new house given its size. In mathematical terms, you're looking for a function that maps inputs (size) to outputs (price).

Let's start simple. Assume the relationship is linear: price = w × size + b. Here, w (weight) is the slope and b (bias) is the y-intercept. The question is: what values of w and b give us the best predictions?

"Best" here means: predictions that are as close as possible to the actual prices. We measure this "closeness" using something called a . A common one is Mean Squared Error (MSE): take the difference between predicted and actual price for each house, square it (to make all errors positive), and average them.

So the learning problem becomes: find the values of w and b that minimize the loss function. That's it. That's what "training" a model means.

Finding the Minimum

Alright, so we need to find values that minimize a function. How hard can that be?

If we only had one parameter to optimize, we could just try a bunch of values and pick the one with the lowest loss. But neural networks have millions, sometimes billions of parameters. GPT-4 has over a trillion. Trying every combination is impossible - there are more possibilities than atoms in the universe.

We need something smarter. And here's where the intuition comes in. Imagine you're blindfolded and dropped somewhere on a hilly landscape. Your goal is to find the lowest point - the valley. How would you do it?

You'd feel the ground around you. If it slopes down to your left, you'd step left. If it slopes down forward, you'd step forward. You'd keep taking small steps in the direction that goes downhill. Eventually, you'd reach a point where every direction leads uphill - that's the bottom of the valley.

That's gradient descent. Instead of blindly searching, you use the slope of the landscape to guide your steps. The "landscape" is your loss function plotted against your parameters. The "slope" is the gradient.

3D Loss Surface

Drag to rotate, scroll to zoom
2.50
w
2.50
b
12.50
Loss
0
Steps

The orange ball follows the gradient downhill toward the green minimum. Height = loss value. Color shows loss intensity.

In the demo above, you can see a loss surface - the height represents the loss for different values of w and b. Watch how gradient descent navigates down the surface, always moving in the direction of steepest descent. Click anywhere to drop a new starting point and watch it find its way down.

What is the Gradient?

The gradient is just a fancy word for "which direction is uphill, and how steep is it?" More precisely, it's a vector of . Each element tells you: if I increase this parameter slightly, how much does the loss change?

Let's say you have two parameters, w and b. The gradient is:

∇L = [∂L/∂w, ∂L/∂b]

If ∂L/∂w is positive, it means increasing w increases the loss - so we should decrease w. If it's negative, increasing w decreases the loss - so we should increase w.

The gradient points in the direction of steepest ascent. Since we want to minimize, we go in the opposite direction - the direction of steepest descent. Hence the name.

The update rule is beautifully simple:

w_new = w_old - α × ∂L/∂w

Where α (alpha) is the learning rate - how big of a step we take. This single equation is the heart of nearly all modern AI training.

The Learning Rate: A Balancing Act

The learning rate is arguably the most important in machine learning. Get it wrong, and your model either learns nothing or explodes.

Too small: Your model learns extremely slowly. It takes forever to converge. Each step is so tiny that you might need millions of iterations to get anywhere. It's like trying to walk from Mumbai to Delhi by taking baby steps.

Too large: Your model overshoots the minimum. It bounces around wildly, never settling down. The loss might even increase instead of decreasing. Imagine trying to land a plane by alternating between diving and climbing - you'd crash.

Just right: The sweet spot. Fast enough to make progress, small enough to converge smoothly. Finding this sweet spot is one of the key skills in training ML models.

Learning Rate Effect

α =0.1
2.500
w
10.2500
Loss
0
Steps

Try different learning rates. Too small = slow progress. Too large = overshoots and may diverge. Watch how step size changes!

Play with the slider above. Watch how a tiny learning rate (0.001) creeps along slowly, while a large one (1.0) bounces around chaotically. The middle values (0.01 - 0.1) usually work best - but "best" depends on your specific problem.

In practice, people often use learning rate schedules - starting with a larger rate to make quick progress, then reducing it as training progresses to fine-tune. It's like driving fast on a highway, then slowing down as you navigate into a parking spot.

Flavors of Gradient Descent

So far, I've described the basic version. But there are several variants, each with different trade-offs. Understanding these helped me a lot when I was training models for my Gujarati Q&A app.

Batch Gradient Descent

The "vanilla" version. Compute the gradient using all your training data, then take one step. This gives you the true gradient - the exact direction of steepest descent.

The problem? If you have millions of data points (which is common), you need to process all of them just to take a single step. That's extremely slow and needs huge amounts of memory.

Stochastic Gradient Descent (SGD)

The opposite extreme. Compute the gradient using just one randomly chosen data point, then take a step. Rinse and repeat.

This is much faster per step - you process one example instead of millions. But the gradient estimate is noisy. One data point might not be representative. The path to the minimum becomes jagged and erratic.

Surprisingly, this noise can actually help! It can help escape and find better solutions. Sometimes random jiggling helps you discover a path you wouldn't find by being too precise.

Mini-Batch Gradient Descent

The best of both worlds, and what everyone actually uses in practice. Compute the gradient using a small batch of data - typically 32, 64, or 128 examples.

You get a reasonably accurate gradient estimate (averaging over multiple examples reduces noise) while still being computationally efficient. Plus, modern GPUs are optimized for processing batches in parallel - you get much better hardware utilization.

Gradient Descent Variants

Step: 0
Batch GD
SGD
Mini-batch
Drag to rotate • Scroll to zoom
Batch GD
7.625
Loss
SGD
7.625
Loss
Mini-batch
7.625
Loss

Batch takes smooth, direct steps. SGD is noisy and erratic. Mini-batch balances both. Toggle "3D Surface" off to see paths clearly from above.

Watch how the three variants behave differently in the demo above. Batch gradient descent takes smooth, confident steps. SGD is erratic and noisy. Mini-batch is somewhere in between - controlled but with some healthy randomness.

The Challenges

Gradient descent sounds elegant, but real-world loss surfaces are messy. Here are some challenges that kept me up at night when training my models.

Local Minima

Remember the blindfolded hiker analogy? What if there are multiple valleys? You might end up in a small valley when there's a much deeper one nearby. Gradient descent only sees the local landscape - it has no way to know if there's something better over the next hill.

For simple problems like linear regression, this isn't an issue - the loss surface is convex (bowl-shaped), with exactly one minimum. But neural networks have incredibly complex, non-convex loss surfaces with countless local minima.

The good news: research has shown that in high-dimensional spaces (millions of parameters), most local minima are actually pretty good - close to the global minimum in terms of loss. The landscape is more like an egg carton than a mountain range.

Saddle Points

These are trickier than local minima. A saddle point is where the gradient is zero, but it's not a minimum - it's a minimum in some directions and a maximum in others. Like the middle of a horse's saddle.

At a saddle point, basic gradient descent gets stuck. The gradient is zero, so there's no direction to move. Yet there are clearly better solutions nearby - you just can't see them from where you're standing.

Plateaus and Flat Regions

Sometimes the loss surface is nearly flat over a large region. The gradient is close to zero, so steps are tiny. Training seems to stall. You might think you've converged, but actually you're just in a flat area with better regions beyond.

Optimization Challenges

Two valleys - the ball might settle in the shallower one

3.690
Loss
1.4311
|∇L|
0
Steps

Rotate to explore the surface. Try different starting points to see how gradient descent can get stuck in challenging landscapes.

The demo above shows these scenarios. Try different starting points and watch how gradient descent behaves. Notice how it can get stuck at saddle points, settle in local minima, or slow down dramatically in flat regions.

Modern Optimizers

Vanilla gradient descent has limitations. Over the years, researchers have developed smarter variants that address these challenges. When I was fine-tuning models for my app, understanding these made a real difference.

Momentum

Imagine a ball rolling down a hill. It doesn't just follow the local slope - it builds up speed. Even if it hits a small bump (local minimum), its momentum carries it through.

Momentum does exactly this. Instead of just using the current gradient, it accumulates past gradients:

v = β × v + (1-β) × ∇L
w = w - α × v

Where β (typically 0.9) controls how much history to remember. This helps blast through saddle points, escape shallow local minima, and accelerate in consistent directions.

RMSprop

Different parameters might need different learning rates. A parameter that rarely updates should take bigger steps when it does. A parameter that updates constantly should take smaller steps.

RMSprop (Root Mean Square Propagation) adapts the learning rate per parameter based on the recent history of gradients:

s = β × s + (1-β) × (∇L)²
w = w - α × ∇L / (√s + ε)

Parameters with large recent gradients get smaller effective learning rates. Parameters with small recent gradients get larger ones. It's automatic adjustment.

Adam

Why choose between momentum and adaptive learning rates when you can have both? (Adaptive Moment Estimation) combines the best of both worlds. It maintains both a momentum term and an adaptive learning rate term.

Adam is the default choice for most practitioners. When in doubt, start with Adam. It works well across a wide range of problems with minimal tuning. That said, it's not always the best - for some problems, plain SGD with momentum actually generalizes better.

Optimizers Race

Step: 0
SGD
Momentum
RMSprop
Adam
Target (1, 1)
Drag to rotate • Scroll to zoom
SGD
6.88
Loss
Momentum
6.88
Loss
RMSprop
6.88
Loss
Adam
6.88
Loss

Adam typically wins the curved valley. Toggle "3D Surface" off to see paths clearly from above.

Watch the four optimizers race in the demo above. Notice how SGD struggles with the curved valley while momentum helps it navigate. Adam usually reaches the minimum fastest, adapting its steps automatically.

A Quick Word on Backpropagation

I've been glossing over something important. I've said "compute the gradient" many times, but how do you actually do that for a neural network with millions of parameters?

The answer is . It's an application of the chain rule from calculus. The loss depends on the output, the output depends on the last layer's weights, which depend on the second-to-last layer's outputs, and so on.

Backpropagation efficiently computes how the loss changes with respect to each parameter by working backwards through the network. The "back" in backpropagation refers to this backward pass through the network.

The key insight: you can compute gradients for millions of parameters in roughly the same time as one forward pass. That's what makes training deep networks feasible.

Backpropagation deserves its own article (maybe I'll write that next!). For now, just know that it's the efficient algorithm that makes gradient descent practical for neural networks.

Putting It Together: Linear Regression

Let's see gradient descent in action with the simplest possible example: fitting a line to some points. No neural networks, no complexity - just a line.

We have some data points. We want to find the line y = w × x + b that best fits them. "Best" means minimizing the mean squared error between our predictions and the actual values.

Watch gradient descent find the optimal line:

Linear Regression Loss Surface

Drag to rotate • Scroll to zoom
0.500
w (slope)
1.000
b (intercept)
13.447
MSE Loss
0
Steps
Current model: y = 0.50 × x + 1.00

The orange ball descends the MSE loss surface to find optimal w and b. Toggle "3D Surface" off to see the path from above.

Click "Add Point" to create your own dataset and watch gradient descent figure out the best line. Notice how it starts with a random line and gradually improves - each step reducing the total error.

Where is Gradient Descent Used?

Everywhere. I mean it. If you've interacted with any AI system today, gradient descent was involved in training it.

Large Language Models: ChatGPT, Claude, Gemini - all trained using gradient descent. When these models "learn" to predict the next word, they're doing millions of gradient updates across billions of parameters. The scale is mind-boggling, but the core algorithm is the same one we've been discussing.

Image Recognition: Every photo app that recognizes faces, every self-driving car that identifies pedestrians, every medical system that spots tumors - trained with gradient descent.

Recommendation Systems: Netflix suggesting your next show, Spotify creating playlists, Amazon recommending products - all powered by models trained with gradient descent.

Voice Assistants: Siri, Alexa, Google Assistant - the speech recognition and natural language understanding models are trained with... you guessed it.

My Gujarati Q&A App: Even my small project uses gradient descent. Fine-tuning embedding models for Gujarati text, training the question-answering components - all gradient descent under the hood.

It's remarkable that such a simple idea - "take small steps downhill" - is the foundation of all these diverse applications.

When NOT to Use Gradient Descent

Gradient descent is powerful, but it's not always the right tool.

Non-differentiable functions: Gradient descent needs gradients. If your function has discontinuities or isn't differentiable, you can't compute gradients. Techniques like evolutionary algorithms or reinforcement learning might be better choices.

Discrete optimization: If your parameters are discrete (integers, categories), gradients don't exist. You can't take "half a step" from category A to category B. Techniques like genetic algorithms, simulated annealing, or integer programming are more appropriate.

Problems with closed-form solutions: Linear regression actually has a closed-form solution - you can compute the optimal weights directly without iteration. Using gradient descent for such problems is like taking a flight when you could teleport. But for neural networks, no closed-form solution exists, so gradient descent is our best option.

Very noisy gradients: If your gradient estimates are extremely noisy (high variance), gradient descent might never converge. Variance reduction techniques or alternative optimization methods might be needed.

Practical Tips

If you're just starting to train models, here are some things I've learned (often the hard way):

Start with Adam. It's robust and works well out of the box. You can experiment with others once you have a baseline.

Normalize your inputs. If your features have wildly different scales (e.g., age 0-100 vs. salary 0-1000000), gradient descent will struggle. Normalize everything to similar scales.

Monitor the loss curve. If it's not going down, something is wrong. If it's going down too slowly, try a larger learning rate. If it's oscillating wildly, try a smaller one.

Use learning rate schedules. Start with a higher rate, decay it over time. Common schedules include step decay, exponential decay, and cosine annealing.

Watch for overfitting. If training loss goes down but validation loss goes up, you're memorizing the training data instead of learning general patterns. Add regularization or get more data.

Gradient clipping for RNNs. Recurrent networks are prone to exploding gradients. Clipping the gradient norm can prevent training from blowing up.

Conclusion

When I started digging back into machine learning, I expected complexity. Sophisticated algorithms, advanced mathematics, intricate techniques. And yes, there's plenty of that. But at the core of it all is this surprisingly simple idea: take small steps in the direction that reduces your error.

That's gradient descent. Follow the slope downhill. It's almost embarrassingly straightforward. And yet it powers every neural network, every language model, every image classifier, every recommendation system. The same basic algorithm that fits a line to points also trains GPT-4. The scale is different, but the principle is identical.

Like Merkle trees in my previous article, gradient descent is another example of how simple ideas, applied correctly, can achieve remarkable things. It's not about complexity - it's about finding the right abstraction.

I'm still working on my Gujarati Q&A app. Every time I fine-tune a model, every time I watch the loss curve go down, I think about gradient descent. Millions of tiny steps, each one making the model slightly better. It's a beautiful thing.

So next time you ask ChatGPT a question or get a Netflix recommendation, remember: somewhere in the past, a simple algorithm took billions of small steps downhill to make that possible.