Stochastic Gradient Descent Made Fun

The expression for the slope of a function is:

What does it mean when the slope is zero?

What is the expression for the x-coordinate of the vertex of a parabola?

If you connect two dots that are equidistant from the vertex with a line, what is the slope of that line?

For the upward-opening parabola

y = x²

, how does the slope change from the left to right side of the vertex?

Using this quartic equation as an example: y = x⁴ + 2x³ − 3x² + x + 1 We have something that looks like a vertex, but since it's a higher order polynomial, it's a local minimum: there's no formula for this! It's at (-2.22, -13.60).

In Calculus mode, we can find the x-values of extrema (local mins/maxes) by taking the derivative and solving where that derivative equals zero.

In real life problem solving, why might finding a vertex or local min/max be useful?

If f(x) represents the amount of water pollution, and x is the amount of a resource used in an engineering project, would it be better to find the local minimum or local maximum of this function? Explain your reasoning.

But the world is a lot more complicated than one number in and one number out! We can actually extend this concept to 3D! Don't worry about all the underlying math, I want to focus on intuition. 3D functions are cool; they are normally not taught until Calculus III. Below is the 3D equivalent of a parabola called a paraboloid:

Click and drag to rotate the paraboloid!

Why do you think the real world doesn't normally use 2D functions for science and engineering?

Let's say you could walk inside that paraboloid above, but you must be blindfolded. If you didn't exactly know where the bottom is, how would you figure out where the lowest point is? What concept from the start of the lesson helps us solve this question? Hint: There is no equation for this problem.

The "Blindfolded in a Valley" Analogy

Drag to orbit the valley. The person stays visible as you compare the terrain from different angles.

10.

Yes! The world is messy and we can't just plug stuff into equations. As we have a bunch of data, we can't rely on the quadratic formula, completing the square, etc. I mean it would be insane to solve quartic equations by hand with this nightmare formula:

So what in the world do we do? We use computers!

Computers can crunch a bunch of math insanely fast. We can also use context clues to help computers make a lot of rapid guesses back-to-back, so they get closer to the output we want.

Remember the concept of local minima/maxima and how we talked about slope? Those concepts all exist in 3D too.

So how do you think a computer can find the local minimum of a function if all we know is the input, output, and slope of where we are "standing"? Hint: Imagine you are walking in the paraboloid from the last page trying to find the very bottom.

At each point, we can calculate the slope at that point and use it to decide which downhill step to take.

11.

Spoiler for Q10: Computers can think like someone standing at the edge of a paraboloid, and make small steps downward until it knows the slope is flat. Remember that line we drew through the vertex of the parabola from Q4? Computers can figure that out, but in 3D too.

In Algebra II mode, we can say the slope is flat when the secant plane has no incline, like a perfectly level sheet across the bowl.

Computers really just make a bunch of educated guesses until it gets close enough to the result we want: the lowest point of the paraboloid.

This is the concept of Gradient Descent, a foundational reason why ChatGPT works!!

So instead of walking inside the paraboloid, let's imagine we have a super weird bowl with marbles inside. We drop marbles somewhere randomly in this bowl, and wherever the marbles end up landing decides how much money is saved. The deeper inside the bowl they end up, the better.

We have two holes. Think of the deepest hole as the global minimum and the other hole is a local minimum.

Let's say you could control the bowl with your hands and you want as many marbles as possible to end up in the deeper hole. You are blindfolded once again, so you can't tilt the bowl in the direction you know has the deeper hole.

What could you do to try to get the most marbles in the deepest hole after the marbles are dropped? Could anything be done?

The Training Simulator: "The 3D Marble Bowl"

Below is a 3D version of our weird bowl. The "best" result is the deepest part of the bowl (the Global Minimum). But sometimes, a marble gets stuck in a shallow dent on the side (a Local Minimum). The slider controls Stochastic Noise: the middle gives a useful shake that fades over time, while the far right can shake marbles right out of the bowl.

A little shaking helps the marbles escape shallow dents, but the shaking has to calm down. If the bowl keeps shaking forever, the marbles never settle into the best spot.

Noise / Shakiness

Noise: 28%

Marbles in Global Minimum: 0

Loop running: marbles reset automatically.

12.

You just learned the concept of Stochastic Gradient Descent! This concept is not normally taught until 500-level graduate math classes 😱!

Stochastic is a fancy word for random.

To summarize, adding a bit of random "shakiness" helps computers better approximate to the best "local minimum" compared to going straight down.

Some shakiness is good, but too much causes the marbles to fall out, so we need just the right amount.

In AI training, this "shakiness" is usually called noise or stochasticity. Shaping the bowl is one analogy for training AI. Temperature is different: it matters later, when ChatGPT is actually writing its response.

Instead of marbles in weird bowl holes, instead picture the depth of the hole as reducing pollution, saving resources, etc.

In your own words, why does “shakiness” increase accuracy?

You think that’s crazy? It’s about to get nuts.

13.

What in the world does this have to do with ChatGPT?

AI like ChatGPT stores how words relate to each other as a whole bunch of numbers organized in a specific way. Check this out:

In this first example, ChatGPT stores the concept of “Royalty” on the x-axis and “Gender Identity” on the y-axis. Word vector example with royalty and gender axes

Word vector example with royalty and gender axes

In this second example, ChatGPT stores the concept of “Activity Intensity” on the x-axis and “Temperature” on the y-axis. Word vector example with activity intensity and temperature axes

Word vector example with activity intensity and temperature axes

Word vector example with concept arithmetic

Here, we can see how we can perform math operations between word concepts in the way ChatGPT thinks. If we go by the coordinate pairs, we can calculate: Monarch = Queen - Woman.

Before we jump into four dimensions, it helps to think about Flatland, written by Edwin Abbott Abbott: a world where the inhabitants can only see two dimensions. Its class depictions are also a satire of the rigidity of Victorian-era social hierarchy. A 3D object passing through their world would look like a changing 2D slice or shadow, because they can only observe its projection into their space.

$Flatland projection illustration showing a higher-dimensional object as a lower-dimensional view$ Flatland inhabitants discussing how a higher-dimensional object appears in their two-dimensional world

Flatland inhabitants discussing how a higher-dimensional object appears in their two-dimensional world

Vector projection is the same core idea in math: we take something with more information and ask, "How much of it points in this direction?" A word vector can be projected onto axes like royalty, temperature, or activity intensity, and each projection gives us one meaningful part of the full concept.

Now let’s add one more dimension. Below, the words sit in 3D concept space, but they only "exist" when the slider matches their hidden 4th dimension. You can choose different "lenses" to see how AI organizes concepts like Flavor, Vibe, or Arcane Power.

4D Word Vector Visualization: The Flavor-Scanner

Choose a 4D Lens:

Drag to rotate. Move the slider to scan the 4th dimension (Edibility). Watch as words morph into view and grow in size when the "Flavor-Scanner" hits their specific coordinate.

4th Dimension: Edibility Scanner

Scanning Edibility: 72%

Cosmic / Massive Microscopic / Life Synthetic / Artifacts Culinary Concepts

So ChatGPT does all of this, but instead of pairs of numbers, it is able to perform these operations on lists of thousands of numbers at a time. This is because there are so many more relationships between words outside of “Royalty” and “Gender Identity” or “Activity Intensity” and “Temperature”.

So instead of 2D space (with parabolas) or even 3D space (with paraboloids)...

We are working with functions in THOUSAND-PLUS DIMENSIONAL SPACE!!!! Illustration of high-dimensional space

What does it mean to "train AI"? We are using two analogies. One analogy is words coming into place. Another analogy is a bowl being shaped.

Both visuals are simplified ways to imagine training: the model is gradually learning patterns from data.

What Does it Mean to Train AI?

This animation is a depiction of how ChatGPT is "trained" to know what to say. Words coming into place and a bowl being shaped are both analogies for what it means to train AI.

Words Falling Into Place

The Bowl Being Shaped

What do you think happens when we raise or lower the temperature of the output of ChatGPT? Yes, this is shaking a bowl in a thousand-plus dimensional space. (Hint: "Shaping" the bowl is different from "shaking" it.)

14.

How ChatGPT Generates an Output

After training, the model uses the patterns it learned to choose one word after another. The word space helps represent possible next words, and the shaped bowl is an analogy for the trained model already being ready to guide the output.

Temperature

Temperature: 38% - coherent with some variety

Words Selected in Order

Trained Bowl Guiding Choices

Next Word Probability

To conclude, we started by talking about how to calculate the slope of a function, and we ended with the insanity of dropping marbles in strange, thousand-plus-dimensional bowls to explain how ChatGPT works. What you are learning in class now aligns with the foundation of not only AI, but how the world works in SO many ways. Below, please connect ONE topic you learned in class to ONE new idea you learned today.

P.S.: What we talked about today is exactly why TikTok and Instagram Reels are so addicting and the Oxford Word of the Year is “Rage Bait”. The “marbles at the bottom of the bowl” for social media have to do with watch time, comments, and shares. Outrage and dishonesty is more effective at this than positivity and the full truth.

One last note: examples like walking down a hill blindfolded, shaping a bowl, and dropping marbles into a bowl are analogies. They are not a perfect representation of how AI works, but they help get the general ideas across.

Thank you so much for listening and participating!!

About the Speaker (Griffin Rutherford)

I am a graduate of Santa Fe Prep and a Bachelor and Master of Science in Computer Science Graduate from Colorado School of Mines. People say I have "golden retriever energy", and I love being physically active. I am an avid weightlifter, trail cyclist, snowboarder, runner, and more. I moved back to Santa Fe to work a remote position as an AI Research Engineer. I was a member of the Alpha Tau Omega fraternity as Philanthropy Chair, and I lived in the frat house basement for two years (at the expense of my GPA). I also love to sing, write, and enjoy long conversations with friends and family.

Stochastic Gradient Descent Made Fun

The Math Behind ChatGPT and AI

Guest Speaker - Griffin Rutherford - MS Computer Science - Mountain Runner

The Training Simulator: "The 3D Marble Bowl"

About the Speaker (Griffin Rutherford)