Stochastic Gradient Descent

The Math Behind ChatGPT!

Guest Speaker - Griffin Rutherford - MS Computer Science - Mountain Runner

This topic is normally only taught at the master's degree level. But I'm sure you can learn it!

1.
The expression for the slope of a function is:
2.
What does it mean when the slope is zero?
3.
What is the expression for the x-coordinate of the vertex of a parabola?
4.
If you connect two dots that are equidistant from the vertex with a line, what is the slope of that line?
5.
How does the slope change from the left to right side of the vertex of any parabola?
6.
Using this quartic equation as an example: y = x⁴ + 2x³ − 3x² + x + 1 We have something that looks like a vertex, but since it's a higher order polynomial, it's a local minimum: there's no formula for this! It's at (-2.22, -13.60).
In real life problem solving, why might finding a vertex or local min/max be useful?
7.
If f(x) represents the amount of water pollution, and x is the amount of a resource used in an engineering project, would it be better to find the local minimum or local maximum of this function? Explain your reasoning.
8.
But the world is a lot more complicated than one number in and one number out! We can actually extend this concept to 3D! Don't worry about all the underlying math, I want to focus on intuition. 3D functions are cool; they are normally not taught until Calculus III. Below is the 3D equivalent of a parabola called a paraboloid:

Click and drag to rotate the paraboloid!

Why do you think the real world doesn't normally use 2D functions for science and engineering?
9.
Let's say you could walk inside that paraboloid above, but you must be blindfolded. If you didn't exactly know where the bottom is, how would you figure out where the lowest point is? What concept from the start of the lesson helps us solve this question? Hint: There is no equation for this problem.
10.

Yes! The world is messy and we can't just plug stuff into equations. As we have a bunch of data, we can't rely on the quadratic formula, completing the square, etc. I mean it would be insane to solve quartic equations by hand with this nightmare formula:

So what in the world do we do? We use computers!

Computers can crunch a bunch of math insanely fast. We can also use context clues to help computers make a lot of rapid guesses back-to-back, so they get closer to the output we want.

Remember the concept of local minima/maxima and how we talked about slope? Those concepts all exist in 3D too.

So how do you think a computer can find the local minimum of a function if all we know is the input, output, and slope of where we are "standing"? Hint: Imagine you are walking in the paraboloid from the last page trying to find the very bottom.
Paraboloid surface
11.

Spoiler for Q10: Computers can think like someone standing at the edge of a paraboloid, and make small steps downward until it knows the slope is flat. Remember that line we drew through the vertex of the parabola from Q4? Computers can figure that out, but in 3D too.

Computers really just make a bunch of educated guesses until it gets close enough to the result we want: the lowest point of the paraboloid.

This is the concept of Gradient Descent, the foundational reason why ChatGPT works!!

So instead of walking inside the paraboloid, let's imagine we have a super weird bowl with marbles inside. We drop marbles somewhere randomly in this bowl, and wherever the marbles end up landing decides how much money is saved. The deeper inside the bowl they end up, the better.

We have two holes. Think of the deepest hole as the global minimum and the other hole is a local minimum.

Let's say you could control the bowl with your hands and you want as many marbles as possible to end up in the deeper hole. You are blindfolded once again, so you can't tilt the bowl in the direction you know has the deeper hole.

What could you do to try to get the most marbles in the deepest hole after the marbles are dropped? Could anything be done?
12.

You just learned the concept of Stochastic Gradient Descent! This concept is not normally taught until 500-level graduate math classes 😱!

Stochastic is a fancy word for random.

To summarize, adding a bit of random "shakiness" helps computers better approximate to the best "local minimum" compared to going straight down.

Some shakiness is good, but too much causes the marbles to fall out, so we need just the right amount.

In the world of AI, this "shakiness" is referred to as "noise" or "temperature".

Instead of marbles in weird bowl holes, instead picture the depth of the hole as reducing pollution, saving resources, etc.

In your own words, why does “shakiness” increase accuracy?

You think that’s crazy? It’s about to get nuts.

13.
What in the world does this have to do with ChatGPT?
AI like ChatGPT stores how words relate to each other as a whole bunch of numbers organized in a specific way. Check this out:

In this first example, ChatGPT stores the concept of “Royalty” on the x-axis and “Gender Identity” on the y-axis. Word vector example with royalty and gender axes

In this second example, ChatGPT stores the concept of “Activity Intensity” on the x-axis and “Temperature” on the y-axis. Word vector example with activity intensity and temperature axes Word vector example with concept arithmetic

Here, we can see how we can perform math operations between word concepts in the way ChatGPT thinks. If we go by the coordinate pairs, we can calculate: Monarch = Queen - Woman.

So ChatGPT does all of this, but instead of pairs of numbers, it is able to perform these operations on lists of MILLIONS of numbers at a time. This is because there are so many more relationships between words outside of “Royalty” and “Gender Identity” or “Activity Intensity” and “Temperature”.

So instead of 2D space (with parabolas) or even 3D space (with paraboloids)...

We are working with functions in MILLION DIMENSIONAL SPACE!!!! Illustration of high-dimensional space

Remember when we talked about dropping marbles in bowls and shaking a little bit? ChatGPT does this while working in a dimensional space WE CAN’T EVEN COMPREHEND!! Long story short, ChatGPT’s outputs are based on sets of predictions that use information from relationships between words. Accuracy, comprehension, and output from these AI models can be thought of as a more complicated version of finding a local minimum, like the vertex of a parabola.

Reminder on our new terms: stochastic is a fancy word for random, and this concept is also referred to as “noise” or “temperature”. Temperature is more widely used in the context of language models like ChatGPT.

What do you think happens when we raise or lower the temperature of the output of ChatGPT? Yes, this is shaking a bowl in a million dimension space.
14.
To conclude, we started by talking about how to calculate the slope of a function, and we ended with the insanity of dropping marbles in strange, million-dimension bowls to explain how ChatGPT works. What you are learning in class now aligns with the foundation of not only AI, but how the world works in SO many ways. Below, please connect ONE topic you learned in class to ONE new idea you learned today.

P.S.: What we talked about today is exactly why TikTok and Instagram Reels are so addicting and the Oxford Word of the Year is “Rage Bait”. The “marbles at the bottom of the bowl” for social media have to do with watch time, comments, and shares. Outrage and dishonesty is more effective at this than positivity and the full truth.

Thank you so much for listening and participating!!

About the Speaker (Griffin Rutherford)

I am a 24-year-old graduate of Santa Fe Prep and a Bachelor and Master of Science in Computer Science Graduate from Colorado School of Mines. People say I have "golden retriever energy", and I love being physically active. I am an avid weightlifter, trail cyclist, snowboarder, runner, and more. I moved back to Santa Fe to work a remote position as an AI Research Engineer. I was a member of the Alpha Tau Omega fraternity as Philanthropy Chair, and I lived in the frat house basement for two years (at the expense of my GPA). I also love to sing, write, and enjoy long conversations with friends and family.

Alpine Lake trail selfie