A Brief(ish) Story of ChatGPT

35 min readMay 23, 2023

What is this ChatGPT thing? And why is everyone & their mom talking about it?

It is a language model. But that doesn’t say much. To understand this complex AI model we’ll break it into its basic components.

The Chat part is obvious if you’ve used it. You give it prompts and it responds. Below are a variety of examples:

The GPT part means Generative Pre-Trained Transformers. Pretty self explanatory. Moving on…

NOT, let’s examine one at a time.

Generative: 😺 vs 🐶

Short and sweet: ChatGPT is a model that generates content. More specifically, in Machine Learning (ML) you can divide models into two categories. One is generative, the other is discriminative. The difference is this: generative models learn the characteristics that make up a group of data. For instance, given pictures of cats and dogs it will learn what a dog looks like to the extent that you can say “hey, draw a picture of a dog” and it will be able to generate an image of a dog, one that didn’t exist before. Discriminative models don’t learn the intrinsic characteristics of data. So you can’t ask it to draw a dog

Instead, you can give it an image of a dog or a cat and it will say “this is a dog” or “this is a cat”. In other words, it will learn to discriminate (differentiate) between cats and dogs

Thus, by being Generative, ChatGPT is able to generate content because it has learned the characteristics of some data. In this case, that data isn’t cats and dogs pictures, it is… language itself

Sexy stuff.

We mentioned ML. You’ve heard the term, but how do you eat it? Think of it as a subset of AI. A way to program computers so that they can learn patterns in data. As you give them new data, they keep learning. In the example above, how does a computer learn to differentiate between cats and dogs? Well, you find tons of images of cats and dogs and give them to the model. If you have a label (“cat” or “dog”) for each image the model will do some mathematical computations and learn patterns like “If I see triangular ears and a little triangular nose, then it’s probably a cat!”

In the context of ChatGPT, generative means that it learns the probability distribution of language. What the hell does that even mean? Well, as I wrote the sentence “What the hell does that even “ you probably could guess what the next word would be: “mean”. That’s because you’ve had years of training. Meaning, for years you’ve heard and spoken English and you’ve heard that phrase before many times. For that reason, this sounds off: “What the hell does that even chicken?” If you analyze it, not only does it sound funny, but it doesn’t make sense because the sentence structure calls for a verb to be where chicken is. That’s the probability distribution of language. There is a structure to it, and certain words appear together more often than others. That’s what Generative means for models. They learn that, probably, “mean” is way more likely to appear at the end of my “What the hell…” sentence than chicken.

Have you ever played that game where you get in a circle and one person says a word, any word. The next person repeats that word and adds one new word to it. So on and so forth, thus socially building a sentence on the go, á la kumbaya. Well, in essence, that’s what ChatGPT does, but by itself, without friends 😢

Basically, all ChatGPT does is guess the next word in a sentence, given words that came before it. It’s just really good at guessing what that next word is because it learned language and can understand context. But how does a models learn language? To answer that let’s move on to the Pre-Trained part, or the P in GPT.

Pre-Trained

Like the 😺 vs 🐶 model above, when you ask ChatGPT for something, it doesn’t go into the internet, learns the probability distribution of language from scratch and then responds to you. Because just like it took you years to learn English, models too take time to learn. ChatGPT is Pre-Trained, meaning it already saw billions of sentences and learned that “Best day ever” is a more likely phrase than “Best day water”. Hence, when you prompt ChatGPT it responds in seconds.

So what does training actually entail? A computer doesn’t recognize “Best day ever” as language. It’s not a human. It can work with numbers though. The first step to train a language model (like ChatGPT) is to convert phrases into a series of numbers. We call these vectors and they look like this: [9, 0, 2, 4, 7]. It’s just a list of numbers, in this case a list of five numbers. They can be of any size. This is a vector: [0], this is also a vector: [0, 1, 2, 3, 4, … , 10000]. A vector (a.k.a. list of numbers) doesn’t do much by itself. But, the difference or similarity between two or more vectors contains knowledge! To exemplify:

[9, 0, 2, 4, 7] and [9, 0, 2, 4, 0]. Are these two lists similar? Very much so! At any given position, all numbers are the same except for the last ones (7 and 0).
— — [9, 0, 2, 4, 7]
— — [9, 0, 2, 4, 0]
[9, 0, 2, 4, 7] and [0, 3, 3, 7, 2]. How about these? Very different.
Here’s a challenge: Is the list [1, 7] more similar to [3, 6] or to [4, 5] (remember, vectors can be of any size, here they’re size 2)? Hard to tell. But maybe if we plot them we can see. The plot below is just a 2D grid of numbers from 0 to 10, like coordinates. It’s easy to see that coordinates [1, 7] are closer to [3, 6] than to [4, 5]. We can say that red is more similar to blue than it is to green.

Vectors of size 2 (or two-dimensional) can be plotted in 2D grids. But we can have a vector of just one dimension. For instance, vectors [3] and [7]:

One-dimensional vectors are just points in a line. Two-dimensional vectors are points in a 2D grid. Three-dimensional vectors are points in a 3D space:

Then we can have a 4D vector like [5, 2, 0, 9] but that’s really hard for humans to imagine because our world is 3D. Go ahead, take your time and imagine a 4D graph.

…

Welcome back after failing to imagine a 4D graph. It doesn’t matter that we can’t picture those 4+ dimensions. What matters it that the same rules that apply for 1D, 2D and 3D, apply to any size. Why is this relevant? Because the vectors that ChatGPT uses are of approximately 1000 dimensions! How do you even plot that? Like I said, you can’t, but just like we knew that the 5-dimensional vectors [9, 0, 2, 4, 7] and [9, 0, 2, 4, 0] are alike, we can similarly compare how close or different two 1000-dimensional vectors are. You just apply some mathematical operations like addition and subtraction to measure distances. Computers are really good at that.

Ok we get it, you can have vectors of any size and calculate the degree of closeness (aka similarity) between them… Why does that matter? It matters because ChatGPT represents a sentence like “Dude, where is my car?” as a list of like 1000 numbers. Big vectors. And how crazy is it that we can convert a sentence into just a list of numbers? Crazy I tell you! More importantly, the model will learn to represent sentences that are similar as vectors that are closer than sentences that are dissimilar. And in doing so it captures knowledge. Let’s look at an example. Note: Because I don’t want to be typing 1000 numbers throughout this example, and you don’t want to be reading them either, for simplicity I’m going to represent sentences as 2D vectors such as [8, 2]. Technically, that’s allowed, but there is very little information you can capture in just two numbers. There’s much more space for information storage in 1000 numbers.

Example: Of Monsters 🤖 and Men 👨

This section is a little more technical, but not too much. If you’re patient, it will help you truly understand ChatGPT.

Say that we have four sentences:

Monsters are horrific and evil
Humans can be good creatures
Dogs are loyal and happy
Children are a joyful and playful bunch

If I say, “ok computer, we’re going to represent the above sentences as the following vectors”:

[7, 5]
[9, 8]
[9, 5]
[3, 7]

Are those good representations? If your answer is “Good for what?” then you’re on to something. Let me introduce a new concept: goals. Just like we set new year’s goals such as exercising more, complaining less, learn that tiktok dance (I’m not the only one, right?), machines also need goals. In ML (machine learning, that is) we train computer models to perform a task. Basically, to reach a goal. For ChatGPT the goal was to learn language. More specifically, it was tasked to be able to respond to prompts, and to do so while understanding context (it’s different to say “ChatGPT, tell me a joke” than to say “ChatGPT, write code in Python”). And to do all this while using human-like English.

That’s a lofty goal and we’ll get there later, but for this example our goal will be to simply categorize a sentence as positive 🍏 or negative 🔴. Now, if I ask “are those four vectors above good representations?”, can we answer that question? Now that we know what the goal is, we might. Let’s analyze the vectors (each circle in the image is a vector, numbered by the order above, and colored as follows: green = positive, purple = neutral, red = negative).

If the goal is to differentiate between positive and negative sentences, this is a bit of a mess. Sentence 3 is closer to 1 than it is to 4, which doesn’t make much sense because 4 and 3 have a positive sentiment (we can tell by the words loyal, happy, joyful, playful), but sentence 1 not so much (horrific, evil).

The following would be more useful representations because the cute sentences are closer to each other, and the ugly sentence is far from them. We want vectors that end up being plotted like that.

When we say we train a model, this is what we mean: in the beginning, the model randomly guesses vector representations for each sentence because it doesn’t know better. Then it goes through a series of iterations where it checks whether it is doing it right or not. How does it know if it’s getting things right? In this case we would need to give a little test to the model, something that says “you got it right — gold star”. Like a teacher grading a test, the model sort of grades itself. We give it a cheatsheet and say “study until you get most of these right”.

The very first step would be to label the data ourselves (i.e. create a cheatsheet for the model), like this:

We (a human) manually assign labels to sentences: negative, neutral, or positive. The model will then tweak the vector representations of each sentence so that the positive ones are closer together, neutral ones are a little farther away, and negative ones are furthest away. Remember that computers don’t know what words mean, so we must convert labels to numbers. A useful method is to denominate negative as -1, neutral as 0, and positive as 1. Why we use those three numbers isn’t necessarily obvious, but we have learned that those are good representations after years of creating and testing models. And it kinda makes sense because -1 (negative) is farther away from 1 (positives) than they are from 0 (neutral). In other words, negative is more like neutral than it is like positive.

Now, how does a model find the best vectors to represent a sentence? Remember that Calculus thing you learned back in college and thought “this is BS”. Well, models use math from that Calculus thing and do go through the following four steps:

Come up with some random vectors — It’s a random numbers party. Doesn’t matter what they are, the point here is just to get started. Example:
a. [0, 0]
b. [9, 0]
c. [0, 5]
d. [1, 3]
Pass vectors through a series of transformations — Those can range from a simple addition, to a complex equation. As an example, let’s pass our last sentence vector [1, 3] through three transformations. Note: I’m choosing three random transformations to illustrate a point, but the process of choosing good transformations are more technical and scientific. In ChatGPT, the main transformations are called, you guessed it, Transformers! The T in ChatGPT! (stay tuned for the next section). Ok, here we go:
a. First transformation: multiply by 3
i. 1 x 3 = 3
ii. 3 x 3 = 9
iii. So we end up with [3, 9]
b. Second transformation: subtract 1 → [2, 8]
c. Third transformation: divide by 10 → [0.2, 0.8]
Apply a Reduction Transformation — What that means is that since our vectors are 2-dimensional but our labels are 1-dimension, we need a way to compare apples to apples. If I say “is the vector [0.2, 0.8] closer to -1 or to 1?” That’s hard to tell because -1 is a 1D vector (a point in a line) and [0.2, 0.8] is a 2D vector (a point in a grid). So, we convert the 2D to a 1D. We can do this many ways, and one way is to just take the average of [0.2, 0.8], so let’s do that:
a. Average = (0.2 + 0.8) / 2 = 0.5
Compare — We know that sentence #4 is a positive sentence because a human labeled it as such. So we know it must be closer to 1 than to -1. Looking at the image below where we plot 0.5 in a line, we can easily see that hey! It’s not that bad!
-1 | — — — — — — | —o— — — — | 1
Make Changes and Repeat Steps 1 through 4 — It’s not that bad, but it could be better. The model uses some of that Calculus magic (tools like backprop) to make modifications to the vector in the right direction. That is, modify it so that when we go through the transformations in steps 2 and 3 the end result is closer to our goal, which is 1. Example:
a. Step one — Instead of [1, 3], make it something like [2, 3] because Calculus says so
b. Step two — multiply, subtract, divide to end up with [0.5, 0.8]
c. Step three — take average to get 0.65
d. Step four — What do you know?! 0.65 is even closer to 1 (which is what we want). Representing “Children are a joyful and playful bunch” with the vector [2, 3] is a good representation!

Now if we do this for the other three sentences and plot the vectors we would see how positive sentences cluster together, far away from neutral and furthest away from negative. The model has learned 💪 But it isn’t ready to rebel against humans quite yet.

By the way, do you know what those five steps above are? An algorithm. It’s just a series of ordered steps used together to accomplish a goal. Feel free to throw that fancy algorithm word around from now on.

More importantly, what you just learned is the basis for how most of Machine Learning models work. In particular, it’s how ChatGPT learned. It converted human text to vectors, then transformed those vectors and tested how good those vectors were at performing a task. Then made adjustments. Now when you ask something to it, ChatGPT transforms your text and performs some transformations to then predict what the best response would be.

What we have then is a computer that spent hours looking at sentences (it Pre-Trained) to learn. But ChatGPT can do more than answer things like “is this a positive or negative sentence?”. It learned to generate language because it learned the underlying structure of the data (it is Generative). The data, in this case, was a big chunk of the internet. Billions of sentences. ChatGPT learned the structure of language because that’s what it was tasked to learn. That was its goal.

How does that look in practice? Now things get interesting…

Learning to Speak

Some years ago, researchers discovered that computers could become particularly good at two games. The first is called “Guess the next word!”. The second is “Guess the next sentence!”. They don’t call them games though, but tasks instead. Let’s see how you do.

Task 1: Guess the MASKED Words

Can you guess the hidden words behind [MASK] on each sentence below?

These enchiladas are too [MASK].
[MASK] has the largest number of Tour de France titles.
President Trump is the [MASK] president the U.S. has had in 30 years.

From the whole vocabulary of English words (well, technically speaking ChatGPT does not guess words but tokens, but think of tokens as words) ChatGPT would then choose the best fitting MASKED word.

What are these “right” words? They are, of course:

spicy
France
worst

Or is it:

expensive
England
best

🤔 Hard to tell… “spicy” and “expensive” are equally likely without more context. France is the right answer in this case, but if ChatGPT never saw a sentence that connected France to the most Tour de France titles it could easily guess it wrong. And whether it is “worst” or “best” on the last sentence is a rather subjective (and controversial) topic.

During training, ChatGPT would be given sentences like the one above, and have to guess which word was hidden behind [MASK]. Then it would check to see if it got it right. If it didn’t, it was “penalized” and had to go back and make adjustments so that next time it would get it right. Just like the vector modifications we learned above for negative and positive sentences.

Large Language Models (like ChatGPT) learn probability distributions of language from looking at lots of sentences and actively guessing hidden words. For instance, ChatGPT would “think”: I have seen spicy and enchiladas as part of the same sentence many times before… chances are high that they go together. For the enchiladas sentence however, there are several options that make sense (see image below). What the model (ChatGPT) does is to read a sentence and when it sees MASK it goes through the English vocabulary and scores each word depending on its level of fit. The more likely the model thinks a word to be the hidden word, the higher the score. How did it know to give spicy a better score?

That is the magic question. How does it guess words so well that it seems like a human wrote the damn thing?! The answer is context and it’s the topic of the next section (Task 2) and the following one (Transformers)

Task 2: Guess the Next Sentence!

Many important tasks ChatGPT gets asked to do, such as Question Answering, are based on understanding the relationship between two sentences, which is not directly captured by language modeling (that is, by learning language via guessing the next word).

Therefore, ChatGPT was not only trained to predict the next word. It was also trained to predict the next sentence. It’s a very similar mechanism to the one for masked words.

I’m not really good with Mexican food. [MASKED SENTENCE].

The model is tasked with choosing the right sentence from a bunch of choices. If it gets it wrong it’s penalized so that it has to make adjustments to its sentence vectors, just like it has to when it guesses a word wrong. This is how it is able to learn that “These enchiladas are too spicy” is a likely sentence after “I’m not really good with Mexican food”. Because from context we know that Mexican food is more well known for being spicy than for being expensive, or big, for instance.

We know what the model goals are (to predict the next word and sentence) and how models learn (by playing the game with millions of sentences and finding probability distributions of words, modifying its vectors when it doesn’t guess things right). It is now time to explain where the magic comes from 🪄

Hiding words and sentences and training language models to predict the hidden parts are techniques we’ve used for several years. But we haven’t had something as powerful as ChatGPT until recently. What then made ChatGPT the magical creature that it is? There’s three main sources of magic:

Transformers — context machines
Size — bigger models
Fine-tuning — learning to understand instructions

First, let’s talk Transformers! The T in ChatGPT.

Magic 1: Transformers

Look at the following sentences

That haunted house, there is no way you can go inside and come out [MASK].
That smelly house, there is no way you can go inside and come out [MASK].

They are the same, except for one word. That word (haunted vs smelly) changes the meaning of the sentence. So much so that you would probably guess completely different words behind MASK. You, human, know that. But how does ChatGPT know that it is this word in particular that is of importance? How does it know to focus there and not so much in, say, “inside”. After all, “inside” is closer to MASK than haunted/smelly is.

Similarly, how does a model know that “python” here means two very different things? The first sentence refers to python snakes, the second to the Python programming language.

Pythons, are mostly found in the tropics and subtropics.
Python tends to be beginner friendly because of how declarative it is.

We need some history for this one… Back in 2017 (that’s about two pandemics ago) Google Brain published a paper called “Attention Is All You Need”. The paper discovered that some specific vector transformations (remember transformations from the “Of Monsters and Men” example) were able to teach the model which parts of a sentence were the most important, so to speak.

You know how we used vectors to represent a whole sentence? Well, you can represent words as vectors too. Let’s recap: a vector is simply a list of numbers. There isn’t much value in any one of them by themselves, but the relationships between them (similarity, or closeness) that has value. Because if two word vectors are similar (e.g. [0, 9] and [0, 8] are alike, but [1, 5] and [8, 0] not so much) then it means the words are similar. In the image below, the vectors for “apple” and “pear” are close together and away from “sandwich” and “burger” because although they’re all food, fruits are more similar amongst themselves than compared to a sandwich.

The details get technical, but in simple terms, what researchers at Google did was to create a data transformation that would take word vectors and find which words commanded the attention of the sentence (aka which words have the greatest influence in the rest of the words in that sentence). They named that process: Self Attention.

Why is that useful? I’ll answer that question with a question… If you read the sentence “The animal didn’t cross the street because it…” What is it referring to? The animal, or the street? Reading the rest of the sentence helps clarify because “wide” is more likely talking about the street. What Self Attention allows us to do is to mathematically determine what word “it” is most connected to. In this case (image below) we can see that it, correctly, determines a stronger connection to “street”.

Cool, but how does a model determine that? Turns out if you multiply word vectors in a sentence against each other and use the outputs to guess the MASKED words, you end up getting representations of words and sentences that “know” which words are most important. Just like in the negative-vs-positive example above (where the model learned to tweak the vectors so that positive sentences became more similar to other positive sentences) here the model learns how to represent word vectors to find the most important words in each sentence, and use that context to best guess the masked word.

Let’s illustrate. The image below shows what happens inside ChatGPT. To the left of the vertical line is the architecture, and to the right is an example with values. We start from the bottom where it says “Input”. To train ChatGPT they gave it billions of sentences from text found on the internet. Sentences like “this restaurant was not too terrible”. But to force the model to learn, they covered 15% of words. So, in this example, instead of “terrible” we cover it with [MASK] so that the model has to learn how to correctly guess “terrible”.

The first step is to convert words to vectors. Here we call them Embeddings, but they are the same thing: a list of numbers. We can start with random numbers like in the “Of Monsters and Men” example. It is the model’s job to figure out what those embeddings (aka vectors) will be, which are shown in green. We then pass those embeddings to a transformation layer. We will explain this further in a bit, but for know think of it as code that tells the model “multiply word vectors against each other”. The end result are the Transformed Embeddings we see in yellow. Our point of interest during training is the masked Transformed Embedding (TEmask). The model takes that embedding and applies some transformations to it, then it compares the end result with a dictionary of embeddings. That dictionary contains a vector for each word in the English vocabulary, so that we can ask “this TEmask embedding, which word vector is it closest to?” As an example, say that the model computes the transformed vector for MASK to be [2, 4, 0, 5, 1, 8] (yellow box on the right-hand side). It takes that embedding and compares it to the English vocabulary vectors (blue box). Then it scores how similar the vocabulary words are to that vector. In this case, it finds that the vector for terrible, which is [2, 4, 4, 5, 1, 8], has a “similarity” score of 0.64. The model outputs the content in red which is a list of words and their “similarity” score. The higher the score, the more likely that word is to be the word behind MASK.

***Note****: This diagram simplifies computations considerably, but the overall idea is that.*

The model has just said “hey, this masked word here, I think it’s terrible”. The model then has the ability to then see what is hidden behind MASK. In this case it confirms that it correctly guessed “terrible”. Well done ChatGPT 👏! But what happens if instead it had scored “loud” higher than “terrible”, for instance? If the model gets it wrong it has to go back (using the Calculus thing we discussed in the “Of Monsters and Men” example) and change the transformations (remember, these are just operations like addition, multiplication, etc) so that next time the results are closer to the right word. The model iterates through this process for a while until it gets good enough at guessing.

One last thing, the model is trained to guess words bidirectionally. What’s that? It means moving forward and backward. In the sentence “The baby will be [MASK] the following week.” if we move forward it will use the context provided by “The baby will be” to guess the masked word. If we move backward, it will use the context from “the following week.” to guess the masked word. By using both we’re making the model more robust because, as you know, what comes after the masked word is as helpful as what comes before.

Forward: The → baby → will → be → [MASK] → the → following → week → .
Backward: . → week → following → the → [MASK] → be → will → baby → The

The image below shows how we do this everyone-against-everyone multiplication of word vectors to get the attention-aware vectors. The model basically multiplies every word vector in “this restaurant was not too terrible” against each other. In order to get the attention-aware vector for “terrible” (y-terrible) it multiplies the “terrible” word vector with “this”, “restaurant”, “was”, and so on. The image below shows highlights when the “terrible” word vector is being multiplied by the “not” word vector.

The model then combines all the green vectors (similarly to how we converted 2D vectors to 1D by taking their average). One way to do that, for instance, could be to just sum their values:

Vector 1 = [1, 2, 3]
Vector 2 = [9, 3, 5]
Vector 3 = [0, 8, 8]
Combination of all three vectors = V1 + V2 + V3 = [1, 2, 3] + [9, 3, 5] + [0, 8, 8] = [10, 13, 16]

That’s it. That’s our sentence vector. Like the [2, 3] vector we used to represent “Children are a joyful and playful bunch”, in this case [10, 13, 16] represents the sentence “this restaurant was not too terrible”. The powerful things here is that the mechanism used to obtain this vector basically measures which word has the highest impact in the sentence. In other words, by forcing the word vectors to look at itself and each other (that’s self-attention), you get a final consolidated vector that “knows” what the important words are.

To drive that point home, let’s say we use embeddings that are the same size as the sentence, so that each word corresponds to a value (a number) in the embedding, like in the image below. Using transformers with self-attention (green arrows) we would get a vector that captures what the most important parts of the sentence are. In this case, it would give higher values to “restaurant” and “terrible” (7 and 9, respectively), while giving lower values to “this” and “was” (1 and 2, respectively). Without self-attention (red arrows) we would get vectors that don’t really know which words are important and which aren’t.

Alright, enough Transformers talk. The key insight here is that training the model with a middle step that allows the sentence to “look” at itself allows it to learn what words contain more value, and therefore to learn context so that the model can intelligently predict what word to add next depending on whether it saw the word “haunted” or “smelly”, for example.

Now that we know how the model learns context, it’s time to learn more magic. Why is it that humans are smarter than birds? Without getting too technical, the answer is mostly: size. Not physical size, but brain-power size. So let’s learn about machine “brain” power.

Magic 2: Size

Bigger models are “smarter” (usually). ChatGPT is only the latest of similar models, and this is how they’ve grown over the years:

But what does bigger mean? What are these parameters you’re talking about, sir?

We’ve talked about transformations, but they probably seem like black boxes to you. Fair enough, let’s demystify. Look at the purple boxes in the image below.

Inside those boxes we find a series of operations (aka transformations). If you’ve heard the term Neural Networks, well that’s what’s inside. Note: They are not exactly like what I will show below, and ChatGPT uses a system of Encoder-Decoders as the wrapper for those networks, but the concepts are all the same.

All you need to know is that a neural network is a series of operations (multiplications, additions, etc). Here there are three layers of nodes (blue circles, also called neurons because researchers based this design from how brains work — a network of neurons). Each circle in a layer is connected to all circles in the following layer. It’s everyone against everyone here. Free-for-all. Choose-your-battle. It’s a party and everyone’s invited. In the image below we start with a sentence “penguins are some elegant birds”. We convert its words into vector form as shown (it can be at random, so we get penguin = [1, 4, 0 ,9]), then we pass them through the neural network.

The image below shows what “passing through the network” means. I removed all connections to only show those from the top node at each layer for explanation purposes. I also colored them to help explain. Let’s start with the “penguin” vector: [1, 4, 0, 9]. That will be the value of the orange node. We take that vector and multiply it by the numbers (weights) shown next to the lines coming from the orange node connections. Here’s how the penguin vector gets transformed:

Orange node = [1, 4, 0 9]
Purple node = orange node x purple weight = [1, 4, 0, 9] x 0.4 = [0.4, 1.6, 0, 3.6]
Green node = purple node x green weight = [0.4, 1.6, 0, 3.6] x 5 = [2, 8, 0, 18]

At the end the network spits out a vector that results from combining values in its last layer (the green circle and its buddies underneath it). Don’t worry about the actual math. The important thing here is that data passes through the network and is being transformed (in this case via a multiplication) by the weights. Those weights, those are the parameters. So when we say that ChatGPT is more magical than its predecessors because it’s bigger, that’s what we mean: it has more parameters, meaning, there are more values by which we multiply the data as it passes through. The more parameters (neurons) in a network, the more knowledge it can store. That’s why we humans are smarter than them birds.

We can add operations (a.k.a. make the model bigger) via three main ways:

by making the networks deeper (grow it horizontally)
by making it wider (grow it vertically)
by adding more of those purple boxes in the first table in this section

Source: http://trendytechz.com/deep-neural-network-plays-major-role-finding-obstacles-near/

Conclusion: want to make your Language Models smarter, make them bigger (that’s why they’re now called Large Language Models, because we’ve made them huge).

There is one caveat to this whole “bigger is better” paradigm. Recent research has shown that you can achieve the same level of performance (aka magic) by increasing not the model size, but the amount of data it is trained on. For instance, ChatGPT was trained with ~10 billion sentences, this research suggest that OpenAI (who created ChatGPT) could improve the model by keeping it the same size (same number of parameters) and training it using, say, 100 billion sentences instead.

In any case, GPT-4 is in the works and word on the AI street is that it will be made up of 100 trillion parameters. Not sure if they will also increase the amount of training data, but they should.

Moving on to the final technical section of this writeup that started as an outline and now is pretending to be a book…. Magic #3: Fine-Tuning.

Magic 3: Fine-Tuning

ChatGPT was trained on so much internet text, it saw so many things, that it now has PTSD. Just kidding, it saw so many things that you can ask it pretty much anything and it will at least respond (maybe inaccurately, but respond). It saw:

Poems
Math problems
Code
Books
Conversations
Technical papers
Jokes

And so much more.

That’s great because it knows so much. But it creates a problem: when you ask ChatGPT to tell you a joke, how does it know to not respond on the form of a poem, or to not give you code? What you’re asking is a subset of the what the model can do, how does it know how to respond? After all, ChatGPT only learned to predict the next word and to connect sentences coherently.

The answer, surprise surprise, is Fine-Tuning! Yayy! 🥳 Now, what the hell is that?!

When a kid is learning how to speak, it’s not enough that it knows how to put together sentences in a coherent way. If your 6-year-old started cussing like a sailor you will probably intervene and say “we don’t talk like that”. And via this social context over the years we learn to talk and behave differently when in a classroom, when at a bar, when at a funeral, or when talking to our best friend. That is fine-tuning.

Here, ChatGPT is the 6-year-old, and the researchers who built it are like the teachers, friends, or anyone who would say “hey don’t talk like that!” or even “hey that was funny. More of that!”. It’s the good old conditioning game of reward and punishment. For ChatGPT, two things were done:

Supervised Fine-Tuning
Reinforcement Training from Human Feedback

Supervised Fine-Tuning

The intent of ChatGPT is to be, well, a chatbot. Remember goals? If we want a model to be good at something, we need to train it to do that, then say “bad robot” when it doesn’t do it right. That’s the supervised part: a human telling it “you did it right/wrong” (like adult supervision). The first step to fine-tune the model was to teach it to be a chatbot. How?

First, two humans had a conversation, one of them pretending to be just a regular human, and the other pretending to be the ideal chatbot.

Source: https://www.youtube.com/watch?v=VPRSBzXzavo&t=72s (this is a terrific video btw, worth a watch)

Remember when we introduced vectors and wanted to classify sentences as positive, neutral or negative; and we required some human labeling to check whether the model got it right?

This is the same principle. We feed these conversations to the model and ask it to predict the next response. We know what that next response is, of course, so once the model predicts something, it can take a peek at the actual response and use that Calculus magic to go back and modify its parameters (that is, the series of operations it uses to multiply and add sentence vectors) so that next time it gets it right (or closer to right). It’s the same principles we’ve covered before, the only thing that changes is what data goes through the transformations. In this case the data is vectors representing the history of conversation, and the label (analogous to the positive, neutral, negative we saw before) is a vector representing the next response.

Fantastic, the model can now spit out language and behave like a smart chatbot. But a problem was found 😨. You see, the model learned by reading reasonably coherent conversations between two humans. But say you have a trained model and you start having the following conversation with it:

Human: “Hey model, why does the government ask me to file taxes if it already knows how much I owe?”
Bot: “Hi human, it does it for the lols”
Human: “That sucks. Hey can you help me with my taxes?”
Bot: “Orange is the new black”.

The first bot answer is kind of strange but reasonable. The second one though, it’s completely off (or maybe it’s making a dark joke about you going to jail). Since the way ChatGPT works is that it reviews the whole history of a conversation (again, it’s a terrific context machine!) to answer intelligently, if at one point it makes a mistake (because it’s imperfect) then it’s probable that each subsequent answer will be increasingly off because each time it’s using those previous wrong answers 😟

We don’t want that, nah ah! So we bring some fancy Machine Learning technique. Meet: Reinforcement Training (I swear we’re almost done).

Reinforcement Training from Human Feedback

Have you seen videos of those robots that make you think “ok maybe I do need to be nicer to my toaster”? The way they train those robots is through Reinforcement Learning. In simple terms, it consists of coding a reward-punishment system so that the robot seeks rewards and in doing so, learns. For instance, if the robot falls, it learns to avoid stepping in that particular way, or learns to balance itself better.

The same principle was used as the last step in building ChatGPT. It consists of two main steps:

Collect comparison data and train a reward model — Researchers would have a conversation with the model, giving it prompts like “Explain the moon landing to a 6-year-old”. Then for any given response, they would sample alternative responses. Frankly, I’m a little unclear as to how they got these alternative responses, and it’s not too important, but my intuition is that they asked the model the same prompt several times (these models are stochastic, meaning, there is some randomness to them so if you give ChatGPT the same prompt back to back, it will give you different responses). So, for each prompt they have 4 possible answers (A, B, C, D). A human then ranks these based on their preference (D > C > A = B). Then they used these rankings to create a reward model. Just like the dog-looking robot above was told “falling is bad, avoid falling”, here they make a reward model that says “answer D for this type of prompt is better than answer C”.
Optimize a policy against the reward model — the model must then internally learn a policy system (PPO in the image below) that tries to approximate that reward model (RM, which we developed in step 1) as much as possible. For instance, the reward policy might be that when asked to explain things for a 6-year-old it should use some light humor. Strange policy, but let’s run with it for a second. The model doesn’t know that policy. It needs to learn it, instead of us just giving it away. Because that process of learning, that is the magic. Just like you as an adult know that playing with fire burns, but then you see dumb kid Jimmy about to burn its little finger and you say nothing and think “let dumb Jimmy learn the lesson”. And sure enough he burns his finger but learns not to do it again 🔥 (follow me on Twitter for more parenting advice). Well, same thing, we can’t just take the RM and give it to the model. Instead, we let it learn by giving it prompts, asking it to choose the right answer, then allowing the model (kid) to ask the RM (life) “hey, did I get it right?”. If it did, reward. If it didn’t, get burned. Tough love, but it works.

Source: https://arxiv.org/pdf/2203.02155.pdf

Let’s review:

They trained a computer to understand and generate language by asking it to guess what the next word would be
They made it a context-understanding machine by pushing it to learn what words on a sentence are the most important.
Because we don’t want the model to say things like “Last night I was wondering about windows and doors. Why did Rose not let Jack on that door?” where each sentence on its own makes sense but they have nothing to do with each other, they asked the model to guess the next sentence so it learns to connect them.
To make it smarter, they built a pretty large model and fed it a large amount of data.
We don’t just want a model that can talk English nicely. We want it to understand and respond to instructions. So researchers fine-tuned it by rewarding it when it responded what a human deemed to be a good response.

And just like that, after years of research, tens of brilliant people working hard, months of development, $100+ million USD spent in training, and a dash of fairy dust, we have this breakthrough model that we can ask it things like:

And be dissatisfied with both answers 😒. Which brings up a timely point… how do increase your chances of getting a good response from these models?

ChatGPT In Practice

These Language Models, well, they can’t do everything for us. The quality of the response is directly related to the quality of the prompt/question provided. Here is a great link to prompt the model to get it in the right “mood” to answer your questions as it can.

Curious as to where this AI magic is going? Let’s walk to the next section to hear my subjective thoughts on the matter…

Last Thoughts: The Future

How unique is ChatGPT? Will it deem Google useless? Can I train a ChatGPT-like model myself? Is it factual? Is it biased? How long until machines take over?

ChatGPT is somewhat unique, but not that much. It’s the first publicly available model of its kind, but its underlying technology (what we went over in this long article) is well known. And it’s the result of years of research and development not only from OpenAI (who created ChatGPT) but from many other labs that have pushed the frontiers of Language Models in the last two decades.

Google, Meta, DeepMind, Microsoft, IBM, and others are all working on similar models. They have been for quite a while. For instance, Google Brain (a research lab at Google) came up with Transformers in their current form, which allowed them to build the next generation of models (BERT) four years ago.

There’s speculation in the air about what this means to the search engine giant. My inclination is to think that Google isn’t really falling behind, they’re just being quieter. They built Sparrow with 500+ billion parameters. DeepMind (owned by Alphabet, which is Google’s parent company) built Chinchilla which is smaller but performs at a similar level to GPT-3 (which is the basis for ChatGPT). Meta’s Atlas is a smaller model but is able to retrieve documents in real time to find factual answers (something that ChatGPT has been openly critiqued for lacking). And there are several others.

I don’t believe that the future of language models belongs to any one of them. I believe that several 4–5 companies with big research departments and deep (I mean deep) pockets will be at the forefront of this movement which I dare to say may be as revolutionary for the internet as Google was ni 1999.

Because of how costly training these models is, it’s unlikely that individuals or smaller organizations will develop their own, at least not at a level to compete with the mega models. But, these smaller organizations and individuals will find the uses for the mega models. They (we) will probably also be able to take the mega models and further train them at more specific tasks, if those models are made public (like BERT). Imagine taking ChatGPT and making it exceedingly good at working with legal documents. That will be super boring, but also super useful to lawyers.

A big portion of work in the coming years will be spent in making these models more factual and less biased. Currently they are trained with internet data, which comes with errors and biases. We are already seeing efforts to both clean the input data better, and to fine-tune models further to reduce the biases and errors they’ve shown to have.

So, is it a matter of months or years until these models become sentient? I belong to the camp that thinks that won’t happen, maybe ever. At least not from using the current methods we’re using now. As you saw, the technology is not intelligence per se. It’s a machine that very accurately predicts what the next word will be, limited to some parameters provided by the humans training it.

Whatever the future of Language Models hold, it is star bright and something to be excited about. It might just turn upside down the way in which we interact with the internet and written information as a whole.

Demonstrate-Search-Predict

A framework for composing searches using language models.

Thank you for reading! I hope this was helpful. Please let me know if you have any questions or anything isn’t clear so I can clarify and also improve this document.

RESOURCES

Technical Papers

The following papers explain how ChatGPT came into existence. It’s a list of (arguably) the most seminal research that allowed ChatGPT (and similar models) to be built. They are in chronological order, and it’s best to study them in such order.

Transformers — Attention is all you Need (2017)
- https://arxiv.org/pdf/1706.03762v5.pdf
- Breakthrough: Revolutionary method to contextualize meaning via transformer layers
Large Language Models Training — Policy Optimization (2017)
- https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
GPT — Improving Language Understanding by Generative Pre-Training (2018)
- by OpenAI
- https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
BERT — Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)
- by Google Brain
- https://arxiv.org/pdf/1810.04805.pdf
GPT-2 — Language Models are Unsupervised Multitask Learners (2019)
- https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
GPT-3 — Language Models are Few-Shot Learners) (2020)
- https://arxiv.org/pdf/2005.14165.pdf
ChatGPT
- NOTE: OpenAI hasn’t published a ChatGPT paper, but most of the important tweaks from GPT-3 to ChatGPT are captured in the following papers
- Learning to summarize from human feedback (2022) https://arxiv.org/pdf/2009.01325.pdf
- Training language models to follow instructions with human feedback (2022) https://arxiv.org/pdf/2203.02155.pdf
Chinchilla — Training Compute-Optimal Large Language Models (2022)
- by DeepMind
- Lesson: scaling the number of training tokens (the amount of text training data) is as important as scaling model size
- https://arxiv.org/pdf/2203.15556.pdf
Demonstrate-Search-Predict (Stanford 2023)
- https://arxiv.org/pdf/2212.14024.pdf

Semi-Technical Explanations

BERT Explanations
- What is BERT?: The Transformer neural network architecture EXPLAINED. “Attention is all you need” (NLP)
- What are Transformers?: The Illustrated Transformer
ChatGPT Explanations
- How ChatGPT actually works
- How ChatGPT was trained

Additional Resources

Prompting ChatGPT like a pro:
- GitHub — f/awesome-chatgpt-prompts: This repo includes ChatGPT prompt curation to use ChatGPT better.

Appendix

Vectors

If I say to an alien from a far away galaxy “meet me at [48.85, 2.35]”, it wouldn’t understand. First because it probably doesn’t speak English, but more importantly, because those are Earth coordinates. Do they look familiar? They too are vectors. And by themselves they don’t mean much, but in context they do. Coordinates contain knowledge because we know that there is a place on Earth with coordinates [0, 0]. So when I say [48.85, 2.35] other people understand that I need to move 48 degrees North and 2.35 degrees East. It’s the relationship, or distance, between one coordinates vector and another that is helpful to us. So I’ll be meeting the alien at Paris, France.