Self-attention adds context

Project Source Code

Get the project source code below, and follow along with the lesson material.

Download Project Source Code

To set up the project on your local machine, please follow the directions provided in the README.md file. If you run into any issues with running the project source code, then feel free to reach out to the author in the course's Discord channel.

This lesson preview is part of the Fundamentals of transformers - Live Workshop course and can be unlocked immediately with a single-time purchase. Already have access to this course? Log in here.

This video is available to students only
Unlock This Course

Get unlimited access to Fundamentals of transformers - Live Workshop with a single-time purchase.

Thumbnail for the \newline course Fundamentals of transformers - Live Workshop
  • [00:00 - 00:04] So how transformers actually predict? And so now we're going to dive deeper into the inner of a transformer layer.

    [00:05 - 00:15] Okay. So to give you a little bit more context, a large language model, as we said before, is basically a transformer based model.

    [00:16 - 00:31] And a transformer is a single layer or a single block that has repeated many, many times and all those blocks together along with these two extra operations and bed and nearest neighbors. That whole thing is what we call a large language model.

    [00:32 - 00:35] Okay. So now let's look at this transformer.

    [00:36 - 00:40] This is a transformer in the center of our network. It manipulates vectors for us.

    [00:41 - 00:45] So we saw, okay, yeah. So manipulate vectors for us.

    [00:46 - 00:58] Before I go deeper, usually I like to introduce things in line, but here we have a concept that I have to introduce first for the rest of this lesson to make sense. So let's look at these points again.

    [00:59 - 01:01] These are the words that we saw from before. Right.

    [01:02 - 01:06] Now, let's say that we had done something called a dot product. Right.

    [01:07 - 01:18] So remember a dot product or inner product is just when you multiply or when you quote unquote, look at the similarity between two vectors. So here's mechanically what a dot product is.

    [01:19 - 01:28] Let's say we're computing the dot product of this green array and this blue array. To do that, we would take zero times point seven plus one times point seven.

    [01:29 - 01:35] So you take each of the elements, you multiply them together and you sum the products. And so that's mechanically what a dot product is.

    [01:36 - 01:42] But you don't really need to know that the most important part is that dot product means similarity. Right.

    [01:43 - 01:56] So what you'd expect is that once you compute the similarity, these two should have the highest similarity. These two blue and purple should have the next highest similarity and green and purple should have the least similarity.

    [01:57 - 01:59] Right. Just intuitive or sorry, similarity is also how close they are.

    [02:00 - 02:01] Right. So that's roughly what we expect.

    [02:02 - 02:04] And turns out that's true. Right.

    [02:05 - 02:14] Green and blue have the highest dot product point seven blue and purple have the next highest negative point seven dot product. And then green and purple has the lowest dot product, which is negative one.

    [02:15 - 02:17] Right. That matches our intuition dot products.

    [02:18 - 02:23] Tell us how similar vectors are. I mean, there are some extra conditions and caveats, but you can ignore that for now.

    [02:24 - 02:30] Now, again, like always, the best way to explain this is by looking at code. So let's go back to our notebook and explore a little bit.

    [02:31 - 02:36] So here we're going to explore. Not products.

    [02:37 - 02:45] So we're going to take a few words that we know should be related. And then we're going to take a word that we know should definitely not be related.

    [02:46 - 02:59] So just to just jog your memory, whenever I write this model and then I have this syntactic sugar, I'm basically accessing a dictionary that maps from orange to its corresponding vector. So now let's do dog.

    [03:00 - 03:04] So now I have these three vectors. Let's actually compute their inner products.

    [03:05 - 03:16] So fortunately, there's a nice, again, there's more syntactic sugar, but I like using this one and you might see this a lot in code. Whenever there's an ampersand like this, this just means I'm taking the inner product between these two.

    [03:17 - 03:22] So I'll use inner product and dot product interchangeably. In this particular case, they're exactly the same thing.

    [03:23 - 03:31] So I'm going to take a dot product between dog and apple, right. And what we expect is that this number should be fairly big, right?

    [03:32 - 03:39] It should be bigger than the inner product between orange and apple, because orange and apple are more similar than dog and apple are. So let's try that dog and apple.

    [03:40 - 03:43] They get us some number, right? Now let's try this again with orange and apple.

    [03:44 - 03:50] Sorry, I might have said this in reverse earlier. We expect orange and apple to have a higher dot product, right?

    [03:51 - 03:55] They're very, very similar. Like we saw before, the more similar you are, the higher the dot product.

    [03:56 - 04:01] And here a dog and apple are going to have a relatively lower dot product. In fact, this is half of the magnitude, right?

    [04:02 - 04:06] So that matches our intuition. So you can also try this in a bunch of other random parts, right?

    [04:07 - 04:14] You can try king and queen versus launch, right? And you'll see the same idea happened again and again, where the more similar words are the higher dot product.

    [04:15 - 04:22] Okay. So now that I've done that, moving forward in your head, you should equate dot product with similarity, right?

    [04:23 - 04:26] And that's just the idea I wanted to convey. Okay.

    [04:27 - 04:28] So we did this demo. We explored dot products for vectors.

    [04:29 - 04:34] Okay. So now that we know that let's compartmentalize that, put that aside for now.

    [04:35 - 04:47] And let's talk about a different, a related idea. In particular, this is generally speaking, generally, some words change meaning based on other words.

    [04:48 - 04:50] So nothing related to math, nothing related to LMS. This is just generally in English.

    [04:51 - 04:53] We know that this is true. And so let me give an example.

    [04:54 - 04:57] Cold, right? When you look at this word, there are several different possible meetings.

    [04:58 - 05:00] Yeah, exactly. We need context to understand what a word means.

    [05:01 - 05:07] So in this case, cold could mean not warm. And as soon as I say feel cold, you know, immediately what that means, right?

    [05:08 - 05:20] And if I say, "ah cold," you immediately know that cold means sickness. So based on feel cold or "ah cold," just one word, you have context or understanding what the meaning of this word is.

    [05:21 - 05:29] And that's important. But now that we know that cold has multiple meanings, we actually need different vectors for each meaning of cold.

    [05:30 - 05:40] We need to distinguish between those two somehow, regardless of what computation we do and what we do with that vector, we need to understand when there are two different meanings of the word. We need a vector for each meaning.

    [05:41 - 05:44] All right. So how do we do that though?

    [05:45 - 05:52] Let's say I have cold here and I convert it into a vector zero negative one. And let's say I have the extra word now.

    [05:53 - 05:58] I have the word that provides the context we need. So feel here would be the extra word that we needed.

    [05:59 - 06:04] So let's say we have feel cold. Each of those words is embedded or converted to vectors negative one zero zero negative one.

    [06:05 - 06:12] And we don't have access to the other words yet. So when we're embedding cold, we don't have access to feel yet.

    [06:13 - 06:19] So we need to do that in the transformer. In the transformer, we just somehow modify these two vectors so that they incorporate context somehow.

    [06:20 - 06:26] Okay. And then also a question from Maya was that @ sign the Python matrix multiply operation.

    [06:27 - 06:27] Exactly. It absolutely is.

    [06:28 - 06:36] It just happens that for vectors, the dot product it generates to become just the inner product. But you're absolutely right.

    [06:37 - 06:41] It is also the NumPy matrix multiply operator. Okay.

    [06:42 - 06:49] So back to here, we have now taken two words and we've now embedded each of those words separately. Right.

    [06:50 - 06:55] But we need to somehow manipulate these vectors so that cold incorporates its meaning. Right. Sorry.

    [06:56 - 07:00] So that there is a different vector for each meaning of cold. So how do we do that?

    [07:01 - 07:04] Let's look at this transformer. This transformer has two different parts.

    [07:05 - 07:06] The first part is self attention. Right.

    [07:07 - 07:12] At a very, very high level, self attention adds context across words. Right.

    [07:13 - 07:21] So it uses all of the words in theory to add context to each one of them. Now the multi layer perception is the second part.

    [07:22 - 07:27] And this one modifies words individually. So we're going to have been using words and tokens interchangeably.

    [07:28 - 07:36] Here, self attention is operating across all the tokens. Multi layer perception is operating on individual tokens one by one.

    [07:37 - 07:41] So that is one important distinction. Let's focus first on self attention.

    [07:42 - 07:53] So self attention here is going to be what we're going to use to ensure that cold has different vectors for each meaning. So what's one way that we can do that?

    [07:54 - 08:04] Well, one way we can do that is wherever cold is, whatever its vector is, we can take the word that comes before it. In this case, it's field and we can just take the average of those two vectors.

    [08:05 - 08:09] Now we get point five point five, or sorry, negative point five, negative point five. Right.

    [08:10 - 08:14] And if the word before was then this first vector would be different. And as a result to this vector would be different.

    [08:15 - 08:24] So in some sense, we've achieved our objective by just taking an average of the word before it. We now have a different vector for cold for each of its different meanings.

    [08:25 - 08:33] But there's a problem. If we take an average of all the inputs for every output, all the outputs are going to be the same.

    [08:34 - 08:35] Right. So we can't have that.

    [08:36 - 08:44] And there's also a different problem intuitively. Intuitively, the meaning of feel doesn't change because cold comes after it.

    [08:45 - 08:50] Right. Feel has meaning on its own that is unchanging.

    [08:51 - 08:55] So intuitively, this vector should not change. So here's what we're actually going to do.

    [08:56 - 09:09] Instead of taking an average, we're going to take a weighted average that's computed based on the inputs. I haven't explained how these weights are calculated just yet, but for now, just imagine somehow the weights are calculated and now I have a weighted average.

    [09:10 - 09:17] And then this allows us to represent no change for feel and some sort of change for cold.