Demo - Adding "context" to a vector
Get the project source code below, and follow along with the lesson material.
Download Project Source CodeTo set up the project on your local machine, please follow the directions provided in the README.md
file. If you run into any issues with running the project source code, then feel free to reach out to the author in the course's Discord channel.
This lesson preview is part of the Fundamentals of transformers - Live Workshop course and can be unlocked immediately with a single-time purchase. Already have access to this course? Log in here.
Get unlimited access to Fundamentals of transformers - Live Workshop with a single-time purchase.
[00:00 - 00:05] Okay, so now we're going to build self-attention manually. And that's going to tell us how these weights are actually calculated.
[00:06 - 00:18] Okay, so now let's go back to our notebook and I'm going to show you right here . Okay, so for this demo, we're going to build self-attention manually.
[00:19 - 00:31] So manual self-attention. So this is going to be the time to post questions in the chat if you have any questions.
[00:32 - 00:43] Because this is a, it's a core part of the process in a transformer, but there are quite a few different steps. So I'm going to do my best to try and break down those steps, but there are quite a bit of steps.
[00:44 - 00:48] Okay, so let's start off with importing towards just like usual. I think I do this up above.
[00:49 - 00:58] I'll do it again just in case. Okay, so now we're going to embed these vectors by convert, sorry, we're going to embed these words, converting them into vectors.
[00:59 - 01:11] Right, so you are and then I'm going to write, oh, whoops, you. It's model, you are model are and then cool.
[01:12 - 01:17] It's model, cool. Okay, so the first thing that was.
[01:18 - 01:38] Okay, so the first thing that we actually do in self-attention is we need to convert each of these into like a one to three different things, queries, keys, values, but for now we're going to start off with just two of them. Right, so we're going to convert you into queries and keys.
[01:39 - 01:43] Right, so let's do that. So to do that, we're going to do a matrix multiply.
[01:44 - 01:52] So here is. No worries, yes, the model is basically dictionary that maps from words to vectors.
[01:53 - 02:01] I'll actually show you really briefly what a vector here looks like. Right, we're going to have a vector that has shaped 300 and it's just a bunch of numbers like that.
[02:02 - 02:12] So model here is a dictionary, math and one words to vectors. Okay, so now I'm going to do is I'm actually going to instantiate a matrix.
[02:13 - 02:19] So for now, ignore the intuition for what this is. I can explain later as well, but there's a lot to explain there.
[02:20 - 02:29] So for now we've got these three. Right, and what this will do is this allows to generate the queries and keys for us.
[02:30 - 02:35] So sorry, ignore value for now. Okay, so we're going to actually have used the word and that already.
[02:36 - 02:46] So use a general word like generate. Okay, so to actually give us the query, we're going to take a matrix product with this vector.
[02:47 - 02:56] So you are sorry, x at and then key is equal to x at. And I'm going to turn both the query and the key.
[02:57 - 03:07] And the first thing we're going to do is generate the query and the key. So actually, yeah, so you query and you key and then we're going to do the same thing for the other words as well.
[03:08 - 03:17] We're going to generate all the queries and all the keys. That's it's full query for key generating.
[03:18 - 03:28] Okay, so this is upset at me because not all these are tied to tensors. So I'm actually just going to do this.
[03:29 - 03:43] Oh, ha ha, these tensors are really big. So that's why signal while.
[03:44 - 03:51] Okay, so I'll just have to make those tensors smaller so that this code runs faster. But for now, we've now converted these into queries and keys.
[03:52 - 04:03] Okay, so now the question is what happens when we take an inner product. So we're going to take an inner product between very you and key are.
[04:04 - 04:07] Okay, so what does this do? Right, what does this inner product mean?
[04:08 - 04:21] We mentioned before that inner products represent similarity. Right, what we've done is we've defined we've trained WQ and WK these matrices in such a way that when I do this inner product.
[04:22 - 04:34] Oops, so actually type these variables wrong such that this inner product has a special meaning. This inner product says how much does the word you modify the meaning of arm.
[04:35 - 04:41] Right, so query is always doing the modification. So let me write that down here.
[04:42 - 04:50] Query is doing the modification. It's providing context.
[04:51 - 04:58] Right, and the key is being modified. This is what we're trying to understand based on context.
[04:59 - 05:07] Right, so this is going to be true generally. So now let's see, like inner product is a sum sort of value.
[05:08 - 05:11] Now we're going to do a few more inner parts. So let's say we did you.
[05:12 - 05:24] Prairie and then we do pool keep. I realize I should have used the example feel hold and uphold, but it's okay.
[05:25 - 05:26] We've already started with this example. So I'll continue with this one.
[05:27 - 05:35] So does you modify the meaning of cool. Again, the key is what we're modifying.
[05:36 - 05:45] The query is what provides that modification. Okay, so now that we've completed this inner product, let's actually use these to compute some weights for the weighted average that we talked about earlier.
[05:46 - 05:56] So we're going to compute a set of weights. It's going to be a list here and this list is going to be you query at at cool key.
[05:57 - 06:03] Right, so this will tell us that you modify school. And then we're going to say our query with cool key.
[06:04 - 06:26] So does our modify cool and then finally this cool on a fight. Right, so what we've done here is notice that every single one of these determines how we modify the key cool and then we're going to use these three numbers to actually compute a weighted average of the new vector that corresponds to cool.
[06:27 - 06:36] But right now, let's say I ran this. I have three weights and there are just three random numbers and these numbers don't sum to one.
[06:37 - 06:41] So that's not really a weighted average. That's just some multiplication.
[06:42 - 06:43] Right. So let's actually fix that.
[06:44 - 06:50] We want all of these three to sum to one. So let me actually import something else.
[06:51 - 06:59] And I import. Forch and then functional as F. And then we're going to apply a softmax.
[07:00 - 07:02] I saw the question. I'll get back to that in a second.
[07:03 - 07:05] So this is softmax. Right.
[07:06 - 07:13] And so softmax will allow us to actually ensure that all of these weights sum to one. So let's actually try that.
[07:14 - 07:20] So this list, I actually listened to a tensor. Perfect.
[07:21 - 07:25] So now I've got these three weights. This one is basically zero.
[07:26 - 07:29] This one is point nine. This one is point nine nine.
[07:30 - 07:36] And then this one is point very, very close to zero as well. So basically this dominated all of the weight.
[07:37 - 07:43] That's okay. So here we have three weights and now we can finally use this for we can finally use this for our weighted average.
[07:44 - 07:45] Okay. Great.
[07:46 - 07:51] So I have a few questions. So the question is, is this leveraging what we learned in the training of the foundation model?
[07:52 - 07:56] Oh, is this leveraging what we learned, what was learned in the training of foundation. Yes, exactly.
[07:57 - 07:58] So that's a good question. That's exactly it.
[07:59 - 08:11] During the training of this model, the WQ and WK will be fully tuned. And so in theory, those WQs and WKs, those weights will provide the meeting that I'm proposing here.
[08:12 - 08:25] And this is just one interpretation of what's going on. This is just how one way to intuitively explain what's going on and why we did all of this.
[08:26 - 08:30] Okay. And the second question is, can you explain weights a bit? Is this weights and biases in this case?
[08:31 - 08:38] Weights are basically parameters of your network. So if you remember from grade school, I think MX must be.
[08:39 - 08:44] Right. So you have the equation for a line and something like this. In this case, this is actually a model, right?
[08:45 - 08:50] This is a model because it's not a large language model, but it's a model. It's a model because we have some inputs X, right?
[08:51 - 09:00] And we knew that to be points along the X axis and we have outputs Y and we knew that to be points along the Y axis. But most importantly, you had M and B, which determined the slope and intercept of your line.
[09:01 - 09:10] Right. So in this case, you would say M and B are your parameters because they determine how you predict. And the same way, LMS are no different, except this equation becomes a lot more complicated.
[09:11 - 09:21] Right. Yeah. So for example, the equation for attention, if I did it all in one line is something like this. It's really confusing. So don't worry. It doesn't make any sense. But basically , this would be something like this.
[09:22 - 09:37] Right. So something long and nasty. And basically, this is the, I guess actually there's no X here. So we write that out. So it'd be WQ X K WK X. And then this would be, I haven't introduced values yet.
[09:38 - 09:52] So don't worry about this, but you would have something like this, right? And so WQ WK WV are your weights and X is your input. And this is exactly the same thing or not the exactly same. The same concept as the line parameters that I talked about.
[09:53 - 10:08] Okay. Now, third question is how we discuss what a tensor is. Yeah. So a tensor we haven't really talked about, but a tensor you can just imagine is a vector and a vector is just an array of numbers. And that array of numbers, it turns out, has meaning. It has semantic meaning for us. So we can add them, subtract them.
[10:09 - 10:20] And they, and those mental relations actually have some semantic meaning, meaning some meaning in English. Okay. Oh, and the, what I said was we didn't actually talk about tensors. If this doesn't make sense, you can just ignore this.
[10:21 - 10:34] So, a tensor is a general, the question is, are there ways always two dimensional? And so that ties into my response here actually. A tensor is, it could be n dimensional, it can be one dimensional, two dimensional, three dimensional, four dimensional, so on so forth.
[10:35 - 10:55] A tensor kind of any arbitrary number of dimensions. So let's say that I was writing a tensor like this torch, ramping, and then one, two, three, four, five , right. So this would be a four dimensional tensor, and I can add as many dimensions as I want. Right. Now, if a tensor only has one dimension, then we call that a vector.
[10:56 - 11:13] If a tensor only has two dimensions, then we call that a matrix. And that's pretty much where special names and after that we just call it a tensor. So the question here is are arrays always two dimensional. So sometimes we'll have matrices, right? And in this particular case, I've been doing everything with only vectors.
[11:14 - 11:29] So with one dimensional tensors. And I've been doing that because it makes the code a little bit easier to understand. But if you're in the solution notebook that I gave you, there's actually a full tensor version of the code. It just looks way more complicated. And so I didn't really want to introduce it here.
[11:30 - 11:37] But yeah, you just add more dimensions to each of these operations. Yeah, and we can talk about that code later too.
[11:38 - 11:54] So I think I do want to keep going though for now for now. Okay, so now that we have the ways here, we can now use these weights to actually compute to compute the weighted average. So let's do that. We're going to have attention.
[11:55 - 12:17] So I put for the third position, because I okay so basically that which token the key is tells you which token you're modifying. So in this case, I'm modifying the key. Cool, right, and cool. We just assume is the third word in our sequence, right.
[12:18 - 12:46] And we're assuming that our input sequences you are cool. So that means the third word is cool. And that's the word that we're trying to modify. That's the word we're going to output from attention. To do that, we're going to compute you times weights of zero, plus our times weights of one, plus cool times weights of two. This is now our weighted average of inputs.
[12:47 - 13:08] Now let's make this slightly more accurate. In reality, we don't actually take a weighted average of the original vectors. You are in cool. We actually take weighted average of another vector. So earlier we had mentioned that we have queries and keys.
[13:09 - 13:28] But if you look in documentation for hugging face and whatnot, and for these transformers you'll see there's actually a third type. There's a value, which you see a lot so value. So let's actually create these value tensors as well. So query key value. This is the same thing.
[13:29 - 13:40] Okay, I know this code down below is going to take a while to run. Let's run it anyways while I explain. Oh, whoops, so actually let's do. New value.
[13:41 - 13:59] Our value. And then finally, we have cool value. So every single one of these input tokens now has three vectors associated with it. Query key and value. And as we discussed query and key tell you what the weights of the weighted average are.
[14:00 - 14:14] The value is what you're actually taking the weighted average of. So here we actually use u value. Our value and cool value.
[14:15 - 14:33] And this finally gives us the output of the attention module in the third position. And you would do something similar for the other tokens as well. But for now we're just going to use this single every token. Okay, so that was manual, the manual version of self attention.
[14:34 - 14:52] Oh, and I'm running out of power. So maybe now that I've been talking nonstop for an hour and a half. Let's take a very quick five minute break until 1230. I'm going to grab my charger. You can go grab some water and then we'll go back shortly. Hey, Stephen. So we're just taking a really quick five minute break until 1230. But yes, this is the middle of module four.
[14:53 - 15:03] We just finished building the manual version of self attention. And then after this we're going to be building the manual version of. Oh, okay, so I guess I should explain real quick.
[15:04 - 15:18] Quick recap, module four is that there are transformer blocks, like Lego building blocks inside of a large language model. And inside of the transformer, there are two parts. One is called self attention. The other one is called the multi layer perception or MLP for short.
[15:19 - 15:24] And we just built the first part of the transformer, which is the self attention. We're going to be building the MLP next.
[15:25 - 15:29] Thank you. You you