Absolute positional encoding

This lesson preview is part of the Fundamentals of transformers - Live Workshop course and can be unlocked immediately with a single-time purchase. Already have access to this course? Log in here.

This video is available to students only

Unlock This Course

Get unlimited access to Fundamentals of transformers - Live Workshop with a single-time purchase.

$Thumbnail for the \newline course Fundamentals of transformers - Live Workshop$

[00:00 - 00:09] Okay, so why we need long aarities, we've finished that. Okay, so the next part here is we're going to talk about position.

[00:10 - 00:17] We're going to use the information that position provides us. So let's say we have these two sentences.

[00:18 - 00:23] Again, let's back away from large language models. Let's back away from math and all these diagrams in general.

[00:24 - 00:30] We're just looking at what English provides us now. English generally provides us for information based on the position of a word.

[00:31 - 00:35] So here we've got something really something something something. Right.

[00:36 - 00:43] And then at this other sentence, we have really at the very end. You might have a rough idea of what really means in these two contexts.

[00:44 - 00:52] Right. Really here, I know what words are underneath these black boxes, but really here basically could mean very, it could mean extremely.

[00:53 - 00:56] Right. And down here at the bottom really could mean something like actually.

[00:57 - 01:04] Right. And maybe the rules are switched, but to some degree, by just looking at the position, we can guess what that word might mean.

[01:05 - 01:07] So here I've got, I really love ice cream. Right.

[01:08 - 01:12] And so really emphasizes love. And then here we've got, I love ice cream really.

[01:13 - 01:14] Right. So really actually.

[01:15 - 01:24] So positional loan tells us what word might mean. Position suggests meaning.

[01:25 - 01:33] But here's the problem. If you look at our self attention from before, self attention was a weighted average of the inputs.

[01:34 - 01:40] Right. The problem with the weighted average is that you could simply change the ordering of the inputs.

[01:41 - 01:49] And as long as you change the ordering of the weights as well, you get the exact same output. So in this case, let's say that X and Y are my input values.

[01:50 - 01:57] Three and two are my weights. If I switch X and Y, as long as I switch to and three as well, my weighted average is going to give me the same result.

[01:58 - 02:02] Right. And so this is the issue with self attention here.

[02:03 - 02:13] In self attention, I could just take, let's say, the blue vector with the blue weight and the purple vector with the purple weight and just switch them to. And I'd get exactly the same output.

[02:14 - 02:29] So the problem is that self attention doesn't really have a notion of position. And what I mean specifically by that is if I change the position of tokens, I get the same output and I shouldn't have that happen.

[02:30 - 02:37] If I change the position of my input tokens, I should have a different factor so that the model can distinguish between the two cases. So here's an example.

[02:38 - 02:51] Let's say that I say you are cool versus are you cool? Obviously, these two sentences have completely different meanings and I need to be able to distinguish between these two or the large language model needs to distinguish between these two.

[02:52 - 03:05] So position suggests meaning, but self attention doesn't have a notion of position and it isn't able to leverage this fact. So we need to introduce something called positional coding.

[03:06 - 03:10] Right. So this is our diagram from before.

[03:11 - 03:19] It looks slightly different. It's just a simplified version of it, but this is our diagram from before with our transformer and inside the transformer, we have our attention and our MLP.

[03:20 - 03:26] We're going to add a new operation. This new operation is going to be our positional encoding.

[03:27 - 03:38] And the goal is that different positions of inputs should give us different outputs from attention. So here's one way to do that.

[03:39 - 03:53] One way to do that is using something called an absolute positional encoding. So let's say we have three tokens, right? The first token, we're going to add the number one to all of its dimensions.

[03:54 - 04:00] For the second token, we're going to add the number two to all of its dimensions. For the third token, we're going to add number three and so on and so forth.

[04:01 - 04:06] So to best explain this, I'll actually jump right back into notebook just as usual.