Demo Cons of absolute positional bias

Project Source Code

Get the project source code below, and follow along with the lesson material.

Download Project Source Code

To set up the project on your local machine, please follow the directions provided in the README.md file. If you run into any issues with running the project source code, then feel free to reach out to the author in the course's Discord channel.

This lesson preview is part of the Fundamentals of transformers - Live Workshop course and can be unlocked immediately with a single-time purchase. Already have access to this course? Log in here.

This video is available to students only
Unlock This Course

Get unlimited access to Fundamentals of transformers - Live Workshop, plus 70+ \newline books, guides and courses with the \newline Pro subscription.

Thumbnail for the \newline course Fundamentals of transformers - Live Workshop
  • [00:00 - 00:03] So let's go to the notebook. All right.

    [00:04 - 00:09] Let's create a new. Let's create a demo.

    [00:10 - 00:15] This demo is going to be absolute positional encoding. All right.

    [00:16 - 00:22] So in this case, we have three different vectors, which I'm going to copy from above. We have three vectors.

    [00:23 - 00:36] You are cool. All right. So as we said before, we're going to add one to all of the input dimensions. And this right here is syntactic sugar.

    [00:37 - 00:40] Technically what I should be doing is the following. I should be doing something like this.

    [00:41 - 00:45] Right. I should add a vector of all ones. And so what happens if I do that?

    [00:46 - 00:49] What happens if I add a vector? Oh, I want to add two vectors, actually, right?

    [00:50 - 00:54] I guess I never explained that. So let's say I do this or tensor four or five, six.

    [00:55 - 01:07] What will happen is the first dimensions will sum together, the second dimensions will sum together and the third dimensions will sum together. So if I sum these two together, then you should expect something like five, seven, nine, right?

    [01:08 - 01:09] And there we go. Five, seven, nine.

    [01:10 - 01:13] So that's the add two tensors together. Now, what if though I did the following?

    [01:14 - 01:18] Instead of adding two tensors, I just added like a two. Pipe torch is much.

    [01:19 - 01:27] It knows that if I just add a scalar, it should add two to every single one of these. So this is the same thing as doing this.

    [01:28 - 01:35] So if I do this, I get three, four, five. And if I just add two, I get the exact same result.

    [01:36 - 01:45] So we're going to leverage that. We're going to say that you plus one gives us the positionally encoded version of the u vector.

    [01:46 - 01:48] Yeah, this is just a bunch of numbers. It doesn't really tell us a whole lot.

    [01:49 - 01:56] So I'm just going to say you encode it, like, pushally encoded. R, positionally encoded is R plus two.

    [01:57 - 02:03] And then cool, but it's like coded is cool, plus three. Right.

    [02:04 - 02:13] And then you would continue this indefinitely. So you can imagine this seems pretty straightforward, but there's actually a very, very fundamental problem with this approach.

    [02:14 - 02:22] So let's see what that fundamental problem is. I'm going to scroll back up and I'm going to pull these words, orange, apple and dog.

    [02:23 - 02:31] So here we've got orange, apple and dog. And as we said before, there is an interesting property here.

    [02:32 - 02:38] When I take the inner product of orange and apple, they are very close together . And therefore the dot product is higher.

    [02:39 - 02:47] Whereas if I do dog and apple, the dot product is lower. And that's because dog and apple is less similar than orange and apple are.

    [02:48 - 02:51] So this is good. This is a property that we desire.

    [02:52 - 02:59] We want this difference to be significant. And in this case, the magnitude of this dot product is twice as large as this one.

    [03:00 - 03:05] So I would say that the difference is pretty significant. But now let's say that let's say that dog.

    [03:06 - 03:12] Is the 1000. Were in its sequence.

    [03:13 - 03:16] And let's say that. Orange.

    [03:17 - 03:23] And apple. Are also the 1000 words of their sequences.

    [03:24 - 03:31] So all of these words would be positionally encoded by adding 1000 to them. Right.

    [03:32 - 03:36] Because remember, we're adding. The position of each word to itself.

    [03:37 - 03:41] So here, let's see what that looks like. You are cool.

    [03:42 - 03:47] And all of them now have 1000 added to them. Right.

    [03:48 - 03:51] Now, let's see what happens to inner product. You have orange and apple.

    [03:52 - 03:56] Oh, whoops, I encoded. I was just thinking code it the wrong words.

    [03:57 - 04:00] So orange. Apple.

    [04:01 - 04:04] Dog. And then this is orange.

    [04:05 - 04:08] Apple. And dog.

    [04:09 - 04:16] And so now if I do orange, positionally encoded, Apple, positionally encoded. OK, so first we have a massive number.

    [04:17 - 04:22] And I realize there is a problem here that I need to fix in just a second. And then let me do dog.

    [04:23 - 04:25] Wasn't at Apple. Wasn't.

    [04:26 - 04:36] OK, so basically these are two really big numbers, but let me fix something. Actually, I need to explain something real quick.

    [04:37 - 04:45] So let's say that we did. For each of these vectors, dog, apple, orange, they actually have a special property.

    [04:46 - 04:58] If we look at the length of any of these vectors, so dog. Apple and orange, you'll notice something, which is all of them have a length of one.

    [04:59 - 05:05] So length is the same thing as saying the norm. And it's also the same thing as saying the distance from the origin.

    [05:06 - 05:16] So all of these points are a distance of one from the origin. And these vectors that I've created don't have that property.

    [05:17 - 05:21] And we need to make sure that property is maintained. Let's create a new cell.

    [05:22 - 05:30] Oh, and we need that property to be maintained so that these inner products actually make sense. So let's do that.

    [05:31 - 05:33] Let's write orange. Oh, whoops, let's write normalized.

    [05:34 - 05:41] And it takes an effector and we divide by its one. OK, great.

    [05:42 - 05:46] Actually, yeah, these are all non-py vectors. These are non-py vectors.

    [05:47 - 05:52] OK, so now I'm going to normalize all of these. Perfect.

    [05:53 - 05:56] And now we're going to redo this check for dot product. So this is approximately one.

    [05:57 - 06:01] You can ignore that point, oh, one at the right end. That's an in position Python.

    [06:02 - 06:04] You can ignore that. And this is also one.

    [06:05 - 06:18] OK, so now dog and apple. Are similar in the way that orange and apple are similar, according to these numbers. And as we know, that's certainly not true.

    [06:19 - 06:30] Dog and apple should be much less similar than orange and apple are. So what we've done by adding this 1000 is we've completely dominated this inner product with just the positions index.

    [06:31 - 06:50] Right. So in effect, we've completely annihilated any differences in the vector here and this vector is now basically just 1000 plus a tiny number. Right. And imagine this issue exacerbates if we have more and more tensile star tokens in our input, right, because this number gets bigger and bigger.

    [06:51 - 06:59] So this is the issue with absolute positional coding. We completely lose information and all of our dot products now are not sense.

    [07:00 - 07:06] So how can we fix that? We can fix that by using something called relative to this one coding.

    [07:07 - 07:16] Instead of adding the index of the position, we're going to add the index of a relative position. In this case, you can see that we've got the position one, two, one, zero, one.

    [07:17 - 07:26] And it's basically following a sinusoid. In this case, I've got a, yeah, I've actually got a sign function plotted right here and you can see that it fluctuates.

    [07:27 - 07:42] Now, luckily for us, you can see that this never exceeds two and it will also never exceed negative two. So we'll never have this issue where our positional encoding added to the input fully dominates the inner product.

    [07:43 - 07:54] Right. We've now limited its effect on the inner product. And at the same time, you can see that adding one or two would ensure that we can't switch words and have the results be the same.

    [07:55 - 08:06] OK, so that's relative positional coding versus absolute positional coding. I'm going to pause here to see if there are any questions.

    [08:07 - 08:22] Give it like a few more seconds. OK. Great.

    [08:23 - 08:24] So