Demo - The necessity of non-linearities
Get the project source code below, and follow along with the lesson material.
Download Project Source CodeTo set up the project on your local machine, please follow the directions provided in the README.md
file. If you run into any issues with running the project source code, then feel free to reach out to the author in the course's Discord channel.
This lesson preview is part of the Fundamentals of transformers - Live Workshop course and can be unlocked immediately with a single-time purchase. Already have access to this course? Log in here.
Get unlimited access to Fundamentals of transformers - Live Workshop with a single-time purchase.
[00:00 - 00:06] I'm going to say the same things anyways, just maybe a little slower because I can't type as fast as I can draw. Okay, so let me stop share here.
[00:07 - 00:16] And then let me share again. Okay, so back to here.
[00:17 - 00:27] Now what I'm going to do is I'm going to open up the exact same map that I was going to use, but put on the desktop. Okay, perfect.
[00:28 - 00:29] All right. Oh, I see, I see.
[00:30 - 00:31] I know why. Okay, great.
[00:32 - 00:32] There we go. Let's do that.
[00:33 - 00:34] Okay. All right.
[00:35 - 00:42] So here I've now got a massive whiteboard for us to do the wall. And the question that we wanted to answer was, why do we need non-linearities?
[00:43 - 00:51] The best way to explain this is to actually look at lines. So before we had, we had, let's say we've graphed a line here.
[00:52 - 00:53] Okay. So these are some axes.
[00:54 - 00:57] This is our x axis right over here. And we also have our y axis.
[00:58 - 01:04] Our y axis looks something like this. So let's say that we have a line.
[01:05 - 01:07] Right. So this line might look something like this.
[01:08 - 01:13] Now that's fine. But let's say we had a bunch of points now.
[01:14 - 01:19] Right. So I have a bunch of points in space and we have a bunch of circles in the middle, something like this.
[01:20 - 01:26] And we also have a second group of shapes. So maybe we have stars.
[01:27 - 01:34] So we have stars that go all around this object, sorry, all around these points . So stars out here.
[01:35 - 01:43] And let's say that I wanted to find a line. I want to define a function that separates these two.
[01:44 - 01:53] Right. The problem is there's clearly going to be no such line that is ever able to fully that is ever able to successfully divide these two groups of points.
[01:54 - 01:57] Right. So what can we do about that?
[01:58 - 02:02] Let's look at the equation for a line again. So the equation for a line is y is equal to mx plus b.
[02:03 - 02:09] And I'll make this bigger. So let's actually do this.
[02:10 - 02:18] Let's make this bigger or bigger. OK, so now we've got this equation y is equal to mx plus b.
[02:19 - 02:26] Oh, let me open up my chat again to see if there are questions. OK, so the question was, I wish you'd spend more time explaining QKV.
[02:27 - 02:30] OK, cool. Let me write that down so we can come back to revisit QKV.
[02:31 - 02:36] And if any questions there, we can certainly answer those. OK, so here we have y is equal to mx plus b.
[02:37 - 02:41] Now, what we can do is we can actually make a line of lines. Right.
[02:42 - 02:47] So let's say we have a set of m, we have something like p and r. Right.
[02:48 - 02:53] But now it's the is equal to y. OK, so z is equal to p y plus r and y is equal to mx plus b.
[02:54 - 02:58] So this y here is the line that we defined up here. Right.
[02:59 - 03:00] So let's actually just plug that in. Let's plug in y.
[03:01 - 03:07] And bear with me here, this is the limit of the math that we're going to do. Right.
[03:08 - 03:13] OK, so here we've got we plug in what y is. And then now let's actually simplify this expression.
[03:14 - 03:18] Right. So here we have pm x plus b plus r.
[03:19 - 03:35] Now, if you look at this, p is a parameter, m is a parameter, b and r are also both parameters and yeah, b and r are both parameters as well. These are both learned parameters.
[03:36 - 03:39] These are both learned parameters. We basically have the equation for a line again.
[03:40 - 03:43] Yeah, I thought I was missing something. Let's look weird.
[03:44 - 03:44] Thank you. Thank you.
[03:45 - 03:51] OK, so we have pb here plus r. So basically, pb and r are all parameters, p and m are all parameters.
[03:52 - 03:58] So this is basically just the equation for a line again. We've got something multiplying x plus another, another thing.
[03:59 - 04:06] So what I'm trying to illustrate here is that a line of a line is still a line. Right.
[04:07 - 04:15] A linear transformation of a linear transformation is still on the other linear transformation. So no matter how many times we do something like this, it's always going to end up being a line.
[04:16 - 04:25] Right. In other words, if I was to make a multilayer perceptron, I made an MLP and I just had matrix multiply after matrix multiply after matrix multiply.
[04:26 - 04:31] I could just have one matrix multiply and it'd be the same. Exactly.
[04:32 - 04:33] Right. The models would be exactly equivalent.
[04:34 - 04:38] So we need to actually insert some sort of nonlinearity. Right.
[04:39 - 04:42] We need to modify x, for example. So let's say that we do that.
[04:43 - 04:51] Let's say that we did x squared. Oh, because there's no way to do super scripts here.
[04:52 - 04:55] Okay. Let's do squared like this.
[04:56 - 05:05] So now we have y squared x squared plus b. So if you remember from, if you remember the equation for a circle, exactly just this.
[05:06 - 05:13] Right. So if I square my axis and I square my y's, we can now represent this equation.
[05:14 - 05:23] And this equation allows us to finally create a circle. Right.
[05:24 - 05:35] So if you use a platform like Desmos or some other plotting software, you can just try this equation x squared plus y squared equals to some number, like two or three. And you'll find that it creates a circle for you.
[05:36 - 05:40] Right. So this right here, I did this to be so.
[05:41 - 05:45] Right. So this would finally create that circle for us.
[05:46 - 05:54] So what we know now is that a nonlinearity like squared can help us create more interesting shapes with a simple equation like this. Right.
[05:55 - 06:02] But that nonlinearity doesn't have to be a square. In fact, in most neural networks, the nonlinearity is not a square.
[06:03 - 06:09] And most neural networks, the nonlinearity is something that we call a relu. At least that was one of the first ones.
[06:10 - 06:17] So here I have a new plot and we're going to draw here. I'm going to make this a little thicker.
[06:18 - 06:23] And then I'm going to make it red. Right.
[06:24 - 06:30] So pretty Lou, if you plot it, looks something like the following. So we'll give something like this.
[06:31 - 06:41] And the equation for real Lou is y is equal to the max of x, zero. Right.
[06:42 - 06:45] So let's say that x is greater than zero. Right.
[06:46 - 06:55] So on this side of the axis, on the right side, when x is greater than zero, this max doesn't affect anything. We just have the simple identity line, y is equal to x.
[06:56 - 07:06] But when x is less than zero, it's on the negative side, then this maximum takes full effect and y is now equal to zero. So that's why this line flattens out.
[07:07 - 07:11] Right. And like we said before, these are lines of lines.
[07:12 - 07:25] So anything that's not a line like this, as simple as it is as a non-linearity, this is sufficient. So in between layers and MLP, we'll add a non-linearity like this real Lou.
[07:26 - 07:31] Now, finally, let's go to LMS and LL GMs. Anon-linearity is typically not a real Lou.
[07:32 - 07:41] Typically, it's something called the swig glue. And I tried to link this before and at least on my trackpad, it's a little hard to draw, but I'll do my absolute best.
[07:42 - 07:51] Unless there's no draw tool, then I guess I'm out of luck. OK, whatever, I'm just going to have to explain it.
[07:52 - 08:01] A swig glue is almost like a real Lou, except at this corner, it rounds smoothly. So you can imagine there's a rounded edge right here and that's a swig glue.
[08:02 - 08:12] And I can explain why that is and why we do this, but ensure it is rounded and you want to learn more, let me know and then I'll explain it. But right now, the most important part is non-linearities are important.
[08:13 - 08:26] Right, they're important for us to add expressivity so that we can express more interesting functions like this circle instead of just more lines and non- linearities help us to achieve that. OK, great.
[08:27 - 08:29] So now let's go back to our content.