Add the remaining components (Skip connections, norms, positional encodings)
We will ignore multi-head attention as it introduces unnecessary complexity into the code. We can mention this in passing.
We will ignore multi-head attention as it introduces unnecessary complexity into the code. We can mention this in passing.