Add the remaining components (Skip connections, norms, positional encodings)

We will ignore multi-head attention as it introduces unnecessary complexity into the code. We can mention this in passing.