Great video :)! But I have one remark: At 8:45 you say that the 8 attention vectors get averaged, but in the original paper "Attention Is All You Need" on page 4, it says that the output from the different attention heads are being concatenated rather than being averaged, which I think would also make more sense. But maybe this is just a misunderstanding on my side.