1. High dimensional embedding space is way more vast than you'd think, so adding two vectors together doesn't really destroy information in the same way as addition does in low dimensional cartesian space - the semantic and position information remains separable.
2. I find the QKV nomenclature unintuitive too. Cross attention explains it a bit, where the Q and K come from different places. For self attention they are the same, but the terminology stuck.
2. I find the QKV nomenclature unintuitive too. Cross attention explains it a bit, where the Q and K come from different places. For self attention they are the same, but the terminology stuck.