isn't the final token as some position N?
And given context size limit Y, when we generate the next token, right now I get attention from N - Y to N?
And this supposes I get attention from 0 to N, but the attention decreases exponentially as we approach token 0?
isn't the final token as some position N?
And given context size limit Y, when we generate the next token, right now I get attention from N - Y to N?
And this supposes I get attention from 0 to N, but the attention decreases exponentially as we approach token 0?