What happened is that "transformers go whrrrrrr." (yes, that's the academic term...

What happened is that "transformers go whrrrrrr." (yes, that's the academic term)

In the end, LLMs using causal language modeling or masked language modeling learn to best solve their objectives by creating an efficient global model of language patterns: CLM is actually a harder problem to solve since MLM can leak information through surrounding context, and with transformer scaling law research post-BERT/GPT it's not a surprise CLM won out in the long run.