It makes sense if you think of the LLM as building a data-aware model that compresses the noisy data by parsimony (the principle that the simplest explanation that fits is best). Typical text compression algorithms are not data-aware and not robust to noise.
In lossy compression the compression itself is the goal. In prediction, compression is the road that leads to parsimonious models.
In lossy compression the compression itself is the goal. In prediction, compression is the road that leads to parsimonious models.