Yes it's just doing compression. No it's not the diss you think it is.
In Ted Chiang’s New Yorker article, he likened language models to “a blurry JPEG”. JPEG is a lossless lossy (Edit: I meant to quote Ted Chiang, but slipped up) compression method for images. And some people absolutely hated this comparison. I’m going to attempt to convince you that the objective of maximising log-likelihood is optimising for compression. And then I’m going to cover something perhaps a little more controversial: compression and understanding aren’t antithetical concepts.