A (mini-) Compression View of Large Language Models (followup)

A (mini-) Compression View of Large Language Models (followup)

"Why are LLMs so good at summarization tasks?"

The deep dive on using compression ideas as a bridge is a little longer than I wanted it to be. So, in keeping with the theme, I asked GPT-4o for an editorial one-page version. If you didn’t read the whole thing, here’s its TL;DR response, and then we’ll feature creep a little to discuss why LLMs are so good at this kind of summarization task:

Editorial: A Fresh Perspective on Large Language Models

Large Language Models (LLMs) like GPT-2 and GPT-3 are often dismissed as mere "statistical parrots" or "auto-complete on steroids." However, this view overlooks the sophisticated predictive capabilities that make these models so powerful. By examining LLMs through the lens of data compression, we can gain a deeper understanding of how they generate coherent and contextually relevant text.

The Compression Connection

Data compression is all about reducing the size of data by identifying and leveraging patterns. A good compression tool predicts the next piece of data based on the previous data, much like how weather forecasts predict the path of a hurricane. This prediction isn't a random guess; it's a calculated estimate based on known patterns.

In the context of LLMs, the model predicts the next word in a sequence based on the context provided by the preceding words. This is akin to lifting structural information out of data to create a more compact representation. For example, a date-time string can be compressed by removing redundant structural information, reducing its size significantly.

Predictive Power

The real magic of LLMs lies in their predictive modeling. These models are trained on vast amounts of text, enabling them to make highly accurate predictions about what comes next in a sentence. This predictive capability can be harnessed for data compression. By using an LLM to predict the next word, we can encode the prediction and only store the deviations from this prediction, thereby compressing the data.

Practical Application

To test this theory, a custom compression tool was built using GPT-2. The results were impressive: the tool compressed text to about 25% of its original size, outperforming traditional methods like gzip, which achieved around 38.6%. This demonstrates that LLMs can serve as highly effective predictive models for data compression.

Generating Text

Interestingly, the same principles can be applied in reverse to generate text. By feeding a prompt into the LLM and allowing it to predict the next words, we can create coherent and contextually relevant text. This process is not random; it involves a controlled addition of entropy, which allows the model to produce text that feels creatively inspired.

Conclusion

Viewing LLMs through the lens of data compression offers a new perspective on their capabilities. These models are not just regurgitating data; they are making sophisticated predictions based on learned patterns. This controlled addition of entropy is what enables LLMs to generate text that is both coherent and contextually appropriate.

In summary, LLMs are far more than "statistical parrots." They are powerful predictive models that can be used for both data compression and text generation, offering a nuanced understanding of how these models operate.

This is pretty spectacular, and no matter how often I get results like this, I’m still amazed.

To get this summary, I used a system prompt of I need a one-page editorial version of the following article. It should contain the salient details while avoiding jargon and detailed explanations. I also set a "zero temperature" so it would use the most probable word at each step. This is a poorly named setting which we'll come back to in a second.

Using the language and analogies from the article, this request asks the LLM to deterministically predict the words needed to reproduce as much information as possible from the article itself in a much shorter space. That is, it extends the prompt with text containing as much of the article's entropy as possible, without adding anything new.

The temperature setting affects only how much randomness is used when selecting which of the next predicted words to output, and says nothing about the information in the prompt or model. As that output word is fed back into the next step, this is the only source of new, unpredictable information (i.e. entropy). So with a zero setting, the entropy of the text cannot increase; it can only rehash what is already in the prompt.

Next, we have to look inside the LLM at what it's doing to predict the next word.

However, we have to be careful applying these compression analogies to the inside of an LLM because, unlike the compression or generation tool wrapped around the outside of the model, the various computations are all mushed together and quite opaque. We have a good general idea of what layers and components of an architecture do, but even with detailed analysis of a specific model and its internal data, stating anything categorical is risky. Often we simply do not know what the model has learned or how it concludes what it does. This is the Interpretability Problem, and much has been written about this elsewhere.

With that said, we can still recognize some patterns, and this one is strong enough that I feel okay stating it here:

The reason an LLM is good at summarization tasks is that the entire point of the transformer architecture, and specifically the attention layers (and, again, there are many good descriptions of this architecture out there), is to calculate which parts of its input are related, important, and should be paid attention to. In other words, which parts of the prompt add the most information and so are least predictable from the surrounding context. Put yet another way, it identifies the "entropy hot spots."

Then, to predict the next word in the output, it needs to copy the information from a hot spot and slots it into grammatically correct words (which, again, is pretty amazing). Even though it's predicting one word at a time, it can see the whole article, plus what it has already output, allowing it to prioritize hot spots in order, while ignoring the low-entropy cold spots.

Note: Here "information" refers to the specific internal calculations, while "entropy" measures the impact of these calculations on the model's overall predictions. In this case they are not interchangeable.

This process mirrors lossy compression techniques, like MP3 or JPEG, where the model predicts the data (e.g., discernible audio or low-frequency image features) and discards the less relevant details (entropy).

So, again, using the compression or information theory idea gives us a view of WHAT this LLM is doing and why it can do it so well, while unfortunately leaving the HOW it’s doing it internally unanswered.