Storing the web to interact with our language model driven future is probably something that should be considered. Search engines use sitemaps and crunch that data down after processing and collection. We could preprocess the content and provide files to be picked up with the content instead of trusting the processing and collection. I’m not sure we will end up with people packaging content for later distribution. That in some ways is a change from the delivering of hypertext to the online world to an entirely different method of sharing. We could just pre-build our own knowledge graph node and be ready to share with the world as the internet as it was constructed is functionally vanishing. Agentic interactions are on the rise and people visiting and reading online content will be diminished. Our method of interface will be with or through an agent and it will be a totally different online experience.
I actually spent a bunch of time yesterday working on distilling language models by starting with how to work with GPT2. A few of those notes are shared out on GitHub. They are fully functional and you can run them on Google Colab or anywhere you can work with a Python based notebook. I’m really interested in model distillation right now. A lot of libraries and frameworks for GPT distillation already exist and have for some time. You could grab Hugging Face’s transformers (DistilBERT, DistilGPT2), torch.distill, Google’s T5 distillation techniques, and DeepSpeed & FasterTransformer (for efficient inference). You could do some testing and see what results and benefits of GPT-2 distillation exist. A smaller model could provide reduced parameters for better efficiency. Faster inference could mean you could run on lower-end GPUs or even a CPU if you absolutely had to go that route. Distillation of models does help to preserve performance to retain most of the teacher model’s capabilities.
Breakdown of the potential code steps:
- Loads GPT-2 as a Teacher Model: The full-size GPT-2 model is used for generating soft targets.
- Defines a Smaller GPT-2 as Student Model: Reduced layers and attention heads for efficiency.
- Applies Knowledge Distillation Loss: Uses KL Divergence between student and teacher logits. Adds cross-entropy loss to ensure the model learns ground truth.
- Trains the Student Model: Uses AdamW optimizer and trains for a few epochs.
- Saves the Distilled Model: The final distilled model is saved for future use.
I’m absolutely getting the most out of the free ChatGPT interface these days. I keep hitting the limit and having to wait till the next day to get more analysis time. That is probably a good use case to pay them via subscription, but I’m not going to do that. It makes it a little bit more fun to just try to get as much out of the free tier as humanly possible.