Considering the future of research
Sunday weblog notes from February 15, 2026 that were compiled and shared.
This missive started out as a Lindahl Letter and just devolved into a weblog post. Over the last few days, I have watched all the Olympic hockey games, and it has been great. During that time, this missive has been bubbling up in the forefront of my thoughts. It’s been percolating and building, but I’m not done with the argument. I’m just in that thinking phase, and to that end, you can see a few paragraphs formed out of that consideration.
I really have spent a lot of time thinking about how all the shared AI models published online could be combined, distilled, or just reused. So much effort went into creating all these models that have been shared on Hugging Face, but they are ultimately disposable in the current form. At some point, the big project of training and growing larger models may very well end up refocusing on gathering the very best of all the models into a grand model. That sounds fantastic, but I think it is actually something that will be possible. People have written about continuous distillation or data flywheels that maybe start down that path.
Google researchers have even published reports about increases in these types of attacks, which is really just a change in methodology [1]. Based on what they shared in terms of adversarial use of distillation, people are trying to figure out how to take a well-trained model and gain some advantage of continuous improvement outside of just training new models from large corpus efforts. Some of the largest corpus builds have actually come from book scanning projects. Google Books involved a huge scanning project, and Anthropic researchers followed a similar effort [2]. The team at Anthropic literally scanned millions of physical books to create a corpus of well-structured, gated, and known works. From those massive corpses models get trained just in the same way you can download an export from Wikipedia or The Pile to build a model.
Many people have built and trained models based on large corpus efforts. Based on the current state of the public internet, I’m not sure the quality remains the same as it was before. Something has changed, and quality just does not exist anymore. You could say that slop has destroyed the continuity of the written word. It is true in academics and in general online writing. I read a lot of academic papers, and you can tell that a lot of the current stuff is just not as poignant anymore. My argument would be that to stand on the shoulders of giants to advance the academy, papers have to make a substantial contribution to the field to stand out and become well-referenced and accepted within academia. Right now, the academic intake process is getting flooded with papers that don’t make a substantial contribution and are lesser academic works. That does not make the inquiry process any less important or the contributions of research any more valuable. It just means that finding the signal out of the noise is infinitely harder and going forward may be the permanent reality of the academy.
[2] https://www.theverge.com/podcast/872998/anthropic-claude-books-netflix-theaters-vergecast

