Throughout the last five days I have been sprinting forward trying to really understand how to use the GPT-2 model. It took me a lot longer to dig into this one than my normal mean time to solve. Documentation on these things is challenging because of two distinct factors that increase complexity. First, the instructions are typically not purely step by step for the read. You have to have some understanding of what you are doing to be able to work the instructions to conclusion. Second, the instructions happened at a specific point in time and the dependencies, versions, and deprecations that have happened since are daunting to overcome. At the heart of the joy that Jupyter Notebooks create is the ability to do something rapidly and share it. Environmental dependencies change over time and that once working notebooks slowly drift away from being useful to being a time capsule of now perpetually failing code. That in some ways is the ephemeral nature of the open source coding world that is currently expanding. Things work in the moment, but you have to ruthlessly maintain and upgrade to stay current on the open source wave of change.
My argument above is not an indictment of open source and dependencies within code on versions and libraries. Things just get real over time as a code base that was once bulletproof proves to be dependent on something that was deprecated. Keep in mind that my journey to use the GPT-2 model included working with a repository that was published on GitHub just 15 months ago with very limited documentation. The file with developer instructions did not include a comprehensive environment list. I was told that this is why people build Docker containers that can be a snapshot in time deployed again and again to essentially freeze time. That is now how I work in real time or code when I’m doing things actively developing. My general use case is to sit down and work with the latest version of everything. That might not be a good idea as code is generally not assumed to be future proof. An environmental dependency file would help be a signpost for future developers to know where exactly things stood when this code base was shared via repository to GitHub.
Really digging into the adventure of digging into the code base for the last five days has been fun and full of learning. Digging into something for me involves opening the code up in Microsoft Visual Studio Code and trying to understand each block that was shared. The way I learned to tinker with and edit Python code was one programing debugging session at a time. I’ll admit that learning was a lot easier in a Jupyter Notebook environment. That allows you to pretty much run each section one after another and see any errors that are spit out so you can work to debug the code to get to that perfect possible future of working code. Oh that resplendent moment of working code where you move on to the next problem. It is a wonderful feeling of accomplishment to see code work. It is a supremely frustrating feeling to watch errors flood the screen or even worse to get nothing in return beyond obvious failure. Troubleshooting general failure is a lot harder than working to resolve a specific error. Right now between the two sessions of my Google Chrome browser I have maybe 70 tabs open. On reboot it is so bad that I end up having to go to browser settings, history, and recently closed to bulk reopen this massive string of browser tabs that at one point were holding my attention.
One of the best features I learned about in GitHub was to search for recently updated repositories. To accomplish that I searched for what I was looking for then sorted the results by last update. Based on the problems described above that type of searching was highly useful to learn the right environmental setup necessary to do the other things I wanted in a Google Colab notebook. On a side note when somebody published to GitHub using a notebook from Google Colab enough bread crumbs exist to find interesting use cases by searching for “colab” plus whatever you are looking for from the main page of GitHub. Out of pure frustration on learning how to set up the environment to get going I used searches filtered to most recently updated for “colab machine learning” and “colab gpt” to get going. Out of that frustration I learned something useful about just looking around to see what people are actively working on and taking a look at what they are actively sharing on GitHub. My searching involved looking at a lot of code repositories that did not have any stars, reviews, or interactions. As my GPT skills improve I’ll make suggestions for some of those repositories on how to get their code bases working again now that a lot of them are getting massive numbers of errors that essentially end up concluding in, “ModuleNotFoundError: No module named ‘tensorflow.contrib’.” That error is truly deflating when it appears. Given how important it is to a lot of models and code I probably would have developed handling for it in the base TensorFlow given that it was intentionally deprecated.
My next big adventure will be to take the environmental setup necessary to get the GPT-2 model working and work out the best method to ingest my corpus of 20 years worth of my writing and see what it spits out as the next post. That has been my main focus in learning how to use this model and potentially even learning how to use the GPT-3 model that was released earlier this week by OpenAI. Part of the fun of doing this is not messing with it locally on my computer and creating a research project that cannot be reproduced. Within what I’m trying to do the fun will be releasing the Jupyter notebook and the corpus file to allow other researchers to build more complex models based on my large writing database or other researches could verify the results through reproducing the steps taking the notebook. That is the really key part here of the whole thing. Giving somebody the tools to freely reproduce the research on Google Colab without any real limitations is a positive setup forward in research quality. Observing a phenomenon and being able to describe it is great. Being able to reproduce the phenomenon being described is how scientific method can be applied to the effort.