Last night was full of really deep and resonating thunder. It is still raining this morning as I sip my two shots of Nespresso. Sitting here writing and thinking about how a real thunderstorm impacts my sleep is how my day is starting. Yesterday I was able to publish another Jupyter notebook to GitHub. This one helped train that cleaned up Graduation with Civic Honors (GCH) corpus file using the OpenAI GPT-2 model. Strangely, enough the gpt-2-simple installation did not need the start and stop text encoding tokens. They actually made the output worse. Given that they had to be removed from the text file. One of the other things I learned is that some of the formatting needs to be removed to help the model be more successful. The indentation at the start of the paragraphs and any block text indentions seemed to really confuse the model. This was an important set of learnings for me.
Making sure the training file is really clean and well formatted is an important part of getting solid output. The GCH text file contains a lot of text that is very focused on a single topic. The phrasing and the content is very similar. The OpenAI GPT-2 model should be able to replicate the style and reasonably replicate the content with the 355 million parameter model after a couple runs of training. Previously during my weekend efforts I had given the model a training dataset of my last 20 years of writing from this weblog. That corpus of weblog posts spanned a very long time and was in a variety of styles and formats. On the brighter side of things that corpus does not include any indentation for the paragraphs or very many block text examples. Generally, the larger corpus had a larger degree of loss in the model, but produced a mixed bag of reasonable and very unreasonable blocks of prose. Some of them did very much sound like something I might have written. Admidently, I had to check on some of it to see if it was spitting out reused text or if it was making novel prose.
This next paragraph was going to be epic…
Just now when I sat down to start writing something strange happened. I started to type and nothing happened. Somehow the Corsair cube computer had lost track of the keyboard. Right now I’m using a Microsoft Natural Ergonomic Keyboard 4000 to type. Windows 10 forgetting about the keyboard is first for this machine build. I’ll have to keep an eye on that one. Apparently, this keyboard arrived on Tuesday, March 31, 2015. That means this keyboard is actually holding up pretty well based on five years of my daily typing. My typing is not really a soft keystroke type of thing. After a few minutes of getting going my typing is more like thrashing around and hoping for the best. It is entirely possible that buying a new keyboard is something that will happen at some point this year. The action on this keyboard actually still works pretty well. Nothing on the left hand side of the keyboard has even given out yet. Normally, the first thing to end up failing is a key in the S, D, or F section of the keyboard.
Based on my success with using my 20 year writing corpus and the OpenAI GPT-2 model together this weekend, I started to think about what would happen if I cleaned up the data on my, “Graduation with Civic Honors,” (GCH) book and used that corpus as a start to seed the GPT-2 model to create some longer form outputs. Right now my efforts have focused primarily on smaller weblog post size outputs. I was rather literally trying to get a well trained version of the GPT-2 model to spit out my next weblog post. I figured with 20 years of examples it might very well be able to string one together. Getting the GPT-2 model to turn out eloquent prose about the promise of civil society might be a better project for this week. The book as 14 chapters and to format the text for training it might be as easy as dropping in the <|startoftext|> token before the chapter and the <|endoftext|> token after the end of each chapter. Alternatively, by dropping that into this post I created a false positive for anybody that scapes this weblog post. Later on when I grab an updated writing corpus I’ll probably forget to extract this tag and frown at the artificial blip in the data.
Cleaning up the text of GCH and producing a single text file for training will be pretty easy. I should be able to do that today. The process today is going to be done by hand: 1) save the MS Word document as a “Plain Text (*.txt)” file with Unicode UTF-8 encoding, 2) remove all of the erata outside of the chapter title, headers, and content, and 3) insert the <|startoftext|> token with the corresponding <|endoftext|> token for each chapter. Given that the file is about 142 kilobytes in size and fairly regular in structure it should take about 20 minutes to clean up. After that creating a Jupyter notebook to ingest the corpus and run the GPT-2 model will be a pretty straightforward replication of my efforts from this weekend. Working to automate those 3 steps listed above in a Jupyter notebook to clean a text file might be more fun. I’m assuming any file that clearly has a chapter header like “Chapter 1:” will be easy enough to transform with tokens for start and stop. Most passages of text would not have a false positive for that string of text. That is pretty much exclusively used to signpost the start of a chapter. The word chapter on the other hand probably occurs throughout the file and would create false positives if it were used without the trailing number and colon combination.
Yesterday, I was able to share on GitHub a working GPT-2 model that ingested a 20 year corpus of my writing. Using the OpenAI 355 million parameter version of GPT-2 did produce some outputs that looked reasonably like something that I would write. Normally training would take the longest amount of time, but in this case the entire Jupyter notebook takes less than an hour to run to create a reasonable text creation engine. Part of what I’m going to work on today is cleaning up the input corpus file to make it better. A really good text encoding process might help move the needle from reasonable to mostly accurate. The idea that within an hour of training on my writing corpus the GPT-2 model could reasonably approximate my writing style is very surprising. Seriously, I was super surprised when it worked the first time yesterday. A victory lap was taken around the house. That pretty much means shuffling around the house arms in the air with victory at the forefront of my thoughts.
Now that the experience of having victory at the forefront of my thoughts is fresh in my mind perhaps today will be the day that the next iteration of the Jupyter notebook will be shared on GitHub. Being able to share working notebooks in realtime is really satisfying. Part of my journey into natural language processing and machine learning models has always been about being able to reproduce the actions being taken. I do not like to write or produce one off code that requires hand holding to run successfully. Working with TensorFlow has to be about sharing and doing things that are repeatable. Over the last few years I have taken a ton of courses on how to use and work with TensorFlow. People have shared so much content online and GitHub is overflowing with different projects that people have undertaken. A lot of that is so very specifically targeted that generalizing it to a pattern others can repeat is a challenge. All of my efforts begin with the idea of sharing the pattern in the end. That helps me ensure that every step gets covered along the way and people can reasonably work through my examples.
Throughout the last five days I have been sprinting forward trying to really understand how to use the GPT-2 model. It took me a lot longer to dig into this one than my normal mean time to solve. Documentation on these things is challenging because of two distinct factors that increase complexity. First, the instructions are typically not purely step by step for the read. You have to have some understanding of what you are doing to be able to work the instructions to conclusion. Second, the instructions happened at a specific point in time and the dependencies, versions, and deprecations that have happened since are daunting to overcome. At the heart of the joy that Jupyter Notebooks create is the ability to do something rapidly and share it. Environmental dependencies change over time and that once working notebooks slowly drift away from being useful to being a time capsule of now perpetually failing code. That in some ways is the ephemeral nature of the open source coding world that is currently expanding. Things work in the moment, but you have to ruthlessly maintain and upgrade to stay current on the open source wave of change.
My argument above is not an indictment of open source and dependencies within code on versions and libraries. Things just get real over time as a code base that was once bulletproof proves to be dependent on something that was deprecated. Keep in mind that my journey to use the GPT-2 model included working with a repository that was published on GitHub just 15 months ago with very limited documentation. The file with developer instructions did not include a comprehensive environment list. I was told that this is why people build Docker containers that can be a snapshot in time deployed again and again to essentially freeze time. That is now how I work in real time or code when I’m doing things actively developing. My general use case is to sit down and work with the latest version of everything. That might not be a good idea as code is generally not assumed to be future proof. An environmental dependency file would help be a signpost for future developers to know where exactly things stood when this code base was shared via repository to GitHub.
Really digging into the adventure of digging into the code base for the last five days has been fun and full of learning. Digging into something for me involves opening the code up in Microsoft Visual Studio Code and trying to understand each block that was shared. The way I learned to tinker with and edit Python code was one programing debugging session at a time. I’ll admit that learning was a lot easier in a Jupyter Notebook environment. That allows you to pretty much run each section one after another and see any errors that are spit out so you can work to debug the code to get to that perfect possible future of working code. Oh that resplendent moment of working code where you move on to the next problem. It is a wonderful feeling of accomplishment to see code work. It is a supremely frustrating feeling to watch errors flood the screen or even worse to get nothing in return beyond obvious failure. Troubleshooting general failure is a lot harder than working to resolve a specific error. Right now between the two sessions of my Google Chrome browser I have maybe 70 tabs open. On reboot it is so bad that I end up having to go to browser settings, history, and recently closed to bulk reopen this massive string of browser tabs that at one point were holding my attention.
One of the best features I learned about in GitHub was to search for recently updated repositories. To accomplish that I searched for what I was looking for then sorted the results by last update. Based on the problems described above that type of searching was highly useful to learn the right environmental setup necessary to do the other things I wanted in a Google Colab notebook. On a side note when somebody published to GitHub using a notebook from Google Colab enough bread crumbs exist to find interesting use cases by searching for “colab” plus whatever you are looking for from the main page of GitHub. Out of pure frustration on learning how to set up the environment to get going I used searches filtered to most recently updated for “colab machine learning” and “colab gpt” to get going. Out of that frustration I learned something useful about just looking around to see what people are actively working on and taking a look at what they are actively sharing on GitHub. My searching involved looking at a lot of code repositories that did not have any stars, reviews, or interactions. As my GPT skills improve I’ll make suggestions for some of those repositories on how to get their code bases working again now that a lot of them are getting massive numbers of errors that essentially end up concluding in, “ModuleNotFoundError: No module named ‘tensorflow.contrib’.” That error is truly deflating when it appears. Given how important it is to a lot of models and code I probably would have developed handling for it in the base TensorFlow given that it was intentionally deprecated.
My next big adventure will be to take the environmental setup necessary to get the GPT-2 model working and work out the best method to ingest my corpus of 20 years worth of my writing and see what it spits out as the next post. That has been my main focus in learning how to use this model and potentially even learning how to use the GPT-3 model that was released earlier this week by OpenAI. Part of the fun of doing this is not messing with it locally on my computer and creating a research project that cannot be reproduced. Within what I’m trying to do the fun will be releasing the Jupyter notebook and the corpus file to allow other researchers to build more complex models based on my large writing database or other researches could verify the results through reproducing the steps taking the notebook. That is the really key part here of the whole thing. Giving somebody the tools to freely reproduce the research on Google Colab without any real limitations is a positive setup forward in research quality. Observing a phenomenon and being able to describe it is great. Being able to reproduce the phenomenon being described is how scientific method can be applied to the effort.
Getting back into the groove of writing and working on things really just took a real and fun challenge to kickstart. Having a set of real work to complete always makes things a little bit easier and clearer. Instead of thinking about the possible you end up thinking about the pathing to get things done. Being focused on inflight work has been a nice change of direction. Maybe I underestimated how much a good challenge would improve my quarantine experience. Things have been a little weird since March and the quarantine came into being and it is about to be June on Monday. That is something to consider in a moment of reflection.
I have been actively working in the Google Colab environment and on my Windows 10 Corsair Cube to really understand the GPT-2 model. My interest in that has been pretty high the last couple of days and I have been working locally in Windows and after that became frustrating I switched over to using GCP hardware via the Google Colab environment. One of the benefits of switching over is that instead of trying to share a series of commands and some notes on what happened I can work out of a series of Jupyter notebooks. They are easy to share, download, and mostly importantly to create from scratch. The other major benefit of working in the Google Colab environment is that I can dump everything and reset the environment. Being able to share the notebook with other people is important. That allows me to actively look at and understand other methods being used.
One of the things that happened after working in Google Colab for a while was the inactivity timeouts made me sad. I’m not the fastest Python coder in the world. I frequently end up trying things and moving along very quickly for short bursts that are followed by longer periods of inactivity while I research an error, think about what to do next, or wonder what went wrong. Alternatively, I might be happy that something went right and that might create enough of a window that a timeout occurs. At that point, the Colab environment connection to the underlying hardware in the cloud drops off and things have to be restarted from the beginning. That is not a big deal unless you are in the middle of training something and did not have proper checkpoints saved off to preserve your efforts. I ended up subscribing to Google’s Colab Pro which has apparently faster GPUs, longer runtimes (less idle timeouts), and more memory. At the moment, the subscription costs $9.99 a month and that seems reasonable to me based on my experiences so far this week.
Anyway —- I was actively digging into the GPT-2 model and making good progress in Google Colab and then on May 28 the OpenAI team dropped another model called GPT-3 with a corresponding paper, “Language Models are Few-Shot Learners.” That one is different and has proven a little harder to work with at the moment. I’m slowly working on a Jupyter notebook version.
Throughout the last few days I have been devoting all my spare time to learning about and working with the GPT-2 model from OpenAI. They published a paper about the model and it makes for an interesting read. The more interesting part of the equation is actually working with the model and trying to understand how it was constructed and working with all the moving parts. My first efforts were to install it locally on my Windows 10 box. Every time I do that I always think it would have been easier to manage in Ubuntu, but that would be less of a challenge. I figured giving Windows 10 a chance would be a fun part of the adventure. Giving up on Windows has been getting easier and easier. I actually ran Ubuntu Studio as my main operating system for a while with no real problems.
My training data set for my big GPT-2 adventure is everything published on my weblog. That includes about 20 years of content that spans. The local copy of the original Microsoft Word document with all the formatting was 217,918 kilobytes whereas the text document version dropped all the way down to 3,958 kilobytes. I did go and manually open the text document version to make sure it was still readable and structured content.
The first problem is probably easily solved and it related to a missing module named “numpy”
PS F:\GPT-2\gpt-2-finetuning> python encode.py nlindahl.txt nlindahl.npz
Traceback (most recent call last):
File “encode.py”, line 7, in
import numpy as np
ModuleNotFoundError: No module named ‘numpy’
Resolving that required a simple “pip install numpy” in PowerShell. That got me all the way to line 10 in the encode.py file. Where this new error occurred:
PS F:\GPT-2\gpt-2-finetuning> python encode.py nlindahl.txt nlindahl.npz
Traceback (most recent call last):
File “encode.py”, line 10, in
from load_dataset import load_dataset
File “F:\GPT-2\gpt-2-finetuning\load_dataset.py”, line 4, in
import tensorflow as tf
ModuleNotFoundError: No module named ‘tensorflow’
Solving this one required a similar method in PowerShell “pip install –upgrade pip install https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.8.0-py3-none-any.whl” that also included a specific path to tell is where to get TensorFlow.
I gave up on that path and went a different route…
Getting the GPT-2 model setup on this Windows 10 machine was not as straightforward as I had hoped it would be yesterday. Python got upgraded, Cuda got upgraded, cuDNN got installed, and some flavor of the C++ build tools got installed on this machine. Normally when I elect to work with TensorFlow I boot into an Ubuntu instance instead of trying to work with Windows. That is where I am more proficient at managing and working with installations and things. I’m also a lot more willing to destroy my Ubuntu installation and spin up another one to start whatever installation steps I was working on again from the start in a clean environment. My Windows installation here has all sorts of things installed on it and some of them were in conflict or something with my efforts to get GPT-2 running. In fairness to my efforts yesterday, I only had a very limited amount of time after work to figure it all out. Time ran out and installation had occurred via the steps on GitHub, but no magic was happening. Time ran out and that was a truly disappointing scenario to have happened.
Yesterday, I started looking around at all the content I have online. The only base I do not have covered is probably needing to share a new speaking engagement photo online. I need to set up a page for speaking engagements at some point with that photo and a few instructions on how best to request my engagement. Every time I have done a speaking engagement my weblog and Twitter traffic picked up for a little bit. Using the “Print My Blog” plugin I was able to export 1,328 pages of content for a backup yesterday. My initial reaction to that was wondering how many pages of that were useful content and how much of it was muddled prose. Not only did that question of usefulness make me wonder, but also I wondered if I loaded that file into the OpenAI GPT-2 what would come out as the predicted next batch of writing. That is probably enough content to spit out something that reasonably resembles my writing. I started to wonder if the output would be more akin to my better work or my lesser work. Given that most of my writing is somewhat iterative and I build on topics and themes the GPT-2 model might very well be able to generate a weblog in my style of writing.
Just for fun I’m going to try to install and run that little project. When that model got released I spent a lot of time thinking about it, but did not put it to practice. Nothing would be more personal than having it generally create the same thing that I tend to generate on a daily basis. A controlled experiment would be to set it up and let it produce content each day and compare what I produce during my morning writing session to what it spits out as the predicted next batch of prose. It would have the advantage or disadvantage of being able to review 1,328 pages and predict what is coming next. My honest guess on that one is that the last 90 days are probably more informative for prediction than the last 10 years of content. However, that might not be accurate based on how the generative model works. All that content might very well help fuel the right parameters to generate that next best word selection. I had written “choice” to end that last sentence, but that felt weird to write that the GPT-2 model was making a choice so I modified the sentence to end with selection.