Last night was full of really deep and resonating thunder. It is still raining this morning as I sip my two shots of Nespresso. Sitting here writing and thinking about how a real thunderstorm impacts my sleep is how my day is starting. Yesterday I was able to publish another Jupyter notebook to GitHub. This one helped train that cleaned up Graduation with Civic Honors (GCH) corpus file using the OpenAI GPT-2 model. Strangely, enough the gpt-2-simple installation did not need the start and stop text encoding tokens. They actually made the output worse. Given that they had to be removed from the text file. One of the other things I learned is that some of the formatting needs to be removed to help the model be more successful. The indentation at the start of the paragraphs and any block text indentions seemed to really confuse the model. This was an important set of learnings for me.
Making sure the training file is really clean and well formatted is an important part of getting solid output. The GCH text file contains a lot of text that is very focused on a single topic. The phrasing and the content is very similar. The OpenAI GPT-2 model should be able to replicate the style and reasonably replicate the content with the 355 million parameter model after a couple runs of training. Previously during my weekend efforts I had given the model a training dataset of my last 20 years of writing from this weblog. That corpus of weblog posts spanned a very long time and was in a variety of styles and formats. On the brighter side of things that corpus does not include any indentation for the paragraphs or very many block text examples. Generally, the larger corpus had a larger degree of loss in the model, but produced a mixed bag of reasonable and very unreasonable blocks of prose. Some of them did very much sound like something I might have written. Admidently, I had to check on some of it to see if it was spitting out reused text or if it was making novel prose.
This next paragraph was going to be epic…