Just now when I sat down to start writing something strange happened. I started to type and nothing happened. Somehow the Corsair cube computer had lost track of the keyboard. Right now I’m using a Microsoft Natural Ergonomic Keyboard 4000 to type. Windows 10 forgetting about the keyboard is first for this machine build. I’ll have to keep an eye on that one. Apparently, this keyboard arrived on Tuesday, March 31, 2015. That means this keyboard is actually holding up pretty well based on five years of my daily typing. My typing is not really a soft keystroke type of thing. After a few minutes of getting going my typing is more like thrashing around and hoping for the best. It is entirely possible that buying a new keyboard is something that will happen at some point this year. The action on this keyboard actually still works pretty well. Nothing on the left hand side of the keyboard has even given out yet. Normally, the first thing to end up failing is a key in the S, D, or F section of the keyboard.
Based on my success with using my 20 year writing corpus and the OpenAI GPT-2 model together this weekend, I started to think about what would happen if I cleaned up the data on my, “Graduation with Civic Honors,” (GCH) book and used that corpus as a start to seed the GPT-2 model to create some longer form outputs. Right now my efforts have focused primarily on smaller weblog post size outputs. I was rather literally trying to get a well trained version of the GPT-2 model to spit out my next weblog post. I figured with 20 years of examples it might very well be able to string one together. Getting the GPT-2 model to turn out eloquent prose about the promise of civil society might be a better project for this week. The book as 14 chapters and to format the text for training it might be as easy as dropping in the <|startoftext|> token before the chapter and the <|endoftext|> token after the end of each chapter. Alternatively, by dropping that into this post I created a false positive for anybody that scapes this weblog post. Later on when I grab an updated writing corpus I’ll probably forget to extract this tag and frown at the artificial blip in the data.
Cleaning up the text of GCH and producing a single text file for training will be pretty easy. I should be able to do that today. The process today is going to be done by hand: 1) save the MS Word document as a “Plain Text (*.txt)” file with Unicode UTF-8 encoding, 2) remove all of the erata outside of the chapter title, headers, and content, and 3) insert the <|startoftext|> token with the corresponding <|endoftext|> token for each chapter. Given that the file is about 142 kilobytes in size and fairly regular in structure it should take about 20 minutes to clean up. After that creating a Jupyter notebook to ingest the corpus and run the GPT-2 model will be a pretty straightforward replication of my efforts from this weekend. Working to automate those 3 steps listed above in a Jupyter notebook to clean a text file might be more fun. I’m assuming any file that clearly has a chapter header like “Chapter 1:” will be easy enough to transform with tokens for start and stop. Most passages of text would not have a false positive for that string of text. That is pretty much exclusively used to signpost the start of a chapter. The word chapter on the other hand probably occurs throughout the file and would create false positives if it were used without the trailing number and colon combination.