Consider what research trajectories to really deep dive

Yesterday I started to consider what research trajectories to really deep dive into this year. The next few weeks are going to be devoted to some literature reviews within the polling methodology space. That is an area where I want to dig in and see some contemporary research. I have some quality content written for the next couple of Substack posts, but that writing needs some time and attention to get it over the finish line. It’s the continued release of all these new models that keeps pulling my attention in other directions. Pacing these releases to any normal schedule would be impossible at this point. So many companies are trying to put forward chat enabled bots that it is nearly overwhelming. Right now I have been using the ChatGPT version from OpenAI and Google’s Bard. I’ll admit right now that the one that I interacted with over a longer duration was the OpenAI version. It has helped me write a couple of books that are passable. One on ethics in AI and the other about political debt. The one on AI ethics will show up in my Substack post today. 

One of the things that I have considered a bit is how I’m picking titles for these weblog posts. Given that the only social media they are shared on is Twitter as a single tweet. Maybe I should be crafting the title of the posts as a stand alone tweet. Right now a title is simply selected from some interesting block of words (more than 3) that were written as a part of the post. Over the long history of this blog a lot of good titles have been used and now we are at a point where generating unique ones is a bit more challenging. To that end it has become easier to just select from text a passable title and move along with the writing project. Recently, I have been highlighting where the title came from in bold text to help make that title selection reality a little bit more obvious to the reader. 

Tomorrow morning is going to be key in terms of recording and editing down a couple of Substack posts to get back on track in terms of a backlog. I had successfully worked ahead and now I’m back into a position where all the content that is complete will be published. Later today the last of the recorded and ready Substack editions will go out for publication. I have been working on the next ones of course and that content has had review and effort put into it, but none of it has been recorded. I have not had my microphone setup for the last two weekends making it harder to record high quality audio. I had considered trying to use my Pixel 7 Pro smartphone to do the recording, but it does not sound as good as the Yeti X microphone. A huge difference exists in the audio quality on those two microphone systems. One is markedly bigger than the other. The Yeti X is bigger than my phone to begin with and the microphone on a smartphone is exceedingly small as a component.

A few research trajectory notes

Today I finished working on the main content for week 88 of The Lindahl Letter. That one is a bridge piece between two sets of more academic side efforts. I went from working on introductory syllabus to starting to prepare a bit for the more advanced set of content. Initially, I had considered making the advanced versions a collection of research notes that were built around very specific and focused topics. That is entirely a path that might be taken after the 104th week. The packaging on the content instead will be put into another companion syllabus to allow an introductory look and a more advanced topic follow up for people looking for a bit more machine learning content. Functionally those two documents put together will be the summation of 104 weeks of my efforts in the machine learning space. It is the book end to my journey into really diving deep into machine learning and studying it every weekend and a lot of weekdays.

After two years of digging into the machine learning space I’m going to pivot over and focus on writing and studying artificial intelligence in general for year 3 of The Lindahl Letter. It should be a fun departure and hopefully it will mix things up a little bit with a broader collection of literature. A lot of people talk about deploying AI in the business world and almost all of that conjecture is entirely based on deploying a machine learning model into a production environment. When those same people deploy an actual AI product they will hopefully see the difference.

A bit of regular blogging

Today I have a few hours blocked off to engage in a bit of writing and creating. Based on a review of the recent stats it looks like the Substack posts are working well enough. My draft of the week 73 Substack post is mostly complete. Drafting went well enough Friday into Saturday on that one. I’m going to rework it a bit here in a few minutes after finishing this post. It seemed like a good idea to just go ahead and do a bit of regular blogging right here before jumping back over to working on that post. Recently, I have been more likely to work on academic efforts and less likely to just produce a bit of stream of consciousness style prose. That seems to be a developing trend. I’m trying to avoid it becoming a pattern. 

Strangely enough yesterday was actually a really big Google keep day for me. I took a ton of notes of things to work on and things to write about in the future. Sometimes that happens when I end up in a reflective mood. My notes are a wonderful little roadmap to producing future content. For those of you who do not take writing topic notes, it is a method of keeping track of things that could be promising without investing a ton of time into them immediately. My massive “Substack Posts” Google Doc has all the posts written so far and the backlog list which is functionally a collection of notes that goes out to 120 total topics. At this point along the journey, I have written posts 1 to 73 within that backlog. Right now I’m still working at a pace where one or more posts are being produced each weekend. Each one of those posts is just a written set of research notes cataloging my journey to better understand a topic. To that end my research notes are essentially provided free of charge to an audience on Substack. 

My plan is still to take all that content and drop it into a 2nd edition of The Lindahl Letter book when the 2nd full year of publishing is complete. I’m not entirely sure Substack as a platform will last forever and memorizing the content in an actual printed publication format is not a bad method of persevering things along the way. Sure the content is available to people for free who want to consume it that way and as a book for people who prefer to read it in that format. That is just a part of the process and adventure along the way. I had done all the setup for posts and the pre-work on a bunch of posts up until the last post that was completed. I have now staged the posts up until week 104 which will be a big 2nd year of posting recap. 

At the two year mark I’m planning on moving away from machine learning posts into just generally covering artificial intelligence and producing research notes related to a planned set of academic articles. That means that it is possible that weeks of ongoing coverage of something being worked as an academic article could be distributed. That is probably a good method to really dig deep into a few topics along the way. One of the things I have worked pretty hard to avoid is producing coverage of the same topic over and over again. Somebody who sat down with the book later on will see an ongoing coverage of topics and not encounter a repetitive reading experience. One of the things that I have really tried to avoid along the way is providing weblog-like coverage within The Lindahl Letter which would end up blending this type of content with that more research note type of coverage. I’m sure a blending of the two types of content would be possible, but that is not really the intended vibe.

Latest paper research notes

Over the last few days, I have been looking at sketches of the healthcare landscape in the United States. My research is strictly limited to that universe of care at the moment. Maybe later I could do some comparative analysis, but at the moment a limited universe is necessary to make progress on this initial research effort. I have a very large Moleskine sketchbook that has A3 size pages. Which for those of you who do not know happens to be 11.75 inches by 16.5 inches. That gives me plenty of space to sketch out ideas. At the moment, I have been working on three different sketches that will be converted from sketch to slide at some point. That effort includes mapping the healthcare space, plotting the next 5 years, and a sketch of where ML will be in that 5 year mapping of healthcare. My initial analysis showed a bunch of different ways to look at things. It feels like the overall ecosystem is being pushed from a lot of directions instead of being driven organically into a cohesive mesh.

Upcoming Research

Yesterday I completed an order for a couple of St. Vincent vinyl records. I’m going to give them a listen and see what I think about the records after a couple of spins. 

Today I added a static page to the weblog called, “Upcoming Research.” As a space for online content it is going to be devoted to the 5 research projects I’m working on and as part of my daily focus on having a trajectory statement it makes sense to codify current work.

My modus operandi for creating prose is to open a new word processing document every day and begin with a blank page. To this end my tabula rasa approach requires me to bring forward something from nothing. However, given my renewed focus on producing papers and other manuscripts that means a sustained focus will be required. Maintaining a sustained focus on one thing is a different type of modus operandi compared to trying to really clear your mind and work from a state of a pure stream of consciousness that approaches a true state of tabula rasa. While it is totally possible that both methods can be utilized. They are mutually exclusive by definition. One is a seeded method to preload content and the other is a method to avoid preseeding ideas or intentions. 

I’m back on my intermittent fasting diet of only eating two meals between 1100 hours and 1800 hours. For the most part the meal plan works out well enough, but it is challenging to sustain for several weeks.

Considering data permanence again

At the end of my writing session yesterday I accidentally sent out 307 tweets. Deleting every one of those by hand on Twitter as the rate limited API spit them out was a little bit nerve racking. My expectation was that either the deduplication feature over at Twitter would catch this or the integration code on my side was written well enough not to post things modified using the bulk edit feature. Neither of those things held true and logic failed. That really did mean that a bunch of people who had alerts turned on received a lot of updates notifications. Given that I have recently started using a vtech landline headset system to obfuscate my cellular connection to avoid notifications I’m feeling a little bit of shame related to that blaring coding mistake. 

Releasing those posts from the private mode back to published brings the public archive to a complete status from 2020 to current. At some point, I’m going to bring all the posts back from the 3,000 word a day writing habit period of 2018, but I’m going to need to fix that integration with Twitter before making that update. The easiest way to fix that integration would be to simply go to the settings menu and disconnect Twitter. Right now the setting for “Sharing posts to your Twitter feed” has been enabled. It would just take one click to disconnect it and that would pretty much solve the problem, but it would not do it via code it would do it via literally removing the potential for the problem to occur again. Maybe later this week that is what it will come to after some contemplation about the problem. I am really considering releasing the 153 posts that are currently set to private mode that occurred in that highly productive writing period. 

I have really spent a fair amount of time thinking about the nature of permanence and the written word recently. Until we start saving content to crystals (5D optical data storage) all of this writing and posting is going to be ephemeral at best. It is possible that my code on GitHub will be stored that way at some point and the GPT-2 model trained on my writing Corpus would fall into that storage process and be saved for posterity. However, just because content got saved to crystal and was potentially accessible for ages does not mean any interest in the content would exist. People might not boot up the Nels bot for dialogue and exchange. Most of the interest in complex language modeling right now is based on overwhelming large datasets vs. contained individual personality development. 

To that end I was reading this article called “The Pile: An 800GB Dataset of Diverse Text for Language Modeling” from the arXiv website from Gao et al., 2020. That diverse collection of data includes 825 gigabytes of content which functionally has been cleared of all sources and the authorship removed. This action has removed individuality from the language model in favor of generalization. Future models might end up going the other direction and favoring personality over generalization, but that might end up being more isolated based on what I’m seeing so far in terms of language modeling. 

On the brighter side of things, is that these experiences are focusing my research interests on that pivotal point of consideration between generalized and personality specific language models. I have a sample IEEE paper format template saved as a Microsoft Word document ready to house that future paper on my desktop screen right now. It’s entirely possible that after hitting publish on this missive that is where my attention will be placed for the rest of the day.

Still thinking about research

Over the last couple of days I have spent some time researching and thinking about finding a solid IRB partner to do some survey research in the technology space. It seems like the best solution may be to find a research partner at a university that has better access to that type of support. That is probably something that is possible to figure out, but it was part of a realization that some things are easier with an institutional partner. Outside of sending out physical surveys which would require approval I could write and deploy an automated analysis tool that compiles survey-like scores to support research papers. Alternatively, it would be possible to grab some publicly available data sets and do work within that space. My strong preference is to do work by creating my own method of data collection. That lets me really target the research project to the question being tested.

Papers for the sake of papers

More papers are getting published than any human could possibly read. Those publications are getting stacked up across many different fields and in the case of machine learning the sheer volume of content is staggering. You could try to only focus on a specific journal or two, but some of the most cutting edge research barely goes into the journal system anymore. A lot of it is just sort of pushed out online and those cutting edge researchers are on to the next project. It feels like a vicious cycle of papers for the sake of papers. My efforts to communicate and share my thoughts are generally focused on the medium I’m using and having some reason to share it with people. Within that framework of the necessity to communicate something is hopefully a better line in the sand for what should end up in a paper. We may hit an inflection point where only the top researchers in a field are able to pull together references and share content in a way that is widely read and dispersed. It would be a method of gatekeeping by sustained successful communication, but this could create a type of bubble around a top set of researchers and it could very well obscure the future edge of technology. 

This is a topic that I’m really concerned about obviously. I have spent a good portion of my morning thinking about the future of academic research and the fragile current nature of the broader academy of academic thought. Intergenerational equity within the academy is about the effective storage and sharing of knowledge across the shoulders of giants as the intersection of technology and modernity occurs. Solutions to that quandary are probably beyond any single weblog post or thinking session. It will take a collective action within the academy to rebalance the means of communication toward something new. Somebody within a major field will need to hold some type of conference, lead a nationwide chautauqua, or create an institute to begin that process. Ultimately the system of introducing knowledge in any academic discipline involves lectures where a profession reduces a mountain of content into a presentable set of mapped coursework. That process sometimes ends up in books being published and other times a few of those textbooks become the standard across a discipline. Even the best ones either evolve over time or are replaced by the next set. That is a natural part of communicating the essence of an ever growing mountain of knowledge. 

I keep thinking that maybe every discipline will end up with a sort of encyclopedia of knowledge for that core area of exploration. Just like people built out giant tomes of knowledge to share content when print was the primary medium of communication, some type of modern encyclopedia for a field could provide a foundation for begging to understand a vast accumulation of knowledge within a field. You have to have some way of opening the door to people wanting to learn about the content, but most of them cannot start at the end of the stream of knowledge by reading the latest work by the foremost experts in the field. They need some type of foundational knowledge to be able to understand and consider that work at the bleeding edge of what is possible. In some of the sciences reading the mathematics presented on the page alone requires a certain amount of knowledge before it could be comprehended. My abilities in mathematics are decent, but occasionally when reading a machine learning paper the mathematics on a page are daunting and take me a bit to try to figure out exactly what the researcher is trying to communicate to me as long strings of math are not annotated and commented like code to help people along the way of reading them from start to finish.

Thinking about research schedules

Some of my thoughts have drifted toward the productivity that having due dates from classes provided me over the years. Maybe I need to start thinking about turning the cycle of writing a paper into a scheduled thing like a class. A few different examples exist related to creating timelines for writing a research paper. Waiting for the paper to organically develop is not really working at the moment. It may very well be time to start working toward a new writing plan that involves a research schedule just like starting a class and working toward turning in a research paper at the end. I’m going to start working toward an August 16, 2021 start date for a planned cycle of research. 

I sort of started my effort by going out to consider literature reviews: