Managing all those backup files

Today was a day where some old file sorting happened. On a regular basis the number of files I’m using is fairly limited. My backup has built up way more files over the years that I seriously wonder about sometimes. This massive collection of files stored to the cloud just hangs out and does nothing really. I’m not entirely sure why I still keep it all and back it up with such care. One of the strategies that I have been considering is just setting up a new folder and putting on the files that really matter to me in that one and then at some point allowing the rest to vanish from existence. To be fair about the whole thing I’m guessing that some of those files are not necessary and probably should not have been saved in the first place. The degree to which my digital pack-rat-ness was effective is somewhat astonishing at this point.

Sadly, I’m not the only one with massive collections of archived documents in a variety of clouds. Something is going to happen to all these clouds as people end up abandoning their files over time. I have a plan for my main cloud account where without any action on my part for 90 days the files are shared out. I’m not sure exactly what the people who are slated to get access to these files will end up doing with them and it might just be an overwhelming pile of digital artifacts. Within the grand aggregate of cloud files I’m really wondering about how many of them are needed or if we have just created a reality where data center after data center is busy keeping records of nothing really. That is a thought related to both data permanence and necessity. While I cannot bring myself to just hit delete on all the files and move along I’m sure that is the inevitable outcome unless we learn how to store files on crystals. At that point, it won’t matter how much data people want to store forever it will be possible. The bigger question will be if anybody ever does anything with all those stored files on what I’m sure will be a mountain like pile of data storage crystals.

One of the things I have started doing within my more academic side writing pursuits is finding ways to publish and store my work in places outside of my weblog or my cloud storage. That public type of sharing is in an effort to help the writing stand the test of time in a better way. It is an attempt to achieve some type of data permanence based on making the content accessible. My weblog is scraped by the way back machine which does mean that my writings here are generally backed up beyond the ways that I back them up. You can if you wanted to scroll across the years and see the various weblog styles and other elements that go back a long time within that archive. That is one of the ways that the internet itself is backup up and accessible to people interested in that type of archive. I’m always curious about the freshness of a weblog and nostalgic browsing of things from 20 years ago does not really appeal to me at the moment.

Considering a bit of considering

Right now, at this very moment, I’m working on Substack post 88 of 104 and starting to consider moving that effort out of Google Docs and over to Microsoft Word as well. At the moment, that move has not happened just yet as I’m still writing some of that content on my Chromebook and I’m not a big fan of the Office 365 online version of Microsoft Word. I really got used to using Google Docs and the experience it provides. Now I’m writing out of the desktop application while sitting at my main computer running Windows. That means that for the most part my writing has become something that happens at my desk in my office and not on the go or anywhere outside of my desk. Given that most of the writing I end up doing happens in the morning and is during my scheduled blocks of time that works out well enough. Right now, I’m pretty far ahead of the Friday publishing schedule again. Week 82 just went live and the next 5 are ready to go already. That means that the entire introduction to machine learning syllabus is now draft complete. I do plan on going back and reading it again from start to finish and doing any final edits.

Probably the principal thing that is keeping me from moving completely out of Google Docs and over to Microsoft Word is that the final version of The Lindahl Letter that gets published does not contain the Tweets of the week or the links to things that are included in the Substack newsletter. When I go back to format the content for final publication, I have been removing those two sections. That is certainly something that could be augmented moving forward where any links or content being shared is put into the main body of the post to avoid having to use those two links only sections or I could just include them in the final product as well. Based on the statistics I have available to me it does not appear like that content is really consumed very much by people. People tend to read the prose at the top of the post and are not opening the email to see what Tweets I have enjoyed the most that week. Maybe the reason those got included was purely indulgent on my part which is interesting as an aside to consider.

After finishing up that syllabus I’m interested in working on some more research note type efforts where I’m really digging into the relevant scholarly articles as well as covering topics within the machine learning space. That is the goal of my Substack efforts moving forward. Of course, I broke that trajectory with my first set of writing efforts from the week 88 content. I’m probably going to need to reconsider the topics listed from 88 to 104 to make sure that they are ones that could support solid research notes. I’m not sure if they will end up getting converted over to Overleaf and eventually published that way, but that would be the general idea of what needs to happen moving forward.

Working with all that leftover data

Yesterday I was wondering about just deleting my data archives. Over the last 20 years I have accumulated so much data. Sure most of it is backed up to the cloud and the writing part of it is a very slim section of the overall data. I was considering just wiping it all and starting over. My thoughts then drifted to a few thoughts about what I might miss after that great data purge. I might need my writing archives, photos, and videos. Maybe keeping all my photos and family videos would be a good thing to actually complete. From what I know all 3 sets of that data are backed up to a couple of clouds and I have alternate copies. I’m not sure why I went to all of that trouble to keep all of that data. We used to print our favorite photographs and share them in books with people. Nobody is going to want to look through 20 or 30 thousand digital photos. They are not even shared online anymore. At one point, all of my digital photos were posted on Flickr or just shared in an online directory. I have considered bringing back a digital photograph section of my weblog to share content. I’m not talking about sharing all my photos this time around, but it would be good to share the ones that people might enjoy or use as a desktop background.

Thinking about online permanence again

Today I’m really focused on what parts of the internet are more permanent than others. A decade from now Today I’m really focused on what parts of the internet are more permanent than others. I’m curious about what will happen in the future, “A decade from now will GitHub and YouTube still be housing content?” It is really about my effort to question what will remain online year after year. Back on May 20, 2021 I released an album on YouTube called, “This is an ambient music recording called dissonant dystopia.” That work of art is 33 minutes of dissonant music and it will exist online as long as YouTube houses it. That means its existence is pretty much tied to the permanence of YouTube as a platform. I’m going to guess that a lot of content faces the same constraint. The continued existence of that art is tied to the platform where it is hosted. I could probably post the album to a few other places to increase the odds of it outlasting YouTube as a platform, but I’m not sure that is an effort that is worth my time. My guess about the future of online permanence is that Instagram and YouTube will continue to exist for as long as the modern internet persists as a technology. 

It is times like these when I begin to wonder what will happen to the world wide web when pockets of private isolation creep up within the walls of applications. We are seeing a fragmentation of what was the open internet. Be at the continued growth of dark pockets of the online world or just application based islands. You are seeing parts of the internet that you can gain access to the front door, but they are not truly a part of an open internet. They are something else and that something else is evolving right now before our eyes. We could very well see a change in the format of the content in the next decade. Sure hypertext has connected the world, but a metaverse will potentially be a video/image stream that is way beyond a text based communication method. Keep in mind that this weblog barely contains any imagery and the primary method of communicating content is text based. In a metraverse of rooms, zones, areas, or community spaces it is entirely possible that it will be immersive and that image and sound will define the method of communication that will be occurring. 

Really the most advanced method of communication I have considered is either recording these missives as audio for a podcast or working to make a video version of a podcast which just really includes a perspective of me reading the content. Either way that will be a one way method of communication either via text dissemination, audio recording, or video recording. It will be nothing outside of an asynchronous method of communication. I might respond to a comment or a note that somebody provided, but it would not be within an immersive environment. It would be purely asynchronous in nature.

Considering data permanence again

At the end of my writing session yesterday I accidentally sent out 307 tweets. Deleting every one of those by hand on Twitter as the rate limited API spit them out was a little bit nerve racking. My expectation was that either the deduplication feature over at Twitter would catch this or the integration code on my side was written well enough not to post things modified using the bulk edit feature. Neither of those things held true and logic failed. That really did mean that a bunch of people who had alerts turned on received a lot of updates notifications. Given that I have recently started using a vtech landline headset system to obfuscate my cellular connection to avoid notifications I’m feeling a little bit of shame related to that blaring coding mistake. 

Releasing those posts from the private mode back to published brings the public archive to a complete status from 2020 to current. At some point, I’m going to bring all the posts back from the 3,000 word a day writing habit period of 2018, but I’m going to need to fix that integration with Twitter before making that update. The easiest way to fix that integration would be to simply go to the settings menu and disconnect Twitter. Right now the setting for “Sharing posts to your Twitter feed” has been enabled. It would just take one click to disconnect it and that would pretty much solve the problem, but it would not do it via code it would do it via literally removing the potential for the problem to occur again. Maybe later this week that is what it will come to after some contemplation about the problem. I am really considering releasing the 153 posts that are currently set to private mode that occurred in that highly productive writing period. 

I have really spent a fair amount of time thinking about the nature of permanence and the written word recently. Until we start saving content to crystals (5D optical data storage) all of this writing and posting is going to be ephemeral at best. It is possible that my code on GitHub will be stored that way at some point and the GPT-2 model trained on my writing Corpus would fall into that storage process and be saved for posterity. However, just because content got saved to crystal and was potentially accessible for ages does not mean any interest in the content would exist. People might not boot up the Nels bot for dialogue and exchange. Most of the interest in complex language modeling right now is based on overwhelming large datasets vs. contained individual personality development. 

To that end I was reading this article called “The Pile: An 800GB Dataset of Diverse Text for Language Modeling” from the arXiv website from Gao et al., 2020. That diverse collection of data includes 825 gigabytes of content which functionally has been cleared of all sources and the authorship removed. This action has removed individuality from the language model in favor of generalization. Future models might end up going the other direction and favoring personality over generalization, but that might end up being more isolated based on what I’m seeing so far in terms of language modeling. 

On the brighter side of things, is that these experiences are focusing my research interests on that pivotal point of consideration between generalized and personality specific language models. I have a sample IEEE paper format template saved as a Microsoft Word document ready to house that future paper on my desktop screen right now. It’s entirely possible that after hitting publish on this missive that is where my attention will be placed for the rest of the day.