Over the last couple of weeks, I have been considering how important all my cloud backup datasets really are in terms of being necessary. I have document repositories dating back to 1999 and I’m not entirely sure that they are needed for any real reason at this point. Beyond that consideration I started to wonder how long the DVD, Blu-ray, and other disc medium storage restore points actually last. Maybe I could take a little bit of time and test a few of the older discs to see if the storage media is still working.
Category: Data permanence
-
It’s ok, you can just stop using search engines
Just type in the website you want to go visit. Are you in need of some news? Just go to your favorite news website and read whatever that proprietor is sharing as the latest news. I’m not talking about drifting over into the world of private search. I’m just saying you can probably live without using a search engine to help you move along with your day to day tasking, projects, and or adventures. Maybe you moved to a new place or something came up where you absolutely have to use a search engine to help move things along. Sure, I get that one off or specific need, but for the most part generally speaking it’s ok you can just stop using search engineers in your day to day life. Nobody ever puts a search engine on their everyday carry list of things you need. Maybe just skip that next random search for something.
-
Over time the algorithm and the crowd moves
Things seem to be changing a bunch in terms of social networks this year. I’m curious to see if a return to the weblog and self hosted content is coming. The problem with that creation of your own island of content is that the crowds are generally elsewhere. We have seen massive changes in where new crowds are going. I mean MySpace was exciting before it was a wasteland. Sometimes I miss the community of the early communities like what happened on Flickr at the start. Some of that early community started to happen on Substack Notes. You will find that the best chance to see a smaller number of creators and have a solid level of interaction with people is at the start of a thing.
Over time the algorithm and the crowd removes the elements of community that form up the things I generally like about new platforms. I mean where Twitter is now and where it was coming out of SXSW (2006) is like two entirely different companies that just happen to share the same name. It was interesting to see what was happening on Substack Notes. At the time, I was at a soccer practice and thought using the post template that Substack provided to share the news about notes was a good idea. I’m not sure my sharing helped drive any degree of adoption, but I did share that news. So far I’m checking into Substack Notes every day as a part of my entertainment routine. It’s probably eating up a little bit of time that would have been spent on Twitter.
Sometimes I play Boom Beach, the rather mindless smartphone game, on my Google Pixel 7 Pro (now repaired) for a couple of minutes during the afternoon. I’m actively considering just hitting delete on my Twitter account. It is not really something that I need for any reason. Maybe spending some time without any social media would increase my writing output. I had Facebook turned off for years and that was fine. Maybe getting rid of Twitter would be a good thing to complete. Anyway, that is a course of action that I’m strongly considering. The fact that I did not just complete the action is probably an indicator that it won’t happen. It however is something that I have spent some time considering.
-
the careful curation of content
It’s one of those things where it is hard to put words on a page about it. Working with one of the chat systems to make content seems to trivialize the writing process. My day starts with an hour of focused academic work. That time is the fulfilled promise of decades of training that included a lot of hard work to get to this point. I can focus on a topic and work toward understanding it. All of that requires my focus and attention on something for that hour. Sometimes on the weekends I spend a couple of hours doing the same thing on a very focused topic. Those chat models with their large language model backends (LLM) produce content within seconds. It’s literally like a 1:60 ratio for output. It takes me an hour to produce what it creates within that minute including the time for the user to enter the prompt.
Maybe I did not expect this type of interaction to really affect me in this way. Everything has been questioned in terms of my writing output and what exactly is going to happen now. The door has been flung open to the creation of content. Central to that problem is the reality that the careful curation of content within academics and the publish first curation of the media are going to get flooded. Both systems are going to get absolutely overloaded with submissions. Something has to give based on the amount of attention that exists. They are not minting any new volume of attention and the channels for grabbing that attention are relatively limited. The next couple of years are going to be a mad scrabble toward some sort of equilibrium between the competing forces of content curation and flooding.
This really is something that I’m concerned about on an onboarding basis. Do all the books, photos, articles, and paints in the before times just end up with a higher value weighting going forward? Will this AI revolution have cheapened the next generation of information delivery in ways we will not fully get to appreciate until the wave has passed us and we can see the aftermath of that scenario? Those questions are at the heart of what I’m concerned about. Selfishly they are questions about the value and purpose of my own current writing efforts. More broadly they are questions about the value of writing within our civil society as we work toward the curation of sharable knowledge. We all work toward that perfect possible future either with purpose or without it. Knowledge is built on the shoulders of the giants that came before us adding to collective understanding of the world around us. Anyone with access and an adventurous spirit can pick up the advancement of some very complex efforts to enhance the academy’s knowledge on a topic.
Maybe I’m worried that the degree of flooding with flatten information so much that the ability to move things forward will diminish. Sorting, seeking, and trying to distill value from an oversupply of newly minted information may well create that diminishing effect. We will move from intellectual overcrowding in the academy to just an overwhelming sea of derivative content marching along beyond any ability to constrain or consume. I’m going to stop with that last argument as it may be the best way to sum this up.
-
Managing all those backup files
Today was a day where some old file sorting happened. On a regular basis the number of files I’m using is fairly limited. My backup has built up way more files over the years that I seriously wonder about sometimes. This massive collection of files stored to the cloud just hangs out and does nothing really. I’m not entirely sure why I still keep it all and back it up with such care. One of the strategies that I have been considering is just setting up a new folder and putting on the files that really matter to me in that one and then at some point allowing the rest to vanish from existence. To be fair about the whole thing I’m guessing that some of those files are not necessary and probably should not have been saved in the first place. The degree to which my digital pack-rat-ness was effective is somewhat astonishing at this point.
Sadly, I’m not the only one with massive collections of archived documents in a variety of clouds. Something is going to happen to all these clouds as people end up abandoning their files over time. I have a plan for my main cloud account where without any action on my part for 90 days the files are shared out. I’m not sure exactly what the people who are slated to get access to these files will end up doing with them and it might just be an overwhelming pile of digital artifacts. Within the grand aggregate of cloud files I’m really wondering about how many of them are needed or if we have just created a reality where data center after data center is busy keeping records of nothing really. That is a thought related to both data permanence and necessity. While I cannot bring myself to just hit delete on all the files and move along I’m sure that is the inevitable outcome unless we learn how to store files on crystals. At that point, it won’t matter how much data people want to store forever it will be possible. The bigger question will be if anybody ever does anything with all those stored files on what I’m sure will be a mountain like pile of data storage crystals.
One of the things I have started doing within my more academic side writing pursuits is finding ways to publish and store my work in places outside of my weblog or my cloud storage. That public type of sharing is in an effort to help the writing stand the test of time in a better way. It is an attempt to achieve some type of data permanence based on making the content accessible. My weblog is scraped by the way back machine which does mean that my writings here are generally backed up beyond the ways that I back them up. You can if you wanted to scroll across the years and see the various weblog styles and other elements that go back a long time within that archive. That is one of the ways that the internet itself is backup up and accessible to people interested in that type of archive. I’m always curious about the freshness of a weblog and nostalgic browsing of things from 20 years ago does not really appeal to me at the moment.
-
Considering a bit of considering
Right now, at this very moment, I’m working on Substack post 88 of 104 and starting to consider moving that effort out of Google Docs and over to Microsoft Word as well. At the moment, that move has not happened just yet as I’m still writing some of that content on my Chromebook and I’m not a big fan of the Office 365 online version of Microsoft Word. I really got used to using Google Docs and the experience it provides. Now I’m writing out of the desktop application while sitting at my main computer running Windows. That means that for the most part my writing has become something that happens at my desk in my office and not on the go or anywhere outside of my desk. Given that most of the writing I end up doing happens in the morning and is during my scheduled blocks of time that works out well enough. Right now, I’m pretty far ahead of the Friday publishing schedule again. Week 82 just went live and the next 5 are ready to go already. That means that the entire introduction to machine learning syllabus is now draft complete. I do plan on going back and reading it again from start to finish and doing any final edits.
Probably the principal thing that is keeping me from moving completely out of Google Docs and over to Microsoft Word is that the final version of The Lindahl Letter that gets published does not contain the Tweets of the week or the links to things that are included in the Substack newsletter. When I go back to format the content for final publication, I have been removing those two sections. That is certainly something that could be augmented moving forward where any links or content being shared is put into the main body of the post to avoid having to use those two links only sections or I could just include them in the final product as well. Based on the statistics I have available to me it does not appear like that content is really consumed very much by people. People tend to read the prose at the top of the post and are not opening the email to see what Tweets I have enjoyed the most that week. Maybe the reason those got included was purely indulgent on my part which is interesting as an aside to consider.
After finishing up that syllabus I’m interested in working on some more research note type efforts where I’m really digging into the relevant scholarly articles as well as covering topics within the machine learning space. That is the goal of my Substack efforts moving forward. Of course, I broke that trajectory with my first set of writing efforts from the week 88 content. I’m probably going to need to reconsider the topics listed from 88 to 104 to make sure that they are ones that could support solid research notes. I’m not sure if they will end up getting converted over to Overleaf and eventually published that way, but that would be the general idea of what needs to happen moving forward.
-
Working with all that leftover data
Yesterday I was wondering about just deleting my data archives. Over the last 20 years I have accumulated so much data. Sure most of it is backed up to the cloud and the writing part of it is a very slim section of the overall data. I was considering just wiping it all and starting over. My thoughts then drifted to a few thoughts about what I might miss after that great data purge. I might need my writing archives, photos, and videos. Maybe keeping all my photos and family videos would be a good thing to actually complete. From what I know all 3 sets of that data are backed up to a couple of clouds and I have alternate copies. I’m not sure why I went to all of that trouble to keep all of that data. We used to print our favorite photographs and share them in books with people. Nobody is going to want to look through 20 or 30 thousand digital photos. They are not even shared online anymore. At one point, all of my digital photos were posted on Flickr or just shared in an online directory. I have considered bringing back a digital photograph section of my weblog to share content. I’m not talking about sharing all my photos this time around, but it would be good to share the ones that people might enjoy or use as a desktop background.
-
Considering data permanence again
At the end of my writing session yesterday I accidentally sent out 307 tweets. Deleting every one of those by hand on Twitter as the rate limited API spit them out was a little bit nerve racking. My expectation was that either the deduplication feature over at Twitter would catch this or the integration code on my side was written well enough not to post things modified using the bulk edit feature. Neither of those things held true and logic failed. That really did mean that a bunch of people who had alerts turned on received a lot of updates notifications. Given that I have recently started using a vtech landline headset system to obfuscate my cellular connection to avoid notifications I’m feeling a little bit of shame related to that blaring coding mistake.
Releasing those posts from the private mode back to published brings the public archive to a complete status from 2020 to current. At some point, I’m going to bring all the posts back from the 3,000 word a day writing habit period of 2018, but I’m going to need to fix that integration with Twitter before making that update. The easiest way to fix that integration would be to simply go to the settings menu and disconnect Twitter. Right now the setting for “Sharing posts to your Twitter feed” has been enabled. It would just take one click to disconnect it and that would pretty much solve the problem, but it would not do it via code it would do it via literally removing the potential for the problem to occur again. Maybe later this week that is what it will come to after some contemplation about the problem. I am really considering releasing the 153 posts that are currently set to private mode that occurred in that highly productive writing period.
I have really spent a fair amount of time thinking about the nature of permanence and the written word recently. Until we start saving content to crystals (5D optical data storage) all of this writing and posting is going to be ephemeral at best. It is possible that my code on GitHub will be stored that way at some point and the GPT-2 model trained on my writing Corpus would fall into that storage process and be saved for posterity. However, just because content got saved to crystal and was potentially accessible for ages does not mean any interest in the content would exist. People might not boot up the Nels bot for dialogue and exchange. Most of the interest in complex language modeling right now is based on overwhelming large datasets vs. contained individual personality development.
To that end I was reading this article called “The Pile: An 800GB Dataset of Diverse Text for Language Modeling” from the arXiv website from Gao et al., 2020. That diverse collection of data includes 825 gigabytes of content which functionally has been cleared of all sources and the authorship removed. This action has removed individuality from the language model in favor of generalization. Future models might end up going the other direction and favoring personality over generalization, but that might end up being more isolated based on what I’m seeing so far in terms of language modeling.
On the brighter side of things, is that these experiences are focusing my research interests on that pivotal point of consideration between generalized and personality specific language models. I have a sample IEEE paper format template saved as a Microsoft Word document ready to house that future paper on my desktop screen right now. It’s entirely possible that after hitting publish on this missive that is where my attention will be placed for the rest of the day.