Nels Lindahl — Functional Journal

A weblog created by Dr. Nels Lindahl featuring writings and thoughts…

Tag: ArXiv

  • Considering data permanence again

    At the end of my writing session yesterday I accidentally sent out 307 tweets. Deleting every one of those by hand on Twitter as the rate limited API spit them out was a little bit nerve racking. My expectation was that either the deduplication feature over at Twitter would catch this or the integration code on my side was written well enough not to post things modified using the bulk edit feature. Neither of those things held true and logic failed. That really did mean that a bunch of people who had alerts turned on received a lot of updates notifications. Given that I have recently started using a vtech landline headset system to obfuscate my cellular connection to avoid notifications I’m feeling a little bit of shame related to that blaring coding mistake. 

    Releasing those posts from the private mode back to published brings the public archive to a complete status from 2020 to current. At some point, I’m going to bring all the posts back from the 3,000 word a day writing habit period of 2018, but I’m going to need to fix that integration with Twitter before making that update. The easiest way to fix that integration would be to simply go to the settings menu and disconnect Twitter. Right now the setting for “Sharing posts to your Twitter feed” has been enabled. It would just take one click to disconnect it and that would pretty much solve the problem, but it would not do it via code it would do it via literally removing the potential for the problem to occur again. Maybe later this week that is what it will come to after some contemplation about the problem. I am really considering releasing the 153 posts that are currently set to private mode that occurred in that highly productive writing period. 

    I have really spent a fair amount of time thinking about the nature of permanence and the written word recently. Until we start saving content to crystals (5D optical data storage) all of this writing and posting is going to be ephemeral at best. It is possible that my code on GitHub will be stored that way at some point and the GPT-2 model trained on my writing Corpus would fall into that storage process and be saved for posterity. However, just because content got saved to crystal and was potentially accessible for ages does not mean any interest in the content would exist. People might not boot up the Nels bot for dialogue and exchange. Most of the interest in complex language modeling right now is based on overwhelming large datasets vs. contained individual personality development. 

    To that end I was reading this article called “The Pile: An 800GB Dataset of Diverse Text for Language Modeling” from the arXiv website from Gao et al., 2020. That diverse collection of data includes 825 gigabytes of content which functionally has been cleared of all sources and the authorship removed. This action has removed individuality from the language model in favor of generalization. Future models might end up going the other direction and favoring personality over generalization, but that might end up being more isolated based on what I’m seeing so far in terms of language modeling. 

    On the brighter side of things, is that these experiences are focusing my research interests on that pivotal point of consideration between generalized and personality specific language models. I have a sample IEEE paper format template saved as a Microsoft Word document ready to house that future paper on my desktop screen right now. It’s entirely possible that after hitting publish on this missive that is where my attention will be placed for the rest of the day.

  • Some productivity housekeeping

    Two days ago, I took action on my plan to do more peer reviews of academic journal submissions and reached out to editors from 3 journals: 1) Government Information Quarterly, 2) Journal of Public Administration Research and Theory, and the 3) Public Administration Review. Now that I have built a pretty solid history of working with technology in practice it felt like a good idea to give back to the research community. It will also encourage me to reach more articles in the public administration space. I read a ton of articles in the machine learning and artificial intelligence space every week and they are honestly easier to find and locate as they are shared all over by the others and other interested parties via Twitter thanks to arXiv and other pre-print services. I’m going to give it some thought and try to find a few other journals to support as well. I’m not entirely confident that a lot of peer review work will arrive from the journals mentioned above.

  • Thinking about writing and research

    Well my testing of the Twitter thread creation feature in WordPress has gone well enough. I have used it for the last couple of days and the results were successful, but not in any way inspiring. The linking method they use to bring links from the post over to Twitter is wonky. They mechanically extract the link and share it explicitly within parentheses after the original spot where it occurred. So it looks sort of out of place based on the way people share links in a modern embedded way on the internet. It’s an odd style choice, but then again Twitter itself is a forced minimum in a world of excess and overcommunication. Even the new expanded Tweet limit of 280 characters just allows a couple of sentences to be created at a time. You can elect to communicate within that shortened utterance window, but it is not really how I prefer to exchange my thoughts and ideas.

    My preference is to engage in a more long form exchange of the written word. That is probably why writing academic articles and weblog posts is where I spend a lot of my time. For the most part either way you are granted the chance to put your thoughts, arguments, opinions, or logic down on the page for better or worse. Sure in the academic writing space you can face delays and rounds of revisions, but that is the nature of things so ingrained in the academy that it took an entirely new method of online pre-print sharing to shatter. That happened due to the fact that people want to openly exchange academic information and learn and develop rapidly in certain segments of the academy. That need to exchange and work toward common frameworks and applications of developing technology required a faster cycle of information exchange. Generally in the machine learning space, artificial intelligence domain, and other sciences it seems that publication in arXiv has been working for people. It is a far less gated way to get your words out into the open vs. waiting for the academic review process of some journals that take an awfully long time to evaluate content. 

    A lot of research seems to be starting to form around the way pre-print sharing is changing the very nature of publishing in journals. I’ll provide more coverage of that later when I find an article worth sharing that concisely explains the current situation. That is probably a problem for more than one cup of coffee to solve. I’m only partially into my first cup of coffee and it is only slightly warm at the moment. Microwaving that cup of coffee seems like a lot of work at the moment. I’ll probably muddle on through creating the rest of this page of prose before going to work on the coffee problem. My initial goal for the day was to sit down and write a full page before switching over to editing and finishing the next installment of The Lindahl Letter for Friday, April 30, 2021. During the course of designing the trajectory of that writing effort 37 weeks were mapped out and content was drafted, reworked, and established for the first 14 weeks of publication. As of next Friday we will have traveled to the end of my generally drafted efforts. Starting in week 15 will be entering a new phase of publication on Friday and new content creation for the next week starting on Saturday. It will basically be a weekly content creation and publishing cycle going forward.

    It is entirely possible that a flourish of creativity will occur and enough content will be drafted to work ahead again, but that does not appear to be the case the moment. Things seem to be lining up to a weekly content creation cycle. That is not a bad path forward, but it requires a lot more consistency in weekly writing than being weeks ahead of the point of publication. Generally the amount of tinkering and rework is about to decrease.