Nels Lindahl — Functional Journal

A weblog created by Dr. Nels Lindahl featuring writings and thoughts…

Category: Substack Posts

  • Knowledge graphs vs. vector databases

    Thank you for tuning in to this audio only podcast presentation. This is week 144 of The Lindahl Letter publication. A new edition arrives every Friday. This week the topic under consideration for The Lindahl Letter is, “Knowledge graphs vs. vector databases.”

    Don’t panic, the Google Scholar searches are coming in fast and furious on this one [1]. We had a footnote in the first sentence today. Megan Tomlin writing over at neo4j had probably the best one line definition of the difference by noting that knowledge graphs are going to be in the human readable data camp and vector databases are more of a black box [2]. I actually think that eventually one super large knowledge graph will emerge and be the underpinning of all of this, but that has not happened yet given that the largest one in existence Google holds will always remain proprietary. 

    Combining two LLMs… right now you could call them one after another, but I’m not finding an easy way to pool them into a single model. I wanted to just say to my computer, “use Baysian pooling to combine the most popular LLMs from Hugging Face,” but yeah that is not an available command at the moment. A lot of incompatible content is being generated in the vector database space. People are stacking LLMs and working in sequence or making parallel calls to multiple-models. What I was very curious about was how to go about the process of merging LLMs, combining LLMs, actual model merges, ingestion of models, or even a method to merge transformers. I know that is a tall order, but it is one that would take so much already spent computing cost and move it from sunk to additive in terms of value. 

    A few papers exist on this, but they are not exactly solutions to this problem. 

    Jiang, D., Ren, X., & Lin, B. Y. (2023). LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion. arXiv preprint arXiv:2306.02561. https://arxiv.org/pdf/2306.02561.pdf  you can see more content related to this one here https://yuchenlin.xyz/LLM-Blender/

    Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., … & Wang, C. (2023). AutoGen: Enabling next-gen LLM applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155. https://arxiv.org/pdf/2308.08155.pdf 

    Chan, C. M., Chen, W., Su, Y., Yu, J., Xue, W., Zhang, S., … & Liu, Z. (2023). Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201. https://arxiv.org/pdf/2308.07201.pdf 

    Most of the academic discussions and even the cutting edge papers like AutoGen are about orchestration of models instead of merging, combining, or ingestion of many models into one. I did find a discussion on Reddit from earlier this year about how to merge the weights of transformers [3]. It’s interesting what things end up on reddit. Sadly that subreddit is closed due to a dispute over 3rd party plugins. 

    Exploration into merging and combining Large Language Models (LLMs) is indeed at the frontier of machine learning research. While academic papers like “LLM-Blender” and “AutoGen” offer different perspectives, they primarily focus on ensembling and orchestration rather than true model merging or ingestion. The challenge lies in the inherent complexities and potential incompatibilities when attempting to merge these highly sophisticated models.

    The quest for effectively pooling LLMs into a single model or merging transformers is a journey intertwined with both theoretical and practical challenges. Bridging the gap between the human-readable data realm of knowledge graphs and the more opaque vector database space, as outlined in the beginning of this podcast, highlights the broader context in which these challenges reside. It also underscores the necessity for a multidisciplinary approach, engaging both academic researchers and the online tech community, to advance the state of the art in this domain.

    In the upcoming weeks, we will delve deeper into the community-driven solutions, and explore the potential of open-source projects in advancing the model merging discourse. Stay tuned to The Lindahl Letter for a thorough exploration of these engaging topics.

    Footnotes:

    [1] https://scholar.google.com/scholar?hl=en&as_sdt=0%2C6&q=knowledge+graph+vector+database&btnG= 

    [2] https://neo4j.com/blog/knowledge-graph-vs-vectordb-for-retrieval-augmented-generation/ 

    [3] https://www.reddit.com/r/MachineLearning/comments/122fj05/is_it_possible_to_merge_transformers_d/ 

    What’s next for The Lindahl Letter? 

    • Week 145: Delphi method & Door-to-door canvassing
    • Week 146: Election simulations & Expert opinions
    • Week 147: Bayesian Models
    • Week 148: Running Auto-GPT on election models
    • Week 149: Modern Sentiment Analysis

    If you enjoyed this content, then please take a moment and share it with a friend. If you are new to The Lindahl Letter, then please consider subscribing. New editions arrive every Friday. Thank you and enjoy the week ahead.

  • Synthetic social media analysis

    Thank you for tuning in to this audio only podcast presentation. This is week 143 of The Lindahl Letter publication. A new edition arrives every Friday. This week the topic under consideration for The Lindahl Letter is, “Synthetic social media analysis.”

    After the adventures of last week, I started this writing adventure wanting to try to figure out what people were doing with LangChain and social media. People are both generating content for social media using LLMs and oddly enough repurposing content as well. We have to zoom out for just a second and consider the broader ecosystem of content. In the before-times, people who wanted to astroturf or content farm had some work to do within the content creation space. Now ChatGPT has opened the door and let the power of synthetic content creation loose. You can create personas and just have them generate endless streams of content. People can even download and run models trained for this purpose. It’s something I’m legitimately worried about for this next election cycle. Sometimes I wonder how much content within the modern social media spaces is created artificially. Measuring that is actually pretty difficult. It’s not like organically created content gets a special badge or recognition. 

    For those of you who were interested in finding out insights on any topic with a plugin that works with the OpenAI ChatGPT system then you could take a moment and install “The Yabble ChatGPT Plugin” [1]. Fair warning on this one I had to reduce my 3 plugins down to just Yabble and be pretty explicit in the prompts within ChatGPT to make it do some work. Sadly, I could not just login to Yabble and had to book a demo with them to get access. Stay tuned on that one to get more information on how that system works. I had started by searching out plugins to have ChatGPT analyze social media. This has become easier now with the announcements that OpenAI can openly use Bing search [2]. 

    Outside of searching using any OpenAI tooling like ChatGPT, Google was pretty clear on the reality that what I was really looking for happened to actually be marketing tools. Yeah, I went down the SEO Assistant rabbit hole and it was shocking. So much content exists in this space that is like watching a very full ant farm for the most part. Figuring out where to jump in without getting scammed is probably a questionable decision framework. Whole websites and ecosystems could be synthetically generated pretty quickly. It’s not exactly one click turn key deployments, but it is getting close to that level of content farming.

    I was willing to make the assumption that people who were going to the trouble of making actual plugins for ChatGPT within the OpenAI platform are probably going to be more interesting and maybe are building actual tooling. For those of you who are using ChatGPT with OpenAI and have the plus subscription you just have to open a new chat, expand the plugin area, and scroll down to the plugin store to search for new ones…

    I also did some searches for marketing tools. I’m still struck with the possibility that a lot of content is being created and marketed to people. It’s not the potential flooding of content that becomes so overwhelming that nobody is able to navigate the internet anymore. We are getting very close to the point where it would be entirely possible for the flooding of new content to occur in ways that simply overwhelm everybody and everything with new content. This would be like the explosion of ML/AI papers over the last 5 years, but maybe 10x or 100x even that digital content boom [3].

    Footnotes:

    [1] https://www.yabble.com/chatgpt-plugin

    [2] https://www.reuters.com/technology/openai-says-chatgpt-can-now-browse-internet-2023-09-27/ 

    [3] https://towardsdatascience.com/neurips-conference-historical-data-analysis-e45f7641d232 

    What’s next for The Lindahl Letter? 

    • Week 144: Knowledge graphs vs. vector databases
    • Week 145: Delphi method & Door-to-door canvassing
    • Week 146: Election simulations & Expert opinions
    • Week 147: Bayesian Models
    • Week 148: Running Auto-GPT on election models

    If you enjoyed this content, then please take a moment and share it with a friend. If you are new to The Lindahl Letter, then please consider subscribing. New editions arrive every Friday. Thank you and enjoy the week ahead.

  • Building generative AI chatbots

    Thank you for tuning in to this audio only podcast presentation. This is week 141 of The Lindahl Letter publication. A new edition arrives every Friday. This week the topic under consideration for The Lindahl Letter is, “Building generative AI chatbots.”

    You can feel the winds of change blowing and the potential of people building out election expert opinion chatbots. Maybe you want to know what they are probably going to use to underpin that sort of effort. If you were going out to build some generative AI chatbots for you might very well use one of the 5 systems we are going to dig into today.

    • Voiceflow – This system may very well be the most prominent of the quick to market AI agent building platforms [1]. I have chatbots deployed to both Civic Honors and my main weblog powered by Voiceflow.
    • LangFlow – You are going to need to join the waitlist for this one to get going [2]. I’m still on the waitlist for this one… 
    • Botpress – Like Voiceflow this system lets you pretty quickly jump into the building process of actual chatbot workflows [3]. To be fair with this one I was not able to build and deploy something into production within minutes, but you could do it pretty darn quickly if you had a sense of what you were trying to accomplish. I built something on Botpress and it was pretty easy to use. After login I clicked answer questions from websites to create a bot. I added both Civic Honors and my main Nels Lindahl domain. They just jumped in and advised me that the knowledge upload was complete. Publishing the bot is not as low friction as the Voiceflow embedding launch point, but it was not super hard to work with after you find the share button.
    • FloWiseAI – You will find this is the first system on the list that will require you to get out of your web browser, stretch a bit, and open the command line to get this one installed with a rather simple “npm install -g flowise” command [4]. I watched some YouTube videos on how to install this one and it almost got me to flip over into Ubuntu Studio. Instead of switching operating systems I elected to just follow the regular Windows installation steps.
    • Stack AI – With this one you are right back into the browser and you are going to see a lot of options to start building new projects [5].

    All of these chatbots built using a variety of generative AI models are generally working within the same theory of building. The conversation is being crafted with a user and some type of exchange with a knowledge base. For the most part the underlying LLM is being used to facilitate the conversational part of the equation while some type of knowledge base is being used to gate, control, and drive the conversation based on something deeper than what the LLM would output alone. It’s an interesting building technique and one that would not have been possible just a couple of years ago, but the times have changed and here we are in this brave new world where people can build, deploy, and be running a generative AI chatbot in a few minutes. It requires some planning about what is being built, you need some type of knowledgebase, and the willingness to learn the building parameters. None of that is a very high bar to pass. This is a low friction and somewhat high reward space for creating conversational interactions. 

    Messing around with all these different chatbot development systems made me think a little bit more about how LangChain is being used and what the underlying technology is ultimately capable of facilitating [6]. To that end I signed up for the LangSmith beta they are building [7]. Sadly enough “LangSmith is still in closed beta” so I’m waiting on access to that one as well. 

    During the course of this last week I have been learning more and more about how to build and deploy chatbots that take advantage of LLMs and other generative AI technologies. I’m pretty sure that the development of agency to machine learn models is going to strap rocket boosters to the next stage of technological deployment. Maybe you are thinking that is hyperbole… don’t worry or panic, but you are very soon going to be able to ask these agents to do something and they will be able to execute more and more complex actions. That is the essence of agency within the deployment of these chatbots. It’s a very big deal in terms of people doing basic task automation and it may very well introduce a distinct change to how business is conducted by radically increasing productivity. 

    Footnotes:

    [1] https://www.voiceflow.com/ 

    [2] https://www.langflow.org/ 

    [3] https://botpress.com/ 

    [4] https://flowiseai.com/ 

    [5] https://www.stack-ai.com/ 

    [6] https://www.langchain.com/ 

    [7] https://www.langchain.com/langsmith 

    What’s next for The Lindahl Letter? 

    • Week 142: Learning LangChain
    • Week 143: Social media analysis
    • Week 144: Knowledge graphs vs. vector databases
    • Week 145: Delphi method & door to door canvasing
    • Week 146: Election simulations

    If you enjoyed this content, then please take a moment and share it with a friend. If you are new to The Lindahl Letter, then please consider subscribing. New editions arrive every Friday. Thank you and enjoy the week ahead.

  • Proxy models for elections

    Thank you for tuning in to this audio only podcast presentation. This is week 140 of The Lindahl Letter publication. A new edition arrives every Friday. This week the topic under consideration for The Lindahl Letter is, “Proxy models for elections.”

    Sometimes a simplified model of something is easier to work with. We dug into econometric models recently during week 136 and they can introduce a high degree of complexity. Even within the world of econometrics you can find information about proxy models. In this case today we are digging into proxy models for elections. My search was rather direct. I was looking for a list of proxy models being used for elections [1]. I was trying to dig into election forecasting proxy models or maybe even some basic two step models. I even zoomed in a bit to see if I could get targeted on machine learning election proxy models [2].

    After a little bit of searching around it seemed like a good idea to maybe consider what it takes to generate a proxy model equation to represent something. Earlier I had considered what the chalk model of election prediction would look like with using a simplified proxy of voter registration as an analog for voting prediction. I had really thought that would end up being a highly workable proxy, but it was not wholesale accurate. 

    Here are 3 papers I looked at this week:

    Hare, C., & Kutsuris, M. (2022). Measuring swing voters with a supervised machine learning ensemble. Political Analysis, 1-17. https://www.cambridge.org/core/services/aop-cambridge-core/content/view/145B1D6B0B2877FC454FBF446F9F1032/S1047198722000249a.pdf/measuring_swing_voters_with_a_supervised_machine_learning_ensemble.pdf 

    Zhou, Z., Serafino, M., Cohan, L., Caldarelli, G., & Makse, H. A. (2021). Why polls fail to predict elections. Journal of Big Data, 8(1), 1-28. https://link.springer.com/article/10.1186/s40537-021-00525-8 

    Jaidka, K., Ahmed, S., Skoric, M., & Hilbert, M. (2019). Predicting elections from social media: a three-country, three-method comparative study. Asian Journal of Communication, 29(3), 252-273. http://www.cse.griet.ac.in/pdfs/journals20-21/SC17.pdf 

    I spent some time messing around with OpenAI’s GPT-4 on this topic. That effort drove down to a few proxy models that are typically used. The top 10 seemed to be the following: social media analysis, google trends, economic indicators, fundraising data, endorsement counts, voter registration data, early voting data, historical voting patterns, event-driven, and environmental factors. Combining all 10 proxy models into a single equation would result in a complex, multivariable model. Here’s a simplified representation of such a model:

    E=α1​(S)+α2​(G)+α3​(Ec)+α4​(F)+α5​(En)+α6​(VR)+α7​(EV)+α8​(H)+α9​(Ed)+α10​(Ef)+β

    Where:

    • E is the predicted election outcome.
    • α1, α2​,…α10 are coefficients that determine the weight or importance of each proxy model. These coefficients would be determined through regression analysis or other statistical methods based on historical data.
    • S represents social media analysis.
    • G represents Google Trends data.
    • Ec represents economic indicators.
    • F represents fundraising data.
    • En represents endorsement count.
    • VR represents voter registration data.
    • EV represents early voting data.
    • H represents historical voting patterns.
    • Ed represents event-driven models.
    • Ef represents environmental factors.
    • β is a constant term.

    This equation is a linear combination of the proxy models, but in reality, the relationship might be non-linear, interactive, or hierarchical. The coefficients would need to be determined empirically, and the model would need to be validated with out-of-sample data to ensure its predictive accuracy. Additionally, the model might need to be adjusted for specific elections, regions, or time periods. It would be interesting to try to pull together the data to test that type of complex multivariable model. Maybe later on we can create a model with some agency designed to complete that task. 

    Footnotes:

    [1] https://scholar.google.com/scholar?hl=en&as_sdt=0%2C6&q=election+proxy+models&btnG=

    [2] https://scholar.google.com/scholar?hl=en&as_sdt=0%2C6&q=election+proxy+models+machine+learning&btnG=

    What’s next for The Lindahl Letter? 

    • Week 141: Building generative AI chatbots
    • Week 142: Learning LangChain
    • Week 143: Social media analysis
    • Week 144: Knowledge graphs vs. vector databases
    • Week 145: Delphi method

    If you enjoyed this content, then please take a moment and share it with a friend. If you are new to The Lindahl Letter, then please consider subscribing. New editions arrive every Friday. Thank you and enjoy the week ahead.

  • Machine learning election models

    Thank you for tuning in to this audio only podcast presentation. This is week 139 of The Lindahl Letter publication. A new edition arrives every Friday. This week the topic under consideration for The Lindahl Letter is, “Machine learning election models.”

    This might be the year that I finally finish that book about the intersection of technology and modernity. During the course of this post we will look at the intersection of machine learning and election models. That could very well be a thin slice of the intersection of technology and modernity at large, but that is the set of questions that brought us here today. It’s one of things we have been chasing along this journey. Oh yes, a bunch of papers exist related to the topic this week of machine learning and election models [1]. None of them are highly cited. A few of them are in the 20’s in terms of citation count, but that means the academic community surrounding this topic is rather limited. Maybe the papers are written, but have just not arrived yet out in the world of publication. Given that machine learning has an active preprint landscape that is unlikely. 

    That darth of literature is not going to stop me from looking at them and sharing a few that stood out during the search. None of these papers is approaching the subject from a generative AI model side of things they are using machine learning without any degree of agency. Obviously, I was engaging in this literature review to see if I could find examples of the deployment of models with some type of agency doing analysis within this space of election prediction models. My searching over the last few weeks has not yielded anything super interesting. I was looking for somebody in the academic space doing some type of work within generative AI constitutions and election models or maybe even some work in the space of rolling sentiment analysis for targeted campaign understanding. That is probably an open area for research that will be filled at some point.

    Here are 4 articles:

    Grimmer, J., Roberts, M. E., & Stewart, B. M. (2021). Machine learning for social science: An agnostic approach. Annual Review of Political Science, 24, 395-419. https://www.annualreviews.org/doi/pdf/10.1146/annurev-polisci-053119-015921 

    Sucharitha, Y., Vijayalata, Y., & Prasad, V. K. (2021). Predicting election results from twitter using machine learning algorithms. Recent Advances in Computer Science and Communications (Formerly: Recent Patents on Computer Science), 14(1), 246-256. www.cse.griet.ac.in/pdfs/journals20-21/SC17.pdf  

    Miranda, E., Aryuni, M., Hariyanto, R., & Surya, E. S. (2019, August). Sentiment Analysis using Sentiwordnet and Machine Learning Approach (Indonesia general election opinion from the twitter content). In 2019 International conference on information management and technology (ICIMTech) (Vol. 1, pp. 62-67). IEEE. https://www.researchgate.net/publication/335945861_Sentiment_Analysis_using_Sentiwordnet_and_Machine_Learning_Approach_Indonesia_general_election_opinion_from_the_twitter_content 

    Zhang, M., Alvarez, R. M., & Levin, I. (2019). Election forensics: Using machine learning and synthetic data for possible election anomaly detection. PloS one, 14(10), e0223950. https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0223950&type=printable 

    My guess is that we are going to see a wave of ChatGPT related articles about elections post the 2024 presidential cycle. It will probably be one of those waves of articles without any of them really standing out or making any serious contribution to the academy. 

    The door is opening to a new world of election prediction and understanding efforts thanks to the recent changes in both model agency and generative AI models that help evaluate and summarize very complex things. It’s really about how they are applied to something going forward that will make the biggest difference in how the use cases play out. These use cases by the way are going to become very visible as the 2024 election comes into focus. The interesting part of the whole equation will be when people are bringing custom knowledge bases to the process to help fuel interactions with machine learning algorithms and generative AI. 

    It’s amazing to think how rapidly things can be built. The older models of software engineering are now more of a history lesson than a primer on building things with prompt-based AI. Andrew Ng illustrated in a recent lecture the rapidly changing build times. You have to really decide what you want to build and deploy and make it happen. Ferris Bueller once said, “Life moves pretty fast.” Now code generation is starting to move even faster! You need to stop and look around at what is possible, or you just might miss out on the generative AI revolution.

    You can see Andrew’s full video here: https://www.youtube.com/watch?v=5p248yoa3oE 

    Footnotes:

    [1] https://scholar.google.com/scholar?hl=en&as_sdt=0%2C6&q=Machine+learning+election+models&btnG= 

    What’s next for The Lindahl Letter? 

    • Week 140: Proxy models for elections
    • Week 141: Building generative AI chatbots
    • Week 142: Learning LangChain
    • Week 143: Social media analysis
    • Week 144: Knowledge graphs vs. vector databases

    If you enjoyed this content, then please take a moment and share it with a friend. If you are new to The Lindahl Letter, then please consider subscribing. New editions arrive every Friday. Thank you and enjoy the week ahead.

  • Election prediction markets & Time-series analysis

    Thank you for tuning in to this audio only podcast presentation. This is week 138 of The Lindahl Letter publication. A new edition arrives every Friday. This week the topic under consideration for The Lindahl Letter is, “Prediction markets & Time-series analysis.”

    We have been going down the door of digging into considering elections for a few weeks now. You knew this topic was going to show up. People love prediction markets. They are really a pooled reflection of sentiment about the likelihood of something occuring. Right now the scuttlebut of the internet is about LK-99, a potential, maybe debunked, maybe possible room temperature superconductor that people are predicting whether or not it will be replicated before 2025 [1]. You can read the 22 page preprint about LK-99 on ArXiv [2]. My favorite article about why this would be a big deal if it lands was from Dylan Matthews over at Vox [3]. Being able to advance the transmission power of electrical lines alone would make this a breakthrough. 

    That brief example being set aside, now people can really dial into the betting markets for elections where right now are not getting nearly the same level of attention as LK-99 which is probably accurate in terms of general scale of possible impact. You can pretty quickly get to all posts that the team over at 538 have tagged for “betting markets” and that is an interesting thing to scroll through [4]. Beyond that look you could start to dig into an article from The New York Times talking about forecasting what will happen to prediction markets in the future [5].

    You know it was only a matter of time before we moved from popular culture coverage to the depths of Google Scholar [6].

    Snowberg, E., Wolfers, J., & Zitzewitz, E. (2007). Partisan impacts on the economy: evidence from prediction markets and close elections. The Quarterly Journal of Economics, 122(2), 807-829. https://www.nber.org/system/files/working_papers/w12073/w12073.pdf

    Arrow, K. J., Forsythe, R., Gorham, M., Hahn, R., Hanson, R., Ledyard, J. O., … & Zitzewitz, E. (2008). The promise of prediction markets. Science, 320(5878), 877-878. https://users.nber.org/~jwolfers/policy/StatementonPredictionMarkets.pdf

    Berg, J. E., Nelson, F. D., & Rietz, T. A. (2008). Prediction market accuracy in the long run. International Journal of Forecasting, 24(2), 285-300. https://www.biz.uiowa.edu/faculty/trietz/papers/long%20run%20accuracy.pdf 

    Wolfers, J., & Zitzewitz, E. (2004). Prediction markets. Journal of economic perspectives, 18(2), 107-126. https://pubs.aeaweb.org/doi/pdf/10.1257/0895330041371321 

    Yeah, you could tell by the title that a little bit of content related to time-series analysis was coming your way. The papers being tracked within Google Scholar related election time series analysis were not highly cited and to my extreme disappointment are not openly shared as PDF documents [7]. For those of you who are regular readers you know that I try really hard to only share links to open access documents and resources that anybody can consume along their lifelong learning journey. Sharing links to paywalls and articles inside a gated academic community is not really productive for general learning. 

    Footnotes:

    [1] https://manifold.markets/QuantumObserver/will-the-lk99-room-temp-ambient-pre?r=RWxpZXplcll1ZGtvd3NreQ

    [2] https://arxiv.org/ftp/arxiv/papers/2307/2307.12008.pdf

    [3] https://www.vox.com/future-perfect/23816753/superconductor-room-temperature-lk99-quantum-fusion

    [4] https://fivethirtyeight.com/tag/betting-markets/ 

    [5] https://www.nytimes.com/2022/11/04/business/election-prediction-markets-midterms.html

    [6] https://scholar.google.com/scholar?hl=en&as_sdt=0%2C6&q=election+prediction+markets&btnG= 

    [7] https://scholar.google.com/scholar?hl=en&as_sdt=0%2C6&q=election+time+series+analysis&oq=election+time+series+an 

    What’s next for The Lindahl Letter? 

    • Week 139: Machine learning election models
    • Week 140: Proxy models for elections
    • Week 141: Election expert opinions
    • Week 142: Door-to-door canvassing

    If you enjoyed this content, then please take a moment and share it with a friend. If you are new to The Lindahl Letter, then please consider subscribing. New editions arrive every Friday. Thank you and enjoy the week ahead.

  • All that bad data abounds

    Thank you for tuning in to this audio only podcast presentation. This is week 119 of The Lindahl Letter publication. A new edition arrives every Friday. This week the topic under consideration for The Lindahl Letter is, “All that bad data abounds.”

    Flooding and astroturfing abound at the moment. Both of those things were happening before the advent of large language models (LLMs), but they have increased in frequency now that bad actors are able to just open the floodgates for content. Making large swaths of the internet that is just designed for search engine placement and self-referencial boosting has become so much easier recently. Sure all that bad data was abounding before this shift in what is now happening with OpenAI, Google, Microsoft, and Facebook recently sharing out chat services. 

    It’s one of those things where it is hard to put words on a page about it. Working with one of the chat systems to make content seems to trivialize the writing process. My day starts with an hour of focused academic work. That time is the fulfilled promise of decades of training that included a lot of hard work to get to this point. I can focus on a topic and work toward understanding it. All of that requires my focus and attention on something for that hour. Sometimes on the weekends I spend a couple of hours doing the same thing on a very focused topic. Those chat models with their large language model backends (LLM) produce content within seconds. It’s literally like a 1:60 ratio for output. It takes me an hour to produce what it creates within that minute including the time for the user to enter the prompt. 

    Maybe I did not expect this type of interaction to really affect me in this way. Everything has been questioned in terms of my writing output and what exactly is going to happen now. The door has been flung open to the creation of content. Central to that problem is the reality that the careful curation of content within academics and the publish first curation of the media are going to get flooded. Both systems are going to get absolutely overloaded with submissions. Something has to give based on the amount of attention that exists. They are not minting any new capacity for attention and the channels for grabbing that attention are relatively limited. The next couple of years are going to be a mad scrabble toward some sort of equilibrium between the competing forces of content curation and flooding.

    This really is something that I’m concerned about on an onboarding basis. Do all the books, photos, articles, and paintings in the before times just end up with a higher value weighting going forward? Will this AI revolution have cheapened the next generation of information delivery in ways we will not fully get to appreciate until the wave has passed us and we can see the aftermath of that scenario? Those questions are at the heart of what I’m concerned about. Selfishly they are questions about the value and purpose of my own current writing efforts. More broadly they are questions about the value of writing within our civil society as we work toward the curation of sharable knowledge. We all work toward that perfect possible future either with purpose or without it. Knowledge is built on the shoulders of the giants that came before us adding to collective understanding of the world around us. Anyone with access and an adventurous spirit can pick up the advancement of some very complex efforts to enhance the academy’s knowledge on a topic. 

    Maybe I’m worried that the degree of flooding with flatten information so much that the ability to move things forward will diminish. Sorting, seeking, and trying to distill value from an oversupply of newly minted information may well create that diminishing effect. We will move from intellectual overcrowding in the academy to just an overwhelming sea of derivative content marching along beyond any ability to constrain or consume. I’m going to stop with that last argument as it may be the best way to sum this up. 

    Links and thoughts:

    Top 5 Tweets of the week:

    https://twitter.com/NateSilver538/status/1650899579234140168

    https://twitter.com/verge/status/1651355922751422464

    What’s next for The Lindahl Letter? 

    • Week 120: That one with an obligatory AI trend’s post
    • Week 121: Considering an independent study applied AI syllabus
    • Week 122: Will AI be a platform or a service?
    • Week 123: Considering open source AI
    • Week 124: Profiling OpenAI 

    If you enjoyed this content, then please take a moment and share it with a friend. If you are new to The Lindahl Letter, then please consider subscribing. New editions arrive every Friday. Thank you and enjoy the week ahead.

  • Twitter as a company probably would not happen today

    Thank you for tuning in to this audio only podcast presentation. This is week 108 of The Lindahl Letter publication. A new edition arrives every Friday. This week the topic under consideration for The Lindahl Letter is, “Twitter as a company probably would not happen today.”

    Part of what made Twitter so interesting is the diversity of argument and the townhall nature of it being the first place things show up in the feed. I’m not sure any other company or social platform could attract the same amount of hyperactive content creation users geared at news and coverage of the moment. People are arguing and I’m sure papers will soon be arriving to describe the end of social media. This weekend I’m going to spend a bit of time reading a book by Robert Putnam of “Bowling Alone” fame called “The Upswing”.

    Putnam, R. D. (2000). Bowling alone: The collapse and revival of American community. Simon and schuster.

    Putnam, R. D. (2020). The upswing: How America came together a century ago and how we can do it again. Simon and Schuster.

    It’s probably somewhere in the meta analysis between social capital and social media that a compelling story about why Twitter as a company would not happen today exists. During the course of this analysis you are going to receive two different lines of inquiry. First, I’ll consider the nature of Twitter and a few books related to it and silicon valley in general. Second, we will dig into some of the AI and sentiment analysis scholarly work related to that field of study to help keep the writing trajectory for the year on track. 

    Books have arrived to tell the stories of what happened in Silicon Valley. A lot of unlikely things happened within the borders of the space described as silicon valley. Some of them will be a part of business courses for decades to come. It truly is an interesting thing that happened where so much creativity and output happen in such a relatively small area. 

    Three of the books that I have enjoyed are listed below.

    Bilton, N. (2014). Hatching Twitter: A true story of money, power, friendship, and betrayal. Penguin.

    Frier, S. (2021). No filter: The inside story of Instagram. Simon and Schuster.

    Wiener, A. (2020). Uncanny valley: A memoir. MCD.

    You can zoom out a bit and grab some classic silicon valley reading like:

    Isaacson, W. (2014). The innovators: How a group of inventors, hackers, geniuses and geeks created the digital revolution. Simon and Schuster.

    A lot of scholars over the years have focused their attention on Twitter for a variety of purposes. You can imagine that my interest and the interest of those scholars overlap around the ideas of AI and sentiment analysis. Digital agents abound within the Twitter space and some of them are doing some type of sentiment analysis with what scholars are identifying as artificial intelligence. That second part of the equation makes me a little bit skeptical about the totality of the claims being made. We will jump right into the deep end of Google Scholar on this one anyway [1].

    Papers from a search for “Sentiment analysis Twitter artificial intelligence” [2]

    Kouloumpis, E., Wilson, T., & Moore, J. (2011). Twitter sentiment analysis: The good the bad and the omg!. In Proceedings of the international AAAI conference on web and social media (Vol. 5, No. 1, pp. 538-541). https://ojs.aaai.org/index.php/ICWSM/article/download/14185/14034 

    Ghiassi, M., Skinner, J., & Zimbra, D. (2013). Twitter brand sentiment analysis: A hybrid system using n-gram analysis and dynamic artificial neural network. Expert Systems with applications, 40(16), 6266-6282. 

    Giachanou, A., & Crestani, F. (2016). Like it or not: A survey of twitter sentiment analysis methods. ACM Computing Surveys (CSUR), 49(2), 1-41. https://arxiv.org/pdf/1601.06971.pdf 

    Alsaeedi, A., & Khan, M. Z. (2019). A study on sentiment analysis techniques of Twitter data. International Journal of Advanced Computer Science and Applications, 10(2). https://www.researchgate.net/profile/Abdullah-Alsaeedi/publication/331411860_A_Study_on_Sentiment_Analysis_Techniques_of_Twitter_Data/links/5c78175ba6fdcc4715a3d664/A-Study-on-Sentiment-Analysis-Techniques-of-Twitter-Data.pdf 

    I had considered some evaluation of searches for both “opinion mining Twitter artificial intelligence” and “artificial intelligence analysis of public attitudes” [3][4]. It’s possible some papers from both of those searches show up later. Generally, all of that argument and content could be broken down into two camps of intelligence gathering related to advertising and general opinion mining geared at understanding sentiment. One divergent thread of research from those two would be some of the efforts to identify fake or astroturf content. You can imagine that flooding either fake or astroturf content could change the dynamic for advertising or sentiment analysis. Advertising to a community of bots is a rather poor use of scarce resources. 

    Links and thoughts:

    Top 5 Tweets of the week:

    https://twitter.com/nelslindahl/status/1620898468964478978

    Footnotes:

    [1] https://scholar.google.com/scholar?hl=en&as_sdt=0%2C6&q=twitter+artificial+intelligence&oq=twitter+artif  

    [2] https://scholar.google.com/scholar?hl=en&as_sdt=0%2C6&q=Sentiment+analysis+Twitter+artificial+intelligence&btnG= 

    [3] https://scholar.google.com/scholar?q=Opinion+mining+Twitter+artificial+intelligence&hl=en&as_sdt=0&as_vis=1&oi=scholart 

    [4] https://scholar.google.com/scholar?hl=en&as_sdt=0,6&qsp=4&q=artificial+intelligence+%22analysis+of+public+attitudes%22&qst=ir 

    What’s next for The Lindahl Letter? 

    • Week 109: Robots in the house
    • Week 110: Understanding knowledge graphs
    • Week 111: Natural language processing 
    • Week 112: Autonomous vehicles
    • Week 113: Structuring an introduction to AI ethics

    If you enjoyed this content, then please take a moment and share it with a friend. If you are new to The Lindahl Letter, then please consider subscribing. New editions arrive every Friday. Thank you and enjoy the year ahead.