Managing all those backup files

Today was a day where some old file sorting happened. On a regular basis the number of files I’m using is fairly limited. My backup has built up way more files over the years that I seriously wonder about sometimes. This massive collection of files stored to the cloud just hangs out and does nothing really. I’m not entirely sure why I still keep it all and back it up with such care. One of the strategies that I have been considering is just setting up a new folder and putting on the files that really matter to me in that one and then at some point allowing the rest to vanish from existence. To be fair about the whole thing I’m guessing that some of those files are not necessary and probably should not have been saved in the first place. The degree to which my digital pack-rat-ness was effective is somewhat astonishing at this point.

Sadly, I’m not the only one with massive collections of archived documents in a variety of clouds. Something is going to happen to all these clouds as people end up abandoning their files over time. I have a plan for my main cloud account where without any action on my part for 90 days the files are shared out. I’m not sure exactly what the people who are slated to get access to these files will end up doing with them and it might just be an overwhelming pile of digital artifacts. Within the grand aggregate of cloud files I’m really wondering about how many of them are needed or if we have just created a reality where data center after data center is busy keeping records of nothing really. That is a thought related to both data permanence and necessity. While I cannot bring myself to just hit delete on all the files and move along I’m sure that is the inevitable outcome unless we learn how to store files on crystals. At that point, it won’t matter how much data people want to store forever it will be possible. The bigger question will be if anybody ever does anything with all those stored files on what I’m sure will be a mountain like pile of data storage crystals.

One of the things I have started doing within my more academic side writing pursuits is finding ways to publish and store my work in places outside of my weblog or my cloud storage. That public type of sharing is in an effort to help the writing stand the test of time in a better way. It is an attempt to achieve some type of data permanence based on making the content accessible. My weblog is scraped by the way back machine which does mean that my writings here are generally backed up beyond the ways that I back them up. You can if you wanted to scroll across the years and see the various weblog styles and other elements that go back a long time within that archive. That is one of the ways that the internet itself is backup up and accessible to people interested in that type of archive. I’m always curious about the freshness of a weblog and nostalgic browsing of things from 20 years ago does not really appeal to me at the moment.

A slow start to the week

Today it took about an hour to settle into a headspace where writing was going to happen. I’m not sure why it took so long to begin to accept the blank page. Sometimes that happens. My thoughts were all over the place and getting to a point where focusing was possible actually took that hour of wondering around the internet. Now I’m wondering a bit more about what things should be set as a priority this week. It feels like it might be a week where a lot of things can get done. That is probably good. Some time ago I set a 5 year writing plan for creating content and working down some different academic trajectories. That effort is still underway and my writing streak for publishing The Lindahl Letter part of that remains unbroken. As of right now that effort will span at least 87 weeks based on the content in queue for distribution. One more post is written but has not been recorded. Unfortunately, for the last 7 days my voice has been wrecked from some kind of cold or allergies.

It has been a long time since something stopped me from being able to talk. On the brighter side of things it did open the door to more listening. Talking a ton was not really going to happen. I was able to keep moving along without any real disruption. During that time I even did some game planning and put together some changes in the overall plan from week 87 to 104. Sometimes it is really helpful to sit down and consider where things are going. Knowing your writing trajectory and the associated research trajectory are important parts of planning. You cannot work on everything at one time and figuring out where you want to put your attention is an important part of the equation. When you are going to plan out a few years of effort at a time then you want to make sure to get it right and cover the right content in the right order. My interests at the start focused in on learning about machine learning and conducting weekly research into topics to diver deeper. That has been ongoing for 87 weeks. I’m in a position now where based on that depth and breadth of knowledge I’m able to really start producing various types of research within those subjects.

I have learned how to do typesetting in Overleaf with LaTeX which at first was very frustrating, but after several hours of learning is now possible. Maybe working toward the production of a bunch of different journal type articles or research notes is the right way to go and we will see if that is what happens when my research pivots from machine learning to generally a study of artificial intelligence.

A few research trajectory notes

Today I finished working on the main content for week 88 of The Lindahl Letter. That one is a bridge piece between two sets of more academic side efforts. I went from working on introductory syllabus to starting to prepare a bit for the more advanced set of content. Initially, I had considered making the advanced versions a collection of research notes that were built around very specific and focused topics. That is entirely a path that might be taken after the 104th week. The packaging on the content instead will be put into another companion syllabus to allow an introductory look and a more advanced topic follow up for people looking for a bit more machine learning content. Functionally those two documents put together will be the summation of 104 weeks of my efforts in the machine learning space. It is the book end to my journey into really diving deep into machine learning and studying it every weekend and a lot of weekdays.

After two years of digging into the machine learning space I’m going to pivot over and focus on writing and studying artificial intelligence in general for year 3 of The Lindahl Letter. It should be a fun departure and hopefully it will mix things up a little bit with a broader collection of literature. A lot of people talk about deploying AI in the business world and almost all of that conjecture is entirely based on deploying a machine learning model into a production environment. When those same people deploy an actual AI product they will hopefully see the difference.

A few updates on word processing

I thought it would be fun to get some logos made for The Lindahl Letter. After updating the banner logo on Substack I realized that it broke the link for all previous banner logo posts. Fixing that mistake required updating about 40 posts to include the new banner logo one at a time. I’m guessing that the way the Substack database stores the background banner needs some time of update to prevent this type of previous image link breakdown. It should probably contain a warning at the very least that says if you update this image you are going to break all the older posts that reference the previous image. On the brighter side of that problem nobody really seemed to notice the broken links. I’m probably the one that immediately goes and checks the Substack site after things are published to make sure nothing went wrong.

It took a couple weeks of working on making the switch from Google Docs over to Microsoft Word. Currently, in that journey I’m now doing ok working out of the desktop application for Microsoft Word. I have a document setup for writing daily content and one for Substack posts. Both of those documents can be accessed from the Office 365 interface as well if necessary. I just have had a really hard time adjusting to the online version of Microsoft Word. It’s just not as usable as Google Docs. All those recent articles about Google mining my writing were enough to get me to make the switch. We will see how long this technology shift lasts and I’ll provide some updates along the way. It’s entirely possible at some point I’ll just write academic articles in Overleaf and not use either of the word processing systems. I’m wondering how many academic writers just work out of the LaTeX editor. For that syllabus PDF creation effort, I created the content outside of Overleaf and just used it for typesetting of the content.

Most of the time my writing efforts are about creating something in one application and then moving it somewhere else for distribution. That in and of itself is an interesting and probably unnecessary process. I’m not sure exactly why I have not just moved to creating the content in the place where it will get published.

Machine learning approaches (ML syllabus edition 4/8)

Thank you for tuning in to this audio only podcast presentation. This is week 83 of The Lindahl Letter publication. A new edition arrives every Friday. This week the topic under consideration for The Lindahl Letter is, “Machine learning approaches (ML syllabus edition 4/8).”

During the last lecture we jumped in and looked at 10 machine learning algorithms. This week the content contained within this lecture will cover from a machine learning perspective reinforcement learning and 3 types of supervised learning. Those types of supervised learning will include the general use case of supervised learning, unsupervised learning, and the super interesting semi-supervised learning. Like the model for consideration used in the last lecture I’ll cover the topics in general and provide links to papers covering the topic to allow people looking for a higher degree of depth to dive deeper into academic papers to achieve that goal. My general preference here is to find academic papers that are both readable and are generally available for you to actually read with very low friction. Within the machine learning and artificial intelligence space a lot of papers are generally available and that is great for literature reviews and generally for scholarly work and practitioners working to implement the technology. My perspective is a mix between those two worlds which could be defined as a pracademic view of things. All right; here we go. 

Reinforcement learning – Welcome to the world of machine learning. This is probably the first approach you are going to learn about in your journey. That’s right, it’s time to consider for a brief moment the world of reinforcement learning. You are probably going to need to start to create some intelligent agents and you will want to figure out how to maximize the reward those agents could get. One method of achieving that result is called reinforcement learning. A lot of really great tutorials exist trying to explain this concept and one that I enjoyed was from Towards Data Science way back in 2018 [1]. The nuts and bolts of this one involve trial and error with an intelligent agent trying to learn from mistakes using a maximization of reward function to avoid going down paths that don’t offer greater reward. The key takeaway here is that during the course of executing a model or algorithm a maximization function based on reward has to be in place to literally reinforce maximization during learning. I’m sharing references and links to 4 academic papers about this topic to help you dig into reinforcement learning with a bit of depth if you feel so inclined. 

Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of artificial intelligence research, 4, 237-285. https://www.jair.org/index.php/jair/article/view/10166/24110 

Sutton, R. S., & Barto, A. G. (1998). Introduction to reinforcement learning. https://login.cs.utexas.edu/sites/default/files/legacy_files/research/documents/1%20intro%20up%20to%20RL%3ATD.pdf 

Szepesvári, C. (2010). Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and machine learning, 4(1), 1-103. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.308.549&rep=rep1&type=pdf 

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. https://arxiv.org/pdf/1312.5602.pdf 

Supervised learning – You knew it would only be a matter of time before we went out to some content from our friends over at IBM [2]. They note that within a world where you have some labeled datasets and are training an algorithm to engage in classification or perhaps regression, but probably classification. In some ways the supervised element here is the labeling and guiding of the classification. Outside of somebody or a lot of people sitting and labeling training data the supervision is not from somebody outright sitting and watching the machine learning model run step by step. Some ethical considerations need to be taken into account at this point. A lot of people have worked to engage in data labeling. A ton of services exist to help bring people together to help do this type of work. Back in 2018 Maximilian Gahntz published a piece in Towards Data Science that talked about the invisible workers that are doing all that labeling in large curated datasets [3]. Within the world of supervised learning being able to get high quality labeled data really impacts the ability to make solid models. It’s our ethical duty as researchers to consider what that work involves and who is doing that work. Another article in the MIT Technology Review back in 2020 covered the idea of how gig workers are powering a lot of this labeling [4]. The first academic article linked below with Saiph Savage as a co-author will cover the same topic and you should consider giving it a read to better understand how machine learning is built from dataset to model. After that article, the next two are general academic articles about predicting good probabilities and empirical comparisons to help ground your understanding of supervised learning. 

Hara, K., Adams, A., Milland, K., Savage, S., Callison-Burch, C., & Bigham, J. P. (2018, April). A data-driven analysis of workers’ earnings on Amazon Mechanical Turk. In Proceedings of the 2018 CHI conference on human factors in computing systems (pp. 1-14). https://arxiv.org/pdf/1712.05796.pdf 

Niculescu-Mizil, A., & Caruana, R. (2005, August). Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning (pp. 625-632). https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.60.7135&rep=rep1&type=pdf 

Caruana, R., & Niculescu-Mizil, A. (2006, June). An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd international conference on Machine learning (pp. 161-168). http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf 

Unsupervised learning – It’s a good thing that you were paying very close attention to the explanation of supervised learning above. Imagine that the humans or in some cases the vast collectives of humans labeling training sets just stopped doing that. Within the unsupervised learning world the classification within the machine learning problem space is going to be handed differently. Labeling and the creation of classification has to be a part of the modeling methodology. This topic always makes me think of the wonderful time capsule of a technology show about startups called Silicon Valley (2014 to 2019) that was broadcast by HBO. They had an algorithm explained at one point as being able to principally identify food as hot dog or not hot dog. That’s it the model only could do the one task. It was not capable of correctly identifying all food as that is a really complex task. Trying to use unsupervised learning for example, based on tags and other information identifying different types of food in photographs is something that people have certainly done with unsupervised learning approaches. I’m only sharing one paper about this approach and its from 2001. 

Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine learning, 42(1), 177-196. https://link.springer.com/content/pdf/10.1023/A:1007617005950.pdf 

Semi-supervised learning – All 3 of these different types of learning supervised, unsupervised, and semi-supervised are related. They are different methods of attacking a problem space related to learning as part of the border landscape of machine learning. You can imagine that people wanted to try to create a hybrid model when a limited set of labeled data is used to help begin the modeling process. That is the essence of the process of building out a semi-supervised learning approach [5]. I’m sharing 3 different academic papers related to this topic that cover a literature review, a book about it, and the more advanced topic of pseudo labeling. 

Zhu, X. J. (2005). Semi-supervised learning literature survey. https://minds.wisconsin.edu/bitstream/handle/1793/60444/TR1530.pdf?sequence=1 

Chapelle, O., Scholkopf, B., & Zien, A. (2009). Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3), 542-542. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4787647 

Lee, D. H. (2013, June). Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML (Vol. 3, No. 2, p. 896). https://www.kaggle.com/blobs/download/forum-message-attachment-files/746/pseudo_label_final.pdf 

Conclusion – This lecture covered reinforcement learning and 3 types of supervised learning. You could spend a lot of time digging into academic articles and books related to these topics. Generally, I believe you will start to want to look at use cases and direct your attention to highly specific examples of applied machine learning at this point. Fortunately, a lot of those papers exist and you won’t be disappointed. 

Links and thoughts:

“[ML News] This AI completes Wikipedia! Meta AI Sphere | Google Minerva | GPT-3 writes a paper”

Top 4 Tweets of the week:

Footnotes:

[1] https://towardsdatascience.com/reinforcement-learning-101-e24b50e1d292 

[2] https://www.ibm.com/cloud/learn/supervised-learning#toc-unsupervis-Fo3jDcmY 

[3] https://towardsdatascience.com/the-invisible-workers-of-the-ai-era-c83735481ba

[4] https://www.technologyreview.com/2020/12/11/1014081/ai-machine-learning-crowd-gig-worker-problem-amazon-mechanical-turk/

[5] https://towardsdatascience.com/supervised-learning-but-a-lot-better-semi-supervised-learning-a42dff534781 

Research Note:

You can find the files from the syllabus being built on GitHub. The latest version of the draft is being shared by exports when changes are being made. https://github.com/nelslindahlx/Introduction-to-machine-learning-syllabus-2022

What’s next for The Lindahl Letter?

  • Week 84: Neural networks (ML syllabus edition 5/8)
  • Week 85: Neuroscience (ML syllabus edition 6/8)
  • Week 86: Ethics, fairness, bias, and privacy (ML syllabus edition 7/8)
  • Week 87: MLOps (ML syllabus edition 8/8)
  • Week 88: The future of publishing

I’ll try to keep the what’s next list forward looking with at least five weeks of posts in planning or review. If you enjoyed this content, then please take a moment and share it with a friend. If you are new to The Lindahl Letter, then please consider subscribing. New editions arrive every Friday. Thank you and enjoy the week ahead.

Considering a bit of considering

Right now, at this very moment, I’m working on Substack post 88 of 104 and starting to consider moving that effort out of Google Docs and over to Microsoft Word as well. At the moment, that move has not happened just yet as I’m still writing some of that content on my Chromebook and I’m not a big fan of the Office 365 online version of Microsoft Word. I really got used to using Google Docs and the experience it provides. Now I’m writing out of the desktop application while sitting at my main computer running Windows. That means that for the most part my writing has become something that happens at my desk in my office and not on the go or anywhere outside of my desk. Given that most of the writing I end up doing happens in the morning and is during my scheduled blocks of time that works out well enough. Right now, I’m pretty far ahead of the Friday publishing schedule again. Week 82 just went live and the next 5 are ready to go already. That means that the entire introduction to machine learning syllabus is now draft complete. I do plan on going back and reading it again from start to finish and doing any final edits.

Probably the principal thing that is keeping me from moving completely out of Google Docs and over to Microsoft Word is that the final version of The Lindahl Letter that gets published does not contain the Tweets of the week or the links to things that are included in the Substack newsletter. When I go back to format the content for final publication, I have been removing those two sections. That is certainly something that could be augmented moving forward where any links or content being shared is put into the main body of the post to avoid having to use those two links only sections or I could just include them in the final product as well. Based on the statistics I have available to me it does not appear like that content is really consumed very much by people. People tend to read the prose at the top of the post and are not opening the email to see what Tweets I have enjoyed the most that week. Maybe the reason those got included was purely indulgent on my part which is interesting as an aside to consider.

After finishing up that syllabus I’m interested in working on some more research note type efforts where I’m really digging into the relevant scholarly articles as well as covering topics within the machine learning space. That is the goal of my Substack efforts moving forward. Of course, I broke that trajectory with my first set of writing efforts from the week 88 content. I’m probably going to need to reconsider the topics listed from 88 to 104 to make sure that they are ones that could support solid research notes. I’m not sure if they will end up getting converted over to Overleaf and eventually published that way, but that would be the general idea of what needs to happen moving forward.

ML algorithms (ML syllabus edition 3/8)

Thank you for tuning in to this audio only podcast presentation. This is week 82 of The Lindahl Letter publication. A new edition arrives every Friday. This week the topic under consideration for The Lindahl Letter is, “ML algorithms (ML syllabus edition 3/8).”

Welcome to the lecture on ML algorithms. This topic was held until the 3rd installment of this series to allow a foundation for the concept of machine learning to develop. At some point, you are going to want to operationalize your knowledge of machine learning to do some things. For the vast majority of you one of these ML algorithms will be that something. Please take a step back and consider this very real scenario. Within the general scientific community getting different results every time you run the same experiment makes publishing difficult. That does not stop authors in the ML space. Replication and the process of verifying scientific results is often difficult or impossible without similar setups and the same datasets. Within the machine learning space where a variety of different ML algorithms exist that is a very normal outcome. Researchers certainly seem to have gotten very used to getting a variety of results. I’m not talking about using post theory science to publish based on allowing the findings to build knowledge instead of the other way around. You may very well get slightly different results every time one of these ML algorithms is invoked. You have been warned. Now let the adventure begin. 

One of the few Tweets that really made me think about the quality of ML research papers and the research patterns impacting quality was from Yaroslav Bulatov who works on the PyTorch team back on January 22, 2022. That tweet referenced a paper on ArXiv called, “Descending through a Crowded Valley — Benchmarking Deep Learning Optimizers,” from 2021 [1]. 

That paper digs into the state of things where hundreds of optimization methods exist. It pulls together a really impressive list. The list itself was striking just in the volume of options available. My next thought was about just how many people are contributing to this highly overcrowded field of machine learning. That paper about deep learning optimizers covered a lot of ground and would be a good place to start digging around. We are going to approach this a little differently based on a look at the most common ones. 

Here are some (10) very common ML algorithms (this is not intended to be an exhaustive list):

  1. XGBoost
  2. Naive Bayes algorithm
  3. Linear regression
  4. Logistic regression
  5. Decision tree
  6. Support Vector Machine (SVM) algorithm
  7. K-nearest neighbors (KNN) algorithm
  8. K-means
  9. Random forest algorithm
  10. Diffusion

I’m going to talk about each of these algorithms briefly or this would be a very long lecture. We certainly could go all hands and spend several hours all in together in a state of irregular operations covering these topics, but that is not going to happen today. To make this a more detailed syllabus version of the lecture I’m going to include a few references to relevant papers you can get access to and read after each general introduction. My selected papers might not be the key paper or the most cited. Feel free to make suggestions if you feel a paper better represents the algorithm. I’m open to suggestions. 

XGBoost – Some people would argue with a great deal of passion that we could probably be one and done after introducing this ML algorithm. You can freely download the package for this one [2]. It has over 20,000 stars on GitHub and has been forked over 8,000 times [3]. People really seem to like this one and have used it to win competitions and generally get great results. Seriously, you will find references to XGBoost all over these days. It has gained a ton of attention and popularity. Not exactly to the level of being a pop culture reference, but within the machine learning community it is well known. The package is based on gradient boosting and provides parallel tree boating (GBDT, GBM). This package generally creates a series of models that boost the trees and help create overfitting in sequential efforts. You can read a paper from 2016 about it on arXiv called, “XGBoost: A Scalable Tree Boosting System” [4]. The bottom line on this one is that you get a lot of benefits from gradient boosting built into a software package that can get you moving quickly toward your goal of success.

Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794). https://dl.acm.org/doi/pdf/10.1145/2939672.2939785 

Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., & Chen, K. (2015). Xgboost: extreme gradient boosting. R package version 0.4-2, 1(4), 1-4. https://cran.microsoft.com/snapshot/2017-12-11/web/packages/xgboost/vignettes/xgboost.pdf 

Naive Bayes algorithm – You knew I would have to have something Bayes related near the top of this list. This one is a type of classifier that helps evaluate the probability or relationship between classes. One of the classes with the highest probability will be considered the most likely class. It also assumes that those features are independent. I found a paper on this one that was cited about 4,146 times called, “An empirical study of the naive Bayes classifier” [5].

Rish, I. (2001, August). An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence (Vol. 3, No. 22, pp. 41-46). https://www.researchgate.net/profile/Irina-Rish/publication/228845263_An_Empirical_Study_of_the_Naive_Bayes_Classifier/links/00b7d52dc3ccd8d692000000/An-Empirical-Study-of-the-Naive-Bayes-Classifier.pdf 

Linear regression – This is the most basic algorithm and statistical technique in use here where based on a line (linear) a relationship can be charted for prediction between two things. A lot of the graphics you will see where a lot of content is mapped on a chart with a line dividing the general middle of the distribution would potentially be using some form of linear regression. 

Forkuor, G., Hounkpatin, O. K., Welp, G., & Thiel, M. (2017). High resolution mapping of soil properties using remote sensing variables in south-western Burkina Faso: a comparison of machine learning and multiple linear regression models. PloS one, 12(1), e0170478. https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0170478&type=printable 

Maulud, D., & Abdulazeez, A. M. (2020). A review on linear regression comprehensive in machine learning. Journal of Applied Science and Technology Trends, 1(4), 140-147. https://jastt.org/index.php/jasttpath/article/view/57/20 

Logistic regression – This type of statistical model allows an algorithmic analysis of the probability of success or failure. You could model other binary type questions. The good folks over at IBM have an entire set of pages set up to run through how logistic regression could be a tool to help with decision making [6]. This model is everywhere in simple analysis of things when people are trying to work toward a single decision. 

Christodoulou, E., Ma, J., Collins, G. S., Steyerberg, E. W., Verbakel, J. Y., & Van Calster, B. (2019). A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of clinical epidemiology, 110, 12-22. https://www.researchgate.net/profile/Ewout-Steyerberg/publication/331028284_A_systematic_review_shows_no_performance_benefit_of_machine_learning_over_logistic_regression_for_clinical_prediction_models/links/5c66bed192851c1c9de3251b/A-systematic-review-shows-no-performance-benefit-of-machine-learning-over-logistic-regression-for-clinical-prediction-models.pdf 

Dreiseitl, S., & Ohno-Machado, L. (2002). Logistic regression and artificial neural network classification models: a methodology review. Journal of biomedical informatics, 35(5-6), 352-359. https://core.ac.uk/download/pdf/82131402.pdf 

Decision tree – Imagine diagramming decisions and coming to a fork where you have to decide to go one way or the other. That is how decision trees work based on inputs and corresponding outputs. Normally you will have a bunch of interconnected forks in the road and together they form up a decision tree. A lot of really great explanations of this exist online. One of my favorite ones is from Towards Data Science and was published way back in 2017 [7].

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms (pp. 0-13). Technical report, Department of Computer Science, Oregon State University. https://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.38.2702 

Support Vector Machine (SVM) algorithm – You are going to need to imagine graphing out a bunch of data points then trying to come up with a line that separates them with a maximum margin [8]. 

Noble, W. S. (2006). What is a support vector machine?. Nature biotechnology, 24(12), 1565-1567. https://www.ifi.uzh.ch/dam/jcr:00000000-7f84-9c3b-ffff-ffffc550ec57/what_is_a_support_vector_machine.pdf 

Wang, L. (Ed.). (2005). Support vector machines: theory and applications (Vol. 177). Springer Science & Business Media. https://personal.ntu.edu.sg/elpwang/PDF_web/05_SVM_basic.pdf 

Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., & Scholkopf, B. (1998). Support vector machines. IEEE Intelligent Systems and their applications, 13(4), 18-28. https://www.ifi.uzh.ch/dam/jcr:00000000-7f84-9c3b-ffff-ffffbdb9a74e/SVM.pdf 

K-nearest neighbors (KNN) algorithm – Our friends over at IBM are sharing all sorts of knowledge online including a bit about the KNN algorithm [9]. Apparently, the best commentary explaining this one comes from Sebastian Raschka back in the fall of 2018 [10]. This one is pretty much what you would expect from a technique that looks at distance between neighboring points. 

Peterson, L. E. (2009). K-nearest neighbor. Scholarpedia, 4(2), 1883. http://scholarpedia.org/article/K-nearest_neighbor 

Zhang, M. L., & Zhou, Z. H. (2005, July). A k-nearest neighbor based algorithm for multi-label classification. In 2005 IEEE international conference on granular computing (Vol. 2, pp. 718-721). IEEE. https://www.researchgate.net/profile/Min-Ling-Zhang-2/publication/4196695_A_k-nearest_neighbor_based_algorithm_for_multi-label_classification/links/565d98f408ae1ef92982f866/A-k-nearest-neighbor-based-algorithm-for-multi-label-classification.pdf 

K-means – Some algorithms work to evaluate clusters and K-means is one of those. You can use this to try to help classify unlabeled data into clusters which can be helpful. 

Sinaga, K. P., & Yang, M. S. (2020). Unsupervised K-means clustering algorithm. IEEE access, 8, 80716-80727. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9072123 

Random forest algorithm – Most of the jokes that have been told within the machine learning space often relate to decision trees. The field is not full of a lot of jokes, but trees falling in a random forest are often included in that branch. People really liked the random forest algorithm for a time. You can imagine that a bunch of trees are created to engage in the prediction of classification. The random tree in the forest with the best classification production becomes the winner. This is great as it could find something that was noval or unexpected result based on the randomness. 

Biau, G., & Scornet, E. (2016). A random forest guided tour. Test, 25(2), 197-227. https://arxiv.org/pdf/1511.05741.pdf 

Diffusion – Previously I covered diffusion back in week 79 to try to figure out why it is becoming so popular. It is in no way as popular as XGBoost, but it has been gaining popularity. Over in the field of thermodynamics you could study gas molecules. Maybe you want to learn about how those gas molecules would diffuse from a high density to a low density area and you would also want to know how those gas molecules would reverse course. That is the basic theoretical part of the equation you need to absorb at the moment. Within the field of machine learning people have been building models that learn how based on degree of noise to diffuse the data and then reverse that process. That is basically the diffusion process in a nutshell. You can imagine that the cost to do this is computationally expensive. 

Wei, Q., Jiang, Y., & Chen, J. Z. (2018). Machine-learning solver for modified diffusion equations. Physical Review E, 98(5), 053304. https://arxiv.org/pdf/1808.04519.pdf

Dhariwal, P., & Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 8780-8794. https://proceedings.neurips.cc/paper/2021/file/49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf 

Wrapping this lecture up should be pretty straightforward. Feel free to dig into some of those papers if anything grabbed your attention this week. A lot of algorithms exist in the machine learning space. I tried to grab algorithms that are timeless and will always be relevant when considering where machine learning as a field is going. 

Links and thoughts:

“[ML News] BLOOM: 176B Open-Source | Chinese Brain-Scale Computer | Meta AI: No Language Left Behind”

“Is Intel ARC REALLY Canceled? – WAN Show July 29, 2022”

Top 5 Tweets of the week:

https://twitter.com/pierce/status/1553034275884244992

Footnotes:

[1] https://arxiv.org/pdf/2007.01547.pdf 

[2] https://xgboost.ai/ 

[3] https://github.com/dmlc/xgboost

[4] https://arxiv.org/pdf/1603.02754.pdf 

[5] https://www.cc.gatech.edu/home/isbell/classes/reading/papers/Rish.pdf 

[6] https://www.ibm.com/topics/logistic-regression 

[7] https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052 

[8] https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47 

[9] https://www.ibm.com/topics/knn 

[10] https://sebastianraschka.com/pdf/lecture-notes/stat479fs18/02_knn_notes.pdf 

Research Note:

You can find the files from the syllabus being built on GitHub. The latest version of the draft is being shared by exports when changes are being made. https://github.com/nelslindahlx/Introduction-to-machine-learning-syllabus-2022

What’s next for The Lindahl Letter?

  • Week 83: Machine learning Approaches (ML syllabus edition 4/8)
  • Week 84: Neural networks (ML syllabus edition 5/8)
  • Week 85: Neuroscience (ML syllabus edition 6/8)
  • Week 86: Ethics, fairness, bias, and privacy (ML syllabus edition 7/8)
  • Week 87: MLOps (ML syllabus edition 8/8)

I’ll try to keep the what’s next list forward looking with at least five weeks of posts in planning or review. If you enjoyed this content, then please take a moment and share it with a friend. If you are new to The Lindahl Letter, then please consider subscribing. New editions arrive every Friday. Thank you and enjoy the week ahead.

Application notifications abound

My ecosystem of applications where I’m a daily active user is and has been dropping. I’m generally exhausted by and tired of the endless string of notifications that don’t really notify me about anything substantial. I still remember a time before the advent of smartphones. For the most part, I remember having a flip phone that did not do very much beyond being able to make calls and receive text messages. Nobody really sent pictures with those phones. The resolution of the cameras was terrible. At that time, the phone companies still charged you by the text message or limited people to a small monthly allotment. Things were very different from today.

I’m really considering returning to just not carrying a smartphone around. Much like sending an email and getting an asynchronous response I could just let me smartphone always go to voicemail for the most part. Most phone calls are not really all that urgent anyway. These are the thoughts on my mind at the moment. I’ll probably be nostalgic for the grand experience of going to the mall at some point today. That is how things are shaping up. Right now, I have some music streaming on Pandora and I’m writing in the Microsoft Word desktop application. I did end up spending a few minutes fixing some of the navigation formatting at the start of that adventure, but now things are pretty much setup to be able to write on a daily basis out of this document. Working out of Microsoft Word has not been a smooth transition it is nowhere near as good or useful as Google Docs. The design aesthetic and the usability are very different. It is very much like Microsoft Word is exactly the way I left it year ago and it has not improved very much over the years.

Right now, I should be deeply focused on bringing the most important concepts to the forefront of my thoughts as I begin the day. Instead of working toward some type of meaningful writing I’m stuck on the medium where the writing is going to occur. Most of my writing is still occurring at my main computer in my office. I’m using this Corsair K65 RGB mini wired keyboard. It’s 60% the size of a normal keyboard and does not have the number pad or arrows. Strangely enough I have been really happy with it after getting a Kensington wrist rest to support a slightly more comfortable typing experience. The mechanical keyboard works really well for typing at speed in my office while concentrating on the process of starting the day. To that end, the keyboard has been wonderful and while I have used a natural ergonomic split keyboard for years this one has worked out well enough.

Switching back to Microsoft Word

Given that Google is apparently data mining my efforts writing in a Google Doc every day I should probably shift over and work out of Microsoft Office 365. This word processing document that I’m working out of right now is actually a .DOCX file that just happens to be opened out of Google Docs. It would not take very much effort to move the files over to the Microsoft side of things.  

Hold on just a second here while I make the switch.  

I went out to Google Drive and download the entire novels directory as a zip file. After extracting all those files into the downloads directory, I went ahead and loaded them into the Microsoft OneDrive backup folder for this computer. That took just a couple of minutes to accomplish and now for the first time in a long time. I think it was last year when I was typesetting the “The Lindahl Letter: On Machine Learning” manuscript. I used to work out of Microsoft Word as my primary word processing system for years. Gradually I made the switch to Google Docs and was pretty happy with it until the recent reports of the strangeness related to tracking people. I don’t know if Microsoft is even remotely interested in the things that happen in Microsoft Word. At some point here in a few minutes I’m going to open this document up using the online interface for Office 365 to see how that goes. 

Writing in a standalone application and not just a tab within Chrome is a stranger feeling than I expected this morning. Now that I have the Office 365 web interface pulled up, I now remember why I abandoned this worked processing interface and went over to work out of Google Docs full time. This interface is just visually clunky and unrefined by comparison. The design aesthetic is just so much better out of the Google Doc. I am going to give working out of this interface a try for a few days to see if my feelings about it change over time. Maybe I will get used to the experience and be able to handle working out of the Office 365 ecosystem for work processing.

Working from daft form to a final manuscript

I have been really focused on writing an introduction to machine learning syllabus to share with everybody over on my Substack newsletter. Most of my time and energy has gone into that effort. Right now I’m at the point where a draft exists and has been shared out. That is generally a great point in the process. For me it means that I need to let it breathe for a bit and then go back and rework and reread it a few days later. Picking it up with fresh eyes let me catch the little things that otherwise seemed ok in the initial draft. During the course of that process I have learned how to make figures, tables, and generally use the LaTeX syntax. That was indeed a battle and I shared the files for others to be able to take a look at them if they wanted to see how I used the syntax. I ended up having to learn the whole thing from a bunch of tutorials on YouTube along the way each time I wanted to do something new along the way. It was not until the last section in material that I had to learn how to make tables in LaTeX which was shockingly complex compared to what I expected. You have to understand a bit about how the structure works to see how to modify it in practice. 

Part of learning the LaTeX syntax during my journey was learning to appreciate the Overleaf website and how it manages that type of content. At first, I was wondering why this was any different than using the Google Doc or Microsoft Word processing environments. It really is a bit different and it worked out well enough. It is worth the small cost to be able to use it and I can see where having collaborators and sharing a document is something that the platform helps facilitate in a deeply powerful way. Now that the basic draft process on that syllabus is complete it is time to really focus deeply on the “what’s next?” question. Within my research trajectory notes and upcoming research pages on the Weblog I have a few ideas of what I’m working toward creating. At the moment, I’m thinking that my work with machine learning literature reviews is not complete. I may work out a few more deeper looks at some of the topics contained within the syllabus. I am able to format my research notes and literature reviews into LaTeX syntax PDF documents now. 

I read an article over at The Verge that Google is tracking what I’m doing in my Google Doc and that is not entirely surprising. I will say that during the course of writing in my Substack file which is now drafted to week 87 of 104 planned writing sessions the algorithm has gotten better at providing suggestions while I write on edits and matches my phrasing better. That document about machine learning is really close to 100,000 words right now it is at a word count of 96,925. I’m guessing that in terms of purely original technical learning prose creation I’m on the deeper end of the documents they are analyzing. Somebody I’m sure has written something that is longer. They probably have a different writing schedule than I do and the overall feel and style is probably different.