Everyone is a Data Scientist (qualitatively), and real life examples to prove it

Let me explain. I was reading this book about ‘Big Data’ and the internet, that I casually picked up from a convenient store at the Austin airport.

The book was called “Everybody Lies” by Seth Stephens-Davidowitz and it delves into several really interesting topics mostly revolving around data science, the internet and human psychology.

In one of the earlier chapters of the book, the author provides several examples of how people make everyday decisions based on their previous experiences. One of those examples involved the author’s grandmother.

He talks about how his grandmother helped him pick the right person to date. Every time he brought someone home for dinner or a get together, most family members would give him the wrong advice whereas his grandmother would accurately predict whether that person would be compatible with him or not (binary classification).

This intuition was entirely based on her familiarity with the author’s habits, preferences, attitude, and experiences with previous dates (training data).

While this is not a quantitative decision based on real data, it is a qualitative one based on experience and draws heavily from real data science concepts. This example, therefore, inspired me to map common real-life examples to actual data science techniques to perhaps help people understand these concepts more intuitively.

Here we go!

Creating different playlists on Spotify – Topic Modeling or Clustering

This first one really resonates with me personally as I am meticulous about placing the songs that I like into meaningful categories. Then, I could just play songs from specific playlists depending on my mood without having to put too much effort into cherry-picking each song.

What I am essentially doing is topic modeling or clustering. I am listening to songs, picking out the different attributes (features) and based on these attributes, placing the songs with similar attributes into one group. As a result, we end up with several groups (topics/clusters/playlists) that have different characteristics.

Stirring the pot before tasting the food – Random Sampling

It is common for cooks to sample the food that they are making to ensure that everything tastes as expected. However, one cannot simply pick up anything that is in close proximity. The food needs to be mixed or stirred so that the different ingredients get mixed together properly.

That creates a more consistent distribution of ingredients everywhere which leads to a better generalization of the dish as a whole during the tasting.

Random sampling works the same way. To prevent biases such as convenience bias to creep in, it is important to ensure that the sampling is done randomly which provides a better generalization to the population (the food).

Picking fresh avocados at the grocery – Anomaly Detection

Most avocado lovers (such as me) need to feel the fruit physically at the grocery store in order to judge it’s freshness. It can’t be too hard or too ripe, it has to be just right. This perfect freshness essentially represents the expected value of avocados.

If an avocado is softer or harder than expected, it would be identified as an outlier or an anomaly. Hence, we would reject it and move on to the next one.

Deciding whether to go into the office physically or work remotely – Binary Classification

Once again, this example is fairly simple to understand. If you are like me and have an option to go into the office physically during quarantine, it’s a daily dilemma to figure out if you can work remotely or if you would need to make the effort to get ready to head to the office.

Here, one would need to look at the different aspects of work on that particular day such as:

Do I need to work with equipment or tools that are only available at the office?
Am I planning to eat lunch outside?
Do I need to ship an item at work?
Do I have to be home to receive an important shipment?

The above examples serve as features and we use our experience from previous working days during the quarantine as the training data. Then we make a prediction everyday.

Making cocktails for friends – A/B testing

Let’s say you are at a party and have been given the task of bartending. You decide to take a shot at making Moscow Mules for everyone but are not quite sure what the ingredients are or what steps to follow. You quickly look up a recipe and conjure up a batch of mules.

However, you also came across a second recipe which had suggested adding more lime than the first one. So out of curiosity, you went ahead and created another batch with the new recipe.

You then served the two batches to different people and tried to judge what the reactions would be. It turns out that out of the 25 people present at the party, 18 like the second batch while only 7 like the first batch. From this data, it becomes clear to you that the third batch should be the one with the extra lime. The key here was to change just one variable and observe the difference in results. This is the basis for A/B testing.

In a real-world quantitative analysis, bigger samples of data would need to be collected for each version of the solution and statistical inference techniques such as hypothesis testing would need to be done in order to find if the difference in results between the two batches is statistically significant.

Manually tagging friends in photos – Image classification

This is actually not the best example since most apps that use pictures already have this technology implemented nowadays to automatically tag people based on facial features. Deep learning techniques such as image classification which use the Convolutional Neural Network (CNN) algorithm are widely used for this application.

However, if we consider edge cases where the algorithm can’t recognize certain people, you could still manually tag them if you recognized them. Your intuition for correctly tagging people is based on your extensive knowledge of thousands of pictures of people’s faces (training data) stored in your brain.

Reading comments on a Twitter post – Sentiment Analysis

Oftentimes after posting something on Twitter, or any social media platform for that matter, it’s intriguing to see what people may have commented. An encouraging comment from a friend, family or even someone completely random tends to lift your spirits while a negative comment could make you wonder what in the post may have upset them.

In any case, we are quickly able pick up cues from the comments as to what kind of emotion they are portraying such as joy, anger, disagreement, amusement, curiosity etc.

Sentiment analysis, which is one of the most widely used applications of Natural Language Processing (NLP), does the same thing by ingesting a massive amount of text generally from online sources such as Twitter, Instagram, Facebook, Amazon reviews, Yahoo Finance chatrooms, etc.

Suggesting shows or movies to friends, family and colleagues – Recommendation Engine

This is possibly one of the most common examples out of all the ones listed in this post. People talk about movies and shows all the time, especially with streaming services such as Netflix, Hulu, Amazon Prime, HBO and Disney becoming increasingly popular.

Furthermore, friends tend to get regular recommendations from each other. The same goes for family members and even colleagues at work.

Today, a machine learning algorithm called a Recommendation Engine is widely used to do the same in an automated manner across a wide range of products and services. It is deployed on streaming services, e-commerce platforms, social media platforms and several other domains.

The collaborative filtering method of recommendation engines uses peer-to-peer information from a group of similar people such as friends, family members, colleagues etc., to suggest products, movies, services to users. Content-based filtering, on the other hand, takes into account user behavior such as likes, purchases, page views, clicks to recommend similar content.

Checking IDs of people ordering drinks – Regression

This is an example of a bartender checking people’s IDs to ensure that they are of drinking age. During this process, if the customer is obviously above the age of 21, the host might not even ask for an ID. However, in cases where the customers are relatively young, a check might be required.

Regardless of the outcome, what the host intuitively does in this situation is try and estimate the customer’s age based on facial features, height, behavior, attire, the way they speak etc. And since he/she is trying to predict a continuous variable (age), it is essentially a regression problem.

So there it is – 10 examples of how we subconsciously use data science and machine learning on a daily basis. Let me know what other examples come to mind in the comments below. Hope you enjoyed the post!