Picking the right movie to watch: Part II (using the simplest decision tree model)

Following up from the my previous attempt to model how best to pick a movie, this new iteration takes a very different approach.

In the previous project, I tried to predict the IMDB rating for a specific movie based on several factors such as whether the movie contained an Oscar winning actor, genre of the movie and most importantly the Rotten Tomato score.

Picking the Rotten Tomato score as one of the predictor variables was a mistake. The correlation value between this score and the dependent variable, IMDB rating, was 0.86 which indicated a strong association between the two.

However, in real life, if we are trying to predict the rating for a movie, we need to assume that we don’t have another variable which essentially also provides a score. This can lead to data leakage. Hence there was an inherent issue with the problem definition itself.

In addition to this, even though the model was fairly accurate, it was not precise. Some prediction results from this project are shown in the table above. Actual ratings for movies used for prediction generally fell within the confidence interval of the predicted value but the interval was rather large.

New problem definition

Instead of predicting the IMDB rating, this new problem would be to predict whether I will like a movie or not based on the IMDB rating, and possibly other variables.

Why is this a better solution? This way, I could trust my model to tell me whether I would like a given movie or not. This would allow me to more reliably pick a movie before I take the time to watch it. People (including me) spend countless hours browsing for movies on streaming services such as Netlflix, Amazon Prime etc. The aim for this new model would be to make this process less stressful and time consuming.

In order to execute the solution to this problem, the biggest challenge would be to manually label each movie that I have watched with a ‘like’ or a ‘dislike’, which would then serve as the target variable that I am trying to predict using the IMDB rating.

The motivation behind this approach was to see whether the IMDB rating is a good indicator for me to pick a movie to watch. For me personally, it is far more important to avoid watching things that I dislike than to miss out on a potentially good movie. Hence, the priority for our model should was to get as many true positive results as possible.

Data collection

Fortunately, IMDB makes it’s vast dataset of movies, documentaries, TV shows etc. available in a fairly structured manner. For this project, we worked with just two datasets from IMDB’s database:

titles_basics.tsv: Contained header information for each movie title (indicated by an ID variable ‘tconst’) such as genre, run time and primary title

data.tsv: Contained the average rating and number of votes for each title (once again using ‘tconst’)

Data preparation

1. Cleaning and wrangling

The entire dataset contained over 7 million rows which consumed more than 400MB of space. Therefore, it was important to explore the data and remove unwanted columns early on to reduce the size of the dataset for easier processing.

Missing values were removed and an initial timeframe of 1990 to 2019 was selected. However, the size of the dataset was still large so I decided to scale things down further by selecting just the years 2008 to 2019. Since I was mostly familiar with movies from this timeframe made more sense to select those movies for modeling purposes.

Another important cleaning step was removing documentaries, short films and TV shows as well as genres such as sports.

2. Labelling

As mentioned earlier, each movie was labelled manually as liking or disliking a movie is a personal choice. Since this manual process takes a considerable amount of time, only 130 movies were eventually selected for analysis with the goal of adding more labeled data in the future.

The ‘like’ and ‘dislike’ variable was used as the target variable in a binary classification problem.

3. Combining datasets for analysis and modeling

All the datasets were combined in two steps:

Merge cleaned, wrangled and labeled datasets spanning different timeframes together
Merge labeled dataset with data.tsv which contains the ratings and number of votes for each title using the ‘tconst’ identifier as the variable to join on

A snippet of the resulting combined dataset is shown below. As we can see, each movie title is accompanied by it’s ID, IMDB rating, whether I liked the movie or not and other variables.

EDA and rule-based baseline model

1. EDA

The distribution of the IMDB ratings was slightly left-skewed with the bulk of the values falling between 7 and 7.5.

Also, the distribution of my watched movies showed a higher number of liked movies as compared to disliked movies. It doesn’t appear that any of the classes are imbalanced. Although, the proportion of disliked movies is distinctly lower.

2. Baseline rule-based model

A new variable called ‘high_rating’ was created of boolean type which indicated if a rating for a movie above above 7.5 or not. This is a critical step in the feature engineering process.

Then, I compared the manual ratings to the actual ratings based on the 7.5 threshold. This baseline rule-based model produced an accuracy of 59% which was rather low. I then tried to further optimize this rather simple model.

3. Feature engineering

For the baseline model, I created an additional variable that assigns a boolean if a particular movie has a higher rating than 7.5 or not. Depending on where this rating threshold is set, I was likely to get different modeling results. So I decided to sweep this threshold and observe the accuracy at different points.

Since I was trying to predict what movies I would like, I needed to find the optimal threshold to create a variable that more accurately predicts the likes and dislikes. Hence, this feature engineering step was particularly crucial.

The above chart visualizes accuracy scores across the various thresholds. To truly understand this chart, we need to think about overfitting and underfitting.

If we set our threshold to 4, essentially all movies are going to be labeled as ‘like’, in which case we are most likely to get an accuracy which is essentially equal to the proportion of movies that I liked. That fraction is 80/130 which is 0.615. And that is exactly what we see in the plot.

As the threshold increases, we start more accurately predicting the disliked movies so our overall accuracy increases. This basically means that our true negatives increase without reducing the true positives.

We reach the best accuracy score of 0.75 at a threshold of 6.4 giving us the best tradeoff between bias and variance. If we further increase the threshold after that, we would still accurately predict the disliked movies but we would wrongly predict the liked movies, so our true positives would decrease and false negatives would increase.

Another graph that captures this tradeoff is shown below. The plot explains the difference between the thresholds of 7.5 and 7.1. The x-axis is the actual like and dislike values whereas the legend shows the predicted values.

Threshold of 7.5: Predicting disliked movies better but not predicting liked movies that well. More true negatives but also more false negatives.
Threshold of 7.1: Predicting liked movies better but compromising on disliked movies. More true positives but also more false positives.

For practical purposes, we choose a threshold of 7.1 as that is generally a better value than 6.4 to separate good from bad movies while keeping the accuracy at a respectable 70%.

4. Exploring the runtimeMinutes variable

The boxplot below shows a slightly higher median for liked movies. To be exact, the median for liked and disliked movies are 115 and 107 respectively. We could check whether this difference is statistically significant but that will be covered in a later revision of the project. For now, we will include the variable in our base model.

5. Exploring the numVotes variable

From the boxplot and probability density curves shown below, it is clear that there is a higher correlation between the number of votes and the liked movies.

The mean and median values shown in the table below reflect the observations in the above graphs where these statistics were higher for the liked movies. This basically tells us that there are more liked movies in the dataset.

There appears to be a medium correlation (0.632) between number of votes and average rating. It is possible that people are more exited to rate movies if they really liked it so that could lead to a bias when comparing votes and ratings. For now, we will also include this variable in our base model.

6. Exploring the rating threshold variables

Above, we have created 3 plots with each having a different rating threshold:

7.5: Base threshold
7.1: Practical threshold
6.4: Best threshold based on data

The 7.5 threshold is good at predicting liked movies but poor at predicting disliked movies. The 7.1 rating does equally well with liked movies and slightly improves with disliked movies. The 6.4 threshold does really well with disliked movies but does poorly with liked movies.

Based on these results, we choose 7.1 as the best threshold to use for our model.

Prediction using decision trees

We use decision trees for this modeling process since it’s interpretable, fast and simple for the features that we are using.

1. Feature selection

Note that we have only 2 variables:

numVotes
high_rating_7p1

We did not use runtime as it was not a great distinguisher of liked and disliked movies.

2. Pre-processing

The high_rating_7p1 variable was split into integer values and the data split into training and test sets.

3. Baseline decision trees model

The baseline decision trees model provided an accuracy score of 0.61 and a ROC AUC score of 0.57 which were not the best.

I diagnosed the model further and noticed that it predicts the liked movies really well, as was expected but did not model the disliked movies well.

Maybe that says a little about my movie choices. In fact, what this says is that I tend to categorize the good movies correctly but I also sometimes categorize bad movies as good movies.

4. Model with AverageRating variable

Adding the original average rating variable instead of the high_rating_7p1 variable yielded a lower accuracy and AUC score of 0.564 and 0.538 respectively. This was expected since the latter variable is a feature engineered (categorized) version of the former for a more optimized prediction.

5. Optimization and final model

Using the grid search algorithm for optimization, I observed an improvement in ROC AUC from 0.57 to 0.65 but the accuracy remained the same from the baseline model. This is possibly due to us optimizing the model with respect to AUC.

6. Diagnosing the final model

The confusion matrix above confirmed that we were good at predicting true positives but poor at predicting true negatives.

For me, watching a good movie is important, so I would like to minimize false negatives. In order to do that, we would need to change the prediction threshold. However, that would decrease false negatives as well but will increase the true positives. At the same time, it would increase false positives. However, it’s okay for me to miss out on good movies. What I want to avoid is watching bad movies.

7. Grid search using accuracy as scoring metric

Interestingly, the best paramaters using accuracy as the scoring metric were the same as when roc_auc was the metric. Hence, we stuck with the previous model. Hence this didn’t improve the model further.

8. Predicting unseen movies

Using our the best model, I tried to predict the class for a few unseen movies. The average rating and number of votes for each of these movies are shown in the plot below.

The accuracy and ROC AUC scores came out to be 0.608 and 0.638 respectively. In comparison, the validation data for the model yielded 0.65 and 0.61 respectively. Based on this, the model predicts unseen data similar to if not better than the training data.