Picking the right movie to watch: Part I (using linear regression)

As a movie fanatic, it’s really important for me to pick something decent to watch. Over the years, mainly during my time in college, I went through a horde of movies. Some were recommendations from friends while others were known classics.

With more and more movies under by belt, I began developing a certain taste for specifically dramatic and serious movies. You know, the ones that get nominated for the Academy Awards. Another important thing I noticed was that the films that I tended to appreciate more were generally higher rated on IMDB,  which stands for Internet Movies Database.

So, from then on IMDB was the first place I would go to check ratings before deciding whether to watch a movie or not. Why so much preparation to watch a movie? Well, with so much information at our fingertips today why not do the research to find the perfect movie? Ever find yourself browsing endlessly through Netflix and then realizing that your dinner is finished, same thing.

It’s safe to say that the IMDB score is an important metric for me personally as well as for many others to help narrow down the gargantuan search.

However, is this rating a trusted source to measure the merit of a movie? Should people actually be looking at this metric when deciding on a movie to watch? Yes, whether a movie is good or not is completely down to personal taste. But we can at least determine if the IMDB rating holds any value by comparing it to other factors that surround a movie such as whether the movie won an Academy Award for Best Picture/Actor/Actress.

In order to achieve this, we need to explore how the IMDB rating is related to other factors. One way of doing this is to create a linear model that can predict both high and low ratings. By creating such a predictor, we can not only find the truly good movies but also discover the interaction between the IMDB rating and other variables during exploratory data analysis (EDA).

Dataset

For the analysis, we have a dataset containing 651 randomly sampled movies before 2016 split into 32 variables or attributes containing information about each movie.

Since our objective is to create a prediction model with IMDB rating as the dependent variable, we need to carefully select the explanatory variables.

To make our task easier, we can quickly get rid of variables that add no valuable information to the modeling process such as the movie title or the URL for the movie.

Here is a complete list of all the variables that have been omitted: 

  • title: Unrelated to IMDB rating
  • title_type: Unrelated to IMDB rating
  • studio: Too many studios to provide a single significant contributor
  • thtr_rel_year: Unrelated to IMDB rating
  • thtr_rel_day: Unrelated to IMDB rating
  • dvd_rel_year: Unrelated to IMDB rating
  • dvd_rel_day: Unrelated to IMDB rating
  • director: Too many directors to provide a significant contributor
  • actor1: Too many unique values to provide a significant contributor
  • actor2: Too many unique values to provide a significant contributor
  • actor3: Too many unique values to provide a significant contributor
  • actor4: Too many unique values to provide a significant contributor
  • actor5: Too many unique values to provide a significant contributor
  • imdb_url: Unrelated to rating
  • rt_url: Unrelated to rating

Following this initial filtering, we are left with the following variables describing each movie:

  • Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)
  • Runtime (in minutes)
  • MPAA rating (G, PG, PG-13, R, Unrated)
  • Rating on IMDB – target variable
  • Number of votes on IMDB
  • Critics score on Rotten Tomatoes
  • Audience score on Rotten Tomatoes
  • Whether or not the movie was nominated for a best picture Oscar
  • Whether or not the movie won a best picture Oscar
  • Whether or not one of the main actors in the movie ever won an Oscar
  • Whether or not one of the main actresses in the movie ever won an Oscar 
  • Whether or not the director of the movie ever won an Oscar
  • Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo 
  • Categorical variable for critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten)
  • Categorical variable for audience rating on Rotten Tomatoes (Spilled, Upright)
  • Month the movie is released on DVD
  • Month the movie is released in theaters

Despite trimming down the dataset, we still have a sizeable amount of variables left. However, there is still room to eliminate some of the variables which we will do in the next section.

Exploratory Data Analysis

The quickest way to find the best variables to include in our model is by finding their relation with the target variable (IMDB rating), also called the dependent variable.  

An easy way to visualize all the relations is to create a correlation matrix which displays a scatter plot as well as the correlation between each variable and the IMDB rating, as shown below.

Note that the plot only shows four explanatory variables since you can only determine the correlation value and the scatter plot for numerical variables. Yes, you can encode categorical variables to discrete numerical values but we will go with a simpler approach here.

Straight away, we can see that the IMDB rating is highly correlated with only two of the variables – the critics’ score and audience score on Rotten Tomatoes. This should not come as a great surprise as both IMDB and Rotten Tomatoes are quite likely to have similar ratings for movies with a few exceptions.

The high correlation values for these variables is replicated in the scatter plots which show moderate and somewhat linear relationships with the IMDB rating variable.

Since runtime and the number of votes on IMDB do not appear to have a linear association with rating, those variables can be ignored.

Bivariate Analysis

During this sort of analysis, we generally explore the interaction between each independent variable and the target variable one by one.

First, we look at rating versus audience score which as we already know from the above scatter matrix has a linear, moderate and positive relation. Below is a closer look at the association.

There is a little bit of a fan-shape to the plot as the variance is higher at lower audience scores.

As the correlation scores suggest, the critics score is slightly better correlated with rating as indicated by the scatter plot below.

An important thing to note here is that both audience score and critics score are fairly correlated (value of 0.704) with each other and can introduce collinearity into the model. In such a case, it’s best to pick just one of the two variables as having an additional attribute may not improve the prediction but would make it more complex to interpret.

Categorical Variables Analysis

Let’s take a look at the categorical variables now to see which ones look more promising.

Critics rating

One of the best ways to visualize a numerical variable against a categorical variable is using box plots as the one shown below for rating and critics rating.

Here, we can see a trend where the rating is generally higher for movies rated as “Certified Fresh” on Rotten Tomatoes. Movies given a “Fresh” rating seem to have a relatively lower IMDB rating while films dubbed “Rotten” have an even lower rating. We are starting to see a pattern here, albeit expectedly.

If we take a closer look at the distribution of critics rating, we will notice that out of the 651 observations 307 are labelled as “Rotten” on Rotten Tomatoes, 209 are marked “Fresh” while just 135 have been attributed with a “Certified Fresh” rating. It is interesting to note that more than twice the “Certified Fresh” rated movies are categorized as “Rotten” which indicates that it is rather tough to get a good rating on Rotten Tomatoes. These stats are summarized in the bar chart below.

Another important point to note would be that almost all of the “Certified Fresh” rated movies fall above the score of 7 on IMDB. If we only look at movies with a IMDB rating of 7 or more, 99 of them are “Certified Fresh” while 101 films have the “Fresh” rating with only 24 being cited as “Rotten”.

Considering movies with scores below 6 on IMDB, 161 of them are labeled as “Rotten”, while just a meager two movies are marked as “Certified Fresh”. Therefore you can see that a jump of simply one point on IMDB can lead to such disparities in the rating on Rotten Tomatoes.

Audience Rating

This variable paints a similar picture to that of the critics rating feature when compared to the IMDB rating as shown in the boxplot below.


If we crunch the numbers for audience rating on Rotten Tomatoes, the distribution behaves slightly differently as the number of positive or “Upright” ratings (376) are greater than the “Spilled” ratings (275). Once again, we break down the distribution into the greater than 7 and the less than 6 IMDB rating categories. For the former segment, 223 movies are labelled as “Upright” while only 1 movie is dubbed “Spilled”. For the less than 6 rating category, 164 are “Spilled” while 16 are “Upright”. The difference in the categories are even more apparent for this variable.

From the above analysis, it is evident that the critics are stricter with their ratings while the general public appears to be slightly more liberal with their voting. A fascinating thing to note is that the disparity in Rotten Tomatoes rating above an IMDB rating of 7 and below 6 is strikingly greater for the audience rating as compared to that of critics rating. Therefore, the audience ratings seem to agree more with the IMDB scores than the critics ratings.

One reason for the above observation could be that the audience ratings are segmented into just two categories while the critics rating is distributed into three categories.

Best picture nomination

The plot below shows the relation between rating and whether the movie was nominated for an Oscar.

Expectedly, movies that were nominated generally have a higher IMDB rating. Hence, there is a clear distinction between these two categories.


Best picture win

This variable behaves very similarly to the best picture nominated attribute. Therefore it will be interesting to see whether these two variables might be correlated. As mentioned earlier, if they are collinear one of them may have to be dropped from consideration for the model.

Best director win, actor win and actress win

The plots below are for the best director, best actor and best actress win variables respectively. Note that these wins are not necessarily for the particular movies but for any win during their careers.

None of the above plots show a distinct linear relationship between the independent variables and IMDB rating so we may end up ignoring some of these variables.

Theater release month, DVD release month

None of these variables have a linear or strong relationship with IMDB rating so we can drop them from consideration.

Genre

From the genre list documentaries and musicals stand out a bit more than the others, fetching a relatively higher average IMDB rating while horror and comedy are relatively lower. Although, overall, it is hard to determine a distinct pattern as there are simply too many levels with very small differences between them.


MPAA rating

This variable provides very little information regarding any linear or positive relation to the rating so it is best to drop it.

Top 200 box office

This variable determines whether a movie made it to the top 200 at the box office. Again, there is a very small distinction between the two categories in relation to rating.

We are finally done with the EDA section.

Data Cleaning

Below are the variables omitted as a result of EDA and the corresponding reason to do so.

  • critics_score: Collinear with “audience_score” and is being removed to make the estimates more reliable. Note that “audience_score” is not being removed instead since it has a higher correlation with IMDB rating as seen in the EDA section.
  • runtime: There is a very small correlation between runtime and rating. Also, the scatter plot between the variables does not clearly depict a linear association.
  • num_votes: As seen in the EDA section, there is no clear linear relation between this variable and rating
  • genre: There are simply too many genres and it is unlikely that just one particular genre will contribute to a higher rating
  • thtr_rel_month: Unrelated to IMDB rating as shown in the EDA section
  • dvd_rel_month: Once again, unrelated to the rating variable.
  • mpaa_rating: Similar reason as “genre”
  • best_pic_nom: This variable is collinear with “best_pic_win” as seen in the EDA

We are now down to the best (according to my analysis) eight explanatory variables and we will try to create a linear regression model using these attributes. Here they are:

  • audience_rating
  • critics_score
  • audience_rating
  • best_pic_win
  • best_actor_win
  • best_actress_win
  • best_dir_win
  • top200_box

Note that this section can be called by several names such as “pre-processing” or “feature selection”. I went with “data cleaning” as most of the task was dropping unwanted variables and selecting the ones that we want in our model. If other functions were included in this section, it could have been called something more appropriate for that particular task.

Modeling

We start by including all eight explanatory variables that we had narrowed the dataset down to and create a multiple linear regression model with IMDB rating as the target variable.

This is what the model looks like:

IMDB Rating = B0 + B1(Critics Rating) + B2(Audience Score) + B3(Audience Rating) + B4(Best Pic Win) + B5(Best Actor Win) + B6(Best Actress Win) + B7(Best Director Win) + B8(Top 200 Box Office)

Let’s take a look at the coefficients to see which variables are significant and how accurate the model is initially.

So, what can we infer from the results (above)? The R-squared value is 0.7778 while the adjusted R-squared is slightly higher at 0.7808 which is possibly due to the high number of variables taken into consideration.

With an overall p-value of 2.2e-16, the model is in fact a good predictor of the rating if we consider a significance level of 5%. The p-value is much lower than 0.05 indicating that the 0.7778 accuracy value can be trusted.

Model Selection

Despite having a decent initial model, it would be good see if we can further optimize it before we start making predictions.

At this point, there are several variables in the mix and perhaps we can reduce that number in an efficient way to see if the performance can be improved.

For that, we will use the Backwards Elimination Method (BEM) using p-values since we want to determine the most significant contributor to the model.

The process starts with the full model and the variable with the highest p-value gets removed or “eliminated”. This sequence continues until we are only left with variables that have a p-value less than 0.05.

IterationVariablep-valueModel R-squaredModel p-value
1top200_box0.98670.77782.20E-16
2best_pic_win0.6060.77812.20E-16
3best_dir_win0.26210.77842.20E-16
4best_actor_win0.0720.7792.20E-16

The table above displays the variables that are eliminated at each stage of the elimination process. Notice that the R-squared value sees very little variation throughout the process.

Here is the final model:

IMDB Rating = B0 + B1(Audience Score) + B2(Audience Rating) + B3(Best Actress Win) + B4(Best Director Win)

Below are the model results.

From the table above, we can see that the model has an overall R-squared of 0.7771 with a p-value of 2.2e-16. At this point, the model is almost ready for prediction.

Notice that the critics_rating variable, which is a categorical variable, shows two levels in the results table. The “Fresh” certification has a p-value of 0.259 which is not significant but we decided to stick with the overall variable since the “Fresh” rating has a significant result with a p-value of less than 0.05.

In order to convince ourselves to keep this variable in the model, Analysis of Variance (ANOVA) can be used to analyze the critics rating variable as a whole.

The table above shows the results of ANOVA from which we can see that critics rating as a whole is in fact significant with a p-value of 2.2e-16 with respect to the F-statistic. Hence we keep this variable.

Model Diagnostics

Now that we have a more optimized model let’s take a look at how good (or bad) the model is.

To critique a linear regression model the four general metrics are linear association, normality of residuals, constant variability of residuals and independence of residuals. Below we go through each of these diagnostic tools in some detail.

  • Linear association between the numerical explanatory variables and the target variable. Since we have just one numerical variable in our model (audience score), we simply need to look at the scatter plot of audience score and the residuals.

From the above plot, we can see that the residuals are fairly linear with a moderate relationship. The points appear to be randomly distributed around the 0 residuals point.

  • Nearly normal residuals: The histogram below shows the distribution of the residuals which is unimodal and normally distributed around a mean of 0. There is a slight left skew to the distribution but it’s more of a normal distribution overall.

The normal probability plot for the residuals is shown below which magnifies the left skewness of the data shown by the dropping off of the lower tail.


  • Constant variability of residuals: For this analysis, we use the residual plot of the model which shows that the data has a fan-shape and in fact does not have constant variability. There appears to be greater variability at lower values with the data being more sparse.

To further highlight the difference in variability, the plot below shows the absolute values of the residuals where the lower values once again have greater variability.

  • Independent residuals: Since the data (movies) were collected randomly (random sampling), it is fair to assume that the residuals are independent of each other as well.

Results

One final step before making our predictions. The table below shows the model coefficients as well as other stats such as standard error and p-value.

VariableCoefficientStd Errort valueP value
Intercept3.7980.1242930.565< 2e-16
critics rating (Fresh)-0.0650.058106-1.130.259
critics rating (Rotten)-0.3980.064938-6.1431.41E-09
audience_score0.050.00212223.798< 2e-16
audience_rating (Upright)-0.4540.080966-5.6122.98E-08
best_actress_winyes0.1540.0643212.3990.0167

Just to recollect how these coefficients work, lets look at an example. If we are looking at the audience rating variable we can interpret the results by saying that all else held constant, an “Upright” audience rating decreases the IMDB rating by 0.454348. Keep in mind that a negative or positive coefficient has nothing to do with the significance of the variable.

Prediction

We are finally at the most exciting section. A handful of movies have been selected in order to test the model. These movies were selected at random, although they are all movies that I have either watched or know of.

The table in the above image, shows the movies that were picked for prediction along with the explanatory variables – Critics Rating, Audience Rating, Audience Score, and Best Actress Win.

Prediction was carried out using our multiple regression model which also included a confidence interval for the prediction as can be seen in the table.

From the table, it looks like most of the predictions are fairly close to the actual ratings. Dr. Strange, Up In The Air, Pearl Harbor, Magic Mike and Green Lantern are the closest predictions while Gladiator, Dark Knight and Lord of the Rings are relatively far from the actual ratings. If consider the confidence intervals, only Lord of the Rings falls outside the range.

Evaluation

In order to further evaluate the predictions, we use the R-squared score to compare the results to the actual ratings. This gives us an R-squared score of 0.448 which indicates that the model is not a good predictor of the IMDB rating.

Conclusion

In closing, it is clear that the final model is not the best. Note that this analysis was a rather simple and rough effort to create a movie predictor. There could be several ways to make this model better.

These are several ways the model can be improved:

  • Techniques such as label encoding for categorical variables was not used. This converts the levels within a categorical variable into discrete numerical values that makes them better suited for regression.
  • Scaling was not used for any of the variables. For features that are presented in different units, scaling them to the same standard can help make a more accurate model.
  • Other regression techniques such as Ridge, Lasso, K-nearest neighbors (KNN), or Decision Trees can be implemented for better results.
  • Evaluation techniques such as train/test split, cross-validation and grid search were not used which can help to optimize the model through recursive evaluation.
  • More analysis during the EDA section could have been done to better understand the data and remove redundant data points.

These are just some of the tricks that could yield a better model. I must admit that I am more adept with Python than with R, especially when it comes to data manipulation and machine learning algorithms.

Considering the above possible improvements, a similar analysis but this time with Python will be done in a future post to see if we can improve on the work done here.