The target variable for this exercise was the IMDb rating, which was treated as classification.
Typically, the rating ranges from 0.0 to 10.0, but given that we were focused on the top 250, the minimum in the dataset was 8.0, and the maximum was 9.3. It should be noted that the median of the dataset was 8.3.
As a result, our target variables fell in 3 ranges:
Features were extracted from available information each movies’ IMDb page. The following were chosen based off of their respective assumptions:
It is worth noting that through initial exploration, Number of Votes appeared to be the most correlated compared to each of the other features:
For text-based features such as Stars, Keywords and Directors, each subset was relatively large. There were 584 individual Stars, 431 Keywords and 167 Directors. So, although there were clear “winners” that were more likely to be in the 250 list (ex: only 5 directors had 7 movies in the list), the length of these features were predicted to have been heavily diluted.
Various tree-based methods were used to develop the model including Decision Trees, Random Forests, Extra Trees, Gradient Boost, Bagged Decision Trees and Extra Trees.
Given apprehensions around the total volume of end features, various feature selection models were tested as well. This included using SelectKBest to identify the top features, and manually extracting features (e.g. plot keywords were left off in one iteration).
The best performing model based resulted in an accuracy score of .79 , using Gradient Boost model, and all features:
Although the accuracy score was great, it was clear that the sample was waited towards ratings in the lower third of the ratings. Any ratings that were above 8.4 were not predicted by the model in an accurate manner.
Despite 1,216 total features, the final model was mostly influenced by 1 key features, namely the Number of Votes (as predicted earlier):
Feature | Relative Importance |
---|---|
Number of Votes | .208 |
Duration | .041 |
Genre: Sci | .038 |
Genre: Fi | .037 |
Keyword: Lake Tahoe | .026 |
Keyword: Corrupt Politician | .025 |
Star: Tom Skerritt | .025 |
Year | .019 |
The fact that Number of Votes was the most salient factor could be a result of mob mentality. Perhaps when a movie is already well-rated, and with a high volume of votes, people are less likely to vote the movie down.
Although the first iteration of the model had a great accuracy score, with a total of 1,216 features and 250 target variables, it was unsurprising that the result was sub-par.
Most of the next steps should be around reducing the features, and rebalancing the sample to train/test.
Further analysis could be done on refining our feature selection, including reducing plot keywords to a refined few, to reduce the number of features coming from this macro-feature, and potentially increasing the predictability of plot keywords.
Upon refinement, additional features could also be added, including scores from other sites, such as Rotten Tomatoes, or Metacritic.