Film Recommendation Engine for ‘Life Altering Films’

Life Altering Films


I want to find films for people that will: inspire them, challenge their view of the world, make them think and make them feel – ‘Life Altering Films’ (LAF). Many who seek LAFs typically suffer a few pitfalls:

  1. Spend a lot of time searching through film websites, film databases, review sites, recommendation sites, and watching trailers.
  2. Even with the most rigorous search and recommendation tools available, they often come back with a list of films to watch, the vast majority of which are not LAFs. And sometimes, not even ‘good’ at all.
  3. Given the barriers to finding LAFs, it is likely many miss the opportunity to watch these LAFs. This is detrimental to both the filmmakers and the population. The filmmaker cannot spread their message, and the user is less likely to challenge their view of the world.

Limitations to existing technologies

Recommendation engines have come a long way in the last decade. However, I nor other films friends have yet to find any that reliably find LAFs for us. I suspect recommendation engines suffer common difficulties:

  1. Are driven by commercial incentive and will recommend films you are likely to be persuaded to watch, which may or may not be LAFs. This is a distinctly different to goal to connecting people to their LAFs.
  2. Are not hand crafted enough to the specific user. They are tackling the problem first at scale, and trying to get as many recommendations for as many people as possible. User taste is a lot more subjective than is common opinion, which requires a lot more analysis at the user level.
  3. The difficulty of getting enough previously rated films from new users, from which to make predictions of their LAFs. This is working on the assumption that more data, typically trumps a slightly more tuned recommendation engine.
  4. The challenge of getting more user data by gamifying the experience. Including feeding users the films they are likely to have seen (instead of vastness of all films of all time).

Technical approach

In my approach, the question isn’t ‘can we create a near perfect model that it will recommend accurately?’. It is merely ‘how much data (previously reviewed films) is required, such that an unrefined model, whose features are handcrafted for the new user, will predict reasonably accurately?’

See my code here.

The technical approach to solving this problem is broken down in the following parts:

  • Feature Engineering
    • The datasets
    • Merging the datasets & cleaning the data
    • Creating surveys for new users
    • Generating profiles for new users
    • Creating feature tables
  • Creating Recommendations
    • Linear regression on film attributes – per new user
    • Linear regression on old user ratings – per new user
    • Finding old users with similar ratings to new users
  • Results
  • Future of the project
  • Summary

The data sets

Three publicly available data sets were used. They and their fields of interest are shown below:

  • MovieLens – historical user ratings (films previously rated by 1000s of other users)
  • OMDB – budget, revenue, runtime, job titles (Director, Producer, Editor etc.)
  • IMDB  – Actors (actor name, and # ranking in credits)

Across all 3 data-sets, each film has a unique id, and the film’s title and year of production. I explored all 3 data-sets initially to see what aspects might be useful, and chose all 3 as they each had unique information that could be used as features. The final list of features prepared were:

  • year
  • set of {historical user ratings}
  • set of {cast/crew from all job titles}
  • budget
  • revenue
  • runtime
  • num_ratings
  • average_rating

Merging the data sets & Cleaning the data

Given that none of the films had been linked across data-sets, it was established that the best way to merge the 3 data-sets was by using the film’s title and year. The difficulty of merging came about from various incongruences in the ways the film’s title and year were identified for the same films across data sets. Some examples were:

  1. ‘City of God (Cidade de Deus)’ in MovieLens and ‘City of God’ in OMDB
  2. ‘Birdman’ in MovieLens and OMDB and ‘Birdman or (The Unexpected Virtue of Ignorance)’ in IMDB
  3. ‘V for Vendetta’ is a 2006 movie in MovieLens and OMDB, but its a 2005 movie in IMDB
  4. ‘Usual Suspects, The (1995)’ in MovieLens is ‘The Usual Suspects’ (with year separated) in IMDB & OMDB
  5. OMDB had dates stored as dd­mm­yy and MovieLens & IMDB as (yyyy)

Some experiments were undertaken to try to minimise these incongruences, however given the scope of the challenge to find and model them all, not all were corrected (incongruences of type 4 & 5 were fixed). When all 3 data sets had been linked (despite some title and year issues remaining) there were 8832 films to use in extracting the features. In the future I hope to improve the film title matching algorithms to have a greater pool of films.

Creating surveys for new users

In order to find LAFs for users, I needed to capture their previous preferences through a rating system. It would be infeasible to have each user rate 8832 different films, as this would take a long time, and there is high likelihood that many of the films haven’t been watched. To create a feasible list I experimented with some filtering parameters, and finally settled upon, films that: ­

  • Were made on or after 1965 ­
  • Had English language ­
  • Had at least 10 previous user ratings

I sorted this list by average historical user rating, then by number of ratings. From this sorted list, I took the top 1000 films. Then for the number of new users ~ 10, I printed for each, an excel sheet with the top 500 films randomised, followed by the bottom 500 (of the top 1000) randomised. Each of these randomised sheets was then imported into a shared google sheet for each group of new users (ie friends, family etc), with a separate tab for each user. Here is an example for my own survey data.

See column ‘out of 10’ of sheet ‘Reviewed’ with my personal film ratings. (blank), ‘­’ or ‘-­1’ denoting a film not seen. It was important to have the film titles randomised to remove bias, say for example if the films were sorted by highest rated first. Films the like Shawshank Redemption are well known to rate highly, this could alter one’s rating of those higher on the list.

Connecting pandas to google sheets proved troublesome, as it was both slow and not fully documented anywhere for the current version of pandas and google authentication. I had to experiment and combine a number of different resources and code examples to read new user data from Google Sheets. This was an important step as it meant the task was much less burdensome for the users otherwise having to use excel and saving and emailing versions of complete/incomplete files. Once received, references of dataframes for each new user’s film ratings were read into a 2D dictionary of the format: users[sheet_name][user_name].

Generating profiles for new users

Going hand-in-hand with creating the LAF recommendations, I also wanted to create a profile for each user. The profile, would show what aspects of films, such as: director, screenplay, actors etc are preferred or not preferred for that user. There were two reasons for doing this:

  1. Giving the user some information to better understand what drives their preferences for films
  2. Eventually using the data to drive more customised recommendations for the user

First was to create a model of the features (variables) to be used. This was important since all features came in all shapes and sizes: stored in different dataframes, continuous/discrete, features spread over multiple columns, features with a vast number of instances (e.g. Actors). A dictionary of features structure was created taking into account these nuances, and then a function add_features(df, features) was created, which would add any combination features (excluding historical user ratings) to a dataframe dynamically. Once these features were added, I wanted to visualise which features were most prominent for each user, this was through function graph_top_discrete_features(sheet_name, user_name, feature, max_disp_results, bad_films). This function also called another function sort_by_feature_count. Here are some of the results:



Intuitively, this will need more refinement as these preferences may be misleading. They are merely features that are ‘likely’ to be preferred or not preferred by that user. For example, from the last graph, there are 4 films with Christian Bale that Dave has rated poorly. There may be another 3 films or even more (yet to be rated) with Christian Bale that Dave has rated highly.

CREATING RECOMMENDATIONS – Linear regression on film attributes – per new user.

Attempting to create predictions was started by looking at some continuous features:

  • budget
  • revenue
  • runtime
  • num_ratings
  • average_rating

For each new user, I ran a linear regression model on each feature separately, then all features together. Additionally for each model, data was split into training and test sets for cross validation. Function add_features as discussed above was used here to generate the results, which were added to this table, and graphed below:



Unfortunately there were no significant results that came out of this as none of the model’s test R^2 (performance on future data) were even above 50%. This was including myself, who looked at over 6000 films to rate 657 of those, hence having 657 data points. Perhaps with more data, there might have been some patterns found, but it is probably infeasible to have users offer more ratings. Results may also improve with Principal Component Analysis (PCA). However based on my intuitive understanding, if PCA were to help, we’d expect to see some correlation between single feature models, which wasn’t the case.

CREATING RECOMMENDATIONS – Linear regression on old user ratings – per new user

The next angle, was to try to find linear relationships between each new user, and historical user’s ratings. For this, I created a few functions to deal with all new and old users ratings dynamically:

  • find_hist_users(sheet_name, user_name, only_match_good, max_num_users, min_common_films) ­ find all historical users, who have rated the same films as the new user
  • create_hist_user_features(hist_users_with_new, max_features) ­ convert hist_users_with_new into a user features dataframe
  • pivot_and_keep_cols(df, values, index, columns, keep_cols = [], dropna = True) ­ make each userId a column name, with said historical user’s film ratings.
  • run_linreg_hist_users(sheet_name, user_name, max_num_users, only_match_good, results) ­ run multiple linear regression models for each new user, on old users ratings as features

Using the above functions, I ran linear regressions models for all users, against previous users, tweaking both:

  • The number of users to use as features (1, 2, 5, 10, 20, 50, 100, 200, 500), and
  • either match on good films or not, defined by films with rating at least good_films_threshold = 7.

An excerpt of the results table looked like this:


The R^2 results for each are summarised below. Some were quite high (perhaps overfit), and others were wildly inaccurate. In all, the results although inconsistent for different models, showed some indication (in some of the models) that there are similar user raters out there. See excerpt below:


Upon analysing these results further, it became evident that those new users who had rated more films, had a substantially higher pool of old raters from which to find similar raters. See below a graph of ‘amount of old users’ vs ‘amount of films in common’ for both the heaviest new user rater, and the lightest new user rater:


As seen on the graphs, the sweet spot (the knee) of having a both a large number of both of (‘Amount of old users’, ‘Amount of films in common’) was around: (50, 50) for heaviest film rater, compared with (50, 5) for the lightest film rater. The conclusion here is that, with the more films a new user rates, there is significantly more opportunity to find data (films both users have rated) that could be used to create a model comparing the two.

When comparing a results from all new users of ‘Amount of old users’ & ‘Amount of films in common’, the results follow a similar locus:


CREATING RECOMMENDATIONS – Finding old users with similar ratings to new users

A different, and perhaps more exploratory alternative to recommending came about by testing the theory that there exists similar users to the new users. In order to find similar historical (old) users to new users, there had to be some kind of scoring system/metric to do so. This requires first finding old users who have rated the same films as the new user, in order to generate the results from the scoring system. The algorithm created is described below:

  1. Find old users who have rated the same films
  2. For each old rater, with each new rater, reduce the list to films only both parties have rated.
  3. Of this list calculate the mean & standard deviation (std) for each of the old, and new user.
  4. Compare the new (mean, std) with old (mean, std), then order by the ratio difference in std, and then by ratio difference in mean.
  5. Then remove those old users in the list whose difference in mean and std exceed a certain tolerance (by default 10%)

This was achieved by the following function that was created: find_similar_hist_users(sheet_name, user_name, max_num_users, only_match_good, tolerance, similar_users)

This function was called in a loop that iterated over all new users. After various attempts, it was discovered that the processing and memory constraints caused the program to crash. I optimised this by converting one of the 1.7GB dataframes into a 670MB data frame, which helped slightly. Then instead of calculating the results for all users, I tested only on myself (since I had the most data). Additionally I had to cut the maximum number of old users to search for to 200. See results below for similar users to myself. ‘out of 10’ column denotes new user rating (in this case me), and ‘userId’ column denotes old user’s rating as per row index:


In the future, I’m looking to optimise the speed of this algorithm to increase the search capability. Additionally, I want to experiment with different scoring systems. Once some more patterns emerge I can take this information to define a classification problem to creating recommendations.


Once historical users were found. I then found all of the films that they had rated as at least 8/10, removed from that list the films I’d seen, then printed the results. See results excerpt below:


On inspection of the list there are some films I’d already seen but not yet rated. These films were not part of the survey. This effect would need to be built into the workflow of new users. They review films, then receive recommendations, which may be already watched films. The already watched films can form part of the next survey to add to the user’s predictors data set.

I’m looking forward to watching some of these to road test the system. After this, with a leap of faith or perhaps with more refinement, I shall send recommendations to my early users for feedback.

Future of the project

As mentioned in each of the sections, various areas can be optimised, or explored differently as this several month project is just the tip of the iceberg. Thankfully there is enough evidence to say that similar users likely exist, and their previous ratings can be used to recommend films to new users on an individual level. Here is a summary of the future experiments proposed:

  • Running clustering algorithms to see which patterns exist
  • Creating a classification model, predicting good/bad films for users based on a threshold
  • Principal Component Analysis to improve accuracy of the models that have numerous features.
  • Creating a classification model, that has classes: 7, 8, 9, 10 for ratings out of 10
  • Creating a classification model that uses each new users preferred features (as graphed above ­ e.g. favourite directors/producers/screenplays etc)
  • Ensemble methods to combine the best aspects of various models.
  • Creating a front ­end to allow users to make advanced searches, recording their usage with google analytics to get insight into what aspects in films they find most important.
  • Gamifying the experience for new users to increase the amount of new user data.
  • Improving the film title matching algorithm for the 3 data-sets to create a larger master data set.
  • Continue approach in last section by improving upon the scoring system with old users.
  • Merging the recommendations for new users, to get shared recommendations for a group.


Copious amounts of data have been churned into a useful structure to enable people to rate films. There has been a dent made in the goal of finding LAFs for new users. A lot of new ideas for continued or different experiments have arisen as a result of this work. It will be interesting to see this project grow in the future, and who else might find it useful. Concrete examples of how users can currently use this information are the various features that are described in the user profiles generated; preferred/non­preferred: Director, Screenplay, Producer, Director of Photography, Editor, Original Music Composer, Executive Producer. Similarly for the recommended films on the similar historical users to new user scoring algorithm described above. I can produce a list of basic recommendations for each new user (as I have myself).

Linear models are limited, and time consuming to run. The limitations are that linear relationships don’t always exist, regardless of the amount of data. In this instance, it could be that ratings (the predicted item) is not helpful to be thought of as continuous. Classification models with new user’s preferred and non-preferred people (classes of features such as Directors/Actors/Screenplays) can be further explored. Similarly for old user ratings, and using classes 7, 8, 9, 10 for example.

Not all features are going to be useful, and too many will add to the complexity and processing power needed. Getting this right balance is a continued challenge. I’m confident that with enough data, and a suitable model or collection of models, that new user’s LAFs can be found. The continued question is, how much new user data is needed to find their LAFs?