By Manuel Garrido — Dec 5, 2015

A short introduction to Recommendation Systems

In this tutorial, we will dive into recommendation systems.

You might not know what recommendation systems are but you see them everywhere on the internet.

Everytime you shop on Amazon and you see related products...

Or when Netflix recommends you something interesting to watch...

The purpose of a recommendation system is to predict a rating that a user will give to an item that they have not yet rated.

This rating is produced by analyzing either item characteristics or other user/item ratings (or both) to provide personalized recommendations to users.

There are 2 main approaches to recommendation systems:

Content Filtering. Recommendations depend on item characteristics.
Collaborative Filtering. Recommendations depend on user-item ratings.

In this tutorial we will work with the MovieLens Dataset. This dataset contains user generated movie ratings from the website MovieLens (https://movielens.org/).

It contains multiple files, but the ones we will use in this tutorial will be movies.dat and ratings.dat.

First we will download the dataset:

wget http://files.grouplens.org/datasets/movielens/ml-1m.zip  
unzip ml-1m.zip  
cd ml-1m/

Content Filtering

Here are the first rows of the movies.dat file. The file follows the format:

movieid::movietitle::movie genre(s)

head movies.dat

1::Toy Story (1995)::Animation|Children's|Comedy  
2::Jumanji (1995)::Adventure|Children's|Fantasy  
3::Grumpier Old Men (1995)::Comedy|Romance  
4::Waiting to Exhale (1995)::Comedy|Drama  
5::Father of the Bride Part II (1995)::Comedy  
6::Heat (1995)::Action|Crime|Thriller  
7::Sabrina (1995)::Comedy|Romance  
8::Tom and Huck (1995)::Adventure|Children's  
9::Sudden Death (1995)::Action  
10::GoldenEye (1995)::Action|Adventure|Thriller

With genres being separated by a pipe |.

We load now the movies file:

import pandas as pd  
import numpy as np  
movies_df = pd.read_table('movies.dat', header=None, sep='::', names=['movie_id', 'movie_title', 'movie_genre'])

movies_df.head()

Out[]:

	movie_id	movie_title	movie_genre
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama

In order to be able to work with the movie_genre column, we need to transform it to what is called "dummy variables".

This is a way to convert a categorical variable (e.g. Animation, Comedy, Romance...), into multiple columns (one column named Action, one named Comedy, etc).

For each movie, these dummy columns will have a value of 0 except for those genres the movie has.

# we convert the movie genres to a set of dummy variables 
movies_df = pd.concat([movies_df, movies_df.movie_genre.str.get_dummies(sep='|')], axis=1)  
movies_df.head()

Out[]:

	movie_id	movie_title	movie_genre	Adventure	Animation	Children's	Comedy	...
0	1	Toy Story (1995)	Animation\|Children's\|Comedy	0	1	1	1	...
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy	1	0	1	0	...
2	3	Grumpier Old Men (1995)	Comedy\|Romance	0	0	0	1	...
3	4	Waiting to Exhale (1995)	Comedy\|Drama	0	0	0	1	...
4	5	Father of the Bride Part II (1995)	Comedy	0	0	0	1	...

So for example, the movie with an id of 1 Toy Story, belongs to the genres Animation, Children's and Comedy, and thus the columns Animation, Children's and Comedy have a value of 1.

movie_categories = movies_df.columns[3:]  
movies_df.loc[0]

Out[]:

movie_id 1 movie_title Toy Story (1995) movie_genre Animation|Children's|Comedy Action 0 Adventure 0 Animation 1 Children's 1 Comedy 1 Crime 0 Documentary 0 Drama 0 Fantasy 0 Film-Noir 0 Horror 0 Musical 0 Mystery 0 Romance 0 Sci-Fi 0 Thriller 0 War 0 Western 0 Name: 0, dtype: object

Content filtering is a simple way to build a recommendation system. Here, items (in this example movies) are mapped to a set of features (genres).

To recommend a user an item, first that user has to provide his/her preferences regarding those features.

So in this example, the user has to tell the system how much does he or she like each movie genre.

Right now we have all the movies mapped into genres. We just need to create a user and map that user into those genres.

Let's create a user with strong preference for action, adventure and fiction movies.

from collections import OrderedDict


user_preferences = OrderedDict(zip(movie_categories, []))

user_preferences['Action'] = 5  
user_preferences['Adventure'] = 5  
user_preferences['Animation'] = 1  
user_preferences["Children's"] = 1  
user_preferences["Comedy"] = 3  
user_preferences['Crime'] = 2  
user_preferences['Documentary'] = 1  
user_preferences['Drama'] = 1  
user_preferences['Fantasy'] = 5  
user_preferences['Film-Noir'] = 1  
user_preferences['Horror'] = 2  
user_preferences['Musical'] = 1  
user_preferences['Mystery'] = 3  
user_preferences['Romance'] = 1  
user_preferences['Sci-Fi'] = 5  
user_preferences['War'] = 3  
user_preferences['Thriller'] = 2  
user_preferences['Western'] =1

Once we have users with their movie genre preferences and the movies mapped into genres, to compute the score of a movie for a specific user, we just need to calculate the dot product of that movie genre vector with that user preferences vector.

#in production you would use np.dot instead of writing your own dot product function.
def dot_product(vector_1, vector_2):  
    return sum([ i*j for i,j in zip(vector_1, vector_2)])

def get_movie_score(movie_features, user_preferences):  
    return dot_product(movie_features, user_preferences)

Let's compute the score of the movie 'Toy Story' (a children's animation movie) for the sample user.

toy_story_features = movies_df.loc[0][movie_categories]  
toy_story_features

Action 0 Adventure 0 Animation 1 Children's 1 Comedy 1 Crime 0 Documentary 0 Drama 0 Fantasy 0 Film-Noir 0 Horror 0 Musical 0 Mystery 0 Romance 0 Sci-Fi 0 Thriller 0 War 0 Western 0 Name: 0, dtype: object

toy_story_user_predicted_score = dot_product(toy_story_features, user_preferences.values())  
toy_story_user_predicted_score

Out[]:

So for the user, Toy Story, has a score of 5. Which does not mean much by itself, but helps us comparing how good of a recommendation Toy Story is compared to other movies.

Let's calculate the score for Die Hard (a thrilling action movie):

movies_df[movies_df.movie_title.str.contains('Die Hard')]

	movie_id	movie_title	movie_genre	Action	...
163	165	Die Hard: With a Vengeance (1995)	Action\|Thriller	1	...
1023	1036	Die Hard (1988)	Action\|Thriller	1	...
1349	1370	Die Hard 2 (1990)	Action\|Thriller	1	...

die_hard_id = 1036  
die_hard_features = movies_df[movies_df.movie_id==die_hard_id][movie_categories]  
die_hard_features.T

Out[]:

	1023
Action	1
Adventure	0
Animation	0
Children's	0
Comedy	0
Crime	0
Documentary	0
Drama	0
Fantasy	0
Film-Noir	0
Horror	0
Musical	0
Mystery	0
Romance	0
Sci-Fi	0
Thriller	1
War	0
Western	0

note, 1023 is the dataframe row index for Die Hard, not the movie index in the movielens dataset

die_hard_user_predicted_score = dot_product(die_hard_features.values[0], user_preferences.values())  
die_hard_user_predicted_score

Out[]:

So we see that Die Hard gets an score of 8 vs a 5 for Toy Story. So Die Hard would be recommended before Toy Story. Which makes sense, given this user's preferences are skewed towards action packed movies.

Once we know how to calculate the score for one movie, providing movie recommendations for the user is as easy as calculating the score for all the movies and returning those with the highest scores.

def get_movie_recommendations(user_preferences, n_recommendations):  
    #we add a column to the movies_df dataset with the calculated score for each movie for the given user
    movies_df['score'] = movies_df[movie_categories].apply(get_movie_score, 
                                                           args=([user_preferences.values()]), axis=1)
    return movies_df.sort_values(by=['score'], ascending=False)['movie_title'][:n_recommendations]

get_movie_recommendations(user_preferences, 10)

Out[]:

2253 Soldier (1998) 257 Star Wars: Episode IV - A New Hope (1977) 2036 Tron (1982) 1197 Army of Darkness (1993) 2559 Star Wars: Episode I - The Phantom Menace (1999) 1985 Honey, I Shrunk the Kids (1989) 1192 Star Wars: Episode VI - Return of the Jedi (1983) 1111 Abyss, The (1989) 1848 Armageddon (1998) 2847 Total Recall (1990) Name: movie_title, dtype: object

So the system recommends heavy action and scifi movies. Neat!

Content Filtering makes recommending to a new user very easy. Users just have to express their preferences once. However, Content Filtering shows some caveats:

Need to map each item into the feature space. That means that any time a new item gets added, someone has to manually categorize that item.
Recommendations are limited in scope. This means items can't be categorized in new features.

So content filtering is maybe a too simple option nowadays, which leads us to...:

Collaborative Filtering

Collaborative filtering is another way of predicting user-item scores. This time though, we will use the existing user-item scores to predict the missing ones.

The assumption is that users get value from recommendations based on other users with similar tastes.

For this example we will use the ratings.dat file. This file follows the format:

userid::movieid::rating::timestamp

head ratings.dat

1::1193::5::978300760  
1::661::3::978302109  
1::914::3::978301968  
1::3408::4::978300275  
1::2355::5::978824291  
1::1197::3::978302268  
1::1287::5::978302039  
1::2804::5::978300719  
1::594::4::978302268  
1::919::4::978301368

The MovieLens dataset provides us with a file that includes over 1 million movie ratings.

ratings_df = pd.read_table('ratings.dat', header=None, sep='::', names=['user_id', 'movie_id', 'rating', 'timestamp'])

#we dont care about the time the rating was given
del ratings_df['timestamp']

#replace movie_id with movie_title for legibility
ratings_df = pd.merge(ratings_df, movies_df, on='movie_id')[['user_id', 'movie_title', 'movie_id','rating']]

ratings_df.head()

Out[]:

	user_id	movie_title	movie_id	rating
0	1	One Flew Over the Cuckoo's Nest (1975)	1193	5
1	2	One Flew Over the Cuckoo's Nest (1975)	1193	5
2	12	One Flew Over the Cuckoo's Nest (1975)	1193	4
3	15	One Flew Over the Cuckoo's Nest (1975)	1193	4
4	17	One Flew Over the Cuckoo's Nest (1975)	1193	5

The dataset is a matrix of users and movie ratings, so we convert the ratings_df to a matrix with a user per row and a movie per column.

ratings_mtx_df = ratings_df.pivot_table(values='rating', index='user_id', columns='movie_title')  
ratings_mtx_df.fillna(0, inplace=True)

movie_index = ratings_mtx_df.columns

ratings_mtx_df.head()

Out[]:

movie_title	$1,000,000 Duck (1971)	'Night Mother (1986)	'Til There Was You (1997)	...
user_id
1	0	0	0	...
2	0	0	0	...
3	0	5	0	...
4	0	0	1	...
5	0	0	0	...

We have a matrix of 6040 users and 3706 movies.

To compute similarities between movies, one way is to find the correlation between movies and then use that correlation to find similar movies to those the users have liked.

An easy way of doing this is in python is by using the numpy.corrcoef function, that calculates the Pearson Product Moment Correlation Coefficient (PMCC) between each item pair.
the PMCC has a value between -1 and 1 that measures the correlation (positive or negative) between two variables.

A correlation matrix is a matrix of m x m shape, where element Mij represents the correlation between item i and item j.

corr_matrix = np.corrcoef(ratings_mtx_df.T)  
corr_matrix.shape

Out[]:

(3706, 3706)

Note: We use the transposed ratings matrix to calculate the correlation matrix so it gives back the correlation between movies (rows). If we used the ratings matrix without transposing it, np.corrcoef would return the correlation between users.

Now, if we want to find similar movies to a specific movie, it's just a matter of returning those movies that have a high correlation coefficent with that one.

favoured_movie_title = 'Toy Story (1995)'

favoured_movie_index = list(movie_index).index(favoured_movie_title)

P = corr_matrix[favoured_movie_index]


#only return those movies with a high correlation with Toy Story
list(movie_index[(P>0.4) & (P<1.0)])

Out[]:

['Aladdin (1992)',
 "Bug's Life, A (1998)",
 'Groundhog Day (1993)',
 'Lion King, The (1994)',
 'Toy Story 2 (1999)']

Now to provide recommendations to a user, we take the list of movies that user has rated. Then we sum the correlations of those movies with all the other ones and return a list of those movies sorted by their total correlation with the user.

def get_movie_similarity(movie_title):  
    '''Returns correlation vector for a movie'''
    movie_idx = list(movie_index).index(movie_title)
    return corr_matrix[movie_idx]

def get_movie_recommendations(user_movies):  
    '''given a set of movies, it returns all the movies sorted by their correlation with the user'''
    movie_similarities = np.zeros(corr_matrix.shape[0])
    for movie_id in user_movies:
        movie_similarities = movie_similarities + get_movie_similarity(movie_id)
    similarities_df = pd.DataFrame({
        'movie_title': movie_index,
        'sum_similarity': movie_similarities
        })
    similarities_df = similarities_df[~(similarities_df.movie_title.isin(user_movies))]
    similarities_df = similarities_df.sort_values(by=['sum_similarity'], ascending=False)
    return similarities_df

For example, let's select a user with a preference for kid's movies, and some action movies.

sample_user = 21  
ratings_df[ratings_df.user_id==sample_user].sort_values(by=['rating'], ascending=False)

Out[]:

	user_id	movie_title	movie_id	rating
583304	21	Titan A.E. (2000)	3745	5
707307	21	Princess Mononoke, The (Mononoke Hime) (1997)	3000	5
70742	21	Star Wars: Episode VI - Return of the Jedi (1983)	1210	5
239644	21	South Park: Bigger, Longer and Uncut (1999)	2700	5
487530	21	Mad Max Beyond Thunderdome (1985)	3704	4
707652	21	Little Nemo: Adventures in Slumberland (1992)	2800	4
708015	21	Stop! Or My Mom Will Shoot (1992)	3268	3
706889	21	Brady Bunch Movie, The (1995)	585	3
623947	21	Iron Giant, The (1999)	2761	3
619784	21	Wild Wild West (1999)	2701	3
4211	21	Bug's Life, A (1998)	2355	3
368056	21	Akira (1988)	1274	3
226126	21	Who Framed Roger Rabbit? (1988)	2987	3
41633	21	Toy Story (1995)	1	3
34978	21	Aladdin (1992)	588	3
33432	21	Antz (1998)	2294	3
18917	21	Bambi (1942)	2018	1
612215	21	Devil's Advocate, The (1997)	1645	1
617656	21	Prince of Egypt, The (1998)	2394	1
440983	21	Pinocchio (1940)	596	1
707674	21	Messenger: The Story of Joan of Arc, The (1999)	3053	1
708194	21	House Party 2 (1991)	3774	1

Now we provide movie recommendations to the sample user by using his list of rated movies as an input.

sample_user_movies = ratings_df[ratings_df.user_id==sample_user].movie_title.tolist()  
recommendations = get_movie_recommendations(sample_user_movies)

#We get the top 20 recommended movies
recommendations.movie_title.head(20)

Out[]:

1939 Lion King, The (1994) 324 Beauty and the Beast (1991) 1948 Little Mermaid, The (1989) 3055 Snow White and the Seven Dwarfs (1937) 647 Charlotte's Web (1973) 679 Cinderella (1950) 1002 Dumbo (1941) 301 Batman (1989) 3250 Sword in the Stone, The (1963) 303 Batman Returns (1992) 2252 Mulan (1998) 2924 Secret of NIMH, The (1982) 2808 Robin Hood (1973) 3026 Sleeping Beauty (1959) 1781 Jungle Book, The (1967) 260 Back to the Future Part III (1990) 259 Back to the Future Part II (1989) 2558 Peter Pan (1953) 2347 NeverEnding Story, The (1984) 97 Alice in Wonderland (1951) Name: movie_title, dtype: object

So we see that the system recommends mostly kid's movies and some action movies. Neat!

Collaborative filtering is a widely used recommendation system nowadays. It is capable of recommending new items without having to manually define them. Also, it is able to find recommendations based on hidden features that an expert wouldn't be able to find (for example, combination of genres or actors).

However, it has one mayor drawback. Collaborative filtering cannot recommend items for a new user until he/she has reviewed some items. This problem is called the Cold Start Issue.

One way recommender systems overcome this issue is by using a hybrid Content + Colaborative Filtering. That is, using colaborative filtering as well as content filtering when necessary.

Content Filtering

Collaborative Filtering

Further reading