Using pandas on the MovieLens dataset

UPDATE: If you're interested in learning pandas from a SQL perspective and would prefer to watch a video, you can find video of my 2014 PyData NYC talk here.

This is part three of a three part introduction to pandas, a Python library for data analysis. The tutorial is primarily geared towards SQL users, but is useful for anyone wanting to get started with the library.

Part 1: Intro to pandas data structures

Part 2: Working with DataFrames

Part 3: Using pandas with the MovieLens dataset

In [61]:
chicago[chicago.department == "LAW"][:5]

Out[61]:
name title department salary dept_rank
21971 PATTON, STEPHEN R CORPORATION COUNSEL LAW 173664 1
6311 DARLING, LESLIE M FIRST ASST CORPORATION COUNSEL LAW 149160 2
17680 MARTINICO, JOSEPH P CHIEF LABOR NEGOTIATOR LAW 144036 3
22357 PETERS, LYNDA A CITY PROSECUTOR LAW 139932 4
31383 WONG JR, EDWARD J DEPUTY CORPORATION COUNSEL LAW 137076 5

We can now see where each employee ranks within their department based on salary.

Using pandas on the MovieLens dataset

To show pandas in a more "applied" sense, let's use it to answer some questions about the MovieLens dataset. Recall that we've already read our data into DataFrames and merged it.

In [62]:
# pass in column names for each CSV
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
encoding='latin-1')

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
encoding='latin-1')

# the movies file contains columns indicating the movie's genres
# let's only load the first five columns of the file with usecols
m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
movies = pd.read_csv('ml-100k/u.item', sep='|', names=m_cols, usecols=range(5),
encoding='latin-1')

# create one merged DataFrame
movie_ratings = pd.merge(movies, ratings)
lens = pd.merge(movie_ratings, users)


What are the 25 most rated movies?

In [63]:
most_rated = lens.groupby('title').size().sort_values(ascending=False)[:25]
most_rated

Out[63]:
title
Star Wars (1977)                             583
Contact (1997)                               509
Fargo (1996)                                 508
Return of the Jedi (1983)                    507
Liar Liar (1997)                             485
English Patient, The (1996)                  481
Scream (1996)                                478
Toy Story (1995)                             452
Air Force One (1997)                         431
Independence Day (ID4) (1996)                429
Raiders of the Lost Ark (1981)               420
Godfather, The (1972)                        413
Pulp Fiction (1994)                          394
Twelve Monkeys (1995)                        392
Silence of the Lambs, The (1991)             390
Jerry Maguire (1996)                         384
Chasing Amy (1997)                           379
Rock, The (1996)                             378
Empire Strikes Back, The (1980)              367
Star Trek: First Contact (1996)              365
Back to the Future (1985)                    350
Titanic (1997)                               350
Mission: Impossible (1996)                   344
Fugitive, The (1993)                         336
Indiana Jones and the Last Crusade (1989)    331
dtype: int64


There's a lot going on in the code above, but it's very idomatic. We're splitting the DataFrame into groups by movie title and applying the size method to get the count of records in each group. Then we order our results in descending order and limit the output to the top 25 using Python's slicing syntax.

In SQL, this would be equivalent to:

SELECT title, count(1)
FROM lens
GROUP BY title
ORDER BY 2 DESC
LIMIT 25;

Alternatively, pandas has a nifty value_counts method - yes, this is simpler - the goal above was to show a basic groupby example.

In [64]:
lens.title.value_counts()[:25]

Out[64]:
Star Wars (1977)                             583
Contact (1997)                               509
Fargo (1996)                                 508
Return of the Jedi (1983)                    507
Liar Liar (1997)                             485
English Patient, The (1996)                  481
Scream (1996)                                478
Toy Story (1995)                             452
Air Force One (1997)                         431
Independence Day (ID4) (1996)                429
Raiders of the Lost Ark (1981)               420
Godfather, The (1972)                        413
Pulp Fiction (1994)                          394
Twelve Monkeys (1995)                        392
Silence of the Lambs, The (1991)             390
Jerry Maguire (1996)                         384
Chasing Amy (1997)                           379
Rock, The (1996)                             378
Empire Strikes Back, The (1980)              367
Star Trek: First Contact (1996)              365
Titanic (1997)                               350
Back to the Future (1985)                    350
Mission: Impossible (1996)                   344
Fugitive, The (1993)                         336
Indiana Jones and the Last Crusade (1989)    331
Name: title, dtype: int64


Which movies are most highly rated?

In [65]:
movie_stats = lens.groupby('title').agg({'rating': [np.size, np.mean]})

Out[65]:
rating
size mean
title
'Til There Was You (1997) 9 2.333333
1-900 (1994) 5 2.600000
101 Dalmatians (1996) 109 2.908257
12 Angry Men (1957) 125 4.344000
187 (1997) 41 3.024390

We can use the agg method to pass a dictionary specifying the columns to aggregate (as keys) and a list of functions we'd like to apply.

Let's sort the resulting DataFrame so that we can see which movies have the highest average score.

In [66]:
# sort by rating average

Out[66]:
rating
size mean
title
They Made Me a Criminal (1939) 1 5
Marlene Dietrich: Shadow and Light (1996) 1 5
Saint of Fort Washington, The (1993) 2 5
Someone Else's America (1995) 1 5
Star Kid (1997) 3 5

Because movie_stats is a DataFrame, we use the sort method - only Series objects use order. Additionally, because our columns are now a MultiIndex, we need to pass in a tuple specifying how to sort.

The above movies are rated so rarely that we can't count them as quality films. Let's only look at movies that have been rated at least 100 times.

In [67]:
atleast_100 = movie_stats['rating']['size'] >= 100
movie_stats[atleast_100].sort_values([('rating', 'mean')], ascending=False)[:15]

Out[67]:
rating
size mean
title
Close Shave, A (1995) 112 4.491071
Schindler's List (1993) 298 4.466443
Wrong Trousers, The (1993) 118 4.466102
Casablanca (1942) 243 4.456790
Shawshank Redemption, The (1994) 283 4.445230
Rear Window (1954) 209 4.387560
Usual Suspects, The (1995) 267 4.385768
Star Wars (1977) 583 4.358491
12 Angry Men (1957) 125 4.344000
Citizen Kane (1941) 198 4.292929
To Kill a Mockingbird (1962) 219 4.292237
One Flew Over the Cuckoo's Nest (1975) 264 4.291667
Silence of the Lambs, The (1991) 390 4.289744
North by Northwest (1959) 179 4.284916
Godfather, The (1972) 413 4.283293

Those results look realistic. Notice that we used boolean indexing to filter our movie_stats frame.

We broke this question down into many parts, so here's the Python needed to get the 15 movies with the highest average rating, requiring that they had at least 100 ratings:

    movie_stats = lens.groupby('title').agg({'rating': [np.size, np.mean]})
atleast_100 = movie_stats['rating'].size >= 100
movie_stats[atleast_100].sort_values([('rating', 'mean')], ascending=False)[:15]

The SQL equivalent would be:

SELECT title, COUNT(1) size, AVG(rating) mean
FROM lens
GROUP BY title
HAVING COUNT(1) >= 100
ORDER BY 3 DESC
LIMIT 15;

Limiting our population going forward

Going forward, let's only look at the 50 most rated movies. Let's make a Series of movies that meet this threshold so we can use it for filtering later.

In [68]:
most_50 = lens.groupby('movie_id').size().sort_values(ascending=False)[:50]


The SQL to match this would be:

CREATE TABLE most_50 AS (
SELECT movie_id, COUNT(1)
FROM lens
GROUP BY movie_id
ORDER BY 2 DESC
LIMIT 50
);

This table would then allow us to use EXISTS, IN, or JOIN whenever we wanted to filter our results. Here's an example using EXISTS:

SELECT *
FROM lens
WHERE EXISTS (SELECT 1 FROM most_50 WHERE lens.movie_id = most_50.movie_id);

Which movies are most controversial amongst different ages?

Let's look at how these movies are viewed across different age groups. First, let's look at how age is distributed amongst our users.

In [69]:
users.age.plot.hist(bins=30)
plt.title("Distribution of users' ages")
plt.ylabel('count of users')
plt.xlabel('age');


pandas' integration with matplotlib makes basic graphing of Series/DataFrames trivial. In this case, just call hist on the column to produce a histogram. We can also use matplotlib.pyplot to customize our graph a bit (always label your axes).

Binning our users

I don't think it'd be very useful to compare individual ages - let's bin our users into age groups using pandas.cut.

In [70]:
labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
lens['age_group'] = pd.cut(lens.age, range(0, 81, 10), right=False, labels=labels)
lens[['age', 'age_group']].drop_duplicates()[:10]

Out[70]:
age age_group
0 60 60-69
397 21 20-29
459 33 30-39
524 30 30-39
782 23 20-29
995 29 20-29
1229 26 20-29
1664 31 30-39
1942 24 20-29
2270 32 30-39

pandas.cut allows you to bin numeric data. In the above lines, we first created labels to name our bins, then split our users into eight bins of ten years (0-9, 10-19, 20-29, etc.). Our use of right=False told the function that we wanted the bins to be exclusive of the max age in the bin (e.g. a 30 year old user gets the 30s label).

Now we can now compare ratings across age groups.

In [71]:
lens.groupby('age_group').agg({'rating': [np.size, np.mean]})

Out[71]:
rating
size mean
age_group
0-9 43 3.767442
10-19 8181 3.486126
20-29 39535 3.467333
30-39 25696 3.554444
40-49 15021 3.591772
50-59 8704 3.635800
60-69 2623 3.648875
70-79 197 3.649746

Young users seem a bit more critical than other age groups. Let's look at how the 50 most rated movies are viewed across each age group. We can use the most_50 Series we created earlier for filtering.

In [72]:
lens.set_index('movie_id', inplace=True)

In [73]:
by_age = lens.loc[most_50.index].groupby(['title', 'age_group'])

Out[73]:
title                 age_group
Air Force One (1997)  10-19        3.647059
20-29        3.666667
30-39        3.570000
40-49        3.555556
50-59        3.750000
60-69        3.666667
70-79        3.666667
Alien (1979)          10-19        4.111111
20-29        4.026087
30-39        4.103448
40-49        3.833333
50-59        4.272727
60-69        3.500000
70-79        4.000000
Aliens (1986)         10-19        4.050000
Name: rating, dtype: float64


Notice that both the title and age group are indexes here, with the average rating value being a Series. This is going to produce a really long list of values.

Wouldn't it be nice to see the data as a table? Each title as a row, each age group as a column, and the average rating in each cell.

Behold! The magic of unstack!

In [80]:
by_age.rating.mean().unstack(1).fillna(0)[10:20]

Out[80]:
age_group 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79
title
E.T. the Extra-Terrestrial (1982) 0 3.680000 3.609091 3.806818 4.160000 4.368421 4.375000 0.000000
Empire Strikes Back, The (1980) 4 4.642857 4.311688 4.052083 4.100000 3.909091 4.250000 5.000000
English Patient, The (1996) 5 3.739130 3.571429 3.621849 3.634615 3.774648 3.904762 4.500000
Fargo (1996) 0 3.937500 4.010471 4.230769 4.294118 4.442308 4.000000 4.333333
Forrest Gump (1994) 5 4.047619 3.785714 3.861702 3.847826 4.000000 3.800000 0.000000
Fugitive, The (1993) 0 4.320000 3.969925 3.981481 4.190476 4.240000 3.666667 0.000000
Full Monty, The (1997) 0 3.421053 4.056818 3.933333 3.714286 4.146341 4.166667 3.500000
Godfather, The (1972) 0 4.400000 4.345070 4.412844 3.929412 4.463415 4.125000 0.000000
Groundhog Day (1993) 0 3.476190 3.798246 3.786667 3.851064 3.571429 3.571429 4.000000
Independence Day (ID4) (1996) 0 3.595238 3.291429 3.389381 3.718750 3.888889 2.750000 0.000000

unstack, well, unstacks the specified level of a MultiIndex (by default, groupby turns the grouped field into an index - since we grouped by two fields, it became a MultiIndex). We unstacked the second index (remember that Python uses 0-based indexes), and then filled in NULL values with 0.

If we would have used: python by_age.rating.mean().unstack(0).fillna(0) We would have had our age groups as rows and movie titles as columns.

Which movies do men and women most disagree on?

EDIT: I realized after writing this question that Wes McKinney basically went through the exact same question in his book. It's a good, yet simple example of pivot_table, so I'm going to leave it here. Seriously though, go buy the book.

Think about how you'd have to do this in SQL for a second. You'd have to use a combination of IF/CASE statements with aggregate functions in order to pivot your dataset. Your query would look something like this:

SELECT title, AVG(IF(sex = 'F', rating, NULL)), AVG(IF(sex = 'M', rating, NULL))
FROM lens
GROUP BY title;

Imagine how annoying it'd be if you had to do this on more than two columns.

DataFrame's have a pivot_table method that makes these kinds of operations much easier (and less verbose).

In [75]:
lens.reset_index('movie_id', inplace=True)

In [76]:
pivoted = lens.pivot_table(index=['movie_id', 'title'],
columns=['sex'],
values='rating',
fill_value=0)

Out[76]:
sex F M
movie_id title
1 Toy Story (1995) 3.789916 3.909910
2 GoldenEye (1995) 3.368421 3.178571
3 Four Rooms (1995) 2.687500 3.108108
4 Get Shorty (1995) 3.400000 3.591463
5 Copycat (1995) 3.772727 3.140625
In [77]:
pivoted['diff'] = pivoted.M - pivoted.F

Out[77]:
sex F M diff
movie_id title
1 Toy Story (1995) 3.789916 3.909910 0.119994
2 GoldenEye (1995) 3.368421 3.178571 -0.189850
3 Four Rooms (1995) 2.687500 3.108108 0.420608
4 Get Shorty (1995) 3.400000 3.591463 0.191463
5 Copycat (1995) 3.772727 3.140625 -0.632102
In [78]:
pivoted.reset_index('movie_id', inplace=True)

In [79]:
disagreements = pivoted[pivoted.movie_id.isin(most_50.index)]['diff']
disagreements.sort_values().plot(kind='barh', figsize=[9, 15])
plt.title('Male vs. Female Avg. Ratings\n(Difference > 0 = Favored by Men)')
plt.ylabel('Title')
plt.xlabel('Average Rating Difference');


Of course men like Terminator more than women. Independence Day though? Really?

Closing

This is the point where I finally wrap this tutorial up. Hopefully I've covered the basics well enough to pique your interest and help you get started with the library. If I've missed something critical, feel free to let me know on Twitter or in the comments - I'd love constructive feedback.