Greg Reda

Prototyping a PDF Chatbot from Scratch

2023-10-26T00:00:00-07:00

As part of my work on refstudio, I spent some time prototyping a chatbot that could answer questions about a corpus of PDFs. Tools like LangChain, LlamaIndex, Haystack, and others all have built-in abstractions to simplify this task, but I find that building a simplified version from scratch helps me understand the underlying concepts better.

A basic version of the PDF Chatbot requires two phases with the following steps:

PDF Ingestion
- Convert PDFs to text
- Chunk the text into smaller pieces
- Optional: Generate embeddings for the text chunks
- Persist the text chunks (or embeddings) in some way so that we can query them later
Chatbot Interaction
- Take a question from the user
- Retrieve the most similar text chunks related to the question
  - If we did not create embeddings, we can use a ranking function like BM25 to find the most similar text chunks
  - If we did create embeddings, we can use a nearest neighbors algorithm to find the most similar text chunks
- Include the most similar text chunks as "context" we provide to the LLM with our question (i.e. our prompt)
- Return the LLM's response

While embeddings and a vector database are not strictly necessary for this task, I wanted to get a sense of the ergonomics in working with one, so I used this as an excuse to try out LanceDB. LanceDB is an open-source, embedded vector database with the goal of simplifying retrieval, filtering, and management of embeddings. It's built on Apache Arrow, which I'm a big fan of.

Results

You can find the code for this prototype in this github repo.

Here's a quick demo of the chatbot in action:

Django Command for FIT files

2023-05-21T00:00:00-07:00

FIT - Flexible and Interoperable Transfer - is a protocol designed for storing and sharing data from fitness and health devices.

Since getting a Coros running watch in July 2022, I've been exporting the FIT file data to Dropbox after every run.

Having all of this data laying around seemed like a good excuse for a toy project.¹ I haven't done much web programming in the last five years, so I'm building a little web app with django.

FIT data

The data I'm most interested in are the Session, Lap, and Record types from each file.

Sessions capture aggregated data about your run - things like total distance, average heart rate, average speed, etc.

Laps capture aggregated data about a particular lap of your run. By default, my watch creates one lap every mile. The fields here are similar to sessions - average heart rate, average speed, etc.

Records are the raw data about the run. My watch creates a new "record" every second of the run. It captures my latitude and longitude, as well as things like my heart rate, speed, cadence, estimated power output (watts), step length, etc.

To relate this data back to its source file, I've created one additional type called Activity. This contains the source filename and date, and also acts as a foreign key on the Session, Lap, and Record tables.

Mapping each of these to a django model looks like this (I've omitted many fields for the sake of conciseness):

from django.db import models


class Activity(models.Model):
    source_filename = models.FilePathField(unique=True)
    began_at = models.DateTimeField(null=True)

class Session(models.Model):
    activity = models.ForeignKey(Activity, on_delete=models.CASCADE)
    start_time = models.DateTimeField()
    total_elapsed_time = models.FloatField()
    avg_heart_rate = models.PositiveSmallIntegerField()
    # ... many more fields

class Lap(models.Model):
    activity = models.ForeignKey(Activity, on_delete=models.CASCADE)
    total_elapsed_time = models.FloatField()
    avg_heart_rate = models.PositiveSmallIntegerField()
    # ... more fields omitted

class Record(models.Model):
    activity = models.ForeignKey(Activity, on_delete=models.CASCADE)
    timestamp = models.DateTimeField()
    position_lat = models.CharField(max_length=255, null=True)
    position_long = models.CharField(max_length=255, null=True)
    heart_rate = models.PositiveSmallIntegerField(null=True)
    # ... more fields omitted

Ingest command

Django allows you to register custom commands with your application that can be run via manage.py. This is useful for standalone scripts or ones that you'll want to regularly run.

First, some helper functions to use within the command:

def convert_frame_to_dict(frame) -> Dict[str, Any]:
    return {field.name: field.value
            for field in frame.fields}

Data rows from the FIT file are message objects with a property containing the fields, and each field containing a name and value. convert_frame_to_dict converts these to dictionaries so they're easier to work with.

def extract_datetime_from_filename(filename: str) -> str:
    regex = r"([0-9]+).fit"
    match = re.search(regex, filename)
    if not match:
        return

    try:
        dt = datetime.strptime(match.group(1), "%Y%m%d%H%M%S")
    except ValueError:
        return
    return dt

File names from Coros contain a timestamp marking when the run began (e.g. Run20230520091606.fit). extract_datetime_from_filename parses out that timestamp so it can be stored in the database.

def determine_files_for_ingest(filepaths: List[Path]) -> List[Path]:
    """
    Given a list of filepaths, compare with DB to determine which ones should be ingested.
    Returns a list of filepaths in need of ingest.
    """
    # get a list of all the files in the DB
    ingested_files = Activity.objects.values_list("source_filename", flat=True)

    needs_ingest = []
    for fp in filepaths:
        if fp.name not in ingested_files:
            needs_ingest.append(fp)
    return needs_ingest

Since I will call this command after every run, determine_files_for_ingest compares the FIT file directory with what's already been loaded into the database.

With fitdecode doing the heavy lifting, my custom command looks like below:

class Command(BaseCommand):
    help = "Loads FIT file(s) into the database"

    def add_arguments(self, parser):
        parser.add_argument("fitfile_dir", type=str)

    def handle(self, *args, **options):
        fitfile_dir = Path(options["fitfile_dir"])
        filepaths = fitfile_dir.glob("*Run*.fit")

        # filter out any files that are already in the DB
        needs_ingest = determine_files_for_ingest(filepaths)
        self.stdout.write(f"Found {len(needs_ingest)} files to ingest")

        if not needs_ingest:
            return

        # extract date from filename so we can process in chronological order
        filepaths = {extract_datetime_from_filename(fp.name): fp
                     for fp in needs_ingest}

        for _, fp in sorted(filepaths.items()):
            self.stdout.write(f"Loading {fp.name} to database")

            try:
                self.process_fitfile(fp)
            except Exception as e:
                self.stdout.write(
                    self.style.ERROR(f"Failed to load {fp.name} to database", e)
                )

            self.stdout.write(
                self.style.SUCCESS(f"Successfully loaded {fp.name} to database")
            )

    def process_fitfile(self, filepath: Path):
        with fitdecode.FitReader(filepath) as fit:
            activity = Activity.objects.create(
                source_filename=filepath.name,
                began_at=extract_datetime_from_filename(filepath.name)
            )

            for frame in fit:
                if frame.frame_type != fitdecode.FIT_FRAME_DATA:
                    continue

                data = convert_frame_to_dict(frame)

                if frame.name == 'session':
                    self.create_session(data, activity)
                if frame.name == 'lap':
                    self.create_lap(data, activity)
                if frame.name == 'record':
                    self.create_record(data, activity)

I've ommitted the code for create_session, create_lap, and create_record. Each just instantiates the appropriate model and calls its save method.

Running the command with manage.py:

$ python manage.py ingest_fitfiles ../Apps/coros

The payoff? Now I can easily write SQL queries against my running data!

$ sqlite3 db.sqlite3 < sql/weekly_totals.sql -table
+---------+---------------+-------+--------------+----------+
|  week   | hours_running | miles | feet_climbed | calories |
+---------+---------------+-------+--------------+----------+
| 2023-14 | 3.5           | 21.8  | 883.0        | 2408.0   |
| 2023-15 | 3.6           | 22.2  | 1316.0       | 2375.0   |
| 2023-16 | 4.3           | 28.3  | 1486.0       | 3102.0   |
| 2023-17 | 4.8           | 30.6  | 2139.0       | 3328.0   |
| 2023-18 | 2.2           | 14.5  | 912.0        | 1703.0   |
| 2023-19 | 4.8           | 31.0  | 1686.0       | 3626.0   |
| 2023-20 | 5.2           | 33.6  | 1765.0       | 3871.0   |
+---------+---------------+-------+--------------+----------+

Sure, Strava already does a lot of this, but where's the fun in that?

My history with side projects is one of abandonment.

Notes on using PyInstaller, poetry, and pyenv

2023-05-18T00:00:00-07:00

In my all my years of working in Python, I don't think I've ever had to create a standalone executable. But, it finally happened.

It was a pretty seamless experience, but I did hit a minor hiccup, so I wanted to capture some notes for my future self (and others, too).

I was using pyenv and poetry to manage my Python environments and dependencies, and PyInstaller to create the executable.

Let's assume you have the following project setup:

$ ls -la
total 56
drwxr-xr-x@ 5 greg  staff    160 May 18 17:26 .
drwxr-xr-x@ 5 greg  staff    160 May 18 17:23 ..
-rw-r--r--@ 1 greg  staff  22220 May 18 17:28 poetry.lock
-rw-r--r--@ 1 greg  staff    381 May 18 17:28 pyproject.toml
drwxr-xr-x@ 3 greg  staff     96 May 18 17:24 src

Let's also assume you have a simple Python script at src/main.py that you want to turn into an executable.

# src/main.py

from bs4 import BeautifulSoup
import requests

def get_data():
    response = requests.get('https://www.google.com')
    soup = BeautifulSoup(response.text, 'html.parser')
    title = soup.find('title')
    print(f"Hi, the title element of Google's webpage is: {title.text}")

if __name__ == '__main__':
    get_data()

You can create the excutable via:

$ poetry run pyinstaller src/main.py

This should create a dist directory with the executable and all the necessary libraries. You can run the executable via:

$ ./dist/main/main
Hi, the title element of Google's webpage is: Google

But if you are using pyenv, you may run into the following error:

OSError: Python library not found: libpython3.9.dylib, Python, libpython3.9m.dylib, .Python, Python3
    This means your Python installation does not come with proper shared library files.
    This usually happens due to missing development package, or unsuitable build parameters of the Python installation.

    * On Debian/Ubuntu, you need to install Python development packages:
      * apt-get install python3-dev
      * apt-get install python-dev
    * If you are building Python by yourself, rebuild with `--enable-shared` (or, `--enable-framework` on macOS)

To fix this, you need to reinstall the appropriate version of Python with the --enable-framework flag.

env PYTHON_CONFIGURE_OPTS="--enable-framework" pyenv install 3.10.11

You may have to reestablish your poetry environment to the newly built python version above:

$ pyenv local 3.10.11
$ poetry env use $(which python)

If you'd already installed your dependencies via poetry, you'll have to reinstall them:

$ poetry install

Now you should be able to create the executable:

$ poetry run pyinstaller src/main.py

Check that it worked

$ ls -la
total 64
drwxr-xr-x@ 8 greg  staff    256 May 18 17:51 .
drwxr-xr-x@ 5 greg  staff    160 May 18 17:23 ..
drwxr-xr-x@ 3 greg  staff     96 May 18 17:51 build
drwxr-xr-x@ 3 greg  staff     96 May 18 17:54 dist
-rw-r--r--@ 1 greg  staff    889 May 18 17:54 main.spec
-rw-r--r--@ 1 greg  staff  22220 May 18 17:28 poetry.lock
-rw-r--r--@ 1 greg  staff    381 May 18 17:28 pyproject.toml
drwxr-xr-x@ 3 greg  staff     96 May 18 17:24 src

And run the executable:

$ ./dist/main/main
Hi, the title element of Google's webpage is: Google

The end.

Assorted bits: 2022-12-09

2022-12-09T00:00:00-08:00

I’ve wanted to get in the habit of writing more, so I’m taking some inspiration from old school blogs and sharing some things I've recently enjoyed.

Enjoy your weekend!

[Music] Spectrum by Max Cooper

I'm very into synthesizer centric music. I'm also into music that creates a sonic landscape: that feeling that a piece of audio art is being built around you, and you get to sit back and take it all in. This song by Max Cooper nails that. Put on some good headphones, relax, and give it a listen. It made me want to stop what I was doing and make some music.

[Reading] You Are More Than Just Your Job

If I could, I’d tell my younger self to resist letting what job I have (or don’t have) dominate my identity. - You Are More Than Just Your Job

I appreciate and echo the sentiment that Jenna Discher shares in the above article. After my last few years of health surprises, I've learned that allowing any singular piece of my identity to become overly dominant risks an identity crisis when there's an unexpected shock.

[Reading] Get Numb Before You Get Good

Perhaps you’ve attempted to write a blog, or you’ve taken up knitting and are loath to post pictures of your first pair of socks online, or you dread your first pitch meeting and have over-prepped over the weekend. Or perhaps you were like I was: you’d gotten comfortable in your job, and without realising it you had neglected the fear of doing new things for a bit. - Get Numb Before You Get Good

Doing anything for the first time is difficult, and that difficulty can prevent us from trying or continuing. It's easy to have unreasonable expectations and be disappointed when our early attempts are not as good as we'd like (or as good as someone else's). I like Cedric Chin's advice to "get numb first" - to just focus on doing - and then worry about getting good.

This One's For Me

2022-11-30T00:00:00-08:00

I had a heart attack earlier this year. A "widow maker" heart attack. I fell into cardiac arrest and was clinically dead for a couple of minutes. I'm incredibly fortunate two nurses happened to be nearby. They defibrillated me and administered CPR until the ambulance arrived and took me to the hospital. Those nurses quite literally saved my life.

In 2019, I crashed my bike while descending a mountain road at 30mph. I went over the handlebars, cracked my helmet on the pavement, and slid for some time. A mother and daughter found me trying to get back on my bike and ride home. They kindly called an ambulance, then my wife, and waited with me. I don’t really remember the wait, ambulance ride, or hospital visit due to the concussion.¹ I've only ridden my bike twice since.

In 2018, I was diagnosed with chronic myelogenous leukemia (CML). Cancer. Fortunately, it was the best possible kind of cancer: easily manageable with targeted medication (though as of now, incurable). Thanks to the advent of Bcr-Abl tyrosine kinase inhibitors, it shouldn’t effect my lifespan. Still, that word - cancer - carries a lot of weight.

These three events have derailed me over the last few years.²

Riding the exercise bike during my CML hospital stay.

When the CML diagnosis came, I was training for what would have been my fourth imperial century. Cycling had long been my mental and physical outlet. I'd just begun to take it more seriously, including bike commuting and training throughout winter in Chicago. I was probably in the best shape of my life, both mentally and physically.

Atop San Bruno Mountain. I crashed on the descent shortly after taking this picture.

When I crashed my bike, I was working my way back into that fitness. My wife and I moved to San Francisco not long after my CML diagnosis. I couldn't have been more excited to do the type of riding I'd longed for: lush, beautiful scenery, mountains, and gravel trails. The crash scared me away from all of that.

Looking back at Alcatraz and the San Francisco skyline from Fort Point. I had the heart attack about 30 minutes later. I don't remember taking this picture.

The heart attack came at the end of a 3.5-mile run at Crissy Field. Again, I was working my way back into that fitness I’d first lost when the CML diagnosis came, and then never fully gained back after being scared off my bike.

Why am I writing this?

To finally get it off my chest.

I've written countless drafts of this post over the years. Each has been heavily influenced by the period and mental state in which it was written. They've typically been a mix of dramatics, confusion, and depression. Eventually they reach some form of acceptance, but they all ramble in the same way.

I've never been able to figure out the point of publishing them. It felt like this wasn't the place. It wasn't "who I was." My online identity was that of a software engineer and data scientist. My writing focused on technical topics. My identity felt singular.

Still, I've found it hard to write much of anything over these last few years. I've wanted to, but it always felt as if there was some imaginary hurdle I couldn't clear. I felt mentally blocked. Those drafts were standing in my way.

The point hit me during a recent run through Golden Gate Park. That lush, beautiful scenery I'd dreamed of cycling through when we moved here.

Those unpublished drafts - and this published one - are for me. They've allowed me to let it all out. To process it. To begin moving forward.

Of course, lots of therapy helped too.

Now

I feel back, and in better shape than before. Both physically and mentally.

Prior to my heart attack, I couldn’t run a 5k without walking. I never even considered a 10k.² I'd never run more than 66km in a month.

Last week I ran a 5k in 26:18, an 8:30 per mile pace. This morning I ran a 10k in 57:18, a 9:13 pace. My last three months of running distances were 75, 89, and 111 kilometers.

Progress.

It's been a long road back, but I'm writing this to remind myself - and maybe you too - that goals are achieved by slow and steady progress. There are no quick fixes for physical and mental health. But you can work your way to where you want to be, little by little over time.

This post is for me. It's me allowing myself to be proud of the mental and physical work I've put in these last few years.

I only know what happened because the daughter Googled me, found this site and my email address, and checked on me via email.
Let’s not forget a global pandemic in 2020 and 2021.
And here I was thinking I was just out of shape.

Reviving this space

2022-11-18T00:00:00-08:00

Despite regularly wanting to write more, I haven’t written anything in 18 months. I have countless drafts that I can’t seem to finish, or at least get into a place I feel comfortable publishing. I've mentally blocked myself from doing so.

I can think of plenty of reasons for this, some of which I'll elaborate on in the future, but I think the Twitter-ification of my brain has been one of them. I struggle to think as deeply as I used to.

I’ve also felt hampered by past “success” of some posts. I’ve felt an obligation to stick to the “technical post” theme. My Google Analytics tells me that’s what people come here for. My ego wants to give the people what they want. I need to drive engagement, to get more readers, to hit the front page of Hacker News.

My brain got out of whack. I cared about the wrong things.

I'd like to fix that.

To start, I'm removing Google Analytics from this site and I don't intend to replace it with anything.¹ It's done more harm than good for me.

I also intend to expand the scope of topics I write about. I have mostly written technical content around data science. Put another way, I have mostly written about my career profession. My self-identity has been pretty one dimensional. I'd like to break out of that.

I started this blog as a place to share things I'm working on. As I've gotten older I've found that I don't enjoy "working" outside of work as much as I used to.

I'll still do some of that, but I also want this to be a place where I organize my thoughts around topics I'm interested in. Writing is the means by which I do my best thinking.

Thanks for the idea, Trey.

Mocking an imported module-level function in Python

2021-06-28T00:00:00-07:00

The other day I spent far too much time trying to figure out how to mock a module-level function that was being used inside of my class's method in Python. My googling didn't lead to obvious answers, so I figured it'd be good to document here for future reference.

Imagine we have some module-level function like the following:

# file: project/some_module/functions.py

def fetch_thing():
    # query some database
    return data

And that we use it inside of a class within a different module:

# file: project/other_module/thing.py

from some_module.functions import fetch_thing

class Thing:
    def run(self):
        try:
            data = fetch_thing()
        except:
            self.fail_gracefully()

In this example, I want to test that a failure fetching from the db will fail gracefully, so I need to mock fetch_thing and have it raise an exception.

I kept trying to mock the function at its module path, like so:

from other_module.thing import Thing

thing = Thing()

with patch.object('some_module.functions.fetch_thing') as mocked:
    mocked.side_effect = Exception('mocked error')
    data = thing.run()

But this isn't right. It turns out that you need to mock/patch the function within the module it's being imported into.

from other_module.thing import Thing

thing = Thing()

with patch.object('other_module.fetch_thing') as mocked:
    mocked.side_effect = Exception('mocked error')
    data = thing.run()

Note the very subtle difference in the string path we are passing to patch.object. Because we are importing the function into other_module where our class uses it, that is what we need to mock.

Using Go and Twilio to monitor my email

2020-12-11T00:00:00-08:00

Sometimes I'm expecting an email and want to be notified shortly after receiving it. But I also don't want to stare at my inbox, something I don't particularly enjoy checking in the first place.

To illustrate an example, imagine you're browsing a niche site with limited edition goods, some of which you love but are out of stock (they're limited, after all). Each item has a helpful "Join waitlist" button, allowing you to provide your email address and receive an email once the item is back in stock. Great feature!

There are a couple key pieces to the above scenario though:

The items are limited (supply).
There's a waitlist of unknown size (demand).

In effect, we are being told that supply is fixed and demand is not. If demand is far greater than supply it's likely the item will go out of stock again shortly after the email goes out. This is because those who receive the email first will rush to purchase it, knowing that it's a limited item. How can you ensure you see the email shortly after it's sent?

One idea is to just turn on push notifications for all email, but this approach would have a lot of noise and little signal. I'd like to be notified when a specific email arrives, not when any email arrives.

I spend a lot of time in the Messages app texting with friends and family, so a service that sends me a text message would be great, since I'd see it sooner than an email.

Knowing Gmail has an API and Twilio would make the text messaging piece easy, this felt like a fun little problem to solve and a good excuse to try a new programming language. I opted for Go.

Why Go

I've primarily worked in Python for the last decade. It's a language that I know and love deeply, and I especially appreciate its emphasis on readability and simplicity. It's a language that allows me to focus on the problem I am solving and doesn't get in the way.

But two common complaints that many Python users eventually have are the language's lack of static typing and that it is slow. While I've rarely found performance to truly be a bottleneck, I have gained an appreciation for statically typed, compiled languages.

Go was born at a time when Python adoption was on the rise thanks to the above qualities. While languages like Java and C++ allowed for more performant solutions, each came with more verbosity and complexity.

Go was designed with developer productivity as a primary concern. One of its creators, Rob Pike, describes it best:

What you're given is a set of powerful but easy to understand, easy to use building blocks from which you can assemble—compose—a solution to your problem. It might not end up quite as fast or as sophisticated or as ideologically motivated as the solution you'd write in some of those other languages, but it'll almost certainly be easier to write, easier to read, easier to understand, easier to maintain, and maybe safer.

To put it another way, oversimplifying of course:

Python and Ruby programmers come to Go because they don't have to surrender much expressiveness, but gain performance and get to play with concurrency.

This philosophy feels very Pythonic to me. It's the reason I opted to give Go a ... uh, go.

Code

A Google search of "golang gmail" brings up a quickstart on using the Gmail API and Go to read your inbox labels. The vast majority of this code is authentication handling but it's also almost everything we need.

To search our inbox and send a text when the search has results, we'll add the following functions to the quickstart code:

queryMessages, which will call Gmail's user.messages.list method to search a user's inbox and return any matching messages.
buildSMS, which will create the message content to be sent via text/SMS message.
sendSMS, which will use the Twilio REST API to send the text message to a given phone number.

queryMessages

Takes inputs of a Gmail Service object, a string denoting the user, and another string for the search q (e.g. "foo", "from:foo", etc.). Note the * symbol preceeding a type indicates it is a pointer. Go allows for objects to be passed by reference, differing from Python's "pass by assignment".
Using the service pointer, calls the Gmail's list endpoint using the q parameter to find any messages matching the search. This is akin to using the search box within Gmail.
Does some logging and checks to ensure the API returns a valid response.
Returns an array of Message pointers.

func queryMessages(service *gmail.Service, user string, q string) []*gmail.Message {
    log.Printf("Searching for messages containing: %v", q)
    response, err := service.Users.Messages.List(user).Q(q).Do()
    if err != nil {
        log.Fatalf("Unable to retrieve messages: %v", err)
    }
    if response.HTTPStatusCode != 200 {
        log.Printf("Request returned status code: %v\n", response.HTTPStatusCode)
    }
    log.Printf("Number of messages found: %v\n", len(response.Messages))
    return response.Messages
}

buildSMS

buildSMS takes many of the same inputs as the previous function (there's definitely a nicer way to write this code), but also takes in the list of Messages the previous function returned, as well as whether each Message snippet should be included in the SMS message.

func buildSMS(service *gmail.Service, user string, messages []*gmail.Message, q string, includeSnippets bool) string {
    var sb strings.Builder
    fmt.Fprintf(&sb, "Hi! You have %v emails matching your search of \"%v\".", len(messages), q)
    if len(messages) == 0 {
        return ""
    }
    if includeSnippets == true {
        fmt.Fprintf(&sb, " Here's what they look like.\n")
        for i, m := range messages {
            fmt.Printf("(%v) Fetching message %v\n", i+1, m.Id)
            m, err := service.Users.Messages.Get(user, m.Id).Do()
            if err != nil {
                log.Fatalf("Unable to retrieve message ID %v: %v", m.Id, err)
            }
            fmt.Fprintf(&sb, "(%v) - %v\n", i+1, m.Snippet)
        }
    }
    return sb.String()
}

Go's strings.Builder creates an object in memory which allows strings to be written directly to it, thus minimizing any memory copying. When declaring var sb strings.Builder we're getting a block in the memory registry and then writing directly to it with each Fprintf to &sb. Calling sb.String() returns a string of whatever we've written to the Builder.

sendSMS

Finally, we need to call the Twilio SMS API to send our text. All that's needed is sending an POST request to the /Messages.json endpoint with our message data encoded as json.

func sendSMS(phoneNumber string, message string, config *Config) {
    msgData := url.Values{}
    msgData.Set("To", phoneNumber)
    msgData.Set("From", config.Twilio.PhoneNumber)
    msgData.Set("Body", message)
    reader := *strings.NewReader(msgData.Encode())

    reqURL := config.Twilio.BaseURL + "/Accounts/" + config.Twilio.AccountSID + "/Messages.json"

    client := &http.Client{}
    req, _ := http.NewRequest("POST", reqURL, &reader)
    req.SetBasicAuth(config.Twilio.AccountSID, config.Twilio.AuthToken)
    req.Header.Add("Accept", "application/json")
    req.Header.Add("Content-Type", "application/x-www-form-urlencoded")

    response, _ := client.Do(req)
    var data map[string]interface{}
    decoder := json.NewDecoder(response.Body)
    err := decoder.Decode(&data)

    if response.StatusCode >= 200 && response.StatusCode < 300 {
        if err == nil {
            log.Printf("Twilio message SID: %v", data["sid"])
        }
    } else {
        log.Printf("Twilio returned status: %v", response.Status)
    }
}

Putting it all together

Putting all the necessary pieces together gives us this script, which takes an input search term(s) and phone number.

$ go build main.go
$ ./main -q hello -phone +131255555555

2020/12/09 16:28:20 Searching for messages containing: hello
2020/12/09 16:28:21 Number of messages found: 100
2020/12/09 16:28:22 Twilio message SID: SM72f7e0080030412284dec3afab19489d

I found Go pretty nice to work with and intend to explore it more. It scratches the "statically typed, compiled language" itch I've had recently. I'm particularly intrigued by its concurrency patterns and plan to do some comparisons against Python + pandas for data pipeline tasks.

You can find the code for this project on my Github.

Additional Reading:

Deploying static sites with Github Actions

2020-12-09T00:00:00-08:00

A while back I wrote about deploying my site using Github and Travis CI. But recently it seems Travis CI stopped being free for open source projects.

If you're using a static site generator for your site and hosting it on it on S3, you can use Github Actions to build and deploy your site on each commit (or PR, or whatever).

Setup

If you've already set up Travis CI to deploy your site to S3, switching to Github Actions won't be very difficult.

Actions are defined in YAML and need to live at a path of .github/workflows within your repo. We'll name ours deploy.yml, so its path will be .github/workflows/deploy.yml.

Before defining our workflow steps, we'll want to add any necessary secret passwords, keys, tokens, and such to our repo's encryped secrets. This will allow them to be accessed by the workflow, but securely stored and only visible to those with access to the repo.

Since my site is hosted using S3 and Cloudfront, I'll need secrets for my AWS access keys.

Next, we'll create our deploy.yml file. Github kindly supplies starter workflows in many languages, but since this site uses Pelican, a static site generator for Python, we'll use the Python starter workflow.

name: deploy
on:
  push:
    branches: [master]
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '2.7'

Yes, I'm still using a very old version of Pelican with Python 2.7. I swear I use Python3 everywhere else.

Since we're deploying to S3, we'll need to add a step for configuring our AWS credentials using the configure credentials action. The aws-region should match whichever region your bucket is in.

    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v1
      with: 
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: us-east-1

Because my website's repo uses git submodules, I need to add another step for checking out and updating these submodules on each build.

    - name: Build submodules
      run: |
        sed -i 's/git@github.com:/https:\/\/github.com\//' .gitmodules
        git submodule update --init --recursive

We also need to pip install any dependencies from requirements.txt, like Pelican.

    - name: Install dependencies
      run: |
        sudo apt-get install -qq pandoc
        python -m pip install --upgrade pip
        pip install -r requirements.txt

And finally, we can build our site, deploy it to S3, and invalidate the Cloudfront cache.

    - name: Build website
      run: |
        pelican content
    - name: Deploy to S3
      run: |
        aws s3 sync output/. s3://www.gregreda.com --acl public-read
    - name: Invalidate Cloudfront cache
      run: |
        aws configure set preview.cloudfront true
        aws cloudfront create-invalidation --distribution-id ${{ secrets.AWS_CLOUDFRONT_DISTRIBUTION_ID }} --paths "/*"

Putting it all together gives us this YAML file, which builds and deploys this website on every commit to master.

That's it. Continuous deployment for your S3 hosted website.

newbird: a theme for pelican

2020-11-25T00:00:00-08:00

In 2014, I wrote a custom theme for Pelican, the static site generator I use for this site.

At the time, there were few themes available and I wanted something that was fairly simple in its design, but also that I understood well enough to tweak as necessary. I opted to use Skeleton for the theme's general structure, but also added a fair amount of custom CSS to get things the way I wanted.

But over time all that custom CSS became more of a pain than it was worth. I wanted something I could just drop in and have it look nice.

Yesterday I came across new.css, which I feel achieves its goal of sensible design and classless CSS. It allowed me to quickly create a new Pelican theme with limited CSS-fiddling.

The new theme, which I've named newbird, includes support for Google Analytics, Twitter Cards, and Facebook Open Graph. It also allows for articles to be written in Jupyter Notebooks thanks to Pelican's liquid tags plugin. Notably, I opted not to include any social sharing buttons in order to decrease clutter and page loads.

If you're interested in using newbird for your Pelican-based site, you can find it here.

Scraping pages behind login forms

2020-11-17T00:00:00-08:00

This is part of a series of posts I have written about web scraping with Python.

Web Scraping 101 with Python, which covers the basics of using Python for web scraping.
Web Scraping 201: Finding the API, which covers when sites load data client-side with Javascript.
Asynchronous Scraping with Python, showing how to use multithreading to speed things up.
Scraping Pages Behind Login Forms, which shows how to log into sites using Python.

The other day a friend asked whether there was an easier way for them to get 1000+ Goodreads reviews without manually doing it one-by-one. It sounded like a fun little scraping project to me.

One small complexity was that the user's book reviews were not public, which meant you needed to log into Goodreads to access them. Thankfully, with a little understanding of how HTML forms work, Python's requests library makes this doable with a few lines of code.

This post walks through how to tackle the problem. If you'd like to jump straight to the code, you can find it on my Github.

While we'll use Goodreads here, the same concepts apply to most websites.

First, you'll need to dig into how the site's login forms work. I find the best way to do this is by finding the page that is solely for login. Here's an example from Goodreads:

From there, you'll need to find the necessary details of the login form. While this will include some sort of username/email and password, it will likely include a token and possibly other details.

The best way to find these details is by launching your browser's developer tools inside one of the input fields (like username/email). This will bring you to the code that is responsible for the form and allow you to find the details required.

Using the screenshot above as an example, we can see the form requires some user input fields and as well as some hidden fields:

A hidden utf8 field with a checkmark value. The checkmark value will be converted to its HTML hexcode on submission, which is ✓.
A hidden authenticity_token with a provided value.
A user[email] which is input via the form.
A user[password] which is input via the form.
A hidden n field with a provided value.

When you enter your email and password into the form and press login, the first line in the highlighted red box tells us that the form data is sent via an HTTP POST request to https://www.goodreads.com/user/sign_in (seen in the method and action fields, respectively). The user and password fields are then checked against the site's database to validate the information. Essentially, it's saying "Here are the credentials I was given. Is this a valid user?" If the credentials are valid, you are redirected to some page within the app (like the user's home page).

Once login is successful, a cookie is then stored in your browser's memory. Every time you access one of the site's pages, the site checks to make sure the cookie is valid and that you are allowed to access the page you are trying to reach.

To scrape data that is behind login forms, we'll need to replicate this behavior using the requests library. In particular, we'll need to use its Session object, which will capture and store any cookie information for us.

from bs4 import BeautifulSoup
import requests

LOGIN_URL = "https://www.goodreads.com/user/sign_in"


def get_authenticity_token(html):
    soup = BeautifulSoup(html, "html.parser")
    token = soup.find('input', attrs={'name': 'authenticity_token'})
    if not token:
        print('could not find `authenticity_token` on login form')
    return token.get('value').strip()


def get_login_n(html):
    soup = BeautifulSoup(html, "html.parser")
    n = soup.find('input', attrs={'name': 'n'})
    if not n:
        print('could not find `n` on login form')
    return n.get('value').strip()


email = "some@email.com"  # login email
password = "somethingsecret"  # login password

payload = {
    'user[email]': email,
    'user[password]': password,
    'utf8': '&#x2713;',
}

session = requests.Session()
session.headers = {'User-Agent': ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) '
    'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36')}
response = session.get(LOGIN_URL)

token = get_authenticity_token(response.text)
n = get_login_n(response.text)
payload.update({
    'authenticity_token': token,
    'n': n
})

print(f"attempting to log in as {email}")
p = session.post(LOGIN_URL, data=payload)  # perform login

If the POST request in the last line is successful, our session object should now contain a cookie that allows us to programmatically access the same pages that our user normally has access to. We'll simply need to request these pages using session.get and then can proceed as I've previously detailed.

You can find the complete code for this post on my Github.

Feature Engineering with Time Gaps

2020-02-16T00:00:00-08:00

I tend to forget how to write certain blocks of code when I haven't written them in a while. Here's a common machine learning preprocessing task that falls into that category.

Imagine you have some event logs that capture an entity ID (user, store, ad, etc), timestamp, an event name, and maybe some other details. The data looks something like this:

userid  timestamp   event
789 2019-07-18 01:06:00 login
123 2019-07-19 08:30:00 login
789 2019-07-20 02:39:00 login
789 2019-07-20 08:15:00 login
456 2019-07-20 10:05:00 login
123 2019-07-20 14:40:00 login
123 2019-07-20 18:05:00 login
456 2019-07-21 21:11:00 login
789 2019-07-22 10:05:00 login
123 2019-07-23 09:18:00 login
789 2019-07-23 17:35:00 login
123 2019-07-25 16:49:00 login
789 2019-07-26 12:13:00 login
123 2019-07-27 19:56:00 login

For the sake of simplicity, let's say we want to build a model predicting whether or not a user will login in tomorrow. Our target is y = bool(logins).

Three features we think will be informative are the user's previous logins, whether they logged in yesterday, and the number of days since their last login. We'll call these features lifetime_logins, logins_yesterday, and days_since_last_login.

Using pandas, we aggregate by user and date to get each user's daily count of logins.

df = pd.read_clipboard(parse_dates=['timestamp'])
user_logins = (df.set_index('timestamp')
               .groupby(['userid', pd.Grouper(freq='D')])
               .size()
               .rename('logins'))
# userid  timestamp
# 123     2019-07-19    1
#         2019-07-20    2
#         2019-07-23    1
#         2019-07-25    1
#         2019-07-27    1
# 456     2019-07-20    1
#         2019-07-21    1
# 789     2019-07-18    1
#         2019-07-20    2
#         2019-07-22    1
#         2019-07-23    1
#         2019-07-26    1
# Name: logins, dtype: int64

But we're missing critical information. This is when the brain fart happens.

Recall the structure of our logs. Notice they omit records for when the user had no activity. In order to create our features, we need to fill in time gaps for each user and then roll that information forward.

This goal of this post is to help me remember how to do this in the future.

Filling Time Gaps

First, we need to put each user on a continuous time scale.

# create a continuous DatetimeIndex at a daily level
dates = pd.date_range(df.timestamp.min().date(),
                      df.timestamp.max().date(),
                      freq='1D')

# get unique set of user ids
users = df['userid'].unique()

# create a MultiIndex that is the product (cross-join) of
# users and DatetimeIndexes
idx = pd.MultiIndex.from_product([users, dates], names=['userid', 'timestamp'])

# and reindex our `user_logins` counts by it
user_logins = user_logins.reindex(idx)

# userid  timestamp
# 789     2019-07-18    1.0
#         2019-07-19    NaN
#         2019-07-20    2.0
#         2019-07-21    NaN
#         2019-07-22    1.0
#         2019-07-23    1.0
#         2019-07-24    NaN
#         2019-07-25    NaN
#         2019-07-26    1.0
#         2019-07-27    NaN

This gives us a continuous daily time series for each user. You can see what this looks like for user 789 above.

An important thing to note is that idx will need to be on the same time scale as the current DatetimeIndex in user_logins. Because we aggregated at a daily level using pd.Grouper(freq='D'), the MultiIndex we are using to reindex should also be at a daily level.

Creating Features

Now we're free to create our features. We can zero-fill days each user did not log in. We also need to convert our user_logins to a DataFrame, which allows us to create the new feature columns (e.g. logins_yesterday).

user_logins = user_logins.fillna(0).to_frame()
user_logins['logins_yesterday'] = user_logins.groupby(level='userid')['logins'].shift(1)
#                    logins  logins_yesterday
# userid timestamp
# 789    2019-07-18     1.0               NaN
#        2019-07-19     0.0               1.0
#        2019-07-20     2.0               0.0
#        2019-07-21     0.0               2.0
#        2019-07-22     1.0               0.0
# 123    2019-07-18     0.0               NaN
#        2019-07-19     1.0               0.0
#        2019-07-20     2.0               1.0
#        2019-07-21     0.0               2.0
#        2019-07-22     0.0               0.0
# 456    2019-07-18     0.0               NaN
#        2019-07-19     0.0               0.0
#        2019-07-20     1.0               0.0
#        2019-07-21     1.0               1.0
#        2019-07-22     0.0               1.0

The lifetime_logins and login_streak features need to be context dependant to avoid data leakage when training our model. Our features need to represent what would have been the correct values at the time. We can do this by rolling information forward with shift.

user_logins['lifetime_logins'] = (user_logins
                                  .groupby(level='userid')
                                  .logins.cumsum()
                                  .groupby(level='userid').shift(1))
user_logins['days_since_last_login'] = (user_logins
                                        .groupby(level='userid')
                                        .cumsum()
                                        .groupby(['userid', 'logins'])
                                        .cumcount()
                                        .groupby(level='userid').shift(1)
                                        .rename('days_since_last_login'))

#                    logins  logins_yesterday  lifetime_logins  days_since_last_login
# userid timestamp
# 789    2019-07-18     1.0               NaN              NaN                    NaN
#        2019-07-19     0.0               1.0              1.0                    0.0
#        2019-07-20     2.0               0.0              1.0                    1.0
#        2019-07-21     0.0               2.0              3.0                    0.0
#        2019-07-22     1.0               0.0              3.0                    1.0
# 123    2019-07-18     0.0               NaN              NaN                    NaN
#        2019-07-19     1.0               0.0              0.0                    0.0
#        2019-07-20     2.0               1.0              1.0                    0.0
#        2019-07-21     0.0               2.0              3.0                    0.0
#        2019-07-22     0.0               0.0              3.0                    1.0
# 456    2019-07-18     0.0               NaN              NaN                    NaN
#        2019-07-19     0.0               0.0              0.0                    0.0
#        2019-07-20     1.0               0.0              0.0                    1.0
#        2019-07-21     1.0               1.0              1.0                    0.0
#        2019-07-22     0.0               1.0              2.0                    0.0

This can also be extended to create rolling features: something like logins_last_n_days where n = [7, 14, 21].

for n in [7, 14, 21]: 
    col = 'logins_last_{}_days'.format(n)
    user_logins[col] = (user_logins
                        .groupby(level='userid')
                        .logins
                        .apply(lambda d: d.rolling(n).sum().shift(1)))

Hopefully you've found this post helpful. I know my future self will.

Lenny Dykstra, His Strike Zone, & Bayesian Stats

2018-07-08T00:00:00-07:00

In 2015, former Major Leaguer Lenny Dykstra went on Colin Cowherd’s radio show and claimed that he used to hire private investigators to find dirt on umpires. The intention of doing so was to turn that dirt into a more favorable strike zone for himself. You can find the clip here.

"It wasn't a coincidence I led the league in walks the next few years." - Lenny Dykstra

Over at Fangraphs, Sheryl Ring wrote an interesting article exploring whether Dykstra's claims would amount to extortion in a legal sense. In order to do so, she needed to start by assuming that Dykstra's claims were truthful, though both Sheryl and Fangraphs commenters wondered whether there is any objective evidence that Dykstra benefitted.

That's the question I'd like to explore in this post. Did Lenny Dykstra benefit from a more favorable strike zone? What do his numbers say?

Since PITCHf/x wasn't around when Dykstra played, we can't look directly at balls and strikes called against him. However, we can use his career numbers and some Bayesian statistics to generate expected walk totals.

Analysis

On the show, Dykstra's statement about "leading the league in walks the next few years" gives us a clue as to when this might have started - 1993 - the only year he led the league in walks.

Up until 1993, Dykstra had walked 384 times in 3,667 plate appearances - good for a walk rate of 10.5%. In 1993 and 1994 though, his walk rates climbed to 16.7% and 17.6%, respectively. How likely were those numbers based on his career up until those points?

It's safe for us to assume that his "true" walk rate at that point was somewhere around 10.5% - this was his career BB% and we had a lot of data in support of it (3,667 PAs)

We can model this assumption about his "true" ability to draw a walk as a beta distribution using his pre-1993 numbers as the parameters of our model. Note that a beta distribution is parameterized by α, which represents the number of successes of an event, and β, which represents the number of failures for the same event.

\begin{equation} Beta(α, β) \end{equation} \begin{equation} Beta(BB, PA - BB) \end{equation} \begin{equation} Beta(384, 3283) \end{equation}

In this case, α is Dykstra's total walks prior to 1993 and β is the number of times he did not draw a walk during that same period.

Using this beta, we can simulate the range of values we'd expect his 1993 walk total to fall within, based his number of plate appearances from that season. This gives us both an idea of his expected 1993 BB% as well as his total walks.

On the left, we've simulated the range we'd have expected his 1993 BB% rate to fall within, based on his career numbers up until that point. Using this range, we can then obtain an expected distribution for his total walks, based his total plate appearances in 1993 (shown right).

You'll note that the red lines indicate Dykstra's 1993 numbers, which fall well outside of our expected ranges, indicating that in none of our simulations did Dykstra match his 1993 numbers.

Taking this approach a step further, we can update our beta distribution to include the 1993 season, allowing us to understand what we'd have expected in his 1994 season.

\begin{equation} Beta(384 + 129, 3283 + (773 - 129)) \end{equation} \begin{equation} Beta(513, 3927) \end{equation}

You'll note that the chart on the left includes our previous beta distribution in light blue, which is based on his career up until 1993. When incorporating his surprising 1993 walk numbers, our expected BB% shifts to the right, resulting in the purple distribution shown -- 1993 has given us new evidence to suggest Dykstra has a better eye.

Still, updating our model to include 1993 does not result in numbers we would have expected for 1994. In only 0.02% of our simulations did Dykstra achieve the 68 walks he produced in 1994.

Said differently, he did probably have some dirt on umpires, resulting in a more favorable strike zone.

It's reasonable to ask whether or not the league-wide walk rate changed around the 1993 and 1994 seasons. It didn't, ultimately staying relatively constant throughout Dykstra's career.

While it’s possible that his eye improved mightily between the 1992 and 1993 seasons, it's highly unlikely. As the analysis above shows, his walk numbers fall well outside of what we would have expected.

You can find the code and data for this analysis here.

Hiring Data Scientists

2018-02-04T00:00:00-08:00

Chicago's a big city that feels small -- everyone seems only a degree or two away from one another. This feels especially true within Chicago's tech and data science communities.

As a result, I occasionally get asked about hiring data scientists. Specifically, how do you vet, hire, and evaluate a data scientist if you don't have existing experience (either personally or within your company)?

While I feel this is a hard problem -- and one I never have a great answer to -- I figured sharing how I think about hiring might prove helpful to others.

How I think about it

My general belief is that so long as the candidate clears some programming bar and some quantitative bar (neither of which should be too high), the most important things for success are curiosity and skepticism.

The programming and quantitative bars should be based solely on examples of work they'd do on the job.

I do this via a small take-home assignment that should take candidates a couple of hours. The assignment asks variations of questions they are likely to encounter in the role. A small dataset is provided, which the questions reference.

In my opinion, the programming and quantitative bars mostly come down to:

Can they write code to do what they need to? Getting data out of a database, analysis at scale, automating some regular analysis, etc.
Do they think from a quantitative perspective? Do they think probabilistically?
Can they build a basic model and evaluate it? Do they understand statistics enough such that their analysis will not be harmful?
- My belief is that no data is preferable to bad data. With no data, you're forced to seek alternative forms of information (e.g. talking to users). With bad data, you risk drawing improper conclusions that lead you astray -- false confidence.

Assuming the candidate clears these bars, I believe curiosity and skepticism are the two most important attributes for success.

If they are curious, they will continue to fill gaps in their knowledge, learn new approaches to problems, and seek to continuously learn the business/product side -- and how their work can add value to it. A data scientist that has a tendency to go down rabbit holes can be a good thing if properly directed.

If they are skeptical, they'll refine everyone's thought process by questioning things in a healthy way. They'll innately seek to prove things believed to be true and they'll seek to answer questions that arise -- be it via their own curiosity or others'. This skepticism also acts as a check -- they'll seek alternative ways to prove and test their own work, cautiously fearful of creating bad data that can lead to improper action.

My interview process

Phone screen with recruiter (30 mins)
Phone screen with me (30 mins)
Take-home assignment (~2 hours)
In-office interview (3 hours)

The phone screens are really about feeling the person out, learning about what they're looking for next, and digging into specifics about their past experience/work.

The take-home assignment acts as the "programming and statistical bar" with respect to the job -- brief examples of questions or problems they might work on in the role. We ask candidates to provide any code they wrote or charts they created to answer all questions, even if it's exploratory in nature. We also ask that they be prepared to discuss their work during interviews.

My in-office interviews are three one-hour interviews. Usually, two one-hour interviews with myself and my team, and a joint interview with a product manager and platform engineer. Time is always left for the candidate to ask the interviewers questions.

I should note that depending on how tenured the candidate is and their existing body of work, this process might change slightly. My approach tends to change when candidates can point me to prior work (GitHub, side projects, blog posts).

If you’re looking for more details an ideal hiring process for data scientists, Trey Causey’s advice is excellent and has influenced much of my thinking. Similarly, Q McCallum’s series on common data science hiring mistakes offers practical advice to determine whether you actually need a data scientist, and that you can hire and retain them. Finally, Mikhail Popov's piece on Wikipedia's approach to data science hiring is worth your time.

My Experience as a Freelance Data Scientist

2017-01-07T00:00:00-08:00

Every so often, data scientists who are thinking about going off on their own will email me with questions about my year of freelancing (2015). In my most recent response, I was a little more detailed than usual, so I figured it'd make sense as a blog post too.

If my response comes across as negative, that's certainly not the intention -- being straight-forward about my experience is.

I learned a lot, it just wasn't for me. Working by yourself on short(ish)-term things can get old.

How was your year of freelancing?

Generally, it was good and I learned a lot.

My reason for setting out on my own was really about scratching an itch I've always had (and I suspect many of us have) - can I strike it out on my own?

The freedom was really nice and if you're able to find the work, you can likely work less than you would full-time while making more money. That said, it's certainly not for everyone.

Why'd you stop?

I didn't find it very rewarding in a non-monetary sense.

Freelancing/consulting doesn't really give you the luxury of thinking long-term about something like a product company does. Typically, a client hires you to do something, you do it, and then you're gone.

Thinking long-term and deeply through all the ways data / data science can be impactful upon a business or product is something I really enjoy -- "ohh, we can build a recommendations engine with this ... the search results we're displaying to the user here aren't great -- we can use this data to improve them, etc." I definitely enjoy a more of a slant towards data scientist + product manager than I do data scientist + software engineer.

As an individual freelancer, landing this sort of "feature" work is very hard because:

You're one person and typically these are not small projects. You only have so much capacity (24 hours/day), so it'd take more time for you to do it than it would a team.
Companies often want these things to be a "core competency" in that they do not want someone to build Big Important Thing and disappear. They are risk-averse.
You didn't strike out on your own to build Big Important Thing and then really just maintain that thing for one client in perpetuity (which would likely happen if the company allowed you to build it) -- you started freelancing because you (presumably) wanted some variety.

Companies often have a Thing In Mind they want you to do -- or, they want to "buy" your time for some period (e.g. 80 hours over the next three months at $/hour -- a retainer).

When they have a Thing In Mind, it is much more likely to be dirty work that they do not feel is the best use of their existing team's time than it is to be something they need you, the consultant, for.

When on retainer, I found the experience to be similar, except it can be a bunch of ad-hoc tasks that come up ("can you pull this data for me") that you didn't know would be the case when you signed the contract.

This is all a long way of saying, in my experience, a non-trivial portion of you has to be ok with being a mercenary -- do the thing you're being paid to do and not worry about the rest.

I struggled with that internally and thus did not find the work very stimulating -- I like buying into something, giving it my all, and thinking about the various directions it can be taken.

Tips or lessons learned?

So many, but here are a few:

Keep It Simple, Stupid

This isn't specific to freelancing per se, but it was something freelancing emphasized.

I think data scientists (generally) have a bad habit of latching onto specific words a stakeholder says, while ignoring the other words in the request.¹ For example:

"What's the optimal number of leads that a rep should get? We want to get directionally better."

As data scientists, we hear "optimal number" and we start thinking about doing complex math and building models. We end up ignoring the most important part: "We want to get directionally better" -- our stakeholder is telling us "we don't know much about this right now -- help!"

We need to start simple -- maybe some basic exploratory work + charts -- and surface that back to our stakeholder, giving them the opportunity to say "cool, this is all I needed" or "this is good, but keep going." We need to allow our stakeholder to choose incremental progress and we should not assume they need the more complex (and time-intensive) solution.

Try to get systems access before the project begins

This probably isn't a high priority for the systems team at your client. Thus, if the process of getting you access to things (databases, vpn, etc.) starts the same day the project is set to start, the first day or two will wind up being a waste of time.

Productized consulting

Nail down exactly what you do or create. Have a fixed price for doing it. Don't deviate from that. Turn your "consulting" into a product.

For example, you'll build a churn model for $XX. My brother's company is a good example of this.

Try not to sell hours. Which leads me to ...

Don't bill hourly

Tracking hours sucks and also limits your margin. Try to sell daily or weekly rates (or productized consulting).

Better yet, if you have a well defined scope (ideal, but sometimes hard) and you know the amount of time the project will take you, then set a price on the project. The risk here is the project taking longer than you anticipated and now you're really just doing free work.

I was fearful of underestimating the amount of time something would take me, so I billed hourly. It wasn't fun though. Additionally, all of my clients ended up being retainers.

My biggest fears involve health insurance. Do you have any good resources?

Not really. I just used healthcare.gov and went with a BCBS PPO because I'm pretty risk-averse.

In the data science pipeline, where did your services fall (e.g. Databases > Data Cleaning > Business Intelligence > Advanced Analytics)? Did you do everything?

This was something I should have been better about -- I never really established what my services were. My thesis in going freelance was more about feeling there was a gap in the data consulting market.

My belief was that there was (in 2015 ... and probably still is) a population of companies trying to figure out how to utilize their data, who are not interested in bringing on a consulting firm ($$$), and don't necessarily know if they need a data scientist full-time yet. I felt uniquely positioned to fill that gap due to being GrubHub's first data hire and also having prior consulting experience (PwC, Datascope).

For the reasons mentioned in the second question, I'd classify most of the work I ended up doing as Business Intelligence, along with some product/marketing analytics work -- instrumentation, how to think about using the data, etc -- but never in a "building data-driven features/products" sense. No machine learning or similar.

1. I have a longer post about this in the works.

[Talk] Data-Informed vs Data-Driven

2016-11-20T00:00:00-08:00

A while back (July 2015), I was fortunate to speak at PyData in Seattle about the pitfalls of overreliance on data. You can find the slides here.

My talk centered around my belief that the term "data-driven," when taken at its face value, should not be something we strive for. Instead, we should seek to be "data-informed." Pedantic, I know, but for those that do not work within the field, I think the distinction is important.

Here's its abstract (pardon my snark):

Companies can't stop gushing about how "data-driven" they are - how they're using "big data" and "data science" to synergize and streamline all the things. But being driven by data alone is a flawed approach. Instead, companies should seek to be "data-informed" - interweaving designers, UXers, and data scientists so that each side is able to perfectly complement the one another.

This talk will discuss the importance of allowing data and user research to complement one another, in addition to the pitfalls of being driven by data alone (for instance, the cons of A/B testing).

While the actual talk isn't one of my best - I sound like I'm reading from cards (I kind of was) - I'm still a big believer in the overall message.

We shouldn't be surprised that being data-informed is ultimately a better approach. Simply, we're just adding more information - quantitative and qualitative - to our existing dataset and weighing that information appropriately.

The best decisions make use of all relevant information, not a limited set, much like the best algorithms are those developed with the best data and features.

Asynchronous Scraping with Python

2016-10-16T00:00:00-07:00

This is part of a series of posts I have written about web scraping with Python.

Web Scraping 101 with Python, which covers the basics of using Python for web scraping.
Web Scraping 201: Finding the API, which covers when sites load data client-side with Javascript.
Asynchronous Scraping with Python, showing how to use multithreading to speed things up.
Scraping Pages Behind Login Forms, which shows how to log into sites using Python.

Previously, I've written about the basics of scraping and how you can find API calls in order to fetch data that isn't easily downloadable.

For simplicity, the code in these posts has always been synchronous -- given a list of URLs, we process one, then the next, then the next, and so on. While this makes for code that's straight-forward, it can also be slow.

This doesn't have to be the case though. Scraping is often an example of code that is embarrassingly parallel. With some slight changes, our tasks can be done asynchronously, allowing us to process more than one URL at a time.

In version 3.2, Python introduced the concurrent.futures module, which is a joy to use for parallelizing tasks like scraping. The rest of this post will show how we can use the module to make our previously synchronous code asynchronous.

Parallelizing your tasks

Imagine we have a list of several thousand URLs. In previous posts, we've always written something that looks like this:

from csv import DictWriter

URLS = [ ... ]  # thousands of urls for pages we'd like to parse

def parse(url):
    # our logic for parsing the page
    return data  # probably a dict

results = []
for url in URLS:  # go through each url one by one
    results.append(parse(url))

with open('results.csv', 'w') as f:
    writer = DictWriter(f, fieldnames=results[0].keys())
    writer.writeheader()
    writer.writerows(results)

The above is an example of synchronous code -- we're looping through a list of URLs, processing one at a time. If the list of URLs is relatively small or we're not concerned about execution time, there's little reason to parallelize these tasks -- we might as well keep things simple and wait it out.

However, sometimes we have a huge list of URLs -- at least several thousand -- and we can't wait hours for them to finish.

With concurrent.futures, we can work on multiple URLs at once by adding a ProcessPoolExecutor and making a slight change to how we fetch our results.

But first, a reminder: if you're scraping, don't be a jerk. Space out your requests appropriately and don't hammer the site (i.e. use time.sleep to wait briefly between each request and set max_workers to a small number). Being a jerk runs the risk of getting your IP address blocked -- good luck getting that data now.

from concurrent.futures import ProcessPoolExecutor
import concurrent.futures

URLS = [ ... ]

def parse(url):
    # our logic for parsing the page
    return data  # still probably a dict

with ProcessPoolExecutor(max_workers=4) as executor:
    future_results = {executor.submit(parse, url): url for url in URLS}

    results = []
    for future in concurrent.futures.as_completed(future_results):
        results.append(future.result())

In the above code, we're submitting tasks to the executor -- four workers -- each of which will execute the parse function against a URL. This execution does not happen immediately. For each submission, the executor returns an instance of a Future, which tells us that our task will be executed at some point in the ... well, future. The as_completed function watches our future_results for completion, upon which we'll be able to fetch each result via the result method.

My favorite part about this module is the clarity of its API -- tasks are submitted to an executor, which is made up of one or more workers, each of which is churning through our tasks. Because our tasks are executed asynchronously, we are not waiting for a given task's completion before submitting another -- we are doing so at-will, with completion happening in the future. Once completed, we can get the task's result.

Closing up

With a few changes to your code and some concurrent.futures love, you no longer have to fetch those basketball stats one page at a time.

But don't be a jerk either.

Visualizing the 2015 NL Cy Young Race

2015-11-19T00:00:00-08:00

This year's National League Cy Young race was pretty much a toss-up, with each of Jake Arrieta, Zack Greinke, and Clayton Kershaw putting up numbers we haven't seen in a decade or more.

By now we know that Arrieta wins the award, but being the Cubs homer I am, I started digging into the data a few weeks ago in attempt to show that Arrieta should win the award. However, as is often the case when walking into an analysis with preconcieved notions of its findings, I was left unable to make my case with a straight face.

Unable to confidently make the case that any of the contenders were more deserving of the award than their peers, I decided to turn my work into an article highlighting the historic years each of them had. Unfortunately, the article never wound up published, but you can still read it here, though it's obviously outdated now.

Since I tend to use this site more for technical posts, it seemed like a good idea to walk through a couple pieces of my work -- if you're interested in everything, I've pushed it up to GitHub.

Preprocessing

In order to show the stats I cared about and their progression throughout each pitcher's season, I needed to do some preprocessing of the data. Specifically, I needed to calculate a variety of statistics that are not included in the game logs from Baseball Reference.

After loading the dataset and transforming the innings pitched (IP) field to a numeric value, you'll see a fairly large section of code in the notebook that looks like this:

# Partial innings are stored as 7.1 or 7.2 in the Baseball Reference data.
# Convert it to properly represent 1/3 or 2/3 of an inning
# (necessary for various rate calculations).
def to_innings(IP):
    full, partial = map(float, str(IP).split('.'))
    return full + (partial / 3.)

# example: 7.1 --> 7.3333
arrieta['IP'] = arrieta.IP.apply(to_innings)

arrieta['rollingIP'] = arrieta.IP.cumsum()
arrieta['IPGame'] = arrieta.rollingIP / arrieta.Rk
arrieta['rollingER'] = arrieta.ER.cumsum()
arrieta['rollingERA'] = arrieta['rollingER'] / (arrieta['rollingIP'] / 9.)
arrieta['strikeoutsPerIP'] = arrieta.SO.cumsum() / arrieta['rollingIP']
arrieta['K/9'] = arrieta.SO.cumsum() / (arrieta['rollingIP'] / 9.)
arrieta['strikeoutsPerBF'] = arrieta.SO.cumsum() / arrieta.BF.cumsum()
arrieta['hitsPerIP'] = arrieta.H.cumsum() / arrieta['rollingIP']
arrieta['hitsPerAB'] = arrieta.H.cumsum() / arrieta.AB.cumsum()
arrieta['rollingWHIP'] = (arrieta.H.cumsum() + arrieta.BB.cumsum()) / arrieta['rollingIP']
# opponents against
arrieta['1B'] = arrieta.H - (arrieta['2B'] + arrieta['3B'] + arrieta['HR'])
arrieta['AVG'] = arrieta.H.cumsum() / arrieta.AB.cumsum()
arrieta['OBP'] = (arrieta.H.cumsum() + arrieta.BB.cumsum() + arrieta.HBP.cumsum()) \
                    / (arrieta.AB.cumsum() + arrieta.BB.cumsum() +
                        arrieta.HBP.cumsum() + arrieta.SF.cumsum())
arrieta['SLG'] = (arrieta['1B'].cumsum() + (arrieta['2B'].cumsum() * 2) +
                 (arrieta['3B'].cumsum() * 3) + (arrieta['HR'].cumsum() * 4)) \
                            / arrieta.AB.cumsum()
arrieta['OPS'] = arrieta.OBP + arrieta.SLG

# rates
arrieta['BABIP'] = (arrieta.H.cumsum() - arrieta.HR.cumsum()) \
                        / (arrieta.AB.cumsum() - arrieta.SO.cumsum() -
                            arrieta.HR.cumsum() + arrieta.SF.cumsum())
arrieta['HR%'] = arrieta.HR.cumsum() / arrieta.BF.cumsum()
arrieta['XBH%'] = (arrieta['2B'].cumsum() + arrieta['3B'].cumsum() +
                    arrieta['HR'].cumsum()) / arrieta.BF.cumsum()
arrieta['K%'] = arrieta['SO'].cumsum() / arrieta.BF.cumsum()
arrieta['IP%'] = (arrieta.AB.cumsum() - arrieta.SO.cumsum() -
                    arrieta.HR.cumsum() + arrieta.SF.cumsum()) \
                        / arrieta.BF.cumsum()
arrieta['GB%'] = arrieta['GB'].cumsum() /
                    (arrieta.AB.cumsum() - arrieta.SO.cumsum() -
                        arrieta.HR.cumsum() + arrieta.SF.cumsum())

Here we're adding new, cumulative statistics to each pitcher's DataFrame (e.g. we can easily say what Arrieta's ERA was after his fourth start, or what his batting average on balls in-play (BABIP) was in the second half of the season).

Visualizing their seasons

Now that we have various statistics on a rolling basis, we need a way to compare their performances throughout the season. Thankfully, this is a perfect use case for small multiples, which is a technique meant specifically for comparison.

To do so, we can create a dictionary where each pitcher is a key, and the value is another dictionary containing that pitcher's DataFrame, as well as a color and line style which we'll use in our plot. Then, we'll create a grid of empty subplots, which will be populated by looping through our PITCHERS dictionary.

from collections import OrderedDict

PITCHERS = {'Arrieta': {'df': arrieta, 'color': ja, 'style': '-'},
            'Greinke': {'df': greinke, 'color': zg, 'style': '-'},
            'Kershaw': {'df': kershaw, 'color': kc, 'style': '--'}}
PITCHERS = OrderedDict(sorted(PITCHERS.items()))
stats = ['IP%', 'BABIP', 'XBH%', 'HR%', 'K%']

row_titles = ['{}'.format(row_title) for row_title in PITCHERS.keys()]
col_titles = ['{}'.format(col_title) for col_title in stats]

fig, axes = plt.subplots(figsize=(15,6), nrows=len(PITCHERS),
                            ncols=len(stats), sharex=True)
fig.tight_layout(pad=1.2, h_pad=1.5) # adjust layout spacing

# label each column with stat name
for ax, col_title in zip(axes[0], col_titles):
    ax.set_title(col_title, size=15)

# label each row with player name
for ax, row_title in zip(axes[:,0], row_titles):
    ax.set_ylabel(row_title, rotation=0, size=15, labelpad=40)

# create grid - one chart for each pitcher + stat combination
for i, (name, pitcher) in enumerate(PITCHERS.items()):
    for j, stat in enumerate(stats):
        title = '{}: {}'.format(name, stat)
        pitcher['df'][stat].plot(ax=axes[i,j], color=pitcher['color'],
                                    linestyle=pitcher['style'])

        # for ease of comparison, let's plot the other pitchers on the same chart
        # but let's make them a light grey with the appropriate linestyle
        for k, v in PITCHERS.items():
            if k != name:
                v['df'][stat].plot(ax=axes[i,j], color='grey', alpha=0.4,
                                    linestyle=v['style'])

        axes[i,j].tick_params(axis='both', which='major', labelsize=13)
        axes[i,j].axvline(allstarbreak, color='k', linestyle=':', linewidth=1)
        axes[i,j].yaxis.set_major_locator(MaxNLocator(nbins=4))
        axes[i,0].set_ylim(0, 1.) # IP%
        axes[i,1].set_ylim(0, .500) # BABIP
        axes[i,2].set_ylim(0, .16) # XBH%
        axes[i,3].set_ylim(0, .04) # HR%
        axes[i,4].set_ylim(0, .36) # K%
plt.savefig('images/rates-comparison.png', bbox_inches='tight', dpi=120)

The resulting output is a 3 x 5 grid of charts, where each row corresponds to a pitcher, and each column is a statistic.

Again, this technique is meant for comparing different dimensions (people, cities, departments, etc.) against one another.

For instance, looking down the left-most column, we can see that batters put the ball in play (IP%) about equally against Arrieta and Greinke, but less so against Kershaw. Looking down the far right column, we can see that Kershaw was put in play less often due to his stronger ability to strike hitters out (K%).

Comparing batted ball exit velocity

With PITCHf/x installed in every MLB park, we can also look at data around each pitch made throughout the season. Baseball Savant is a great source of this data.

Since it still wasn't clear who should win the award after looking at a variety of stats, it seemed interesting to answer the most basic question: Which pitcher was hit harder? We know there's a significant relationship between a batted ball's exit velocity and its likelihood to wind up a hit, so this should give us some indication of who was the more difficult pitcher to face.

Looking at the observed distributions of their batted ball exit velocity doesn't tell us much -- Arrieta's mean exit velocity was 85.0 MPH, Greinke's 88.4, and Kershaw's 84.9. Those numbers are pretty close -- so close that we shouldn't assume they're statistically significant, so let's test that using the bootstrap.

With bootstrapping, we generate N random samples of our dataset (typically 1,000 or 10,000). Since we're interested in speaking about the "average" batted ball exit velocity, we take the mean of each random sample, resulting in an approximation of the mean's true distribution. From there, we can look at the 95% confidence intervals to test for significance.

np.random.seed(49) # set random seed for consistency

# only sample from pitches that were hit
arrietaBBs = arrietaPitches[arrietaPitches.batted_ball_velocity > 0].batted_ball_velocity
greinkeBBs = greinkePitches[greinkePitches.batted_ball_velocity > 0].batted_ball_velocity
kershawBBs = kershawPitches[kershawPitches.batted_ball_velocity > 0].batted_ball_velocity
arrietaSamples = []
greinkeSamples = []
kershawSamples = []

# generate 1000 randomly sampled datasets for each pitcher
# each sampled dataset is the same length as our observed dataset
for i in range(1000):
    arrietaSamples.append(np.random.choice(arrietaBBs, size=len(arrietaBBs), replace=True))
    greinkeSamples.append(np.random.choice(greinkeBBs, size=len(greinkeBBs), replace=True))
    kershawSamples.append(np.random.choice(kershawBBs, size=len(kershawBBs), replace=True))

# get the mean of each randomly sampled dataset
arrietaMeans = [np.mean(obs) for obs in arrietaSamples]
greinkeMeans = [np.mean(obs) for obs in greinkeSamples]
kershawMeans = [np.mean(obs) for obs in kershawSamples]

# plot the distributions
fig, ax = plt.subplots(figsize=(10, 4))
plt.hist(arrietaMeans, alpha=.5, label='Arrieta', color=ja)
plt.hist(greinkeMeans, alpha=.6, label='Greinke', color=zg)
plt.hist(kershawMeans, alpha=.3, label='Kershaw', color=kc)
plt.legend(loc='best')
plt.xlabel('Avg. Batted Ball Velocity', fontsize=15)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.xaxis.set_ticks_position('bottom')
plt.tick_params(axis='both', which='major', labelsize=13)
ax.get_yaxis().set_ticks([])
plt.savefig('images/avg-batted-ball-velocity.png', bbox_inches='tight', dpi=120);

![Batted Ball Exit Velocity] (https://raw.githubusercontent.com/gjreda/cy-young-NL-2015/master/images/avg-batted-ball-velocity.png)

While the above chart doesn't explicitly show their 95% confidence intervals, it's pretty clear that Greinke's mean exit velocity is significant when compared to Arrieta and Kershaw -- allowing us to say that, on average, Greinke was hit harder throughout the season than both Arrieta and Kershaw. We cannot confidently say there was a difference in exit velocity when comparing Arrieta and Kershaw to each other though.

The chart above is especially interesting in the context of our small multiples charts. In particular, that Greinke had the lowest ERA, batting average on balls in play (BABIP), and extra base hit rate (XBH%) of the three, despite allowing harder contact. This suggests that Greinke received a bit more help from his defense than Arrieta and Kershaw.

If you're interested in more analysis on the season each of these three had, Dave Cameron at FanGraphs has an excellent write-up explaining the rationale behind his vote.

Hope you've enjoyed the post, and let me know if you have any questions.

Cohort Analysis with Python

2015-08-23T00:00:00-07:00

Despite having done it countless times, I regularly forget how to build a cohort analysis with Python and pandas. I’ve decided it’s a good idea to finally write it out - step by step - so I can refer back to this post later on. Hopefully others find it useful as well.

I’ll start by walking through what cohort analysis is and why it’s commonly used in startups and other growth businesses. Then, we’ll create one from a standard purchase dataset.

What is cohort analysis?

A cohort is a group of users who share something in common, be it their sign-up date, first purchase month, birth date, acquisition channel, etc. Cohort analysis is the method by which these groups are tracked over time, helping you spot trends, understand repeat behaviors (purchases, engagement, amount spent, etc.), and monitor your customer and revenue retention.

It’s common for cohorts to be created based on a customer’s first usage of the platform, where "usage" is dependent on your business’ key metrics. For Uber or Lyft, usage would be booking a trip through one of their apps. For GrubHub, it’s ordering some food. For AirBnB, it’s booking a stay.

With these companies, a purchase is at their core, be it taking a trip or ordering dinner — their revenues are tied to their users’ purchase behavior.

In others, a purchase is not central to the business model and the business is more interested in "engagement" with the platform. Facebook and Twitter are examples of this - are you visiting their sites every day? Are you performing some action on them - maybe a "like" on Facebook or a "favorite" on a tweet?¹

When building a cohort analysis, it’s important to consider the relationship between the event or interaction you’re tracking and its relationship to your business model.

Why is it valuable?

Cohort analysis can be helpful when it comes to understanding your business’ health and "stickiness" - the loyalty of your customers. Stickiness is critical since it’s far cheaper and easier to keep a current customer than to acquire a new one. For startups, it’s also a key indicator of product-market fit.

Additionally, your product evolves over time. New features are added and removed, the design changes, etc. Observing individual groups over time is a starting point to understanding how these changes affect user behavior.

It’s also a good way to visualize your user retention/churn as well as formulating a basic understanding of their lifetime value.

An example

Imagine we have a dataset like the one below (you can find it here):

	OrderId	OrderDate	UserId	TotalCharges	CommonId	PupId	PickupDate
0	262	2009-01-11	47	50.67	TRQKD	2	2009-01-12
1	278	2009-01-20	47	26.60	4HH2S	3	2009-01-20
2	294	2009-02-03	47	38.71	3TRDC	2	2009-02-04
3	301	2009-02-06	47	53.38	NGAZJ	2	2009-02-09
4	302	2009-02-06	47	14.28	FFYHD	2	2009-02-09

Pretty standard purchase data with IDs for the order and user, as well as the order date and purchase amount.

We want to go from the data above to something like this:

Here’s how we get there.

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

pd.set_option('max_columns', 50)
mpl.rcParams['lines.linewidth'] = 2

%matplotlib inline

df = pd.read_excel('/Users/gjreda/Dropbox/datasets/relay-foods.xlsx')
df.head(3)

	OrderId	OrderDate	UserId	TotalCharges	CommonId	PupId	PickupDate
0	262	2009-01-11	47	50.67	TRQKD	2	2009-01-12
1	278	2009-01-20	47	26.60	4HH2S	3	2009-01-20
2	294	2009-02-03	47	38.71	3TRDC	2	2009-02-04

1. Create a period column based on the OrderDate

Since we're doing monthly cohorts, we'll be looking at the total monthly behavior of our users. Therefore, we don't want granular OrderDate data (right now).

df['OrderPeriod'] = df.OrderDate.apply(lambda x: x.strftime('%Y-%m'))
df.head()

	OrderId	OrderDate	UserId	TotalCharges	CommonId	PupId	PickupDate	OrderPeriod
0	262	2009-01-11	47	50.67	TRQKD	2	2009-01-12	2009-01
1	278	2009-01-20	47	26.60	4HH2S	3	2009-01-20	2009-01
2	294	2009-02-03	47	38.71	3TRDC	2	2009-02-04	2009-02
3	301	2009-02-06	47	53.38	NGAZJ	2	2009-02-09	2009-02
4	302	2009-02-06	47	14.28	FFYHD	2	2009-02-09	2009-02

2. Determine the user's cohort group (based on their first order)

Create a new column called CohortGroup, which is the year and month in which the user's first purchase occurred.

df.set_index('UserId', inplace=True)

df['CohortGroup'] = df.groupby(level=0)['OrderDate'].min().apply(lambda x: x.strftime('%Y-%m'))
df.reset_index(inplace=True)
df.head()

	UserId	OrderId	OrderDate	TotalCharges	CommonId	PupId	PickupDate	OrderPeriod	CohortGroup
0	47	262	2009-01-11	50.67	TRQKD	2	2009-01-12	2009-01	2009-01
1	47	278	2009-01-20	26.60	4HH2S	3	2009-01-20	2009-01	2009-01
2	47	294	2009-02-03	38.71	3TRDC	2	2009-02-04	2009-02	2009-01
3	47	301	2009-02-06	53.38	NGAZJ	2	2009-02-09	2009-02	2009-01
4	47	302	2009-02-06	14.28	FFYHD	2	2009-02-09	2009-02	2009-01

3. Rollup data by CohortGroup & OrderPeriod

Since we're looking at monthly cohorts, we need to aggregate users, orders, and amount spent by the CohortGroup within the month (OrderPeriod).

grouped = df.groupby(['CohortGroup', 'OrderPeriod'])

# count the unique users, orders, and total revenue per Group + Period
cohorts = grouped.agg({'UserId': pd.Series.nunique,
                       'OrderId': pd.Series.nunique,
                       'TotalCharges': np.sum})

# make the column names more meaningful
cohorts.rename(columns={'UserId': 'TotalUsers',
                        'OrderId': 'TotalOrders'}, inplace=True)
cohorts.head()

		TotalOrders	TotalUsers	TotalCharges
CohortGroup	OrderPeriod
2009-01	2009-01	30	22	1850.255
	2009-02	25	8	1351.065
	2009-03	26	10	1357.360
	2009-04	28	9	1604.500
	2009-05	26	10	1575.625

4. Label the CohortPeriod for each CohortGroup

We want to look at how each cohort has behaved in the months following their first purchase, so we'll need to index each cohort to their first purchase month. For example, CohortPeriod = 1 will be the cohort's first month, CohortPeriod = 2 is their second, and so on.

This allows us to compare cohorts across various stages of their lifetime.

def cohort_period(df):
    """
    Creates a `CohortPeriod` column, which is the Nth period based on the user's first purchase.

    Example
    -------
    Say you want to get the 3rd month for every user:
        df.sort(['UserId', 'OrderTime', inplace=True)
        df = df.groupby('UserId').apply(cohort_period)
        df[df.CohortPeriod == 3]
    """
    df['CohortPeriod'] = np.arange(len(df)) + 1
    return df

cohorts = cohorts.groupby(level=0).apply(cohort_period)
cohorts.head()

		TotalOrders	TotalUsers	TotalCharges	CohortPeriod
CohortGroup	OrderPeriod
2009-01	2009-01	30	22	1850.255	1
	2009-02	25	8	1351.065	2
	2009-03	26	10	1357.360	3
	2009-04	28	9	1604.500	4
	2009-05	26	10	1575.625	5

5. Make sure we did all that right

Let's test data points from the original DataFrame with their corresponding values in the new cohorts DataFrame to make sure all our data transformations worked as expected. As long as none of these raise an exception, we're good.

x = df[(df.CohortGroup == '2009-01') & (df.OrderPeriod == '2009-01')]
y = cohorts.ix[('2009-01', '2009-01')]

assert(x['UserId'].nunique() == y['TotalUsers'])
assert(x['TotalCharges'].sum().round(2) == y['TotalCharges'].round(2))
assert(x['OrderId'].nunique() == y['TotalOrders'])

x = df[(df.CohortGroup == '2009-01') & (df.OrderPeriod == '2009-09')]
y = cohorts.ix[('2009-01', '2009-09')]

assert(x['UserId'].nunique() == y['TotalUsers'])
assert(x['TotalCharges'].sum().round(2) == y['TotalCharges'].round(2))
assert(x['OrderId'].nunique() == y['TotalOrders'])

x = df[(df.CohortGroup == '2009-05') & (df.OrderPeriod == '2009-09')]
y = cohorts.ix[('2009-05', '2009-09')]

assert(x['UserId'].nunique() == y['TotalUsers'])
assert(x['TotalCharges'].sum().round(2) == y['TotalCharges'].round(2))
assert(x['OrderId'].nunique() == y['TotalOrders'])

User Retention by Cohort Group

We want to look at the percentage change of each CohortGroup over time -- not the absolute change.

To do this, we'll first need to create a pandas Series containing each CohortGroup and its size.

# reindex the DataFrame
cohorts.reset_index(inplace=True)
cohorts.set_index(['CohortGroup', 'CohortPeriod'], inplace=True)

# create a Series holding the total size of each CohortGroup
cohort_group_size = cohorts['TotalUsers'].groupby(level=0).first()
cohort_group_size.head()

CohortGroup
2009-01    22
2009-02    15
2009-03    13
2009-04    39
2009-05    50
Name: TotalUsers, dtype: int64

Now, we'll need to divide the TotalUsers values in cohorts by cohort_group_size. Since DataFrame operations are performed based on the indices of the objects, we'll use unstack on our cohorts DataFrame to create a matrix where each column represents a CohortGroup and each row is the CohortPeriod corresponding to that group.

To illustrate what unstack does, recall the first five TotalUsers values:

cohorts['TotalUsers'].head()

CohortGroup  CohortPeriod
2009-01      1               22
             2                8
             3               10
             4                9
             5               10
Name: TotalUsers, dtype: int64

cohorts['TotalUsers'].unstack(0).head()

CohortGroup	2009-01	2009-02	2009-03	2009-04	2009-05	2009-06	2009-07	2009-08	2009-09	2009-10	2009-11	2009-12	2010-01	2010-02	2010-03
CohortPeriod
1	22	15	13	39	50	32	50	31	37	54	130	65	95	100	24
2	8	3	4	13	13	15	23	11	15	17	32	17	50	19	NaN
3	10	5	5	10	12	9	13	9	14	12	26	18	26	NaN	NaN
4	9	1	4	13	5	6	10	7	8	13	29	7	NaN	NaN	NaN
5	10	4	1	6	4	7	11	6	13	13	13	NaN	NaN	NaN	NaN

Now, we can utilize broadcasting to divide each column by the corresponding cohort_group_size.

The resulting DataFrame, user_retention, contains the percentage of users from the cohort purchasing within the given period. For instance, 38.4% of users in the 2009-03 purchased again in month 3 (which would be May 2009).

user_retention = cohorts['TotalUsers'].unstack(0).divide(cohort_group_size, axis=1)
user_retention.head(10)

CohortGroup	2009-01	2009-02	2009-03	2009-04	2009-05	2009-06	2009-07	2009-08	2009-09	2009-10	2009-11	2009-12	2010-01	2010-02	2010-03
CohortPeriod
1	1.000000	1.000000	1.000000	1.000000	1.00	1.00000	1.00	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.00	1
2	0.363636	0.200000	0.307692	0.333333	0.26	0.46875	0.46	0.354839	0.405405	0.314815	0.246154	0.261538	0.526316	0.19	NaN
3	0.454545	0.333333	0.384615	0.256410	0.24	0.28125	0.26	0.290323	0.378378	0.222222	0.200000	0.276923	0.273684	NaN	NaN
4	0.409091	0.066667	0.307692	0.333333	0.10	0.18750	0.20	0.225806	0.216216	0.240741	0.223077	0.107692	NaN	NaN	NaN
5	0.454545	0.266667	0.076923	0.153846	0.08	0.21875	0.22	0.193548	0.351351	0.240741	0.100000	NaN	NaN	NaN	NaN
6	0.363636	0.266667	0.153846	0.179487	0.12	0.15625	0.20	0.258065	0.243243	0.129630	NaN	NaN	NaN	NaN	NaN
7	0.363636	0.266667	0.153846	0.102564	0.06	0.09375	0.22	0.129032	0.216216	NaN	NaN	NaN	NaN	NaN	NaN
8	0.318182	0.333333	0.230769	0.153846	0.10	0.09375	0.14	0.129032	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	0.318182	0.333333	0.153846	0.051282	0.10	0.31250	0.14	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10	0.318182	0.266667	0.076923	0.102564	0.08	0.09375	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Finally, we can plot the cohorts over time in an effort to spot behavioral differences or similarities. Two common cohort charts are line graphs and heatmaps, both of which are shown below.

Notice that the first period of each cohort is 100% -- this is because our cohorts are based on each user's first purchase, meaning everyone in the cohort purchased in month 1.

user_retention[['2009-06', '2009-07', '2009-08']].plot(figsize=(10,5))
plt.title('Cohorts: User Retention')
plt.xticks(np.arange(1, 12.1, 1))
plt.xlim(1, 12)
plt.ylabel('% of Cohort Purchasing');

# Creating heatmaps in matplotlib is more difficult than it should be.
# Thankfully, Seaborn makes them easy for us.
# http://stanford.edu/~mwaskom/software/seaborn/

import seaborn as sns
sns.set(style='white')

plt.figure(figsize=(12, 8))
plt.title('Cohorts: User Retention')
sns.heatmap(user_retention.T, mask=user_retention.T.isnull(), annot=True, fmt='.0%');

Unsurprisingly, we can see from the above chart that fewer users tend to purchase as time goes on.

However, we can also see that the 2009-01 cohort is the strongest, which enables us to ask targeted questions about this cohort compared to others -- what other attributes (besides first purchase month) do these users share which might be causing them to stick around? How were the majority of these users acquired? Was there a specific marketing campaign that brought them in? Did they take advantage of a promotion at sign-up? The answers to these questions would inform future marketing and product efforts.

Further work

User retention is only one way of using cohorts to look at your business — we could have also looked at revenue retention. That is, the percentage of each cohort’s month 1 revenue returning in subsequent periods. User retention is important, but we shouldn’t lose sight of the revenue each cohort is bringing in (and how much of it is returning).

Hopefully you’ve found this post useful. If I’ve missed anything, let me know.

Additional Resources

Cohort Analysis on Wikipedia
Know Your User Cohorts by Christoph Janz
The Cohort Analysis by Fred Wilson (Union Square Ventures)
What exactly is cohort analysis? on Quora

1. While a purchase might not be at the core of these businesses, they still might occur (e.g. "Buy" buttons on tweets are of value to Twitter, but users and engagement are what the platform is about).

Nonsensical beer reviews via Markov chains

2015-03-30T00:00:00-07:00

I’ve had a bunch of beer reviews and ratings data sitting on my hard drive for about year. For a beer nerd like me, that’s a pretty cool dataset, yet I’ve let it collect digital dust.

Fast forward to last week, where somehow I wound up in the Wikipedia Death Spiral. You know what I mean - you click a link to a Wikipedia article, that article takes you to a new one, then you’re on another, and another … we’ve all been there. And it’s kind of awesome.

Well, the rabbit hole led me to Markov chains, which seemed like a good excuse to mess around with that beer review data.

What are Markov chains?

Markov chains are a random process that transitions to various states, where the “next state” is based on its probability distribution, given the current state.

Imagine we have the following sequence of days, where S indicates it was sunny and R indicates it was rainy:

S S R R S R S S R R R R S R S S S R

Let’s pick a random beginning “state” - let’s just say it’s S (sunny). The next state is based only on the current state. Since our current state is S, we only need to look at observations immediately following a sunny day.

To illustrate, let’s look at the weather pattern again, this time putting the observations to be considered in bold.

S S R R S R S S R R R R S R S S S R

Even though there are 18 observations, only nine need to be considered for the possible next state. Of the nine, four are S and five are R, giving us a 44% (4/9) chance of the next state being sunny and a 55% (5/9) chance of it being rainy.

Now, let’s assume our beginning state (S) transitioned to a second state of R (which it had a 55% chance of doing). Here are the states we need to consider for the possible third state:

S S R R S R S S R R R R S R S S S R

There’s an equal chance (4/8) the third state will be S or R.

With a second-order Markov chain, the current state is two observations. Let’s assume a beginning state of SR and use the same weather sequence as above, again putting the possible next states in bold.

S S R R S R S S R R R R S R S S S R

This time there are only four observations to consider as possible “next states,” with an equal chance it’ll be S or R.

Let’s assume the “next state” picked is R. Now our current (second) state is RR - the S from our beginning state is forgotten. The following are possible third states:

S S R R S R S S R R R R S R S S S R

Again, there’s an equal chance of our third state being S or R.

We can continue picking “next states” and eventually we’ll have generated a random, yet probabilistic sequence of weather.

These same principles can be used to generate a sentence from text data - pick a random beginning state (word) from the text and then pick the next word based on the likelihood of it occurring, given the current word. A first-order Markov sentence would have a one word current state, a second-order would have a two word current state, … and so on.

The larger the corpus and the higher the order, the more sense these Markov generated sentences make. Good thing I have a lot of beer reviews.

The (mini) project

This seemed ripe for a Twitter bot, so I created BeerSnobSays, which tweets nonsensical beer reviews generated via second-order Markov chains.

Not everything it tweets makes much sense:

dissipates about a finger of head and some mild spice interwoven and even beer at a local Greek restaurant.

a big thumbs up though and there are plenty other choices that I was really no distinguishing characteristics that stand out.

those who are looking for a beer best characteristic of this beer into the hype and the lager style that is unwelcome.

But some of it is pretty funny:

off by itself, the taste of apple juice colored brew with a nice warming alcohol bathes your noodle in its dryness.

is almost like sour grains with a hint of booze in the finish, with sweet orange peels and pine sap.

a charred woodiness and smoke can run into pineapple, oranges and citrusy oils with a clean alcohol sting at the bottom of the recipe.

the berry aspect is evident but the tartness and dryness from the beer starts off surprisingly pleasant.

I’m not sure if that last one’s from the bot or a famous poet.

You can follow me and BeerSnobSays on Twitter. You can also find the code for the bot on GitHub.

Using Travis & GitHub to deploy static sites

2015-03-26T00:00:00-07:00

Update: As of December 2020, Travis CI has stopped being free for open source projects. If you've been using Travis to deploy your static site, I recommend migrating to Github Actions. I've written a post about how to do so here.

I’m an unabashed supporter of “Keep It Simple, Stupid” solutions - it’s the reason I use Pelican for this website and host it on S3.

However, I haven’t been completely satisfied with the process of writing a new post or making changes to my theme. It’s felt repetitive - make a change, generate site, check change, regenerate site, and eventually push to S3. Due to the extra steps of generating and pushing, I never felt able to focus on just the change at hand.

I wanted to focus, but also maintain the flexibility of a static site.

Enter TravisCI. Travis is a continuous integration (CI) service hosted at GitHub. Setup a .travis.yml file, check your code into GitHub, and Travis will build the project based on the steps laid out in your .travis.yml. A common use-case of CI is automatically running a test suite against each new commit to make sure a change didn’t break functionality of the app.

Since Pelican is just a Python application, and Travis has S3 integration, I’m now using it to regenerate and deploy my site every time I push a change to it on GitHub.

If you’re using Pelican (or any other static site generator) and hosting on S3, here’s how to set things up.

Setup

First, sign up for Travis - you’ll just need to login with your GitHub account. Travis will then sync with your GitHub repos. Turn on the GitHub repo(s) you’ll be using it with. For me, it’s just my website.

Next, create a new Identity & Access Management (IAM) user on AWS for Travis. Make note of the security credentials - the Access Key ID and Secret Access Key. You’ll need these later.

Also, since this user will need to write files to S3, make sure it has the AmazonS3FullAccess policy. To do so, click on your new user in the IAM dashboard, click “Attach Policy” (in the Managed Policies section), select AmazonS3FullAccess, and attach. Done.

Now, you’ll need to add your AWS Access Key ID and Secret Access Key to your repo’s environment variables in Travis. These are needed in order to write your site’s files to S3.

Lastly, you’ll need to add a .travis.yml file to the root of your project. This tells Travis how to build the application (in this case, a static site generator). Here’s what mine looks like:

language: python
python:
    - "2.7"
cache: apt
install:
    - "sudo apt-get install pandoc"
    - "pip install -r requirements.txt"
script: "pelican content/"
deploy:
    provider: s3
    access_key_id: $AWS_ACCESS_KEY # declared in Travis repo settings
    secret_access_key: $AWS_SECRET_KEY
    bucket: www.gregreda.com
    endpoint: www.gregreda.com.s3-website-us-east-1.amazonaws.com
    region: us-east-1
    skip_cleanup: true
    local-dir: output
    acl: public_read
    detect_encoding: true
notifications:
    email:
        on_failure: always

Here’s a quick rundown:

language - The language in which the application is written. Since we’re using Pelican, it’s Python, but Travis supports a variety of languages. We also specify a version on the next line.
install - This tells Travis any dependencies that need to be installed via apt-get. Some of my posts have IPython Notebook integration, which uses pandoc. I’m also using pip to install the required Python packages (like Pelican).
script - Your build command. In this case, it’s just pelican content, which generates the static site based off of what’s in the content directory. By default, Pelican writes the site to a local directory called output, which we need in the deploy step.
deploy - Since Travis has S3 deployment built-in, all we need to do is tell it which directory (local-dir) to put where (your bucket and its related endpoint and region). Note that we’re also using our AWS keys - the variable names used here must match the names we provided in the environment variables section earlier.
notifications - By default, Travis will email you the results of each build. I’ve turned them off, but there are other notification options as well.

The above is really just a subset of the functionality Travis provides - you can even declare scripts to be run before and after install, or before and after your deploy. Check out the build configuration section of the docs if you’re interested in learning more.

Now, every time I push a commit to GitHub, Travis will clone my repo, cd to it, build, and deploy my site all based on what’s in my .travis.yml file. And I get to focus on writing.

Have questions? Let me know.

Web Scraping 201: finding the API

2015-02-15T00:00:00-08:00

This is part of a series of posts I have written about web scraping with Python.

Web Scraping 101 with Python, which covers the basics of using Python for web scraping.
Web Scraping 201: Finding the API, which covers when sites load data client-side with Javascript.
Asynchronous Scraping with Python, showing how to use multithreading to speed things up.
Scraping Pages Behind Login Forms, which shows how to log into sites using Python.

Update: Sorry folks, it looks like the NBA doesn't make shot log data accessible anymore. The same principles of this post still apply, but the particular example used is no longer functional. I do not intend to rewrite this post.

Previously, I explained how to scrape a page where the data is rendered server-side. However, the increasing popularity of Javascript frameworks such as AngularJS coupled with RESTful APIs means that fewer sites are generated server-side and are instead being rendered client-side.

In this post, I’ll give a brief overview of the differences between the two and show how to find the underlying API, allowing you to get the data you’re looking for.

Server-side vs client-side

Imagine we have a database of sports statistics and would like to build a web application on top of it (e.g. something like Basketball Reference).

If we build our web app using a server-side framework like Django [1], something akin to the following happens each time a user visits a page.

User’s browser sends a request to the server hosting our application.
Our server processes the request, checking to make sure the URL requested exists (amongst other things).
If the requested URL does not exist, send an error back to the user’s browser and direct them to a 404 page.
If the requested URL does exist, execute some code on the server which gets data from our database. Let’s say the user wants to see John Wall’s game-by-game stats for the 2014-15 NBA season. In this case, our Django/Python code queries the database and receives the data.
Our Django/Python code injects the data into our application’s templates to complete the HTML for the page.
Finally, the server sends the HTML to the user’s browser (a response to their request) and the page is displayed.

To illustrate the last step, go to John Wall’s game log and view the page source. Ctrl+f or Cmd+f and search for “2014-10-29”. This is the first row of the game-by-game stats table. We know the page was created server-side because the data is present in the page source.

However, if the web application is built with a client-side framework like Angular, the process is slightly different. In this case, the server still sends the static content (the HTML, CSS, and Javascript), but the HTML is only a template - it doesn’t hold any data. Separately, the Javascript in the server response fetches the data from an API and uses it to create the page client-side.

To illustrate, view the source of John Wall’s shot log page on NBA.com - there’s no data to scrape! See for yourself. Ctrl+f or Cmd+f for “Was @“. Despite there being many instances of it in the shot log table, none found in the page source.

If you’re thinking “Oh crap, I can’t scrape this data,” well, you’re in luck! Applications using an API are often easier to scrape - you just need to know how to find the API. Which means I should probably tell you how to do that.

Finding the API

With a client-side app, your browser is doing much of the work. And because your browser is what’s rendering the HTML, we can use it to see where the data is coming from using its built-in developer tools.

To illustrate, I’ll be using Chrome, but Firefox should be more or less the same (Internet Explorer users … you should switch to Chrome or Firefox and not look back).

To open Chrome’s Developer Tools, go to View -> Developer -> Developer Tools. In Firefox, it’s Tools -> Web Developer -> Toggle Tools. We’ll be using the Network tab, so click on that one. It should be empty.

Now, go to the page that has your data. In this case, it’s John Wall’s shot logs. If you’re already on the page, hit refresh. Your Network tab should look similar to this:

Next, click on the XHR filter. XHR is short for XMLHttpRequest - this is the type of request used to fetch XML or JSON data. You should see a couple entries in this table (screenshot below). One of them is the API request that returns the data you’re looking for (in this case, John Wall’s shots).

At this point, you’ll need to explore a bit to determine which request is the one you want. For our example, the one starting with “playerdashptshotlog” sounds promising. Let’s click on it and view it in the Preview tab. Things should now look like this:

Bingo! That’s the API endpoint. We can use the Preview tab to explore the response.

You should see a couple of objects:

The resource name - playerdashptshotlog.
The parameters (you might need to expand the resource section). These are the request parameters that were passed to the API. You can think of them like the WHERE clause of a SQL query. This request has parameters of Season=2014-15 and PlayerID=202322 (amongst others). Change the parameters in the URL and you’ll get different data (more on that in a bit).
The result sets. This is self-explanatory.
Within the result sets, you’ll find the headers and row set. Each object in the row set is essentially the result of a database query, while the headers tell you the column order. We can see that the first item in each row corresponds to the Game_ID, while the second is the Matchup.

Now, go to the Headers tab, grab the request URL, and open it in a new browser tab, we’ll see the data we’re looking for (example below). Note that I'm using JSONView, which nicely formats JSON in your browser.

To grab this data, we can use something like Python’s requests. Here’s an example:

import requests

shots_url = 'http://stats.nba.com/stats/playerdashptshotlog?'+ \
    'DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00&' + \
    'Location=&Month=0&OpponentTeamID=0&Outcome=&Period=0&' + \
    'PlayerID=202322&Season=2014-15&SeasonSegment=&' + \
    'SeasonType=Regular+Season&TeamID=0&VsConference=&VsDivision='

# request the URL and parse the JSON
response = requests.get(shots_url)
response.raise_for_status() # raise exception if invalid response
shots = response.json()['resultSets'][0]['rowSet']

# do whatever we want with the shots data
do_things(shots)

That’s it. Now you have the data and can get to work.

Note that passing different parameter values to the API yields different results. For instance, change the Season parameter to 2013-14 - now you have John Wall’s shots for the 2013-14 season. Change the PlayerID to 201935 - now you have James Harden’s shots.

Additionally, different APIs return different types of data. Some might send XML; others, JSON. Some might store the results in an array of arrays; others, an array of maps or dictionaries. Some might not return the column headers at all. Things are vary between sites.

Had a situation where you haven't been able to find the data you're looking for in the page source? Well, now you know how to find it.

Was there something I missed? Have questions? Let me know.

[1] Really this can be any server-side framework - Ruby on Rails, PHP’s Drupal or CodeIgniter, etc.

[Talk] Translating SQL to pandas

2014-12-22T00:00:00-08:00

A few weeks ago, I gave a pandas tutorial at PyData NYC titled "Translating SQL to pandas. And back." I don't remember why I put the "And back" in there - if you can translate things one way, you can translate them the other way, too.

Anyway, here's the abstract:

SQL is still the bread-and-butter of the data world, and data analysts/scientists/engineers need to have some familiarity with it as the world runs on relational databases.

When first learning pandas (and coming from a database background), I found myself wanting to be able to compare equivalent pandas and SQL statements side-by-side, knowing that it would allow me to pick up the library quickly, but most importantly, apply it to my workflow.

This tutorial will provide an introduction to both syntaxes, allowing those inexperienced with either SQL or pandas to learn a bit of both, while also bridging the gap between the two, so that practitioners of one can learn the other from their perspective. Additionally, I'll discuss the tradeoffs between each and why one might be better suited for some tasks than the other.

Having never been to a technical conference, much less given a talk at one, it was quite a new experience for me - and something I'd like to do again.

I highly recommend giving a talk at an event like PyData if you ever have the opportunity. And if you think you don't have anything interesting to say, or aren't experienced enough to give a tutorial, or are just plain nervous ... don't worry, I felt all those things too. You should do it anyway.

Below is the video of my talk. You can find the accompanying materials here.

Scraping Craigslist for sold out concert tickets

2014-07-27T00:00:00-07:00

Recently, I've been listening to a lot of lo-fi rock band, Cloud Nothings. Their album, Here & Nowhere Else, has been critically lauded, including garnering "Best New Music" from Pitchfork. As a result, when they came to Chicago's tiny Lincoln Hall in May, tickets sold out in a hurry - well before I found out about the show. Desperately wanting to go, I started checking Craigslist every day or two for tickets.

Lincoln Hall only holds about 500 people, so Craigslist postings were few and far between. When a post did pop up, I always ended up seeing it a couple hours after it was posted and was too late - the tickets had been sold. Noticing that my frustration was beginning to grow, I figured it was time to automate my Craigslist searches for tickets.

If you search on Craigslist and look at the URL of the results page, you'll notice that it looks very similar to this:

Note the section that says query=this+is+my+search+term - that's where your search term gets passed to the databases that back Craigslist (with spaces replaced by + signs). This means we can write code to automate any "for sale" search by hitting http://<city>.craigslist.org/search/sss?query=<term> where <city> corresponds to the subdomain of your city's respective Craigslist and <term> is our search term.

For my use case, there were very few Craigslist results for each search of "Cloud Nothings" and none of them were spammy. I decided to write a script which would run every 10 minutes and send me a text message if any of the results were new. If I got a text, I could quickly head over to Craigslist, email the seller, and go back about my day. I was lucky that ticket brokers hadn't started putting "Cloud Nothings" in their spammy posts - if they had, this solution likely would not have worked - the text messages would have been more noise than signal.

Thankfully, it worked. I was able to get a ticket for face value two nights before the show.

In the sections below, I'll walk through the code behind it all. If you're unfamiliar with web scraping, I suggest reading my previous posts here and here.

Code Walk-Through

Most of the code's functionality is contained within the four functions below.

parse_results

def parse_results(search_term):
    results = []
    search_term = search_term.strip().replace(' ', '+')
    search_url = BASE_URL.format(search_term)
    soup = BeautifulSoup(urlopen(search_url).read())
    rows = soup.find('div', 'content').find_all('p', 'row')
    for row in rows:
        url = 'http://chicago.craigslist.org' + row.a['href']
        create_date = row.find('span', 'date').get_text()
        title = row.find_all('a')[1].get_text()
        results.append({'url': url, 'create_date': create_date, 'title': title})
    return results

The above function takes a search_term, which is used to execute a search on Craigslist. It returns a list of dictionaries, where each dictionary represents a post found within the search results.

Note the global BASE_URL variable - this is the search results URL mentioned earlier. Here, we're injecting our search term into the section of the URL that had query=<term>.

The majority of this function utilizes BeautifulSoup to parse the HTML of Craigslist's search results page. For each post in the search results, we store the URL of the post, its creation date, and its title.

In the next function, we'll write these results to a CSV file, which we'll later use to check whether or not there are "new" posts.

write_results

def write_results(results):
    """Writes list of dictionaries to file."""
    fields = results[0].keys()
    with open('results.csv', 'w') as f:
        dw = csv.DictWriter(f, fieldnames=fields, delimiter='|')
        dw.writer.writerow(dw.fieldnames)
        dw.writerows(results)

As mentioned above, write_results takes a list of dictionaries and writes them to a CSV file called results.csv. Each line of the file will store a post's title, create date, and URL.

You can think of this file similarly to how you might think of a database - we're storing information that we'll need to refer to later on. Since we aren't storing much data, there's really no need to use something like SQLite, MySQL or any other datastore - a text file works just fine for our use case. I'm a big proponent of KISS methodology (Keep It Simple, Stupid).

has_new_records

def has_new_records(results):
    current_posts = [x['url'] for x in results]
    fields = results[0].keys()
    if not os.path.exists('results.csv'):
        return True

    with open('results.csv', 'r') as f:
        reader = csv.DictReader(f, fieldnames=fields, delimiter='|')
        seen_posts = [row['url'] for row in reader]

    is_new = False
    for post in current_posts:
        if post in seen_posts:
            pass
        else:
            is_new = True
    return is_new

This function determines whether or not any of the posts are new (not present in the results from the last time our code was run).

It takes a list of dictionaries (exactly the same as the one parse_results returns) and checks it against the CSV file we created with the write_results function. Since a URL can only point to one post, we can consider it a unique key to check against.

If any of the URLs in results are not found within the CSV file, this function will return True, which we'll use as a trigger to sending off a text message as notification.

send_text

def send_text(phone_number, msg):
    fromaddr = "Craigslist Checker"
    toaddrs = phone_number + "@txt.att.net"
    msg = ("From: {0}\r\nTo: {1}\r\n\r\n{2}").format(fromaddr, toaddrs, msg)
    server = smtplib.SMTP('smtp.gmail.com:587')
    server.starttls()
    server.login(config.email['username'], config.email['password'])
    server.sendmail(fromaddr, toaddrs, msg)
    server.quit()

send_text requires two parameters - the first being the 10-digit phone number that will receive the SMS message, and the second being the content of the message.

This function makes use of the Simple Mail Transfer Protocol (or SMTP) as well as AT&T's email-to-SMS gateway (notice the @txt.att.net). This allows us to use a GMail account to send the text message.

Note that if you are not a GMail user or do not use AT&T for your cell phone service, you'll need to make some changes to this function. You can find a list of other email-to-SMS gateways here.

Since this function uses my GMail credentials, I've stored them in a separate Python file which I am referencing when I call config.email['username'] and config.email['password']. You can find the config setup here. Just make sure you don't accidentally check in your GMail credentials if you're putting this on GitHub.

Putting it all together

You can take a look at the final script here. Feel free to use it however you'd like. Deploying it is as simple as spinning up a micro EC2 instance and setting up a cronjob to run the script as often as you'd like.

Did you like this post? Was there something I missed? Let me know on Twitter.

Principles of good data analysis

2014-03-23T00:00:00-07:00

Data analysis is hard.

What makes it hard is the intuitive aspect of it - knowing the direction you want to take based on the limited information you have at the moment. Additionally, it's communicating the results and showing why your analysis is right that makes this all the more difficult - doing it deeply, at scale, and in a consistent fashion.

Having been a part of many of these deep-dive analyses, I've noticed some "principles" that I've found useful to follow throughout.

Know your approach

Before you begin the analysis, know the questions you're trying to answer and what you're trying to accomplish - don't fall into an analytical rabbit hole. Additionally, you should know some basic things about your potential data - what data sources are available to answer the questions? How is that data structured? Is it in a database? CSVs? Third-party APIs? What tools will you be able to use for the analysis?

Your approach will likely change throughout, but it's helpful to start with a plan and adjust.

Know how the data was generated

Once you've settled on your approach and data sources, you need to make sure you understand how the data was generated or captured, especially if you are using your own company's data.

For instance, let's say you're a data scientist at Amazon and you're doing some analysis on orders. Let's assume there's a table somewhere in the Amazon world called "orders" that stores data about an order. Does this table store incomplete orders? What is the interaction on Amazon.com that creates a new record in this table? If I start an order and do not fully complete the payment flow, will a record have been written to this table? What exactly does each field in the table mean?

You need to know this level of detail in order to have confidence in your analysis - your audience will ask these questions.

Profile your data

Once you're confident you're looking at the right data, you need to develop some familiarity it. Not only will this allow you to gain a basic understanding of what you're looking at, but it also allows you to gain a certain level of comfort that things are still "right" later on in the analysis.

For example, I was once helping a friend analyze a fairly large time series dataset (~10GB). The results of the analysis didn't intuitively jive with me - something felt off. When digging deeper into the analysis, I decided to plot the events by date and noticed we had two days without any data - that shouldn't have been the case.

Profiling your data early on helps to ensure your work throughout the analysis - you'll notice sooner when something is "off."

Facet all the things

I'm increasingly convinced that Simpson's Paradox is one of the most important things for anyone working with data to understand. In cases of Simpson's paradox, a trend appearing in different groups of data disappears when the groups are combined and looked at in aggregate. It illustrates the importance of looking at your data by multiple dimensions.

As an example, take a look at the below table.

The above table shows admission rates for men and women into the University of California, Berkeley's graduate programs for the fall of 1973. Based on the above numbers, the University was sued for an alledged bias against women. However, when faceting the data by sex AND department, we see women were actually admitted into many departments' graduate programs at a rate higher than men.

This is probably the most infamous case of Simpson's paradox. The folks over at Berkeley's VUDLab have put together a fantastic visualization allowing you to explore the data further.

When going through your data, do so with Simpson's paradox in mind. It's extremely important to understand how aggregate statistics can be misleading and why looking at your data from multiple facets is necessary.

Be skeptical

In addition to profiling and faceting your data, you need to be skeptical throughout your analysis. If something doesn't look or feel right, it probably isn't. Pore through your data to make sure nothing unexpected going on, and if there is something unexpected, make sure you understand why it's occurring and are comfortable with it before you proceed.

I'd argue that no data is better than incorrect data in most cases. Make sure the base layer of your analysis is correct.

Think like a trial lawyer

A good trial attorney will prepare their case while also considering how the opposition might respond. When the opposition does present, our attorney will (hopefully) have prepared for that very piece of new evidence or testimony, easily allowing he/she to counter in a meaningful way.

Much like a good trial attorney, you need to think ahead and consider the audience of your analysis and the questions they might ask. Preparing appropriately for those will lend to the credibility of your work. No one likes to hear "I'm not sure, I didn't look at that" and you don't want to be caught flat-footed.

Clarify your assumptions

It's unlikely that your data is perfect and it probably doesn't capture everything you need to complete a thorough and exhaustive analysis - you'll need to hold some assumptions throughout your work. These need to be explicitly stated when you're sharing results.

Additionally, your stakeholders are crucial in helping you determine your assumptions. You should be working with them and other domain experts to ensure your assumptions are logical and unbiased.

Check your work

It seems obvious, but people just don't check their work sometimes. Understandably, there are deadlines, quick turnarounds, and last minute requests; however, I can assure you that your audience would rather your results be correct than quick.

I find it useful to regularly check the basic statistics of the data (sums, counts, etc.) throughout an analysis in order to make sure nothing is lost along the way - essentially creating a trail of breadcrumbs I can follow backwards in case something doesn't seem right later on.

Communicate

Lastly, the whole process should be a conversation with stakeholders - don't work in a silo. It's possible your audience isn't necessarily concerned with decimal point accuracy - maybe they just want to understand directional impact.

In the end, remember that data analysis is most often about solving a problem and that problem has stakeholders - you should be working with them to answer the questions that are most important; not necessarily those that are most interesting. Interesting doesn't always mean "valuable."

Finding the midpoint of film releases

2014-01-23T00:00:00-08:00

"We're talking about Thunderdome. It's from before you were born."

"Most movies are from before I was born."

That statement spurred a pretty interesting question: what's the date where that statement is no longer true? Put another way, what date in history has an equal number of films made before and after it?

My birthday, November 4, 1985, felt like a relatively safe date, but really, no one had a clue if what I said was true, including me. Guesses by about a dozen coworkers included dates from September 1963 all the way to September 2001.

Knowing that IMDB makes their data publicly available, I decided to find the actual date. Using the most current releases.list file (1/17/14 at the time of writing), I held the following assumptions:

Only films. The release-dates.list also includes TV shows and video games. It also includes movies that went straight to video - those count.
Films with a release date in the future do not count.
If the film was released multiple times (different release dates for different countries), use the earliest release date.
If only a release month and year were provided, assume the 15th of that month.
If only a release year was provided, assume July 2nd of that year.

The result?

May 15, 2002.

Of course, given the current rate at which films are being made, this analysis is already out of date.

For those interested, you can find the code here.

3-pointers after offensive rebounds

2013-12-26T00:00:00-08:00

I love college basketball. A lot. My beloved Marquette Golden Eagles are probably the only sports team that can really put me in agony.

Last season, I was watching a nail-biter against Notre Dame. With 3:36 left, Notre Dame's Jack Cooley grabbed an offensive board and promptly passed out to Pat Connaughton for a successful three-pointer. It was a dagger.

One thing stuck out to me though - after the shot, Jay Bilas, who was calling the game with Bill Raftery and Sean McDonough, stated that the best time to attempt a 3-pointer is after an offensive rebound.

Intuitively, this statement makes sense - the defensive front line is crashing the boards in hopes of getting a rebound to end the offensive possession, while the defensive backcourt is out on the wings looking for an outlet pass from their teammates, likely leaving their offensive counterparts unguarded.

I've never seen any data that indicates whether three pointers are more successful after an offensive rebound though, much less whether it's the best time to shoot one. It seemed like something worth investigating.

In the following analysis, we'll try to determine whether there is a material difference between "normal" 3P% (those not shot after an offensive rebound) and 3P% when the shot was preceded by an offensive rebound.

I'll be going step by step through data collection, munging, and analysis. If you're just interested in the answer, skip to the last section.

Getting the data

ESPN has play-by-play data for almost every NCAA Division I game. I've written a python script that will collect all of this data for a given date range. You can find the script here. If you're unfamiliar with web scraping, check out the tutorial I wrote previously.

Analysis

Now let's start our analysis using pandas.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import re

# read PSVs into DataFrame
games = []
files = glob.glob('*.psv')
for f in files:
    df = pd.read_csv(f, sep='|')
    df['game_id'] = f.replace('.psv', '')
    games.append(df)

print('Read {0} games'.format(len(games)))

Read 2931 games

To start, we need to find all incidents of a three pointer immediately after an offensive rebound.

This format is kind of crappy though - since events are in separate columns for the home and away teams, we'd have to write logic to check against each column. Let's munge our data into a slightly different format - one column for team, which will indicate home/away, and another for event, which will store the description of what occurred.

games_df = pd.concat(games)

# add event_id to maintain event order
# we can use the index since pandas defaults to the Nth row of the file
games_df['event_id'] = games_df.index

# melt data into one column for home/away and another for event
# maintain play order by sorting on event_id
melted = pd.melt(games_df, id_vars=['event_id', 'game_id', 'time', 'score'],
                     var_name='team', value_name='event')
melted.sort_index(by=['game_id', 'event_id'], inplace=True)

# drop rows with NaN events - an event only belongs to one team
melted = melted[melted.event.notnull()]
print(melted[10:15])

        event_id    game_id   time score  team                                  event
917522        10  330010066  19:11   0-0  home         Melvin Ejim missed Free Throw.
11            11  330010066  19:11   0-0  away  Jeremiah Kreisberg Defensive Rebound.
12            12  330010066  18:37   0-0  away            Justin Sears missed Jumper.
917525        13  330010066  18:37   0-0  home        Percy Gibson Defensive Rebound.
917526        14  330010066  18:31   0-0  home  Chris Babb missed Three Point Jumper.

We need to know whether the three pointers were missed or made - let's write a function called get_shot_result to extract the shot result from the event column. We can apply it to every row that contains a three pointer, storing the results in a new column called shot_result.

# label whether three pointers were made or missed
get_shot_result = lambda x: re.findall('(made|missed)', x)[0]
shot3 = melted.event.str.contains('Three Point')
melted['shot_result'] = melted[shot3].event.apply(get_shot_result)

Now let's write a function to label the events that meet our criteria - a three point attempt that was preceded by an offensive rebound. We can use shift(1) to reference the event on the previous row.

def criteria(df):
    """Labels if the three pointer was preceded by an offensive rebound."""
    df['after_oreb'] = ((df.event.str.contains('Three Point')) & \
                        df.event.shift(1).str.contains('Offensive Rebound'))
    df.after_oreb.fillna(False, inplace=True)
    return df

melted = melted.groupby('game_id').apply(criteria)
melted[melted.shot_result.notnull()].head(3)

	event_id	game_id	time	score	team	event	shot_result	after_oreb
2	2	330010066	19:31	0-0	away	Austin Morgan missed Three Point Jumper.	missed	False
917518	6	330010066	19:14	0-0	home	Will Clyburn missed Three Point Jumper.	missed	True
917526	14	330010066	18:31	0-0	home	Chris Babb missed Three Point Jumper.	missed	False

Finally, we can calculate the 3P% for our groups and plot the results.

threes = melted[melted.shot_result.notnull()]
attempts = threes.groupby(['shot_result', 'after_oreb']).size().unstack(0)
attempts['perc'] = attempts.made.astype(float) / (attempts.made + attempts.missed)
print(attempts)

shot_result   made  missed      perc
after_oreb                          
False        33244   63608  0.343245
True          2861    5341  0.348817

attempts.index = ['No', 'Yes']
plt.figure(figsize=[8, 6])
attempts.perc.plot(kind='bar')
plt.ylabel('3P%')
plt.xlabel('After Offensive Rebound?')
plt.grid(False);

Digging deeper

At the most basic level, the difference is negligible. Not all post-offensive rebound three pointers are created equally though. Let's investigate further to see whether three pointers shot shortly after the rebound are more successful. Let's look at only three pointers that were shot within seven seconds of the offensive rebound.

To do so, we'll need to munge our data a bit more in order to calculate the seconds elapsed between rebound and shot attempt.

melted['minutes'] = melted.time.apply(lambda x: int(x.split(':')[0]))
melted['seconds'] = melted.time.apply(lambda x: int(x.split(':')[1]))

Notice below that time-outs and end of periods are duplicated within the data (this is because they originally appeared as both a home and away event).

duped_cols = ['game_id', 'event_id', 'time', 'event']
melted[melted.duplicated(cols=duped_cols)][:3]

	event_id	game_id	time	score	team	event	shot_result	after_oreb	minutes	seconds
917546	34	330010066	15:22	Official TV Timeout.	home	Official TV Timeout.	NaN	False	15	22
917559	47	330010066	13:40	Yale Full Timeout.	home	Yale Full Timeout.	NaN	False	13	40
917575	63	330010066	11:53	Official TV Timeout.	home	Official TV Timeout.	NaN	False	11	53

Let's get rid of them so we can easily label events within each period (keeping them in will throw our function off a bit).

melted.drop_duplicates(cols=['game_id', 'event_id', 'event'], inplace=True)

Now we can label each period based on when the End of ... event appears - events before it are the first half; events after, the second half. To do so, we can use cumsum, which will treat True values as 1. This means we can just shift our results down a row and take the running total.

melted['period_end'] = melted.event.apply(lambda x: x.startswith('End of'))
melted[melted.period_end].head(3)

	event_id	game_id	time	score	team	event	shot_result	after_oreb	period_end
171	171	330010066	0:00	End of the 1st Half.	away	End of the 1st Half.	NaN	False	True
480	123	330010120	0:00	End of the 1st Half.	away	End of the 1st Half.	NaN	False	True
770	142	330010228	0:00	End of the 1st Half.	away	End of the 1st Half.	NaN	False	True

calculate_period = lambda x: x.shift(1).cumsum().fillna(0) + 1
melted['period'] = melted.groupby('game_id').period_end.apply(calculate_period)

Now we can use the period to calculate the total time left in the game.

melted.set_index('game_id', inplace=True)

# 40min regulation game + (# periods - 2 halves) * 5min OTs
gametime = lambda x: 40 + (x - 2) * 5
melted['gametime'] = melted.groupby(level=0).period.max().apply(gametime)
melted.reset_index('game_id', inplace=True)

Setting the game_id as the index was necessary for this function since pandas naturally tries to match on indexes.

melted.groupby('gametime').game_id.nunique()

gametime
35            54
40          2365
45           442
50            61
55             7
60             1
65             1
dtype: int64

Notice above that some games have a total gametime of 35 minutes. All college basketball games are at least 40 minutes, so something is off.

It turns out there are some inconsistencies in ESPN's play-by-play data - a couple games do not have the "End of the 1st Half." event. This throws off our period and gametime calculations. Such is life when dealing with scraped data though.

Let's keep things simple and just assume they were normal, non-OT games.

melted.loc[melted.gametime == 35, 'gametime'] = 40

Now let's normalize the event times to seconds left in the game. This will allow us to see how much time elapsed between the offensive rebound and three point attempt.

def clock_to_secs_left(df):
    """Calculates the total seconds left in the game."""
    df['secs_left'] = np.nan
    df.loc[df.period == 1, 'secs_left'] = (df.minutes * 60) + 1200 + df.seconds
    df.loc[df.period > 1, 'secs_left'] = (df.minutes * 60) + df.seconds
    return df

clock_to_secs_left(melted)
print(melted[['game_id', 'time', 'event', 'period', 'secs_left']][:5])

     game_id   time                                     event period  secs_left
0  330010066  19:47                      Chris Babb Turnover.      1       2387
1  330010066  19:45                      Austin Morgan Steal.      1       2385
2  330010066  19:31  Austin Morgan missed Three Point Jumper.      1       2371
3  330010066  19:31          Korie Lucious Defensive Rebound.      1       2371
4  330010066  19:21              Korie Lucious missed Jumper.      1       2361

We can finally see how much time elapsed between offensive rebound and three point attempt.

We'll create a new field which will store the seconds elapsed since the previous event. Then we'll create a new DataFrame called threes_after_orebs which will hold three point attempts that were shot within seven seconds of an offensive rebound.

melted['secs_elapsed'] = melted.secs_left.shift(1) - melted.secs_left

mask = (melted.secs_elapsed >= 0) & (melted.secs_elapsed <= 7)
threes_after_orebs = melted[melted.after_oreb & mask]

Finally, let's group our data by the seconds elapsed and shot result to get the three point percentage for each bucket, which we can plot.

grouped = threes_after_orebs.groupby(['shot_result', 'secs_elapsed']).size()
grouped = grouped.unstack(0).fillna(0)

grouped['attempts'] = grouped.made + grouped.missed
grouped['percentage'] = grouped.made / grouped.attempts.astype(float)

t = threes.shot_result.value_counts()
t = float(t['made']) / (t['made'] + t['missed'])

figsize(12.5, 7)

plt.plot(grouped.percentage, label='O-Reb 3P%', color='#377EB8')
plt.hlines(t, 0, 7, label='"Normal" 3P%', linestyles='--')
plt.xlabel('Seconds Since Offensive Rebound')
plt.xticks(np.arange(8))
plt.ylabel('3-Point Percentage', labelpad=15)
plt.grid(False)
plt.legend(loc='lower right');

It looks like there might be some truth to Jay Bilas' statement that the best time to shoot a three pointer is after an offensive rebound. However, we can go one step further and simulate our way to a numeric value of how correct he was.

Simulations

To start, we'll create two Series based on the melted DataFrame that we have been using throughout the analysis. One Series, which we'll call normal, will hold the results of three pointers that were normally attempted - that is, they were not shot immediately after an offensive rebound. The other Series, after, will contain the results of those shot after an offensive rebound. True will be used to indicate a made basket.

convert = lambda x: True if x == 'made' else False

normal_criteria = (melted.after_oreb == False) & melted.shot_result.notnull()
normal = melted[normal_criteria].shot_result.apply(convert)

after_criteria = (melted.after_oreb) & melted.shot_result.notnull() & \
                    (melted.secs_elapsed <= 7)
after = melted[after_criteria].shot_result.apply(convert)

print("After O-Reb 3P%:", after.mean())
print("Sample Size:", len(after))
print()
print("All other 3P%:", normal.mean())
print("Sample Size:", len(normal))
print()
print("Absolute difference: %.4f" % (after.mean() - normal.mean()))

After O-Reb 3P%: 0.350169109357
Sample Size: 4435

All other 3P%: 0.343264007931
Sample Size: 96838

Absolute difference: 0.0069

While we have data for 2,932 games, it turns out that three pointers shot within seven seconds of an offensive rebound aren't very common - it only occurred 4,435 times, while "normal" three pointers were shot 96,838 times.

The much smaller population means that we're more uncertain about the "true" success rate of those after offensive rebounds.

Using pymc, we can run a simulations to determine how likely it is that three pointers after offensive rebounds really are easier, while taking into account this uncertainty.

We'll first assume a uniform distribution between 30% and 40% for both "normal" three pointers and those after offensive rebounds. This means that we believe the "true" success rate of each to be somewhere between 30-40%. This seems reasonable given that the 3P% for all of NCAA Division I basketball during the 2012-2013 season was 33.89% [source].

We then generate observations using our normal and after Series. Our observations are an example of a Bernoulli distribution, meaning that the outcome is binary - the three pointer was either made (True) or missed (False).

We'll then run 20,000 simulations using our existing data and assumptions.

import pymc as pm

# no chance 3P% is out of this range
p_normal = pm.Uniform("p_normal", 0.3, 0.4)
p_after = pm.Uniform("p_after", 0.3, 0.4)

@pm.deterministic
def delta(p_normal=p_normal, p_after=p_after):
    return p_after - p_normal

# scraped observations
obs_normal = pm.Bernoulli("obs_normal", p_normal, value=normal, observed=True)
obs_after = pm.Bernoulli("obs_after", p_after, value=after, observed=True)

m = pm.MCMC([p_normal, p_after, delta, obs_normal, obs_after])
m.sample(20000)

p_normal_samples = m.trace("p_normal")[:]
p_after_samples = m.trace("p_after")[:]
delta_samples = m.trace("delta")[:]

Finally, we can plot the results of our simulations.

figsize(12.5, 10)

ax = plt.subplot(311)
plt.xlim(0.3, 0.4)
plt.xticks(np.arange(0.3, 0.401, 0.01))
plt.ylim(0, 300)
plt.hist(p_normal_samples, histtype='stepfilled', bins=50, normed=True, 
         color='#E41A1C', label='3P% "Normal"')
plt.vlines(normal.mean(), 0, 300, linestyles='--', label='True "Normal" 3P%')
plt.legend()
plt.grid(False)

ax = plt.subplot(312)
plt.xlim(0.3, 0.4)
plt.xticks(np.arange(0.3, 0.401, 0.01))
plt.ylim(0, 300)
plt.hist(p_after_samples, histtype='stepfilled', bins=50, normed=True,
         color='#4DAF4A', label='3P% After Off. Reb.')
plt.vlines(after.mean(), 0, 300, linestyles='--',
           label='True 3P% After Off. Reb.')
plt.legend()
plt.grid(False)

ax = plt.subplot(313)
plt.xlim(-0.05, 0.05)
plt.xticks(np.arange(-0.05, 0.051, 0.01))
plt.ylim(0, 300)
plt.hist(delta_samples, histtype='stepfilled', bins=50, normed=True,
         color='#377EB8', label='Delta')
plt.vlines(0, 0, 300, linestyles='--', label='$H_0$ (No difference)')
plt.legend()
plt.grid(False);

Notice the much narrower distribution at the top? This is because we had so many observations for "normal" three pointers. There wasn't much uncertainty! It's quite different than the second distribution, where the wider width of the distribution indicates are greater level of uncertainty on those attempted after offensive rebounds. This is due to the much smaller sample size.

The final distribution shows the difference between "normal" and "after" three pointers in our simulations. Much more often than not, "after" three pointers had a higher 3P% in the simulations.

In what percentage of simulations was "after" better than "normal" though?

print("3P% after offensive rebounds was more successful "
      "in {0:.1f}% of simulations").format((delta_samples > 0).mean() * 100)

3P% after offensive rebounds was more successful in 83.1% of simulations

Looks like Jay Bilas is pretty right about this one. While the absolute difference in 3P% is small, those after an offensive rebound were more successful in about 83% of our simulations.

Was there something I missed or got wrong? I'd love to hear from you.

Using pandas on the MovieLens dataset

2013-10-26T03:00:00-07:00

UPDATE: If you're interested in learning pandas from a SQL perspective and would prefer to watch a video, you can find video of my 2014 PyData NYC talk here.

This is part three of a three part introduction to pandas, a Python library for data analysis. The tutorial is primarily geared towards SQL users, but is useful for anyone wanting to get started with the library.

Part 1: Intro to pandas data structures, covers the basics of the library's two main data structures - Series and DataFrames.
Part 2: Working with DataFrames, dives a bit deeper into the functionality of DataFrames. It shows how to inspect, select, filter, merge, combine, and group your data.
Part 3: Using pandas with the MovieLens dataset, applies the learnings of the first two parts in order to answer a few basic analysis questions about the MovieLens ratings data.

Using pandas on the MovieLens dataset

To show pandas in a more "applied" sense, let's use it to answer some questions about the MovieLens dataset. Recall that we've already read our data into DataFrames and merged it.

# pass in column names for each CSV
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=u_cols,
                    encoding='latin-1')

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols,
                      encoding='latin-1')

# the movies file contains columns indicating the movie's genres
# let's only load the first five columns of the file with usecols
m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
movies = pd.read_csv('ml-100k/u.item', sep='|', names=m_cols, usecols=range(5),
                     encoding='latin-1')

# create one merged DataFrame
movie_ratings = pd.merge(movies, ratings)
lens = pd.merge(movie_ratings, users)

What are the 25 most rated movies?

most_rated = lens.groupby('title').size().sort_values(ascending=False)[:25]
most_rated

title
Star Wars (1977)                             583
Contact (1997)                               509
Fargo (1996)                                 508
Return of the Jedi (1983)                    507
Liar Liar (1997)                             485
English Patient, The (1996)                  481
Scream (1996)                                478
Toy Story (1995)                             452
Air Force One (1997)                         431
Independence Day (ID4) (1996)                429
Raiders of the Lost Ark (1981)               420
Godfather, The (1972)                        413
Pulp Fiction (1994)                          394
Twelve Monkeys (1995)                        392
Silence of the Lambs, The (1991)             390
Jerry Maguire (1996)                         384
Chasing Amy (1997)                           379
Rock, The (1996)                             378
Empire Strikes Back, The (1980)              367
Star Trek: First Contact (1996)              365
Back to the Future (1985)                    350
Titanic (1997)                               350
Mission: Impossible (1996)                   344
Fugitive, The (1993)                         336
Indiana Jones and the Last Crusade (1989)    331
dtype: int64

There's a lot going on in the code above, but it's very idomatic. We're splitting the DataFrame into groups by movie title and applying the size method to get the count of records in each group. Then we order our results in descending order and limit the output to the top 25 using Python's slicing syntax.

In SQL, this would be equivalent to:

SELECT title, count(1)
FROM lens
GROUP BY title
ORDER BY 2 DESC
LIMIT 25;

Alternatively, pandas has a nifty value_counts method - yes, this is simpler - the goal above was to show a basic groupby example.

lens.title.value_counts()[:25]

Star Wars (1977)                             583
Contact (1997)                               509
Fargo (1996)                                 508
Return of the Jedi (1983)                    507
Liar Liar (1997)                             485
English Patient, The (1996)                  481
Scream (1996)                                478
Toy Story (1995)                             452
Air Force One (1997)                         431
Independence Day (ID4) (1996)                429
Raiders of the Lost Ark (1981)               420
Godfather, The (1972)                        413
Pulp Fiction (1994)                          394
Twelve Monkeys (1995)                        392
Silence of the Lambs, The (1991)             390
Jerry Maguire (1996)                         384
Chasing Amy (1997)                           379
Rock, The (1996)                             378
Empire Strikes Back, The (1980)              367
Star Trek: First Contact (1996)              365
Titanic (1997)                               350
Back to the Future (1985)                    350
Mission: Impossible (1996)                   344
Fugitive, The (1993)                         336
Indiana Jones and the Last Crusade (1989)    331
Name: title, dtype: int64

Which movies are most highly rated?

movie_stats = lens.groupby('title').agg({'rating': [np.size, np.mean]})
movie_stats.head()

	rating
	size	mean
title
'Til There Was You (1997)	9	2.333333
1-900 (1994)	5	2.600000
101 Dalmatians (1996)	109	2.908257
12 Angry Men (1957)	125	4.344000
187 (1997)	41	3.024390

We can use the agg method to pass a dictionary specifying the columns to aggregate (as keys) and a list of functions we'd like to apply.

Let's sort the resulting DataFrame so that we can see which movies have the highest average score.

# sort by rating average
movie_stats.sort_values([('rating', 'mean')], ascending=False).head()

	rating
	size	mean
title
They Made Me a Criminal (1939)	1	5
Marlene Dietrich: Shadow and Light (1996)	1	5
Saint of Fort Washington, The (1993)	2	5
Someone Else's America (1995)	1	5
Star Kid (1997)	3	5

Because movie_stats is a DataFrame, we use the sort method - only Series objects use order. Additionally, because our columns are now a MultiIndex, we need to pass in a tuple specifying how to sort.

The above movies are rated so rarely that we can't count them as quality films. Let's only look at movies that have been rated at least 100 times.

atleast_100 = movie_stats['rating']['size'] >= 100
movie_stats[atleast_100].sort_values([('rating', 'mean')], ascending=False)[:15]

	rating
	size	mean
title
Close Shave, A (1995)	112	4.491071
Schindler's List (1993)	298	4.466443
Wrong Trousers, The (1993)	118	4.466102
Casablanca (1942)	243	4.456790
Shawshank Redemption, The (1994)	283	4.445230
Rear Window (1954)	209	4.387560
Usual Suspects, The (1995)	267	4.385768
Star Wars (1977)	583	4.358491
12 Angry Men (1957)	125	4.344000
Citizen Kane (1941)	198	4.292929
To Kill a Mockingbird (1962)	219	4.292237
One Flew Over the Cuckoo's Nest (1975)	264	4.291667
Silence of the Lambs, The (1991)	390	4.289744
North by Northwest (1959)	179	4.284916
Godfather, The (1972)	413	4.283293

Those results look realistic. Notice that we used boolean indexing to filter our movie_stats frame.

We broke this question down into many parts, so here's the Python needed to get the 15 movies with the highest average rating, requiring that they had at least 100 ratings:

movie_stats = lens.groupby('title').agg({'rating': [np.size, np.mean]})
atleast_100 = movie_stats['rating'].size >= 100
movie_stats[atleast_100].sort_values([('rating', 'mean')], ascending=False)[:15]

The SQL equivalent would be:

SELECT title, COUNT(1) size, AVG(rating) mean
FROM lens
GROUP BY title
HAVING COUNT(1) >= 100
ORDER BY 3 DESC
LIMIT 15;

Limiting our population going forward

Going forward, let's only look at the 50 most rated movies. Let's make a Series of movies that meet this threshold so we can use it for filtering later.

most_50 = lens.groupby('movie_id').size().sort_values(ascending=False)[:50]

The SQL to match this would be:

CREATE TABLE most_50 AS (
    SELECT movie_id, COUNT(1)
    FROM lens
    GROUP BY movie_id
    ORDER BY 2 DESC
    LIMIT 50
);

This table would then allow us to use EXISTS, IN, or JOIN whenever we wanted to filter our results. Here's an example using EXISTS:

SELECT *
FROM lens
WHERE EXISTS (SELECT 1 FROM most_50 WHERE lens.movie_id = most_50.movie_id);

Which movies are most controversial amongst different ages?

Let's look at how these movies are viewed across different age groups. First, let's look at how age is distributed amongst our users.

users.age.plot.hist(bins=30)
plt.title("Distribution of users' ages")
plt.ylabel('count of users')
plt.xlabel('age');

pandas' integration with matplotlib makes basic graphing of Series/DataFrames trivial. In this case, just call hist on the column to produce a histogram. We can also use matplotlib.pyplot to customize our graph a bit (always label your axes).

Binning our users

I don't think it'd be very useful to compare individual ages - let's bin our users into age groups using pandas.cut.

labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
lens['age_group'] = pd.cut(lens.age, range(0, 81, 10), right=False, labels=labels)
lens[['age', 'age_group']].drop_duplicates()[:10]

	age	age_group
0	60	60-69
397	21	20-29
459	33	30-39
524	30	30-39
782	23	20-29
995	29	20-29
1229	26	20-29
1664	31	30-39
1942	24	20-29
2270	32	30-39

pandas.cut allows you to bin numeric data. In the above lines, we first created labels to name our bins, then split our users into eight bins of ten years (0-9, 10-19, 20-29, etc.). Our use of right=False told the function that we wanted the bins to be exclusive of the max age in the bin (e.g. a 30 year old user gets the 30s label).

Now we can now compare ratings across age groups.

lens.groupby('age_group').agg({'rating': [np.size, np.mean]})

	rating
	size	mean
age_group
0-9	43	3.767442
10-19	8181	3.486126
20-29	39535	3.467333
30-39	25696	3.554444
40-49	15021	3.591772
50-59	8704	3.635800
60-69	2623	3.648875
70-79	197	3.649746

Young users seem a bit more critical than other age groups. Let's look at how the 50 most rated movies are viewed across each age group. We can use the most_50 Series we created earlier for filtering.

lens.set_index('movie_id', inplace=True)
by_age = lens.loc[most_50.index].groupby(['title', 'age_group'])
by_age.rating.mean().head(15)

title                 age_group
Air Force One (1997)  10-19        3.647059
                      20-29        3.666667
                      30-39        3.570000
                      40-49        3.555556
                      50-59        3.750000
                      60-69        3.666667
                      70-79        3.666667
Alien (1979)          10-19        4.111111
                      20-29        4.026087
                      30-39        4.103448
                      40-49        3.833333
                      50-59        4.272727
                      60-69        3.500000
                      70-79        4.000000
Aliens (1986)         10-19        4.050000
Name: rating, dtype: float64

Notice that both the title and age group are indexes here, with the average rating value being a Series. This is going to produce a really long list of values.

Wouldn't it be nice to see the data as a table? Each title as a row, each age group as a column, and the average rating in each cell.

Behold! The magic of unstack!

by_age.rating.mean().unstack(1).fillna(0)[10:20]

age_group	0-9	10-19	20-29	30-39	40-49	50-59	60-69	70-79
title
E.T. the Extra-Terrestrial (1982)	0	3.680000	3.609091	3.806818	4.160000	4.368421	4.375000	0.000000
Empire Strikes Back, The (1980)	4	4.642857	4.311688	4.052083	4.100000	3.909091	4.250000	5.000000
English Patient, The (1996)	5	3.739130	3.571429	3.621849	3.634615	3.774648	3.904762	4.500000
Fargo (1996)	0	3.937500	4.010471	4.230769	4.294118	4.442308	4.000000	4.333333
Forrest Gump (1994)	5	4.047619	3.785714	3.861702	3.847826	4.000000	3.800000	0.000000
Fugitive, The (1993)	0	4.320000	3.969925	3.981481	4.190476	4.240000	3.666667	0.000000
Full Monty, The (1997)	0	3.421053	4.056818	3.933333	3.714286	4.146341	4.166667	3.500000
Godfather, The (1972)	0	4.400000	4.345070	4.412844	3.929412	4.463415	4.125000	0.000000
Groundhog Day (1993)	0	3.476190	3.798246	3.786667	3.851064	3.571429	3.571429	4.000000
Independence Day (ID4) (1996)	0	3.595238	3.291429	3.389381	3.718750	3.888889	2.750000	0.000000

unstack, well, unstacks the specified level of a MultiIndex (by default, groupby turns the grouped field into an index - since we grouped by two fields, it became a MultiIndex). We unstacked the second index (remember that Python uses 0-based indexes), and then filled in NULL values with 0.

If we would have used:

by_age.rating.mean().unstack(0).fillna(0)

We would have had our age groups as rows and movie titles as columns.

Which movies do men and women most disagree on?

EDIT: I realized after writing this question that Wes McKinney basically went through the exact same question in his book. It's a good, yet simple example of pivot_table, so I'm going to leave it here. Seriously though, go buy the book.

Think about how you'd have to do this in SQL for a second. You'd have to use a combination of IF/CASE statements with aggregate functions in order to pivot your dataset. Your query would look something like this:

SELECT title, AVG(IF(sex = 'F', rating, NULL)), AVG(IF(sex = 'M', rating, NULL))
FROM lens
GROUP BY title;

Imagine how annoying it'd be if you had to do this on more than two columns.

DataFrame's have a pivot_table method that makes these kinds of operations much easier (and less verbose).

lens.reset_index('movie_id', inplace=True)
pivoted = lens.pivot_table(index=['movie_id', 'title'],
                           columns=['sex'],
                           values='rating',
                           fill_value=0)
pivoted.head()

	sex	F	M
movie_id	title
1	Toy Story (1995)	3.789916	3.909910
2	GoldenEye (1995)	3.368421	3.178571
3	Four Rooms (1995)	2.687500	3.108108
4	Get Shorty (1995)	3.400000	3.591463
5	Copycat (1995)	3.772727	3.140625

pivoted['diff'] = pivoted.M - pivoted.F
pivoted.head()

	sex	F	M	diff
movie_id	title
1	Toy Story (1995)	3.789916	3.909910	0.119994
2	GoldenEye (1995)	3.368421	3.178571	-0.189850
3	Four Rooms (1995)	2.687500	3.108108	0.420608
4	Get Shorty (1995)	3.400000	3.591463	0.191463
5	Copycat (1995)	3.772727	3.140625	-0.632102

pivoted.reset_index('movie_id', inplace=True)
disagreements = pivoted[pivoted.movie_id.isin(most_50.index)]['diff']
disagreements.sort_values().plot(kind='barh', figsize=[9, 15])
plt.title('Male vs. Female Avg. Ratings\n(Difference > 0 = Favored by Men)')
plt.ylabel('Title')
plt.xlabel('Average Rating Difference');

Of course men like Terminator more than women. Independence Day though? Really?

Additional Resources

pandas documentation
pandas videos from PyCon
pandas and Python top 10
Tom Augspurger's Modern pandas series
Video from Tom's pandas tutorial at PyData Seattle 2015

Closing

This is the point where I finally wrap this tutorial up. Hopefully I've covered the basics well enough to pique your interest and help you get started with the library. If I've missed something critical, feel free to let me know on Twitter or in the comments - I'd love constructive feedback.

Working with DataFrames

2013-10-26T02:00:00-07:00

UPDATE: If you're interested in learning pandas from a SQL perspective and would prefer to watch a video, you can find video of my 2014 PyData NYC talk here.

This is part two of a three part introduction to pandas, a Python library for data analysis. The tutorial is primarily geared towards SQL users, but is useful for anyone wanting to get started with the library.

Part 1: Intro to pandas data structures, covers the basics of the library's two main data structures - Series and DataFrames.
Part 2: Working with DataFrames, dives a bit deeper into the functionality of DataFrames. It shows how to inspect, select, filter, merge, combine, and group your data.
Part 3: Using pandas with the MovieLens dataset, applies the learnings of the first two parts in order to answer a few basic analysis questions about the MovieLens ratings data.

Working with DataFrames

Now that we can get data into a DataFrame, we can finally start working with them. pandas has an abundance of functionality, far too much for me to cover in this introduction. I'd encourage anyone interested in diving deeper into the library to check out its excellent documentation. Or just use Google - there are a lot of Stack Overflow questions and blog posts covering specifics of the library.

We'll be using the MovieLens dataset in many examples going forward. The dataset contains 100,000 ratings made by 943 users on 1,682 movies.

# pass in column names for each CSV
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=u_cols,
                    encoding='latin-1')

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols,
                      encoding='latin-1')

# the movies file contains columns indicating the movie's genres
# let's only load the first five columns of the file with usecols
m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
movies = pd.read_csv('ml-100k/u.item', sep='|', names=m_cols, usecols=range(5),
                     encoding='latin-1')

Inspection

movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1682 entries, 0 to 1681
Data columns (total 5 columns):
movie_id              1682 non-null int64
title                 1682 non-null object
release_date          1681 non-null object
video_release_date    0 non-null float64
imdb_url              1679 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 78.8+ KB

The output tells a few things about our DataFrame.

It's obviously an instance of a DataFrame.
Each row was assigned an index of 0 to N-1, where N is the number of rows in the DataFrame. pandas will do this by default if an index is not specified. Don't worry, this can be changed later.
There are 1,682 rows (every row must have an index).
Our dataset has five total columns, one of which isn't populated at all (video_release_date) and two that are missing some values (release_date and imdb_url).
The last datatypes of each column, but not necessarily in the corresponding order to the listed columns. You should use the dtypes method to get the datatype for each column.
An approximate amount of RAM used to hold the DataFrame. See the .memory_usage method

movies.dtypes

movie_id                int64
title                  object
release_date           object
video_release_date    float64
imdb_url               object
dtype: object

DataFrame's also have a describe method, which is great for seeing basic statistics about the dataset's numeric columns. Be careful though, since this will return information on all columns of a numeric datatype.

users.describe()

	user_id	age
count	943.000000	943.000000
mean	472.000000	34.051962
std	272.364951	12.192740
min	1.000000	7.000000
25%	236.500000	25.000000
50%	472.000000	31.000000
75%	707.500000	43.000000
max	943.000000	73.000000

Notice user_id was included since it's numeric. Since this is an ID value, the stats for it don't really matter.

We can quickly see the average age of our users is just above 34 years old, with the youngest being 7 and the oldest being 73. The median age is 31, with the youngest quartile of users being 25 or younger, and the oldest quartile being at least 43.

You've probably noticed that I've used the head method regularly throughout this post - by default, head displays the first five records of the dataset, while tail displays the last five.

movies.head()

	movie_id	title	release_date	video_release_date	imdb_url
0	1	Toy Story (1995)	01-Jan-1995	NaN	http://us.imdb.com/M/title-exact?Toy%20Story%2...
1	2	GoldenEye (1995)	01-Jan-1995	NaN	http://us.imdb.com/M/title-exact?GoldenEye%20(...
2	3	Four Rooms (1995)	01-Jan-1995	NaN	http://us.imdb.com/M/title-exact?Four%20Rooms%...
3	4	Get Shorty (1995)	01-Jan-1995	NaN	http://us.imdb.com/M/title-exact?Get%20Shorty%...
4	5	Copycat (1995)	01-Jan-1995	NaN	http://us.imdb.com/M/title-exact?Copycat%20(1995)

movies.tail(3)

	movie_id	title	release_date	video_release_date	imdb_url
1679	1680	Sliding Doors (1998)	01-Jan-1998	NaN	http://us.imdb.com/Title?Sliding+Doors+(1998)
1680	1681	You So Crazy (1994)	01-Jan-1994	NaN	http://us.imdb.com/M/title-exact?You%20So%20Cr...
1681	1682	Scream of Stone (Schrei aus Stein) (1991)	08-Mar-1996	NaN	http://us.imdb.com/M/title-exact?Schrei%20aus%...

Alternatively, Python's regular slicing syntax works as well.

movies[20:22]

	movie_id	title	release_date	video_release_date	imdb_url
20	21	Muppet Treasure Island (1996)	16-Feb-1996	NaN	http://us.imdb.com/M/title-exact?Muppet%20Trea...
21	22	Braveheart (1995)	16-Feb-1996	NaN	http://us.imdb.com/M/title-exact?Braveheart%20...

Selecting

You can think of a DataFrame as a group of Series that share an index (in this case the column headers). This makes it easy to select specific columns.

Selecting a single column from the DataFrame will return a Series object.

users['occupation'].head()

0    technician
1         other
2        writer
3    technician
4         other
Name: occupation, dtype: object

To select multiple columns, simply pass a list of column names to the DataFrame, the output of which will be a DataFrame.

print(users[['age', 'zip_code']].head())
print('\n')

# can also store in a variable to use later
columns_you_want = ['occupation', 'sex'] 
print(users[columns_you_want].head())

   age zip_code
0   24    85711
1   53    94043
2   23    32067
3   24    43537
4   33    15213


   occupation sex
0  technician   M
1       other   F
2      writer   M
3  technician   M
4       other   F

Row selection can be done multiple ways, but doing so by an individual index or boolean indexing are typically easiest.

# users older than 25
print(users[users.age > 25].head(3))
print('\n')

# users aged 40 AND male
print(users[(users.age == 40) & (users.sex == 'M')].head(3))
print('\n')

# users younger than 30 OR female
print(users[(users.sex == 'F') | (users.age < 30)].head(3))

   user_id  age sex occupation zip_code
1        2   53   F      other    94043
4        5   33   F      other    15213
5        6   42   M  executive    98101


     user_id  age sex  occupation zip_code
18        19   40   M   librarian    02138
82        83   40   M       other    44133
115      116   40   M  healthcare    97232


   user_id  age sex  occupation zip_code
0        1   24   M  technician    85711
1        2   53   F       other    94043
2        3   23   M      writer    32067

Since our index is kind of meaningless right now, let's set it to the user_id using the set_index method. By default, set_index returns a new DataFrame, so you'll have to specify if you'd like the changes to occur in place.

This has confused me in the past, so look carefully at the code and output below.

print(users.set_index('user_id').head())
print('\n')

print(users.head())
print("\n^^^ I didn't actually change the DataFrame. ^^^\n")

with_new_index = users.set_index('user_id')
print(with_new_index.head())
print("\n^^^ set_index actually returns a new DataFrame. ^^^\n")

         age sex  occupation zip_code
user_id                              
1         24   M  technician    85711
2         53   F       other    94043
3         23   M      writer    32067
4         24   M  technician    43537
5         33   F       other    15213


   user_id  age sex  occupation zip_code
0        1   24   M  technician    85711
1        2   53   F       other    94043
2        3   23   M      writer    32067
3        4   24   M  technician    43537
4        5   33   F       other    15213

^^^ I didn't actually change the DataFrame. ^^^

         age sex  occupation zip_code
user_id                              
1         24   M  technician    85711
2         53   F       other    94043
3         23   M      writer    32067
4         24   M  technician    43537
5         33   F       other    15213

^^^ set_index actually returns a new DataFrame. ^^^

If you want to modify your existing DataFrame, use the inplace parameter. Most DataFrame methods return new a DataFrames, while offering an inplace parameter. Note that the inplace version might not actually be any more efficient (in terms of speed or memory usage) than the regular version.

users.set_index('user_id', inplace=True)
users.head()

	age	sex	occupation	zip_code
user_id
1	24	M	technician	85711
2	53	F	other	94043
3	23	M	writer	32067
4	24	M	technician	43537
5	33	F	other	15213

Notice that we've lost the default pandas 0-based index and moved the user_id into its place. We can select rows by position using the iloc method.

print(users.iloc[99])
print('\n')
print(users.iloc[[1, 50, 300]])

age                  36
sex                   M
occupation    executive
zip_code          90254
Name: 100, dtype: object


         age sex occupation zip_code
user_id                             
2         53   F      other    94043
51        28   M   educator    16509
301       24   M    student    55439

And we can select rows by label with the loc method.

print(users.loc[100])
print('\n')
print(users.loc[[2, 51, 301]])

age                  36
sex                   M
occupation    executive
zip_code          90254
Name: 100, dtype: object


         age sex occupation zip_code
user_id                             
2         53   F      other    94043
51        28   M   educator    16509
301       24   M    student    55439

If we realize later that we liked the old pandas default index, we can just reset_index. The same rules for inplace apply.

users.reset_index(inplace=True)
users.head()

	user_id	age	sex	occupation	zip_code
0	1	24	M	technician	85711
1	2	53	F	other	94043
2	3	23	M	writer	32067
3	4	24	M	technician	43537
4	5	33	F	other	15213

The simplified rules of indexing are - Use loc for label-based indexing - Use iloc for positional indexing I've found that I can usually get by with boolean indexing, loc and iloc, but pandas has a whole host of other ways to do selection.

Joining

Throughout an analysis, we'll often need to merge/join datasets as data is typically stored in a relational manner.

Our MovieLens data is a good example of this - a rating requires both a user and a movie, and the datasets are linked together by a key - in this case, the user_id and movie_id. It's possible for a user to be associated with zero or many ratings and movies. Likewise, a movie can be rated zero or many times, by a number of different users.

Like SQL's JOIN clause, pandas.merge allows two DataFrames to be joined on one or more keys. The function provides a series of parameters (on, left_on, right_on, left_index, right_index) allowing you to specify the columns or indexes on which to join.

By default, pandas.merge operates as an inner join, which can be changed using the how parameter.

From the function's docstring:

how : {'left', 'right', 'outer', 'inner'}, default 'inner' - left: use only keys from left frame (SQL: left outer join) - right: use only keys from right frame (SQL: right outer join) - outer: use union of keys from both frames (SQL: full outer join) - inner: use intersection of keys from both frames (SQL: inner join)

Below are some examples of what each look like.

left_frame = pd.DataFrame({'key': range(5), 
                           'left_value': ['a', 'b', 'c', 'd', 'e']})
right_frame = pd.DataFrame({'key': range(2, 7), 
                           'right_value': ['f', 'g', 'h', 'i', 'j']})
print(left_frame)
print('\n')
print(right_frame)

   key left_value
0    0          a
1    1          b
2    2          c
3    3          d
4    4          e


   key right_value
0    2           f
1    3           g
2    4           h
3    5           i
4    6           j

inner join (default)

pd.merge(left_frame, right_frame, on='key', how='inner')

	key	left_value	right_value
0	2	c	f
1	3	d	g
2	4	e	h

We lose values from both frames since certain keys do not match up. The SQL equivalent is:

SELECT left_frame.key, left_frame.left_value, right_frame.right_value
FROM left_frame
INNER JOIN right_frame
    ON left_frame.key = right_frame.key;

Had our key columns not been named the same, we could have used the left_on and right_on parameters to specify which fields to join from each frame.

pd.merge(left_frame, right_frame, left_on='left_key', right_on='right_key')

Alternatively, if our keys were indexes, we could use the left_index or right_index parameters, which accept a True/False value. You can mix and match columns and indexes like so:

pd.merge(left_frame, right_frame, left_on='key', right_index=True)

left outer join

pd.merge(left_frame, right_frame, on='key', how='left')

	key	left_value	right_value
0	0	a	NaN
1	1	b	NaN
2	2	c	f
3	3	d	g
4	4	e	h

We keep everything from the left frame, pulling in the value from the right frame where the keys match up. The right_value is NULL where keys do not match (NaN).

SQL Equivalent:

SELECT left_frame.key, left_frame.left_value, right_frame.right_value
FROM left_frame
LEFT JOIN right_frame
    ON left_frame.key = right_frame.key;

right outer join

pd.merge(left_frame, right_frame, on='key', how='right')

	key	left_value	right_value
0	2	c	f
1	3	d	g
2	4	e	h
3	5	NaN	i
4	6	NaN	j

This time we've kept everything from the right frame with the left_value being NULL where the right frame's key did not find a match.

SQL Equivalent:

SELECT right_frame.key, left_frame.left_value, right_frame.right_value
FROM left_frame
RIGHT JOIN right_frame
    ON left_frame.key = right_frame.key;

full outer join

pd.merge(left_frame, right_frame, on='key', how='outer')

	key	left_value	right_value
0	0	a	NaN
1	1	b	NaN
2	2	c	f
3	3	d	g
4	4	e	h
5	5	NaN	i
6	6	NaN	j

We've kept everything from both frames, regardless of whether or not there was a match on both sides. Where there was not a match, the values corresponding to that key are NULL.

SQL Equivalent (though some databases don't allow FULL JOINs (e.g. MySQL)):

SELECT IFNULL(left_frame.key, right_frame.key) key
        , left_frame.left_value, right_frame.right_value
FROM left_frame
FULL OUTER JOIN right_frame
    ON left_frame.key = right_frame.key;

Combining

pandas also provides a way to combine DataFrames along an axis - pandas.concat. While the function is equivalent to SQL's UNION clause, there's a lot more that can be done with it.

pandas.concat takes a list of Series or DataFrames and returns a Series or DataFrame of the concatenated objects. Note that because the function takes list, you can combine many objects at once.

pd.concat([left_frame, right_frame])

	key	left_value	right_value
0	0	a	NaN
1	1	b	NaN
2	2	c	NaN
3	3	d	NaN
4	4	e	NaN
0	2	NaN	f
1	3	NaN	g
2	4	NaN	h
3	5	NaN	i
4	6	NaN	j

By default, the function will vertically append the objects to one another, combining columns with the same name. We can see above that values not matching up will be NULL.

Additionally, objects can be concatentated side-by-side using the function's axis parameter.

pd.concat([left_frame, right_frame], axis=1)

	key	left_value	key	right_value
0	0	a	2	f
1	1	b	3	g
2	2	c	4	h
3	3	d	5	i
4	4	e	6	j

pandas.concat can be used in a variety of ways; however, I've typically only used it to combine Series/DataFrames into one unified object. The documentation has some examples on the ways it can be used.

Grouping

Grouping in pandas took some time for me to grasp, but it's pretty awesome once it clicks.

pandas groupby method draws largely from the split-apply-combine strategy for data analysis. If you're not familiar with this methodology, I highly suggest you read up on it. It does a great job of illustrating how to properly think through a data problem, which I feel is more important than any technical skill a data analyst/scientist can possess.

When approaching a data analysis problem, you'll often break it apart into manageable pieces, perform some operations on each of the pieces, and then put everything back together again (this is the gist split-apply-combine strategy). pandas groupby is great for these problems (R users should check out the plyr and dplyr packages).

If you've ever used SQL's GROUP BY or an Excel Pivot Table, you've thought with this mindset, probably without realizing it.

Assume we have a DataFrame and want to get the average for each group - visually, the split-apply-combine method looks like this: Hadley Wickham's Data Science in R slides">

The City of Chicago is kind enough to publish all city employee salaries to its open data portal. Let's go through some basic groupby examples using this data.

!head -n 3 city-of-chicago-salaries.csv

Name,Position Title,Department,Employee Annual Salary
"AARON,  ELVIA J",WATER RATE TAKER,WATER MGMNT,$85512.00
"AARON,  JEFFERY M",POLICE OFFICER,POLICE,$75372.00

Since the data contains a dollar sign for each salary, python will treat the field as a series of strings. We can use the converters parameter to change this when reading in the file.

converters : dict. optional - Dict of functions for converting values in certain columns. Keys can either be integers or column labels

headers = ['name', 'title', 'department', 'salary']
chicago = pd.read_csv('city-of-chicago-salaries.csv', 
                      header=0,
                      names=headers,
                      converters={'salary': lambda x: float(x.replace('$', ''))})
chicago.head()

	name	title	department	salary
0	AARON, ELVIA J	WATER RATE TAKER	WATER MGMNT	85512
1	AARON, JEFFERY M	POLICE OFFICER	POLICE	75372
2	AARON, KIMBERLEI R	CHIEF CONTRACT EXPEDITER	GENERAL SERVICES	80916
3	ABAD JR, VICENTE M	CIVIL ENGINEER IV	WATER MGMNT	99648
4	ABBATACOLA, ROBERT J	ELECTRICAL MECHANIC	AVIATION	89440

pandas groupby returns a DataFrameGroupBy object which has a variety of methods, many of which are similar to standard SQL aggregate functions.

by_dept = chicago.groupby('department')
by_dept

<pandas.core.groupby.DataFrameGroupBy object at 0x1128ca1d0>

Calling count returns the total number of NOT NULL values within each column. If we were interested in the total number of records in each group, we could use size.

print(by_dept.count().head()) # NOT NULL records within each column
print('\n')
print(by_dept.size().tail()) # total records for each department

                   name  title  salary
department                            
ADMIN HEARNG         42     42      42
ANIMAL CONTRL        61     61      61
AVIATION           1218   1218    1218
BOARD OF ELECTION   110    110     110
BOARD OF ETHICS       9      9       9


department
PUBLIC LIBRARY     926
STREETS & SAN     2070
TRANSPORTN        1168
TREASURER           25
WATER MGMNT       1857
dtype: int64

Summation can be done via sum, averaging by mean, etc. (if it's a SQL function, chances are it exists in pandas). Oh, and there's median too, something not available in most databases.

print(by_dept.sum()[20:25]) # total salaries of each department
print('\n')
print(by_dept.mean()[20:25]) # average salary of each department
print('\n')
print(by_dept.median()[20:25]) # take that, RDBMS!

                       salary
department                   
HUMAN RESOURCES     4850928.0
INSPECTOR GEN       4035150.0
IPRA                7006128.0
LAW                31883920.2
LICENSE APPL COMM     65436.0


                         salary
department                     
HUMAN RESOURCES    71337.176471
INSPECTOR GEN      80703.000000
IPRA               82425.035294
LAW                70853.156000
LICENSE APPL COMM  65436.000000


                   salary
department               
HUMAN RESOURCES     68496
INSPECTOR GEN       76116
IPRA                82524
LAW                 66492
LICENSE APPL COMM   65436

Operations can also be done on an individual Series within a grouped object. Say we were curious about the five departments with the most distinct titles - the pandas equivalent to:

SELECT department, COUNT(DISTINCT title)
FROM chicago
GROUP BY department
ORDER BY 2 DESC
LIMIT 5;

pandas is a lot less verbose here ...

by_dept.title.nunique().sort_values(ascending=False)[:5]

department
WATER MGMNT    153
TRANSPORTN     150
POLICE         130
AVIATION       125
HEALTH         118
Name: title, dtype: int64

split-apply-combine

The real power of groupby comes from it's split-apply-combine ability.

What if we wanted to see the highest paid employee within each department. Given our current dataset, we'd have to do something like this in SQL:

SELECT *
FROM chicago c
INNER JOIN (
    SELECT department, max(salary) max_salary
    FROM chicago
    GROUP BY department
) m
ON c.department = m.department
AND c.salary = m.max_salary;

This would give you the highest paid person in each department, but it would return multiple if there were many equally high paid people within a department.

Alternatively, you could alter the table, add a column, and then write an update statement to populate that column. However, that's not always an option.

Note: This would be a lot easier in PostgreSQL, T-SQL, and possibly Oracle due to the existence of partition/window/analytic functions. I've chosen to use MySQL syntax throughout this tutorial because of it's popularity. Unfortunately, MySQL doesn't have similar functions.

Using groupby we can define a function (which we'll call ranker) that will label each record from 1 to N, where N is the number of employees within the department. We can then call apply to, well, apply that function to each group (in this case, each department).

def ranker(df):
    """Assigns a rank to each employee based on salary, with 1 being the highest paid.
    Assumes the data is DESC sorted."""
    df['dept_rank'] = np.arange(len(df)) + 1
    return df

chicago.sort_values('salary', ascending=False, inplace=True)
chicago = chicago.groupby('department').apply(ranker)
print(chicago[chicago.dept_rank == 1].head(7))

                         name                     title      department  \
18039     MC CARTHY,  GARRY F  SUPERINTENDENT OF POLICE          POLICE   
8004           EMANUEL,  RAHM                     MAYOR  MAYOR'S OFFICE   
25588       SANTIAGO,  JOSE A         FIRE COMMISSIONER            FIRE   
763    ANDOLINO,  ROSEMARIE S  COMMISSIONER OF AVIATION        AVIATION   
4697     CHOUCAIR,  BECHARA N    COMMISSIONER OF HEALTH          HEALTH   
21971      PATTON,  STEPHEN R       CORPORATION COUNSEL             LAW   
12635      HOLT,  ALEXANDRA D                BUDGET DIR   BUDGET & MGMT   

       salary  dept_rank  
18039  260004          1  
8004   216210          1  
25588  202728          1  
763    186576          1  
4697   177156          1  
21971  173664          1  
12635  169992          1

Move onto part three, using pandas with the MovieLens dataset.

Intro to pandas data structures

2013-10-26T01:00:00-07:00

UPDATE: If you're interested in learning pandas from a SQL perspective and would prefer to watch a video, you can find video of my 2014 PyData NYC talk here.

A while back I claimed I was going to write a couple of posts on translating pandas to SQL. I never followed up. However, the other week a couple of coworkers expressed their interest in learning a bit more about it - this seemed like a good reason to revisit the topic. What follows is a fairly thorough introduction to the library. I chose to break it into three parts as I felt it was too long and daunting as one.

Part 1: Intro to pandas data structures, covers the basics of the library's two main data structures - Series and DataFrames.
Part 2: Working with DataFrames, dives a bit deeper into the functionality of DataFrames. It shows how to inspect, select, filter, merge, combine, and group your data.
Part 3: Using pandas with the MovieLens dataset, applies the learnings of the first two parts in order to answer a few basic analysis questions about the MovieLens ratings data.

If you'd like to follow along, you can find the necessary CSV files here and the MovieLens dataset here. My goal for this tutorial is to teach the basics of pandas by comparing and contrasting its syntax with SQL. Since all of my coworkers are familiar with SQL, I feel this is the best way to provide a context that can be easily understood by the intended audience. If you're interested in learning more about the library, pandas author Wes McKinney has written Python for Data Analysis, which covers it in much greater detail.

What is it?

pandas is an open source Python library for data analysis. Python has always been great for prepping and munging data, but it's never been great for analysis - you'd usually end up using R or loading it into a database and using SQL (or worse, Excel). pandas makes Python great for analysis.

Data Structures

pandas introduces two new data structures to Python - Series and DataFrame, both of which are built on top of NumPy (this means it's fast).

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('max_columns', 50)
%matplotlib inline

Series

A Series is a one-dimensional object similar to an array, list, or column in a table. It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

# create a Series with an arbitrary list
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'])
s

0                7
1       Heisenberg
2             3.14
3      -1789710578
4    Happy Eating!
dtype: object

Alternatively, you can specify an index to use when creating the Series.

s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'],
              index=['A', 'Z', 'C', 'Y', 'E'])
s

A                7
Z       Heisenberg
C             3.14
Y      -1789710578
E    Happy Eating!
dtype: object

The Series constructor can convert a dictonary as well, using the keys of the dictionary as its index.

d = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 'San Francisco': 1100,
     'Austin': 450, 'Boston': None}
cities = pd.Series(d)
cities

Austin            450
Boston            NaN
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: float64

You can use the index to select specific items from the Series ...

cities['Chicago']

1000.0

cities[['Chicago', 'Portland', 'San Francisco']]

Chicago          1000
Portland          900
San Francisco    1100
dtype: float64

Or you can use boolean indexing for selection.

cities[cities < 1000]

Austin      450
Portland    900
dtype: float64

That last one might be a little weird, so let's make it more clear - cities < 1000 returns a Series of True/False values, which we then pass to our Series cities, returning the corresponding True items.

less_than_1000 = cities < 1000
print(less_than_1000)
print('\n')
print(cities[less_than_1000])

Austin            True
Boston           False
Chicago          False
New York         False
Portland          True
San Francisco    False
dtype: bool


Austin      450
Portland    900
dtype: float64

You can also change the values in a Series on the fly.

# changing based on the index
print('Old value:', cities['Chicago'])
cities['Chicago'] = 1400
print('New value:', cities['Chicago'])

('Old value:', 1000.0)
('New value:', 1400.0)

# changing values using boolean logic
print(cities[cities < 1000])
print('\n')
cities[cities < 1000] = 750

print(cities[cities < 1000])

Austin      450
Portland    900
dtype: float64


Austin      750
Portland    750
dtype: float64

What if you aren't sure whether an item is in the Series? You can check using idiomatic Python.

print('Seattle' in cities)
print('San Francisco' in cities)

False
True

Mathematical operations can be done using scalars and functions.

# divide city values by 3
cities / 3

Austin           250.000000
Boston                  NaN
Chicago          466.666667
New York         433.333333
Portland         250.000000
San Francisco    366.666667
dtype: float64

# square city values
np.square(cities)

Austin            562500
Boston               NaN
Chicago          1960000
New York         1690000
Portland          562500
San Francisco    1210000
dtype: float64

You can add two Series together, which returns a union of the two Series with the addition occurring on the shared index values. Values on either Series that did not have a shared index will produce a NULL/NaN (not a number).

print(cities[['Chicago', 'New York', 'Portland']])
print('\n')
print(cities[['Austin', 'New York']])
print('\n')
print(cities[['Chicago', 'New York', 'Portland']] + cities[['Austin', 'New York']])

Chicago     1400
New York    1300
Portland     750
dtype: float64


Austin       750
New York    1300
dtype: float64


Austin       NaN
Chicago      NaN
New York    2600
Portland     NaN
dtype: float64

Notice that because Austin, Chicago, and Portland were not found in both Series, they were returned with NULL/NaN values.

NULL checking can be performed with isnull and notnull.

# returns a boolean series indicating which values aren't NULL
cities.notnull()

Austin            True
Boston           False
Chicago           True
New York          True
Portland          True
San Francisco     True
dtype: bool

# use boolean logic to grab the NULL cities
print(cities.isnull())
print('\n')
print(cities[cities.isnull()])

Austin           False
Boston            True
Chicago          False
New York         False
Portland         False
San Francisco    False
dtype: bool


Boston   NaN
dtype: float64

DataFrame

A DataFrame is a tablular data structure comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can also think of a DataFrame as a group of Series objects that share an index (the column names). For the rest of the tutorial, we'll be primarily working with DataFrames.

Reading Data

To create a DataFrame out of common Python data structures, we can pass a dictionary of lists to the DataFrame constructor.

Using the columns parameter allows us to tell the constructor how we'd like the columns ordered. By default, the DataFrame constructor will order the columns alphabetically (though this isn't the case when reading from a file - more on that next).

data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data, columns=['year', 'team', 'wins', 'losses'])
football

	year	team	wins	losses
0	2010	Bears	11	5
1	2011	Bears	8	8
2	2012	Bears	10	6
3	2011	Packers	15	1
4	2012	Packers	11	5
5	2010	Lions	6	10
6	2011	Lions	10	6
7	2012	Lions	4	12

Much more often, you'll have a dataset you want to read into a DataFrame. Let's go through several common ways of doing so.

CSV

Reading a CSV is as simple as calling the read_csv function. By default, the read_csv function expects the column separator to be a comma, but you can change that using the sep parameter.

%cd ~/Dropbox/tutorials/pandas/

/Users/gjreda/Dropbox (Personal)/tutorials/pandas

# Source: baseball-reference.com/players/r/riverma01.shtml
!head -n 5 mariano-rivera.csv

Year,Age,Tm,Lg,W,L,W-L%,ERA,G,GS,GF,CG,SHO,SV,IP,H,R,ER,HR,BB,IBB,SO,HBP,BK,WP,BF,ERA+,WHIP,H/9,HR/9,BB/9,SO/9,SO/BB,Awards
1995,25,NYY,AL,5,3,.625,5.51,19,10,2,0,0,0,67.0,71,43,41,11,30,0,51,2,1,0,301,84,1.507,9.5,1.5,4.0,6.9,1.70,
1996,26,NYY,AL,8,3,.727,2.09,61,0,14,0,0,5,107.2,73,25,25,1,34,3,130,2,0,1,425,240,0.994,6.1,0.1,2.8,10.9,3.82,CYA-3MVP-12
1997,27,NYY,AL,6,4,.600,1.88,66,0,56,0,0,43,71.2,65,17,15,5,20,6,68,0,0,2,301,239,1.186,8.2,0.6,2.5,8.5,3.40,ASMVP-25
1998,28,NYY,AL,3,0,1.000,1.91,54,0,49,0,0,36,61.1,48,13,13,3,17,1,36,1,0,0,246,233,1.060,7.0,0.4,2.5,5.3,2.12,

from_csv = pd.read_csv('mariano-rivera.csv')
from_csv.head()

	Year	Age	Tm	Lg	W	L	W-L%	ERA	G	GS	GF	SV	IP	H	R	ER	HR	BB	IBB	SO	HBP	BK	WP	BF	ERA+	WHIP	H/9	HR/9	BB/9	SO/9	SO/BB	Awards
0	1995	25	NYY	AL	5	3	0.625	5.51	19	10	2	0	67.0	71	43	41	11	30	0	51	2	1	0	301	84	1.507	9.5	1.5	4.0	6.9	1.70	NaN
1	1996	26	NYY	AL	8	3	0.727	2.09	61	0	14	5	107.2	73	25	25	1	34	3	130	2	0	1	425	240	0.994	6.1	0.1	2.8	10.9	3.82	CYA-3MVP-12
2	1997	27	NYY	AL	6	4	0.600	1.88	66	0	56	43	71.2	65	17	15	5	20	6	68	0	0	2	301	239	1.186	8.2	0.6	2.5	8.5	3.40	ASMVP-25
3	1998	28	NYY	AL	3	0	1.000	1.91	54	0	49	36	61.1	48	13	13	3	17	1	36	1	0	0	246	233	1.060	7.0	0.4	2.5	5.3	2.12	NaN
4	1999	29	NYY	AL	4	3	0.571	1.83	66	0	63	45	69.0	43	15	14	2	18	3	52	3	1	2	268	257	0.884	5.6	0.3	2.3	6.8	2.89	ASCYA-3MVP-14

Our file had headers, which the function inferred upon reading in the file. Had we wanted to be more explicit, we could have passed header=None to the function along with a list of column names to use:

# Source: pro-football-reference.com/players/M/MannPe00/touchdowns/passing/2012/
!head -n 5 peyton-passing-TDs-2012.csv

1,1,2012-09-09,DEN,,PIT,W 31-19,3,71,Demaryius Thomas,Trail 7-13,Lead 14-13*
2,1,2012-09-09,DEN,,PIT,W 31-19,4,1,Jacob Tamme,Trail 14-19,Lead 22-19*
3,2,2012-09-17,DEN,@,ATL,L 21-27,2,17,Demaryius Thomas,Trail 0-20,Trail 7-20
4,3,2012-09-23,DEN,,HOU,L 25-31,4,38,Brandon Stokley,Trail 11-31,Trail 18-31
5,3,2012-09-23,DEN,,HOU,L 25-31,4,6,Joel Dreessen,Trail 18-31,Trail 25-31

cols = ['num', 'game', 'date', 'team', 'home_away', 'opponent',
        'result', 'quarter', 'distance', 'receiver', 'score_before',
        'score_after']
no_headers = pd.read_csv('peyton-passing-TDs-2012.csv', sep=',', header=None,
                         names=cols)
no_headers.head()

	num	game	date	team	home_away	opponent	result	quarter	distance	receiver	score_before	score_after
0	1	1	2012-09-09	DEN	NaN	PIT	W 31-19	3	71	Demaryius Thomas	Trail 7-13	Lead 14-13*
1	2	1	2012-09-09	DEN	NaN	PIT	W 31-19	4	1	Jacob Tamme	Trail 14-19	Lead 22-19*
2	3	2	2012-09-17	DEN	@	ATL	L 21-27	2	17	Demaryius Thomas	Trail 0-20	Trail 7-20
3	4	3	2012-09-23	DEN	NaN	HOU	L 25-31	4	38	Brandon Stokley	Trail 11-31	Trail 18-31
4	5	3	2012-09-23	DEN	NaN	HOU	L 25-31	4	6	Joel Dreessen	Trail 18-31	Trail 25-31

pandas' various reader functions have many parameters allowing you to do things like skipping lines of the file, parsing dates, or specifying how to handle NA/NULL datapoints.

There's also a set of writer functions for writing to a variety of formats (CSVs, HTML tables, JSON). They function exactly as you'd expect and are typically called to_format:

my_dataframe.to_csv('path_to_file.csv')

Take a look at the IO documentation to familiarize yourself with file reading/writing functionality.

Excel

Know who hates VBA? Me. I bet you do, too. Thankfully, pandas allows you to read and write Excel files, so you can easily read from Excel, write your code in Python, and then write back out to Excel - no need for VBA.

Reading Excel files requires the xlrd library. You can install it via pip (pip install xlrd).

Let's first write a DataFrame to Excel.

# this is the DataFrame we created from a dictionary earlier
football.head()

	year	team	wins	losses
0	2010	Bears	11	5
1	2011	Bears	8	8
2	2012	Bears	10	6
3	2011	Packers	15	1
4	2012	Packers	11	5

# since our index on the football DataFrame is meaningless, let's not write it
football.to_excel('football.xlsx', index=False)

!ls -l *.xlsx

-rw-r--r--@ 1 gjreda  staff  5665 Mar 26 17:58 football.xlsx

# delete the DataFrame
del football

# read from Excel
football = pd.read_excel('football.xlsx', 'Sheet1')
football

	year	team	wins	losses
0	2010	Bears	11	5
1	2011	Bears	8	8
2	2012	Bears	10	6
3	2011	Packers	15	1
4	2012	Packers	11	5
5	2010	Lions	6	10
6	2011	Lions	10	6
7	2012	Lions	4	12

Database

pandas also has some support for reading/writing DataFrames directly from/to a database. You'll typically just need to pass a connection object or sqlalchemy engine to the read_sql or to_sql functions within the pandas.io module.

Note that to_sql executes as a series of INSERT INTO statements and thus trades speed for simplicity. If you're writing a large DataFrame to a database, it might be quicker to write the DataFrame to CSV and load that directly using the database's file import arguments.

from pandas.io import sql
import sqlite3

conn = sqlite3.connect('/Users/gjreda/Dropbox/gregreda.com/_code/towed')
query = "SELECT * FROM towed WHERE make = 'FORD';"

results = sql.read_sql(query, con=conn)
results.head()

	tow_date	make	style	color	plate	state	towed_address	phone	inventory
0	01/19/2013	FORD	LL	RED	N786361	IL	400 E. Lower Wacker	(312) 744-7550	877040
1	01/19/2013	FORD	4D	GRN	L307211	IL	701 N. Sacramento	(773) 265-7605	6738005
2	01/19/2013	FORD	4D	GRY	P576738	IL	701 N. Sacramento	(773) 265-7605	6738001
3	01/19/2013	FORD	LL	BLK	N155890	IL	10300 S. Doty	(773) 568-8495	2699210
4	01/19/2013	FORD	LL	TAN	H953638	IL	10300 S. Doty	(773) 568-8495	2699209

Clipboard

While the results of a query can be read directly into a DataFrame, I prefer to read the results directly from the clipboard. I'm often tweaking queries in my SQL client (Sequel Pro), so I would rather see the results before I read it into pandas. Once I'm confident I have the data I want, then I'll read it into a DataFrame.

This works just as well with any type of delimited data you've copied to your clipboard. The function does a good job of inferring the delimiter, but you can also use the sep parameter to be explicit.

Hank Aaron

hank = pd.read_clipboard()
hank.head()

	Year	Age	Tm	Lg	G	PA	AB	R	H	2B	3B	HR	RBI	SB	CS	BB	SO	BA	OBP	SLG	OPS	OPS+	TB	GDP	HBP	SH	SF	IBB	Pos	Awards
0	1954	20	MLN	NL	122	509	468	58	131	27	6	13	69	2	2	28	39	0.280	0.322	0.447	0.769	104	209	13	3	6	4	NaN	*79	RoY-4
1	1955 ★	21	MLN	NL	153	665	602	105	189	37	9	27	106	3	1	49	61	0.314	0.366	0.540	0.906	141	325	20	3	7	4	5	*974	AS,MVP-9
2	1956 ★	22	MLN	NL	153	660	609	106	200	34	14	26	92	2	4	37	54	0.328	0.365	0.558	0.923	151	340	21	2	5	7	6	*9	AS,MVP-3
3	1957 ★	23	MLN	NL	151	675	615	118	198	27	6	44	132	1	1	57	58	0.322	0.378	0.600	0.978	166	369	13	0	0	3	15	*98	AS,MVP-1
4	1958 ★	24	MLN	NL	153	664	601	109	196	34	4	30	95	4	1	59	49	0.326	0.386	0.546	0.931	152	328	21	1	0	3	16	*98	AS,MVP-3,GG

URL

With read_table, we can also read directly from a URL.

Let's use the best sandwiches data that I wrote about scraping a while back.

url = 'https://raw.github.com/gjreda/best-sandwiches/master/data/best-sandwiches-geocode.tsv'

# fetch the text from the URL and read it into a DataFrame
from_url = pd.read_table(url, sep='\t')
from_url.head(3)

	rank	sandwich	restaurant	description	price	address	city	phone	website	full_address	formatted_address	lat	lng
0	1	BLT	Old Oak Tap	The B is applewood smoked—nice and snapp...	$10	2109 W. Chicago Ave.	Chicago	773-772-0406	theoldoaktap.com	2109 W. Chicago Ave., Chicago	2109 West Chicago Avenue, Chicago, IL 60622, USA	41.895734	-87.679960
1	2	Fried Bologna	Au Cheval	Thought your bologna-eating days had retired w...	$9	800 W. Randolph St.	Chicago	312-929-4580	aucheval.tumblr.com	800 W. Randolph St., Chicago	800 West Randolph Street, Chicago, IL 60607, USA	41.884672	-87.647754
2	3	Woodland Mushroom	Xoco	Leave it to Rick Bayless and crew to come up w...	$9.50.	445 N. Clark St.	Chicago	312-334-3688	rickbayless.com	445 N. Clark St., Chicago	445 North Clark Street, Chicago, IL 60654, USA	41.890602	-87.630925

Move onto the next section, which covers working with DataFrames.

New theme for Pelican

2013-10-24T00:00:00-07:00

I spent some time last weekend making minor changes to this site. Specifically:

New typography - headers are Droid Serif, while everything else is Droid Sans. Fonts are also a bit bigger (I think it's easier to read).
Added Jake Vanderplas' liquid tags plugin for Pelican, which allows for easy embedding of IPython Notebooks.

I plan on continuing to tweak things over time, but I'm pretty happy with the way it looks right now.

Check it out on GitHub and feel free to use it. It's mobile-friendly, too.

Useful Unix commands for data science

2013-07-15T00:00:00-07:00

Imagine you have a 4.2GB CSV file. It has over 12 million records and 50 columns. All you need from this file is the sum of all values in one particular column.

How would you do it?

Writing a script in python/ruby/perl/whatever would probably take a few minutes and then even more time for the script to actually complete. A database and SQL would be fairly quick, but then you'd have load the data, which is kind of a pain.

Thankfully, the Unix utilities exist and they're awesome.

To get the sum of a column in a huge text file, we can easily use awk. And we won't even need to read the entire file into memory.

Let's assume our data, which we'll call data.csv, is pipe-delimited ( | ), and we want to sum the fourth column of the file.

cat data.csv | awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }'

The above line says:

Use the cat command to stream (print) the contents of the file to stdout.
Pipe the streaming contents from our cat command to the next one - awk.
With awk:
1. Set the field separator to the pipe character (-F "|"). Note that this has nothing to do with our pipeline in point #2.
2. Increment the variable sum with the value in the fourth column ($4). Since we used a pipeline in point #2, the contents of each line are being streamed to this statement.
3. Once the stream is done, print out the value of sum, using printf to format the value with two decimal places.

It took less than two minutes to run on the entire file - much faster than other options and written in a lot fewer characters.

Hilary Mason and Chris Wiggins wrote over at the dataists blog about the importance of any data scientist being familiar with the command line, and I couldn't agree with them more. The command line is essential to my daily work, so I wanted to share some of the commands I've found most useful.

For those who are a bit newer to the command line than the rest of this post assumes, Hilary previously wrote a nice introduction to it.

Other commands

head & tail

Sometimes you just need to inspect the structure of a huge file. That's where head and tail come in. Head prints the first ten lines of a file, while tail prints the last ten lines. Optionally, you can include the -N parameter to change the number of lines displayed.

head -n 3 data.csv
# time|away|score|home
# 20:00||0-0|Jump Ball won by Virginia Commonwealt.
# 19:45||0-0|Juvonte Reddic Turnover.

tail -n 3 data.csv
# 0:14|Trey Davis Turnover.|62-71|
# 0:14||62-71|Briante Weber Steal.
# 0:00|End Game|End Game|End Game

wc (word count)

By default, wc will quickly tell you how many lines, words, and bytes are in a file. If you're looking for just the line count, you can pass the -l parameter in.

I use it most often to verify record counts between files or database tables throughout an analysis.

wc data.csv
#     377    1697   17129 data.csv
wc -l data.csv
#     377 data.csv

grep

Grep allows you to search through plain text files using regular expressions. I tend avoid regular expressions when possible, but still find grep to be invaluable when searching through log files for a particular event.

There's an assortment of extra parameters you can use with grep, but the ones I tend to use the most are -i (ignore case), -r (recursively search directories), -B N (N lines before), -A N (N lines after).

grep -i -B 1 -A 1 steal data.csv
# 17:25||2-4|Darius Theus Turnover.
# 17:25|Terrell Vinson Steal.|2-4|
# 17:18|Chaz Williams made Layup.  Assisted by Terrell Vinson.|4-4|

sed

Sed is similar to grep and awk in many ways, however I find that I most often use it when needing to do some find and replace magic on a very large file. The usual occurrence is when I've received a CSV file that was generated on Windows and my Mac isn't able to handle the carriage return properly.

grep Block data.csv | head -n 3
# 16:43||5-4|Juvonte Reddic Block.
# 15:37||7-6|Troy Daniels Block.
# 14:05|Raphiael Putney Block.|11-8|

sed -e 's/Block/Rejection/g' data.csv > rejection.csv
# replace all instances of the word 'Block' in data.csv with 'Rejection'
# stream the results to a new file called rejection.csv

grep Rejection rejection.csv | head -n 3
# 16:43||5-4|Juvonte Reddic Rejection.
# 15:37||7-6|Troy Daniels Rejection.
# 14:05|Raphiael Putney Rejection.|11-8|

sort & uniq

Sort outputs the lines of a file in order based on a column key using the -k parameter. If a key isn't specified, sort will treat each line as a concatenated string and sort based on the values of the first column. The -n and -r parameters allow you to sort numerically and in reverse order, respectively.

head -n 5 data.csv
# time|away|score|home
# 20:00||0-0|Jump Ball won by Virginia Commonwealt.
# 19:45||0-0|Juvonte Reddic Turnover.
# 19:45|Chaz Williams Steal.|0-0|
# 19:39|Sampson Carter missed Layup.|0-0|

head -n 5 data.csv | sort
# 19:39|Sampson Carter missed Layup.|0-0|
# 19:45|Chaz Williams Steal.|0-0|
# 19:45||0-0|Juvonte Reddic Turnover.
# 20:00||0-0|Jump Ball won by Virginia Commonwealt.
# time|away|score|home

# columns separated by '|', sort on column 2 (-k2), case insensitive (-f)
head -n 5 data.csv | sort -f -t'|' -k2
# time|away|score|home
# 19:45|Chaz Williams Steal.|0-0|
# 19:39|Sampson Carter missed Layup.|0-0|
# 20:00||0-0|Jump Ball won by Virginia Commonwealt.
# 19:45||0-0|Juvonte Reddic Turnover.

Sometimes you want to check for duplicate records in a large text file - that's when uniq comes in handy. By using the -c parameter, uniq will output the count of occurrences along with the line. You can also use the -d and -u parameters to output only duplicated or unique records.

sort data.csv | uniq -c | sort -nr | head -n 7
#   2 8:47|Maxie Esho missed Layup.|46-54|
#   2 8:47|Maxie Esho Offensive Rebound.|46-54|
#   2 7:38|Trey Davis missed Free Throw.|51-56|
#   2 12:12||16-11|Rob Brandenberg missed Free Throw.
#   1 time|away|score|home
#   1 9:51||20-11|Juvonte Reddic Steal.

sort data.csv | uniq -d
# 12:12||16-11|Rob Brandenberg missed Free Throw.
# 7:38|Trey Davis missed Free Throw.|51-56|
# 8:47|Maxie Esho Offensive Rebound.|46-54|
# 8:47|Maxie Esho missed Layup.|46-54|

sort data.csv | uniq -u | wc -l
#     369 (unique lines)

While it's sometimes difficult to remember all of the parameters for the Unix commands, getting familiar with them has been beneficial to my productivity and allowed me to avoid many headaches when working with large text files.

Hopefully you'll find them as useful as I have.

Additional Resources:

How random is JavaScript's Math.random()?

2013-06-30T00:00:00-07:00

A few weeks back, I was talking with my friend Molly about personal domains and realized that her nickname, Bierface, was available. The exchange basically went like this:

Me: I should buy bierface.com and just put up a ridiculous picture of you.

Molly: You would have to do a slideshow. Too many gems.

So I did just that, switching randomly between 14 pictures every time the page is loaded. The laughs from it have been well worth the $10 spent purchasing the domain.

She started to question the randomness though. Here's what the code that loads each image looks like:

<a href="http://mollybierman.tumblr.com">
  <img id="bierface" src=""/>
</a>
<script type="text/javascript">
  var n = Math.ceil(Math.random() * 14);
  document.getElementById("bierface").src = "./pictures/"+n+".jpg";
</script>

All we're doing is creating an empty <img> element, and then changing the src attribute of that element via JavaScript. The first line of JavaScript uses a combination of Math.ceil() and Math.random() to get a random integer between 1 and 14 (which are how the images are named). The second line uses that integer to create a file path and tells our <img> element to use that path as the src for the image.

Since the image is loaded by your web client, this seemed like a great opportunity to learn the very basics of grabbing client-side data - I could write some code to repeatedly get which image was loaded in order to determine how random Math.random() truly is.

The Setup

We're going to be using Ghost.py to simulate a WebKit client. Ghost.py requires PyQt or PySide, so you'll want to grab one of those, too. I'm on OSX 1.8.2 and using PySide 1.1.0 for Python 2.7, which you can get here. You'll also need to grab Qt 4.7, which you can find here.

The Code

With a little Python and Ghost.py, we can simulate a browser, allowing us to execute JavaScript telling us which image was loaded. We can also use matplotlib to plot the distribution.

from ghost import Ghost
import matplotlib.pyplot as plt
import os

ghost = Ghost()

# JavaScript to grab the src file name for the image loaded
js = "document.getElementById('bierface').src.substr(33);"

# initialize zero'd out dictionary to hold image counts
# this way we can draw a nice, empty, base plot before we have actual values
counts = dict(zip(range(1, 15), [0 for i in range(1, 15)]))

for i in xrange(1, 1002):
    # draw empty plot on first pass
    if i != 1:
        page, page_resources = ghost.open('http://www.bierface.com')
        image = ghost.evaluate(js)[0]
        image = int(image.split('.')[0]) # grab just the image number
        counts[image] += 1

    plt.bar(range(1, 15), counts.values(), align='center')
    plt.xticks(range(1, 15), counts.keys())
    plt.xlabel('Image')
    plt.ylabel('# of times shown')
    plt.title('n = {0}'.format(str(i-1).zfill(4)))
    plt.grid()

    path = '{0}/images/{1}'.format(os.getcwd(), str(i).zfill(4))
    save(path, close=True)

os.system('ffmpeg -f image2 -r 10 -i images/%04d.png -s 480x360 random.avi')

Let's walk though the code:

Load our libraries and create an instance of the Ghost class.
Store the JavaScript we'll need to execute in order to grab the image file name into a variable named js.
The comment should explain this one - we're initializing a zero'd out dictionary called counts so that our first plot doesn't have an x-axis with just one value. Each key of the dictionary will correspond to one of the images.
The for loop is used to run 1,000 simulations. My xrange usage is a little wacky because I'm using it to title and name the plots - typically xrange starts with 0 and runs up until the number specified (e.g. 1,001 will be the last loop, not 1,002).
This is the section that grabs which image was loaded by simulating a WebKit client with Ghost.py. This section does not get run on the first pass since we want to start with an empty plot.
1. Load bierface.com into our page variable.
2. Execute the JavaScript mentioned in #2 and store it in the image variable. Remember that this will be a string.
3. Split the image string so that we just grab the image number loaded.
4. Update our dictionary of counts for the given image.
Here we're using matplotlib.pyplot to draw a bar chart. Thanks to Jess Hamrick for some awesome plot-saving boilerplate, which I'm using behind the save function.
Finally, use ffmpeg to stitch our plots together into a video.

The Results

Math.random() is pretty random (though #7 is the clear loser in the video below). It's easy to think it's not when working with a small sample size, but it's clear the numbers start to even out as the sample size increases.

Join vs Exists vs In (SQL)

2013-06-03T00:00:00-07:00

Last weekend, I came across Jeff Atwood's excellent visual explanation of SQL joins on Hacker News.

It reminded me of teaching SQL to the incoming batch of PwC FTS associates a few years ago. Not many of them had prior programming experience, much less SQL exposure, so it was a fun week to learn how well us instructors could teach the topic.

Most of them intuitively picked up on how the IN clause worked, but struggled with EXISTS and JOINs initially. An explanation that always seemed to help illustrate the concept was to show that often you can write the exact same query using an IN, EXISTS, or a JOIN.

As an example, let's assume the following two tables, which we'll call tableA and tableB.

id  name    id  title
--  ----    --  ----
1   Kenny   1   Analyst
1   Rob     2   Sales
4   Molly   3   Manager
1   Greg
2   John

If we wanted to get everyone that's an Analyst, we could do the following:

SELECT  *
FROM    tableA
WHERE   tableA.id IN (SELECT tableB.id FROM tableB WHERE title = 'Analyst');

-- Returns 3 records - Kenny, Rob, and Greg

For those not very familiar with SQL, this should be relatively easy to understand. We have written a subquery that will get the id for the Analyst title in tableB. Using IN, we can then grab all of the employees from tableA who have that title.

While IN statements are fairly intuitive, they're often less efficient than the same query written as a JOIN or EXISTS statement would be.

To produce the same results as above, we can do the following:

-- EXISTS
SELECT  *
FROM    tableA
WHERE   EXISTS (SELECT 1 FROM tableB WHERE title = 'Analyst' AND tableA.id = tableB.id);

-- JOIN (INNER is the default when only JOIN is specified)
SELECT  *
FROM    tableA
JOIN    tableB
    ON  tableA.id = tableB.id
WHERE   tableB.title = 'Analyst';

In most cases, EXISTS or JOIN will be much more efficient (and faster) than an IN statement. Why?

When using an IN combined with a subquery, the database must process the entire subquery first, then process the overall query as a whole, matching up based on the relationship specified for the IN.

With an EXISTS or a JOIN, the database will return true/false while checking the relationship specified. Unless the table in the subquery is very small, EXISTS or JOIN will perform much better than IN.

Furthermore, writing the query as a JOIN gives us some additional flexibility to easily return all of the employees if we'd like, or to even check for employees who do not have a title (orphan records).

-- Return employees and display their title
SELECT  *
FROM    tableA
JOIN    tableB
    ON  tableA.id = tableB.id;
-- 1 Kenny  1 Analyst
-- 1 Rob    1 Analyst
-- 1 Greg   1 Analyst
-- 2 John   2 Sales

-- Which employees do not have a title?
SELECT  *
FROM    tableA
LEFT JOIN   tableB
    ON  tableA.id = tableB.id
WHERE   tableB.id IS NULL;
-- 4 Molly  NULL NULL

In the first query above, Molly falls out because she does not have a title. If we would have liked her to appear in the record set, we could simply change the JOIN to a LEFT JOIN and she would appear with NULL data from tableB.

If you have many IN statements littered throughout your code, you should compare the performance of these queries against an EXISTS or JOIN version of the same query - you'll likely see performance gains.

I hope this illustrated some of the subtle differences between INs, EXISTS, and JOINs. Questions and feedback in the comments are appreciated.

More web scraping with Python (and a map)

2013-04-29T00:00:00-07:00

This is a follow-up to my previous post about web scraping with Python.

Previously, I wrote a basic intro to scraping data off of websites. Since I wanted to keep the intro fairly simple, I didn't cover storing the data. In this post, I'll cover the basics of writing the scraped data to a flat file and then take things a bit further from there.

Last time, we used the Chicago Reader's Best of 2011 list, but let's change it up a bit this time and scrape a different site. Why? Because scrapers break, so we might as well practice a little bit more by scraping something different.

In this post, we're going to use the data from Chicago Magazine's Best Sandwiches list because ... who doesn't like sandwiches?

If you're new to scraping, it might be a good idea to go back and read my previous post as a refresher as I don't intend to be methodical in this one.

Finding the data

Looking at the list, it's clear everything is in a fairly standard format - each of the sandwiches in the list gets a <div class="sammy"> and each div holds a bit more information - specifically, the rank, sandwich name, location, and a URL to a detailed page about each entry.

Clicking through a few of the sammy links, we can see that each sandwich also gets a detailed page that includes the sandwich's name, rank, description, and price along with the restaurant's name, address, phone number, and website. Each of these details is contained within <div id="sandwich">, which will make them very easy to get at.

Package choices

We'll again be using the BeautifulSoup and urllib2 libraries. Last time around, the choice of these two libraries generated some discussion in the post's comments section, on Reddit, and Hacker News.

The reason I use BeautifulSoup is because I've found it to be very easy to use and understand, but YMMV. It's been around for a very long time (since 2004) and is certainly in the tool belt of many. That said, Python has a vast ecosystem with a lot of scraping libraries and ones like Scrapy and PyQuery (amongst many others) are worth a look.

Urllib2 is one of Python's URL handling packages within its standard library. Because the standard library has urllib and urllib2, it has at times been confusing to know which is the one you're actually looking for. On top of that, Kenneth Reitz's fantastic requests library exists, which really simplifies dealing with HTTP.

In this example, and in the previous one, I use urllib2 simply because I only need the urlopen function. If this scraper were more complex, I would likely use requests, but I think using a third party library is a bit of overkill for this very simple use case.

Getting the data

Our code this time is going to be very similar to what it was in the previous post, save for a few minor changes. Since the details pages have the data we're looking for, let's get all of their URLs from the initial list page, and then process each details page. We're also going to write all of the data to a tab-delimited file using Python's CSV package.

Last time around, we wrote our code as a set of functions, which I think helps the code's readability since it makes clear what each piece of the code is doing. This time around, we're just going to write a short script since this is really a one-off thing - once we have our data written to a CSV, we don't really have a use for this code anymore.

Our script will do the following:

Load our libraries
Read our base_url into a BeautifulSoup object, grab all <div class="sammy"> sections, and then from each section, grab our sammy details URL.
Open up a file named src-best-sandwiches.tsv for writing. We'll write to this file using Python's csv.writer object and separate the fields by a tab (\t). We'll also pass in a list of field names so that our file has a header row.
Loop through all of our sammy details URLs, grabbing each piece of information we're interested in, and writing that data to our src-best-sandwiches.tsv file.

from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv

base_url = ("http://www.chicagomag.com/Chicago-Magazine/"
            "November-2012/Best-Sandwiches-Chicago/")

soup = BeautifulSoup(urlopen(base_url).read())
sammies = soup.find_all("div", "sammy")
sammy_urls = [div.a["href"] for div in sammies]

with open("data/src-best-sandwiches.tsv", "w") as f:
    fieldnames = ("rank", "sandwich", "restaurant", "description", "price",
                    "address", "phone", "website")
    output = csv.writer(f, delimiter="\t")
    output.writerow(fieldnames)

    for url in sammy_urls:
        url = url.replace("http://www.chicagomag.com", "")  # inconsistent URL
        page = urlopen("http://www.chicagomag.com{0}".format(url))
        soup = BeautifulSoup(page.read()).find("div", {"id": "sandwich"})
        rank = soup.find("div", {"id": "sandRank"}).encode_contents().strip()
        sandwich = soup.h1.encode_contents().strip().split("<br/>")[0]
        restaurant = soup.h1.span.encode_contents()
        description = soup.p.encode_contents().strip()
        addy = soup.find("p", "addy").em.encode_contents().split(",")[0].strip()
        price = addy.partition(" ")[0].strip()
        address = addy.partition(" ")[2].strip()
        phone = soup.find("p", "addy").em.encode_contents().split(",")[1].strip()
        if soup.find("p", "addy").em.a:
            website = soup.find("p", "addy").em.a.encode_contents()
        else:
            website = ""

        output.writerow([rank, sandwich, restaurant, description, price,
                        address, phone, website])

print "Done writing file"

While our scraper does a good job of getting all of the sandwiches and restaurants, a couple of restaurants had "multiple locations" listed as their address. If we were to need this data, we'll have to find another way to get it (like checking each restaurant's website and manually adding their locations to our dataset). We'll also need to manually fix some oddities that wound up in our data due some inconsistent HTML on the other end (addresses and URLs winding up in the phone numbers column).

We're now left with a file full of data about Chicago Magazine's fifty best sandwiches. Sure, it's nice to have the data structured neatly in a flat file, but that's not all that interesting.

Collecting and hoarding data isn't of use to anyone - it's a waste of a potentially very valuable resource - it needs to be taken a step further. In some cases, this means a thorough analysis in search of patterns and trends, surfacing relationships we did not necessarily expect, and utilizing that information to better our decision-making. Data should be used to inform. In some cases, even a very basic visualization of the data can be of use.

Since we have addresses for each restaurant, this seems like a great time to make a map, but first, geocoding!

Geocoding

We're going to make our map using the Google Maps API, but in order to do so, we're first going to need to geocode our addresses to a set of lat/long points. Don't worry, I've taken the time to manually fill in the blanks on those "multiple locations" restaurants (you can grab the new file from my GitHub repo - it's called best-sandwiches.tsv).

To do so, we'll just write a short Python script which hits the Google Geocoding API. Our script will do the following:

Read our best-sandwiches.tsv file using the CSV module's DictReader class, which reads each line of the file into its own dictionary object.
For each address, make a call to the Google Geocoding API, which will return a JSON response full of details about that address.
Using the DictWriter class, write a new file with our data along with the formatted address, lat, and long that we got back from the geocoder.

from urllib2 import urlopen
import csv
import json
from time import sleep

def geocode(address):
    url = ("http://maps.googleapis.com/maps/api/geocode/json?"
        "sensor=false&address={0}".format(address.replace(" ", "+")))
    return json.loads(urlopen(url).read())

with open("data/best-sandwiches.tsv", "r") as f:
    reader = csv.DictReader(f, delimiter="\t")

    with open("data/best-sandwiches-geocode.tsv", "w") as w:
        fields = ["rank", "sandwich", "restaurant", "description", "price",
                 "address", "city", "phone", "website", "full_address",
                 "formatted_address", "lat", "lng"]
        writer = csv.DictWriter(w, fieldnames=fields, delimiter="\t")
        writer.writeheader()

        for line in reader:
            print "Geocoding: {0}".format(line["full_address"])
            response = geocode(line["full_address"])
            if response["status"] == u"OK":
                results = response.get("results")[0]
                line["formatted_address"] = results["formatted_address"]
                line["lat"] = results["geometry"]["location"]["lat"]
                line["lng"] = results["geometry"]["location"]["lng"]
            else:
                line["formatted_address"] = ""
                line["lat"] = ""
                line["lng"] = ""
            sleep(1)
            writer.writerow(line)

print "Done writing file"

Our file now has everything we need to make our map, which we're able to do with some basic HTML, CSS, JavaScript, and a little Google Fusion Tables magic.

Mapping

While we could write another Python script to turn our flat file data into KML for mapping, it's much, much easier to use Google Fusion Tables. However, one important note with the Fusion Tables approach is that the underlying data must be within a public Fusion Table. Since our data is scraped from a publicly accessible website, that's not an issue here.

If you don't see Fusion Table as an option in your Google Drive account, you'll need to "connect more apps" and add it from there.

Once you've added the app, create a new Fusion Table from the delimited file on your computer (our best-sandwiches-geocode.tsv).

After you've finished your upload process, you should now have a spreadsheet-like table with the data in it. You'll notice that some of the columns are highlighted in yellow - this means that Fusion Tables is recognizing that it's a location. Our lat and lng columns should be all the way at the right - hover over the lat column header and select change from the drop down. This should display a prompt showing us the column type is a two column location comprised of both our lat and lng.

This is probably where I should point out that we could have also used Fusion Tables to geocode our data, but writing a script in Python seemed like more fun to me.

Now that we have our data successfully in the Fusion Table, we can use a combination HTML, CSS, some JavaScript, and the Fusion Tables API to serve up a map (you could also just click the map tab in Fusion Tables to see an embedded map of the data, but that's not as fun). We can even style the map with the Google Maps Style Wizard.

Head over to my GitHub repo to see the HTML, CSS, and JavaScript used to create the map (along with the rest of the code and data used throughout this post). I've done my best to comment the best-sandwiches.html file to indicate what each piece is doing. I've also used HTML5's geolocation capabilities so that fellow Chicagoans can easily see which sandwiches are near them (it displays pretty nicely on a mobile browser, too).

You can check out the awesome map we made here. Note that if you aren't in Chicago and let your browser know your location, you likely won't see any of the data - you'll have to scroll over to Chicago.

Hopefully you found this post fun and informative. Was there something I didn't cover? Let me know in the comments.

Write online about what you love

2013-03-16T00:00:00-07:00

The other week, I wrote a very basic intro to web scraping with Python. Some friends knew that I had experience scraping data and they wanted to learn, so I figured it would be a great opportunity to write something publicly and test how well I could explain it.

I'll be extending that scraping post a bit more in the future, but first I wanted to write about how the week and a half since I posted it has gone - or, explain why I think you should write online about what you love.

How it started

Shortly after finishing the post and feeling fairly satisfied with the way it turned out, I posted it and e-mailed a link to three of my friends - two of which were the ones who asked me to teach them. One of them, Kenny, immediately messaged me, read through the post, and said I should share it on Twitter. So I did. To me, any feedback was better than no feedback, so I posted it in the Python subreddit too.

I was really just hoping some people would see it and let me know what they thought.

Turns out, quite a few more people than my 92 followers (at the time) have seen it in the week and a half since. About 32,000 more.

It was pretty exhilarating to watch something I wrote be shared in real-time. Many of the data nerds that I admire and follow on Twitter were sharing it. Hell, even Philadelphia's Chief Data Officer shared it. It was a ton of fun to watch and read both (fairly) positive and constructive comments about it on /r/python. It immediately made me want to write the post you're currently reading, which I started working on two days later.

A week later

Sitting around the following Sunday, making minor CSS tweaks to this site and finishing up the previously mentioned post, I decided to check my Google Analytics to see what the final traffic from /r/python and Twitter looked like. Surprisingly, the real-time section of Analytics showed 250+ on the site. What? How?!

That's when I realized it wound up at the top of Hacker News. And then /r/programming. Traffic went through the roof.

And again, the comments were positive and constructive.

Lesson learned

And this leads me to why you should write about things you're passionate about online. When you're truly passionate about something, you spend a lot of time thinking and learning about it - you try to make it a part of your life. You try to become a reputable source on the topic (or even an expert). It can be something as broad as beer, personal finance, or film; or as niche as stand-up comedy, vegan baking, or 90s midwest emo bands (guilty). It doesn't matter what it is as long as you love it.

Like me, you're likely ecstatic when you find someone you're able to get nerdy with about something you love. You truly enjoy the topic and are always looking for ways to learn more and teach others about it (or just banter).

That's why you should share your knowledge about whatever the field may be. It doesn't matter whether 10 or 10,000 people see what you've shared. There are people interested, but might not know where to start. And that's the best way to reinforce how well we know something - by teaching it to others. You'll be prompted with questions you hadn't thought about before, which will only further your own curiosity. You're forced to explain concepts in simple terms that anyone can understand - you become a better teacher and communicator. Sometimes, someone else crazily passionate about the same topic will even come along and teach you a thing or two.

We all have a thirst for knowledge in some form. The internet's a magnificent place to test our existing knowledge by teaching others and learning more throughout the process thanks to feedback from those with differing experiences.

Put your passions out there. More often than not, you'll be amazed at what you get back.

Web Scraping 101 with Python

2013-03-03T00:00:00-08:00

This is part of a series of posts I have written about web scraping with Python.

Web Scraping 101 with Python, which covers the basics of using Python for web scraping.
Web Scraping 201: Finding the API, which covers when sites load data client-side with Javascript.
Asynchronous Scraping with Python, showing how to use multithreading to speed things up.
Scraping Pages Behind Login Forms, which shows how to log into sites using Python.

Yea, yea, I know I said I was going to write more on pandas, but recently I've had a couple friends ask me if I could teach them how to scrape data. While they said they were able to find a ton of resources online, all assumed some level of knowledge already. Here's my attempt at assuming a very minimal knowledge of programming.

Getting Setup

We're going to be using Python 2.7, BeautifulSoup, and lxml. If you don't already have Python 2.7, you'll want to download the proper version for your OS here.

To check if you have Python 2.7 on OSX, open up Terminal and type python --version. You should see something like this:

Next, you'll need to install BeautifulSoup. If you're on OSX, you'll already have setuptools installed. Let's use it to install pip and use that for package management instead.

In Terminal, run sudo easy_install pip. You'll be prompted for your password - type it in and let it run. Once that's done, again in Terminal, sudo pip install BeautifulSoup4. Finally, you'll need to install lxml.

A few scraping rules

Now that we have the packages we need, we can start scraping. But first, a couple of rules.

You should check a site's terms and conditions before you scrape them. It's their data and they likely have some rules to govern it.
Be nice - A computer will send web requests much quicker than a user can. Make sure you space out your requests a bit so that you don't hammer the site's server.
Scrapers break - Sites change their layout all the time. If that happens, be prepared to rewrite your code.
Web pages are inconsistent - There's sometimes some manual clean up that has to happen even after you've gotten your data.

Finding your data

For this example, we're going to use the Chicago Reader's Best of 2011 list. Why? Because I think it's a great example of terrible data presentation on the web. Go ahead and browse it for a bit.

All you want to see is a list of the category, winner, and maybe the runners-up, right? But you have to continuously click link upon link, slowly navigating your way through the list.

Hopefully in your clicking you noticed the important thing though - all the pages are structured the same.

Planning your code

In looking at the Food and Drink section of the Best of 2011 list, we see that all the categories are a link. Each of those links has the winner, maybe some information about the winner (like an address), and the runners-up. It's probably a good idea to break these things into separate functions in our code.

To start, we need to take a look at the HTML that displays these categories. If you're in Chrome or Firefox, highlight "Readers' Poll Winners", right-click, and select Inspect Element.

This opens up the browser's Developer Tools (in Firefox, you might now have to click the HTML button on the right side of the developer pane to fully show it). Now we'll be able to see the page layout. The browser has brought us directly to the piece of HTML that's used to display the "Readers' Poll Winners" <dt> element.

This seems to be the area of code where there's going to be some consistency in how the category links are displayed. See that <dl class="boccat"> just above our "Readers' Poll Winners" line? If you mouse over that line in your browser's dev tools, you'll notice that it highlights the entire section of category links we want. And every category link is within a <dd> element. Perfect! Let's get all of them.

Our first function - getting the category links

Now that we know we know the <dl class="boccat"> section holds all the links we want, let's write some code to find that section, and then grab all of the links within the <dd> elements of that section.

from bs4 import BeautifulSoup
from urllib2 import urlopen

BASE_URL = "http://www.chicagoreader.com"

def get_category_links(section_url):
    html = urlopen(section_url).read()
    soup = BeautifulSoup(html, "lxml")
    boccat = soup.find("dl", "boccat")
    category_links = [BASE_URL + dd.a["href"] for dd in boccat.findAll("dd")]
    return category_links

Hopefully this code is relatively easy to follow, but if not, here's what we're doing:

Loading the urlopen function from the urllib2 library into our local namespace.
Loading the BeautifulSoup class from the bs4 (BeautifulSoup4) library into our local namespace.
Setting a variable named BASE_URL to "http://www.chicagoreader.com". We do this because the links used through the site are relative - meaning they do not include the base domain. In order to store our links properly, we need to concatenate the base domain with each relative link.
Define a function named get_category_links.
1. The function requires a parameter of section_url. In this example, we're going to use the Food and Drink section of the BOC list, however we could use a different section URL - for instance, the City Life section's URL. We're able to create just one generic function because each section page is structured the same.
2. Open the section_url and read it in the html object.
3. Create an object called soup based on the BeautifulSoup class. The soup object is an instance of the BeautifulSoup class. It is initialized with the html object and parsed with lxml.
4. In our BeautifulSoup instance (which we called soup), find the <dl> element with a class of "boccat" and store that section in a variable called boccat.
5. This is a list comprehension. For every <dd> element found within our boccat variable, we're getting the href of its <a> element (our category links) and concatenating on our BASE_URL to make it a complete link. All of these links are being stored in a list called category_links. You could also write this line with a for loop, but I prefer a list comprehension here because of its simplicity.
6. Finally, our function returns the category_links list that we created on the previous line.

Our second function - getting the category, winner, and runners-up

Now that we have our list of category links, we'd better start going through them to get our winners and runners-up. Let's figure out which elements contain the parts we care about.

If we look at the Best Chef category, we can see that our category is in <h1 class="headline">. Shortly after that, we find our winner and runners-up stored in <h2 class="boc1"> and <h2 class="boc2">, respectively.

Let's write some code to get all of them.

def get_category_winner(category_url):
    html = urlopen(category_url).read()
    soup = BeautifulSoup(html, "lxml")
    category = soup.find("h1", "headline").string
    winner = [h2.string for h2 in soup.findAll("h2", "boc1")]
    runners_up = [h2.string for h2 in soup.findAll("h2", "boc2")]
    return {"category": category,
            "category_url": category_url,
            "winner": winner,
            "runners_up": runners_up}

It's very similar to our last function, but let's walk through it anyway.

Define a function called get_category_winner. It requires a category_url.
Lines two and three are actually exactly the same as before - we'll come back to this in the next section.
Find the string within the <h1 class="headline"> element and store it in a variable named category.
Another list comprehension - store the string within every <h2 class="boc1"> element in a list called winner. But shouldn't there be only one winner? You'd think that, but some have multiple (e.g. Best Bang for your Buck).
Same as the previous line, but this time we're getting the runners-up.
Finally, return a dictionary with our data.

DRY - Don't Repeat Yourself

As mentioned in the previous section, lines two and three of our second function mirror lines in our first function.

Imagine a scenario where we want to change the parser we're passing into our BeautifulSoup instance (in this case, lxml). With the way we've currently written our code, we'd have to make that change in two places. Now imagine you've written many more functions to scrape this data - maybe one to get addresses and another to get paragraphs of text about the winner - you've likely repeated those same two lines of code in these functions and you now have to remember to make changes in four different places. That's not ideal.

A good principle in writing code is DRY - Don't Repeat Yourself. When you notice that you've written the same lines of code a couple times throughout your script, it's probably a good idea to step back and think if there's a better way to structure that piece.

In our case, we're going to write another function to simply process a URL and return a BeautifulSoup instance. We can then call this function in our other functions instead of duplicating our logic.

def make_soup(url):
    html = urlopen(url).read()
    return BeautifulSoup(html, "lxml")

We'll have to change our other functions a bit now, but it's pretty minor - we just need to replace our duplicated lines with the following:

soup = make_soup(url) # where url is the url we're passing into the original function

Putting it all together

Now that we have our main functions written, we can write a script to output the data however we'd like. Want to write to a CSV file? Check out Python's DictWriter class. Storing the data in a database? Check out the sqlite3 or other various database libraries. While both tasks are somewhat outside of my intentions for this post, if there's interest, let me know in the comments and I'd be happy to write more.

Hopefully you found this post useful. I've put a final example script in this gist.

Translating SQL to Pandas, Part 1

2013-01-23T00:00:00-08:00

I wrote a three part pandas tutorial for SQL users that you can find here.

UPDATE: If you're interested in learning pandas from a SQL perspective and would prefer to watch a video, you can find video of my 2014 PyData NYC talk here.

For some reason, I've always found SQL to a much more intuitive tool for exploring a tabular dataset than I have other languages (namely R and Python).

If you know SQL well, you can do a whole lot with it, and since data is often in a relational database anyway, it usually makes sense to stick with it. I find that my workflow often includes writing a lot of queries in SQL (using Sequel Pro) to get the data the way I want it, reading it into R (with RStudio), and then maybe a bit more exploration, modeling, and visualization (with ggplot2).

Not too long ago though, I came across Wes McKinney's pandas package and my interest was immediately piqued. Pandas adds a bunch of functionality to Python, but most importantly, it allows for a DataFrame data structure - much like a database table or R's data frame.

Given the great things I've been reading about pandas lately, I wanted to make a conscious effort to play around with it. Instead of my typical workflow being a couple disjointed steps with SQL + R + (sometimes) Python, my thought is that it might make sense to have pandas work its way in and take over the R work. While I probably won't be able to completely give up R (too much ggplot2 love over here), I get bored if I'm not learning something new, so pandas it is.

I intend to document the process a bit - hopefully a couple posts illustrating the differences between SQL and pandas (and maybe some R too).

Throughout the rest of this post, we're going to be working with data from the City of Chicago's open data - specifically the Towed Vechicles data.

Loading the data

Using SQLite

To be able to use SQL with this dataset, we'd first have to create the table. Using SQLite syntax, we'd run the following:

CREATE TABLE towed (
    tow_date text,
    make text,
    style text,
    model text,
    color text,
    plate text,
    state text,
    towed_address text,
    phone text,
    inventory text
);

Because SQLite uses a very generic type system, we don't get the strict data types that we would in most other databases (such as MySQL and PostgreSQL); therefore, all of our data is going to be stored as text. In other databases, we'd store tow_date as a date or datetime field.

Before we read the data into SQLite, we need to tell the database to that the fields are separated by a comma. Then we can use the import command to read the file into our table.

.separator ','
.import ./Towed_Vehicles.csv towed

Note that the downloaded CSV contains two header rows, so we'll need to delete those from our table since we don't need them.

DELETE FROM towed WHERE tow_date = 'Tow Date';

We should have 5,068 records in our table now (note: the City of Chicago regularly updates this dataset, so you might get a different number).

SELECT COUNT(*) FROM towed; -- 5068

Using Python + pandas

Let do the same with pandas now.

import pandas as pd

col_names = ["tow_date", "make", "style", "model", "color", "plate", "state",
    "towed_address", "phone", "inventory"]
towed = pd.read_csv("Towed_Vehicles.csv", header=None, names=col_names,
    skiprows=2, parse_dates=["tow_date"])

The read_csv function in pandas actually allowed us to skip the two header columns and translate the tow_date field to a datetime field.

Let's check our count just to make sure.

len(towed) # 5068

Selecting data

SQL

Selection data with SQL is fairly intuitive - just SELECT the columns you want FROM the particular table you're interested in. You can also take advantage of the LIMIT clause to only see a subset of your data.

-- Return every column for every record in the towed table
SELECT * FROM towed;

-- Return the tow_date, make, style, model, and color for every record in the towed table
SELECT tow_date, make, style, model, color FROM towed;

-- Return every column for the first five records of the towed table
SELECT * FROM towed LIMIT 5;

-- Return every column in the towed table - start at the fifth record and show the next ten
SELECT * FROM towed LIMIT 5, 10; -- records 5-14

Additionally, you can throw a WHERE or ORDER BY (or both) into your queries for proper filtering and ordering of the data returned:

SELECT * FROM towed WHERE state = 'TX'; -- Only towed vehicles from Texas

SELECT * FROM towed WHERE make = 'KIA' AND state = 'TX'; -- KIAs with Texas plates

SELECT * FROM towed WHERE make = 'KIA' ORDER BY color; -- All KIAs ordered by color (A to Z)

Python + pandas

Let's do some of the same, but this time let's use pandas:

# show only the make column for all records
towed["make"]

# tow_date, make, style, model, and color for the first ten records
towed[["tow_date", "make", "style", "model", "color"]][:10]

towed[:5] # first five rows (alternatively, you could use towed.head())

Because pandas is built on top of NumPy, we're able to use boolean indexing. Since we're going to replicate similar statements to the ones we did in SQL, we know we're going to need towed cars from TX made by KIA.

towed[towed["state"] == "TX"] # all columns and records where the car was from TX

towed[(towed["state"] == "TX") & (towed["make"] == "KIA")] # made by KIA AND from TX

towed[(towed["state"] == "MA") | (towed["make"] == "JAGU")] # made by Jaguar OR from MA

towed[towed["make"] == "KIA"].sort("color") # made by KIA, ordered by color (A to Z)

Conclusion, Part 1

This was obviously a very basic start, but there are a lot of good things about pandas - it's certainly concise and readable. Plus, since it works well with the various science + math packages (SciPy, NumPy, Matplotlib, statsmodels, etc.), there's the potential to work almost entirely in one language for analysis tasks.

I plan on covering aggregate functions, pivots, and maybe some matplotlib in my next post.

Hello World

2013-01-22T00:00:00-08:00

So I finally got around to putting something of my own up.

My intentions are mainly to use this space as a way to document mini projects that I'm working on, so plan on it being pretty programming, data, visualization, and statistics heavy - I get bored if I'm not learning something new or doing something I find challenging. That said, I'm known to go on tangents about music and beer (and whatever else I feel like ranting about at the time).

There's also a high likelihood that I'll constantly be tweaking the layout of the site - hopefully the four people reading won't mind.

	key	left_value	right_value
0	0	a	NaN
1	1	b	NaN
2	2	c	NaN
3	3	d	NaN
4	4	e	NaN
0	2	NaN	f
1	3	NaN	g
2	4	NaN	h
3	5	NaN	i
4	6	NaN	j

	key	left_value	right_value
0	0	a	NaN
1	1	b	NaN
2	2	c	NaN
3	3	d	NaN
4	4	e	NaN
0	2	NaN	f
1	3	NaN	g
2	4	NaN	h
3	5	NaN	i
4	6	NaN	j

	key	left_value	right_value
0	0	a	NaN
1	1	b	NaN
2	2	c	NaN
3	3	d	NaN
4	4	e	NaN
0	2	NaN	f
1	3	NaN	g
2	4	NaN	h
3	5	NaN	i
4	6	NaN	j