Asynchronous Scraping with Python

Want to work together? My team at Sprout Social is looking for a Senior Data Scientist.

Previously, I've written about the basics of scraping and how you can find API calls in order to fetch data that isn't easily downloadable.

For simplicity, the code in these posts has always been synchronous -- given a list of URLs, we process one, then the next, then the next, and so on. While this makes for code that's straight-forward, it can also be slow.

This doesn't have to be the case though. Scraping is often an example of code that is embarrassingly parallel. With some slight changes, our tasks can be done asynchronously, allowing us to process more than one URL at a time.

In version 3.2, Python introduced the concurrent.futures module, which is a joy to use for parallelizing tasks like scraping. The rest of this post will show how we can use the module to make our previously synchronous code asynchronous.

Parallelizing your tasks

Imagine we have a list of several thousand URLs. In previous posts, we've always written something that looks like this:

from csv import DictWriter

URLS = [ ... ]  # thousands of urls for pages we'd like to parse

def parse(url):
    # our logic for parsing the page
    return data  # probably a dict

results = []
for url in URLS:  # go through each url one by one

with open('results.csv', 'w') as f:
    writer = DictWriter(f, fieldnames=results[0].keys())

The above is an example of synchronous code -- we're looping through a list of URLs, processing one at a time. If the list of URLs is relatively small or we're not concerned about execution time, there's little reason to parallelize these tasks -- we might as well keep things simple and wait it out.

However, sometimes we have a huge list of URLs -- at least several thousand -- and we can't wait hours for them to finish.

With concurrent.futures, we can work on multiple URLs at once by adding a ProcessPoolExecutor and making a slight change to how we fetch our results.

But first, a reminder: if you're scraping, don't be a jerk. Space out your requests appropriately and don't hammer the site (i.e. use time.sleep to wait briefly between each request and set max_workers to a small number). Being a jerk runs the risk of getting your IP address blocked -- good luck getting that data now.

from concurrent.futures import ProcessPoolExecutor
import concurrent.futures

URLS = [ ... ]

def parse(url):
    # our logic for parsing the page
    return data  # still probably a dict

with ProcessPoolExecutor(max_workers=4) as executor:
    future_results = {executor.submit(parse, url): url for url in URLS}

    results = []
    for future in concurrent.futures.as_completed(future_results):

In the above code, we're submitting tasks to the executor -- four workers -- each of which will execute the parse function against a URL. This execution does not happen immediately. For each submission, the executor returns an instance of a Future, which tells us that our task will be executed at some point in the ... well, future. The as_completed function watches our future_results for completion, upon which we'll be able to fetch each result via the result method.

My favorite part about this module is the clarity of its API -- tasks are submitted to an executor, which is made up of one or more workers, each of which is churning through our tasks. Because our tasks are executed asynchronously, we are not waiting for a given task's completion before submitting another -- we are doing so at-will, with completion happening in the future. Once completed, we can get the task's result.

Closing up

With a few changes to your code and some concurrent.futures love, you no longer have to fetch those basketball stats one page at a time.

But don't be a jerk either.