Greg Redahttp://www.gregreda.com/2023-10-26T18:23:07-07:00Prototyping a PDF Chatbot from Scratch2023-10-26T00:00:00-07:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2023-10-26:/2023/10/26/prototyping-a-pdf-chatbot-from-scratch/<p>As part of my work on <a href="https://github.com/refstudio/refstudio">refstudio</a>, I spent some time prototyping a chatbot that could answer questions about a corpus of PDFs. Tools like <a href="https://github.com/langchain-ai/langchain">LangChain</a>, <a href="https://github.com/run-llama/llama_index">LlamaIndex</a>, <a href="https://github.com/deepset-ai/haystack">Haystack</a>, and others all have built-in abstractions to simplify this task, but I find that building a simplified version from scratch helps me …</p><p>As part of my work on <a href="https://github.com/refstudio/refstudio">refstudio</a>, I spent some time prototyping a chatbot that could answer questions about a corpus of PDFs. Tools like <a href="https://github.com/langchain-ai/langchain">LangChain</a>, <a href="https://github.com/run-llama/llama_index">LlamaIndex</a>, <a href="https://github.com/deepset-ai/haystack">Haystack</a>, and others all have built-in abstractions to simplify this task, but I find that building a simplified version from scratch helps me understand the underlying concepts better.</p>
<p>A basic version of the PDF Chatbot requires two phases with the following steps:</p>
<ol>
<li>PDF Ingestion<ul>
<li>Convert PDFs to text</li>
<li>Chunk the text into smaller pieces</li>
<li>Optional: Generate embeddings for the text chunks</li>
<li>Persist the text chunks (or embeddings) in some way so that we can query them later</li>
</ul>
</li>
<li>Chatbot Interaction<ul>
<li>Take a question from the user</li>
<li>Retrieve the most similar text chunks related to the question<ul>
<li>If we did not create embeddings, we can use a ranking function like <a href="https://en.wikipedia.org/wiki/Okapi_BM25">BM25</a> to find the most similar text chunks</li>
<li>If we did create embeddings, we can use a nearest neighbors algorithm to find the most similar text chunks</li>
</ul>
</li>
<li>Include the most similar text chunks as "context" we provide to the LLM with our question (i.e. our prompt)</li>
<li>Return the LLM's response</li>
</ul>
</li>
</ol>
<p>While embeddings and a vector database are not strictly necessary for this task, I wanted to get a sense of the ergonomics in working with one, so I used this as an excuse to try out <a href="https://github.com/lancedb/lancedb">LanceDB</a>. LanceDB is an open-source, embedded vector database with the goal of simplifying retrieval, filtering, and management of embeddings. It's built on Apache Arrow, which I'm a big fan of. </p>
<h3>Results</h3>
<p>You can find the code for this prototype in <a href="https://github.com/gjreda/scratch-pdf-bot">this github repo</a>.</p>
<p>Here's a quick demo of the chatbot in action:</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/r4LAQbu3sd0?si=DarJiS8PYFJrLpKK" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>Django Command for FIT files2023-05-21T00:00:00-07:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2023-05-21:/2023/05/21/django-command-for-fit-files/<p>FIT - <a href="https://developer.garmin.com/fit/protocol/">Flexible and Interoperable Transfer</a> - is a protocol designed for storing and sharing data from fitness and health devices.</p>
<p>Since getting a <a href="https://coros.com/">Coros</a> running watch in July 2022, I've been exporting the FIT file data to Dropbox after every run.</p>
<p>Having all of this data laying around seemed like a …</p><p>FIT - <a href="https://developer.garmin.com/fit/protocol/">Flexible and Interoperable Transfer</a> - is a protocol designed for storing and sharing data from fitness and health devices.</p>
<p>Since getting a <a href="https://coros.com/">Coros</a> running watch in July 2022, I've been exporting the FIT file data to Dropbox after every run.</p>
<p>Having all of this data laying around seemed like a good excuse for a toy project.<sup>1</sup> I haven't done much web programming in the last five years, so I'm building a little web app with django.</p>
<h2>FIT data</h2>
<p>The data I'm most interested in are the Session, Lap, and Record types from each file.</p>
<p><code>Sessions</code> capture aggregated data about your run - things like total distance, average heart rate, average speed, etc.</p>
<p><code>Laps</code> capture aggregated data about a particular lap of your run. By default, my watch creates one lap every mile. The fields here are similar to sessions - average heart rate, average speed, etc.</p>
<p><code>Records</code> are the raw data about the run. My watch creates a new "record" every second of the run. It captures my latitude and longitude, as well as things like my heart rate, speed, cadence, estimated power output (watts), step length, etc.</p>
<p>To relate this data back to its source file, I've created one additional type called <code>Activity</code>. This contains the source filename and date, and also acts as a foreign key on the <code>Session</code>, <code>Lap</code>, and <code>Record</code> tables.</p>
<p>Mapping each of these to a django model looks like this (I've omitted many fields for the sake of conciseness):</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">django.db</span> <span class="kn">import</span> <span class="n">models</span>
<span class="k">class</span> <span class="nc">Activity</span><span class="p">(</span><span class="n">models</span><span class="o">.</span><span class="n">Model</span><span class="p">):</span>
<span class="n">source_filename</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">FilePathField</span><span class="p">(</span><span class="n">unique</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">began_at</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">DateTimeField</span><span class="p">(</span><span class="n">null</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">Session</span><span class="p">(</span><span class="n">models</span><span class="o">.</span><span class="n">Model</span><span class="p">):</span>
<span class="n">activity</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">ForeignKey</span><span class="p">(</span><span class="n">Activity</span><span class="p">,</span> <span class="n">on_delete</span><span class="o">=</span><span class="n">models</span><span class="o">.</span><span class="n">CASCADE</span><span class="p">)</span>
<span class="n">start_time</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">DateTimeField</span><span class="p">()</span>
<span class="n">total_elapsed_time</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">FloatField</span><span class="p">()</span>
<span class="n">avg_heart_rate</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">PositiveSmallIntegerField</span><span class="p">()</span>
<span class="c1"># ... many more fields</span>
<span class="k">class</span> <span class="nc">Lap</span><span class="p">(</span><span class="n">models</span><span class="o">.</span><span class="n">Model</span><span class="p">):</span>
<span class="n">activity</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">ForeignKey</span><span class="p">(</span><span class="n">Activity</span><span class="p">,</span> <span class="n">on_delete</span><span class="o">=</span><span class="n">models</span><span class="o">.</span><span class="n">CASCADE</span><span class="p">)</span>
<span class="n">total_elapsed_time</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">FloatField</span><span class="p">()</span>
<span class="n">avg_heart_rate</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">PositiveSmallIntegerField</span><span class="p">()</span>
<span class="c1"># ... more fields omitted</span>
<span class="k">class</span> <span class="nc">Record</span><span class="p">(</span><span class="n">models</span><span class="o">.</span><span class="n">Model</span><span class="p">):</span>
<span class="n">activity</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">ForeignKey</span><span class="p">(</span><span class="n">Activity</span><span class="p">,</span> <span class="n">on_delete</span><span class="o">=</span><span class="n">models</span><span class="o">.</span><span class="n">CASCADE</span><span class="p">)</span>
<span class="n">timestamp</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">DateTimeField</span><span class="p">()</span>
<span class="n">position_lat</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">CharField</span><span class="p">(</span><span class="n">max_length</span><span class="o">=</span><span class="mi">255</span><span class="p">,</span> <span class="n">null</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">position_long</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">CharField</span><span class="p">(</span><span class="n">max_length</span><span class="o">=</span><span class="mi">255</span><span class="p">,</span> <span class="n">null</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">heart_rate</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">PositiveSmallIntegerField</span><span class="p">(</span><span class="n">null</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="c1"># ... more fields omitted</span>
</code></pre></div>
<h2>Ingest command</h2>
<p>Django allows you to register <a href="https://django.readthedocs.io/en/stable/howto/custom-management-commands.html#module-django.core.management">custom commands</a> with your application that can be run via <code>manage.py</code>. This is useful for standalone scripts or ones that you'll want to regularly run.</p>
<p>First, some helper functions to use within the command:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">convert_frame_to_dict</span><span class="p">(</span><span class="n">frame</span><span class="p">)</span> <span class="o">-></span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]:</span>
<span class="k">return</span> <span class="p">{</span><span class="n">field</span><span class="o">.</span><span class="n">name</span><span class="p">:</span> <span class="n">field</span><span class="o">.</span><span class="n">value</span>
<span class="k">for</span> <span class="n">field</span> <span class="ow">in</span> <span class="n">frame</span><span class="o">.</span><span class="n">fields</span><span class="p">}</span>
</code></pre></div>
<p>Data rows from the FIT file are message objects with a property containing the fields, and each field containing a name and value. <code>convert_frame_to_dict</code> converts these to dictionaries so they're easier to work with.</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">extract_datetime_from_filename</span><span class="p">(</span><span class="n">filename</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="nb">str</span><span class="p">:</span>
<span class="n">regex</span> <span class="o">=</span> <span class="sa">r</span><span class="s2">"([0-9]+).fit"</span>
<span class="n">match</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">regex</span><span class="p">,</span> <span class="n">filename</span><span class="p">)</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">match</span><span class="p">:</span>
<span class="k">return</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">dt</span> <span class="o">=</span> <span class="n">datetime</span><span class="o">.</span><span class="n">strptime</span><span class="p">(</span><span class="n">match</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="s2">"%Y%m</span><span class="si">%d</span><span class="s2">%H%M%S"</span><span class="p">)</span>
<span class="k">except</span> <span class="ne">ValueError</span><span class="p">:</span>
<span class="k">return</span>
<span class="k">return</span> <span class="n">dt</span>
</code></pre></div>
<p>File names from Coros contain a timestamp marking when the run began (e.g. <code>Run20230520091606.fit</code>). <code>extract_datetime_from_filename</code> parses out that timestamp so it can be stored in the database.</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">determine_files_for_ingest</span><span class="p">(</span><span class="n">filepaths</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="n">Path</span><span class="p">])</span> <span class="o">-></span> <span class="n">List</span><span class="p">[</span><span class="n">Path</span><span class="p">]:</span>
<span class="sd">"""</span>
<span class="sd"> Given a list of filepaths, compare with DB to determine which ones should be ingested.</span>
<span class="sd"> Returns a list of filepaths in need of ingest.</span>
<span class="sd"> """</span>
<span class="c1"># get a list of all the files in the DB</span>
<span class="n">ingested_files</span> <span class="o">=</span> <span class="n">Activity</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">values_list</span><span class="p">(</span><span class="s2">"source_filename"</span><span class="p">,</span> <span class="n">flat</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">needs_ingest</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">fp</span> <span class="ow">in</span> <span class="n">filepaths</span><span class="p">:</span>
<span class="k">if</span> <span class="n">fp</span><span class="o">.</span><span class="n">name</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">ingested_files</span><span class="p">:</span>
<span class="n">needs_ingest</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">fp</span><span class="p">)</span>
<span class="k">return</span> <span class="n">needs_ingest</span>
</code></pre></div>
<p>Since I will call this command after every run, <code>determine_files_for_ingest</code> compares the FIT file directory with what's already been loaded into the database.</p>
<p>With <a href="https://github.com/polyvertex/fitdecode">fitdecode</a> doing the heavy lifting, my custom command looks like below:</p>
<div class="highlight"><pre><span></span><code><span class="k">class</span> <span class="nc">Command</span><span class="p">(</span><span class="n">BaseCommand</span><span class="p">):</span>
<span class="n">help</span> <span class="o">=</span> <span class="s2">"Loads FIT file(s) into the database"</span>
<span class="k">def</span> <span class="nf">add_arguments</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">parser</span><span class="p">):</span>
<span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s2">"fitfile_dir"</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">handle</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">options</span><span class="p">):</span>
<span class="n">fitfile_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">options</span><span class="p">[</span><span class="s2">"fitfile_dir"</span><span class="p">])</span>
<span class="n">filepaths</span> <span class="o">=</span> <span class="n">fitfile_dir</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">"*Run*.fit"</span><span class="p">)</span>
<span class="c1"># filter out any files that are already in the DB</span>
<span class="n">needs_ingest</span> <span class="o">=</span> <span class="n">determine_files_for_ingest</span><span class="p">(</span><span class="n">filepaths</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Found </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">needs_ingest</span><span class="p">)</span><span class="si">}</span><span class="s2"> files to ingest"</span><span class="p">)</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">needs_ingest</span><span class="p">:</span>
<span class="k">return</span>
<span class="c1"># extract date from filename so we can process in chronological order</span>
<span class="n">filepaths</span> <span class="o">=</span> <span class="p">{</span><span class="n">extract_datetime_from_filename</span><span class="p">(</span><span class="n">fp</span><span class="o">.</span><span class="n">name</span><span class="p">):</span> <span class="n">fp</span>
<span class="k">for</span> <span class="n">fp</span> <span class="ow">in</span> <span class="n">needs_ingest</span><span class="p">}</span>
<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">fp</span> <span class="ow">in</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">filepaths</span><span class="o">.</span><span class="n">items</span><span class="p">()):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Loading </span><span class="si">{</span><span class="n">fp</span><span class="o">.</span><span class="n">name</span><span class="si">}</span><span class="s2"> to database"</span><span class="p">)</span>
<span class="k">try</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">process_fitfile</span><span class="p">(</span><span class="n">fp</span><span class="p">)</span>
<span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">write</span><span class="p">(</span>
<span class="bp">self</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">ERROR</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Failed to load </span><span class="si">{</span><span class="n">fp</span><span class="o">.</span><span class="n">name</span><span class="si">}</span><span class="s2"> to database"</span><span class="p">,</span> <span class="n">e</span><span class="p">)</span>
<span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">write</span><span class="p">(</span>
<span class="bp">self</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">SUCCESS</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Successfully loaded </span><span class="si">{</span><span class="n">fp</span><span class="o">.</span><span class="n">name</span><span class="si">}</span><span class="s2"> to database"</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">def</span> <span class="nf">process_fitfile</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">filepath</span><span class="p">:</span> <span class="n">Path</span><span class="p">):</span>
<span class="k">with</span> <span class="n">fitdecode</span><span class="o">.</span><span class="n">FitReader</span><span class="p">(</span><span class="n">filepath</span><span class="p">)</span> <span class="k">as</span> <span class="n">fit</span><span class="p">:</span>
<span class="n">activity</span> <span class="o">=</span> <span class="n">Activity</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
<span class="n">source_filename</span><span class="o">=</span><span class="n">filepath</span><span class="o">.</span><span class="n">name</span><span class="p">,</span>
<span class="n">began_at</span><span class="o">=</span><span class="n">extract_datetime_from_filename</span><span class="p">(</span><span class="n">filepath</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">for</span> <span class="n">frame</span> <span class="ow">in</span> <span class="n">fit</span><span class="p">:</span>
<span class="k">if</span> <span class="n">frame</span><span class="o">.</span><span class="n">frame_type</span> <span class="o">!=</span> <span class="n">fitdecode</span><span class="o">.</span><span class="n">FIT_FRAME_DATA</span><span class="p">:</span>
<span class="k">continue</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">convert_frame_to_dict</span><span class="p">(</span><span class="n">frame</span><span class="p">)</span>
<span class="k">if</span> <span class="n">frame</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="s1">'session'</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">create_session</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">activity</span><span class="p">)</span>
<span class="k">if</span> <span class="n">frame</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="s1">'lap'</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">create_lap</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">activity</span><span class="p">)</span>
<span class="k">if</span> <span class="n">frame</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="s1">'record'</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">create_record</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">activity</span><span class="p">)</span>
</code></pre></div>
<p>I've ommitted the code for <code>create_session</code>, <code>create_lap</code>, and <code>create_record</code>. Each just instantiates the appropriate model and calls its <code>save</code> method.</p>
<p>Running the command with <code>manage.py</code>:</p>
<div class="highlight"><pre><span></span><code>$ python manage.py ingest_fitfiles ../Apps/coros
</code></pre></div>
<p>The payoff? Now I can easily write SQL queries against my running data!</p>
<div class="highlight"><pre><span></span><code>$ sqlite3 db.sqlite3 < sql/weekly_totals.sql -table
+---------+---------------+-------+--------------+----------+
<span class="p">|</span> week <span class="p">|</span> hours_running <span class="p">|</span> miles <span class="p">|</span> feet_climbed <span class="p">|</span> calories <span class="p">|</span>
+---------+---------------+-------+--------------+----------+
<span class="p">|</span> <span class="m">2023</span>-14 <span class="p">|</span> <span class="m">3</span>.5 <span class="p">|</span> <span class="m">21</span>.8 <span class="p">|</span> <span class="m">883</span>.0 <span class="p">|</span> <span class="m">2408</span>.0 <span class="p">|</span>
<span class="p">|</span> <span class="m">2023</span>-15 <span class="p">|</span> <span class="m">3</span>.6 <span class="p">|</span> <span class="m">22</span>.2 <span class="p">|</span> <span class="m">1316</span>.0 <span class="p">|</span> <span class="m">2375</span>.0 <span class="p">|</span>
<span class="p">|</span> <span class="m">2023</span>-16 <span class="p">|</span> <span class="m">4</span>.3 <span class="p">|</span> <span class="m">28</span>.3 <span class="p">|</span> <span class="m">1486</span>.0 <span class="p">|</span> <span class="m">3102</span>.0 <span class="p">|</span>
<span class="p">|</span> <span class="m">2023</span>-17 <span class="p">|</span> <span class="m">4</span>.8 <span class="p">|</span> <span class="m">30</span>.6 <span class="p">|</span> <span class="m">2139</span>.0 <span class="p">|</span> <span class="m">3328</span>.0 <span class="p">|</span>
<span class="p">|</span> <span class="m">2023</span>-18 <span class="p">|</span> <span class="m">2</span>.2 <span class="p">|</span> <span class="m">14</span>.5 <span class="p">|</span> <span class="m">912</span>.0 <span class="p">|</span> <span class="m">1703</span>.0 <span class="p">|</span>
<span class="p">|</span> <span class="m">2023</span>-19 <span class="p">|</span> <span class="m">4</span>.8 <span class="p">|</span> <span class="m">31</span>.0 <span class="p">|</span> <span class="m">1686</span>.0 <span class="p">|</span> <span class="m">3626</span>.0 <span class="p">|</span>
<span class="p">|</span> <span class="m">2023</span>-20 <span class="p">|</span> <span class="m">5</span>.2 <span class="p">|</span> <span class="m">33</span>.6 <span class="p">|</span> <span class="m">1765</span>.0 <span class="p">|</span> <span class="m">3871</span>.0 <span class="p">|</span>
+---------+---------------+-------+--------------+----------+
</code></pre></div>
<p>Sure, Strava already does a lot of this, but where's the fun in that?</p>
<hr>
<ol>
<li>My history with side projects is one of abandonment.</li>
</ol>Notes on using PyInstaller, poetry, and pyenv2023-05-18T00:00:00-07:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2023-05-18:/2023/05/18/notes-on-using-pyinstaller-poetry-and-pyenv/<p>In my all my years of working in Python, I don't think I've ever had to create a standalone executable. But, it finally happened.</p>
<p>It was a pretty seamless experience, but I did hit a minor hiccup, so I wanted to capture some notes for my future self (and others …</p><p>In my all my years of working in Python, I don't think I've ever had to create a standalone executable. But, it finally happened.</p>
<p>It was a pretty seamless experience, but I did hit a minor hiccup, so I wanted to capture some notes for my future self (and others, too).</p>
<p>I was using <a href="https://github.com/pyenv/pyenv">pyenv</a> and <a href="https://python-poetry.org/">poetry</a> to manage my Python environments and dependencies, and <a href="https://pyinstaller.org/en/stable/">PyInstaller</a> to create the executable.</p>
<hr>
<p>Let's assume you have the following project setup:</p>
<div class="highlight"><pre><span></span><code>$ ls -la
total <span class="m">56</span>
drwxr-xr-x@ <span class="m">5</span> greg staff <span class="m">160</span> May <span class="m">18</span> <span class="m">17</span>:26 .
drwxr-xr-x@ <span class="m">5</span> greg staff <span class="m">160</span> May <span class="m">18</span> <span class="m">17</span>:23 ..
-rw-r--r--@ <span class="m">1</span> greg staff <span class="m">22220</span> May <span class="m">18</span> <span class="m">17</span>:28 poetry.lock
-rw-r--r--@ <span class="m">1</span> greg staff <span class="m">381</span> May <span class="m">18</span> <span class="m">17</span>:28 pyproject.toml
drwxr-xr-x@ <span class="m">3</span> greg staff <span class="m">96</span> May <span class="m">18</span> <span class="m">17</span>:24 src
</code></pre></div>
<p>Let's also assume you have a simple Python script at <code>src/main.py</code> that you want to turn into an executable.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># src/main.py</span>
<span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
<span class="kn">import</span> <span class="nn">requests</span>
<span class="k">def</span> <span class="nf">get_data</span><span class="p">():</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'https://www.google.com'</span><span class="p">)</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">response</span><span class="o">.</span><span class="n">text</span><span class="p">,</span> <span class="s1">'html.parser'</span><span class="p">)</span>
<span class="n">title</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s1">'title'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Hi, the title element of Google's webpage is: </span><span class="si">{</span><span class="n">title</span><span class="o">.</span><span class="n">text</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s1">'__main__'</span><span class="p">:</span>
<span class="n">get_data</span><span class="p">()</span>
</code></pre></div>
<p>You can create the excutable via:</p>
<div class="highlight"><pre><span></span><code>$ poetry run pyinstaller src/main.py
</code></pre></div>
<p>This should create a <code>dist</code> directory with the executable and all the necessary libraries. You can run the executable via:</p>
<div class="highlight"><pre><span></span><code>$ ./dist/main/main
Hi, the title element of Google<span class="err">'</span>s webpage is: Google
</code></pre></div>
<p>But if you are using pyenv, you may run into the following error:</p>
<div class="highlight"><pre><span></span><code><span class="n">OSError</span><span class="o">:</span><span class="w"> </span><span class="n">Python</span><span class="w"> </span><span class="n">library</span><span class="w"> </span><span class="n">not</span><span class="w"> </span><span class="n">found</span><span class="o">:</span><span class="w"> </span><span class="n">libpython3</span><span class="o">.</span><span class="mi">9</span><span class="o">.</span><span class="na">dylib</span><span class="o">,</span><span class="w"> </span><span class="n">Python</span><span class="o">,</span><span class="w"> </span><span class="n">libpython3</span><span class="o">.</span><span class="mi">9</span><span class="n">m</span><span class="o">.</span><span class="na">dylib</span><span class="o">,</span><span class="w"> </span><span class="o">.</span><span class="na">Python</span><span class="o">,</span><span class="w"> </span><span class="n">Python3</span><span class="w"></span>
<span class="w"> </span><span class="n">This</span><span class="w"> </span><span class="n">means</span><span class="w"> </span><span class="n">your</span><span class="w"> </span><span class="n">Python</span><span class="w"> </span><span class="n">installation</span><span class="w"> </span><span class="n">does</span><span class="w"> </span><span class="n">not</span><span class="w"> </span><span class="n">come</span><span class="w"> </span><span class="k">with</span><span class="w"> </span><span class="n">proper</span><span class="w"> </span><span class="n">shared</span><span class="w"> </span><span class="n">library</span><span class="w"> </span><span class="n">files</span><span class="o">.</span><span class="w"></span>
<span class="w"> </span><span class="n">This</span><span class="w"> </span><span class="n">usually</span><span class="w"> </span><span class="n">happens</span><span class="w"> </span><span class="n">due</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="n">missing</span><span class="w"> </span><span class="n">development</span><span class="w"> </span><span class="kd">package</span><span class="o">,</span><span class="w"> </span><span class="n">or</span><span class="w"> </span><span class="n">unsuitable</span><span class="w"> </span><span class="n">build</span><span class="w"> </span><span class="n">parameters</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">Python</span><span class="w"> </span><span class="n">installation</span><span class="o">.</span><span class="w"></span>
<span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">On</span><span class="w"> </span><span class="n">Debian</span><span class="o">/</span><span class="n">Ubuntu</span><span class="o">,</span><span class="w"> </span><span class="n">you</span><span class="w"> </span><span class="n">need</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="n">install</span><span class="w"> </span><span class="n">Python</span><span class="w"> </span><span class="n">development</span><span class="w"> </span><span class="n">packages</span><span class="o">:</span><span class="w"></span>
<span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">apt</span><span class="o">-</span><span class="kd">get</span><span class="w"> </span><span class="n">install</span><span class="w"> </span><span class="n">python3</span><span class="o">-</span><span class="n">dev</span><span class="w"></span>
<span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">apt</span><span class="o">-</span><span class="kd">get</span><span class="w"> </span><span class="n">install</span><span class="w"> </span><span class="n">python</span><span class="o">-</span><span class="n">dev</span><span class="w"></span>
<span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">If</span><span class="w"> </span><span class="n">you</span><span class="w"> </span><span class="n">are</span><span class="w"> </span><span class="n">building</span><span class="w"> </span><span class="n">Python</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="n">yourself</span><span class="o">,</span><span class="w"> </span><span class="n">rebuild</span><span class="w"> </span><span class="k">with</span><span class="w"> </span><span class="err">`</span><span class="o">--</span><span class="n">enable</span><span class="o">-</span><span class="n">shared</span><span class="err">`</span><span class="w"> </span><span class="o">(</span><span class="n">or</span><span class="o">,</span><span class="w"> </span><span class="err">`</span><span class="o">--</span><span class="n">enable</span><span class="o">-</span><span class="n">framework</span><span class="err">`</span><span class="w"> </span><span class="n">on</span><span class="w"> </span><span class="n">macOS</span><span class="o">)</span><span class="w"></span>
</code></pre></div>
<p>To fix this, you need to reinstall the appropriate version of Python with the <code>--enable-framework</code> flag.</p>
<div class="highlight"><pre><span></span><code>env <span class="nv">PYTHON_CONFIGURE_OPTS</span><span class="o">=</span><span class="s2">"--enable-framework"</span> pyenv install <span class="m">3</span>.10.11
</code></pre></div>
<p>You may have to reestablish your poetry environment to the newly built python version above:</p>
<div class="highlight"><pre><span></span><code>$ pyenv <span class="nb">local</span> <span class="m">3</span>.10.11
$ poetry env use <span class="k">$(</span>which python<span class="k">)</span>
</code></pre></div>
<p>If you'd already installed your dependencies via poetry, you'll have to reinstall them:</p>
<div class="highlight"><pre><span></span><code>$ poetry install
</code></pre></div>
<p>Now you should be able to create the executable:</p>
<div class="highlight"><pre><span></span><code>$ poetry run pyinstaller src/main.py
</code></pre></div>
<p>Check that it worked</p>
<div class="highlight"><pre><span></span><code>$ ls -la
total <span class="m">64</span>
drwxr-xr-x@ <span class="m">8</span> greg staff <span class="m">256</span> May <span class="m">18</span> <span class="m">17</span>:51 .
drwxr-xr-x@ <span class="m">5</span> greg staff <span class="m">160</span> May <span class="m">18</span> <span class="m">17</span>:23 ..
drwxr-xr-x@ <span class="m">3</span> greg staff <span class="m">96</span> May <span class="m">18</span> <span class="m">17</span>:51 build
drwxr-xr-x@ <span class="m">3</span> greg staff <span class="m">96</span> May <span class="m">18</span> <span class="m">17</span>:54 dist
-rw-r--r--@ <span class="m">1</span> greg staff <span class="m">889</span> May <span class="m">18</span> <span class="m">17</span>:54 main.spec
-rw-r--r--@ <span class="m">1</span> greg staff <span class="m">22220</span> May <span class="m">18</span> <span class="m">17</span>:28 poetry.lock
-rw-r--r--@ <span class="m">1</span> greg staff <span class="m">381</span> May <span class="m">18</span> <span class="m">17</span>:28 pyproject.toml
drwxr-xr-x@ <span class="m">3</span> greg staff <span class="m">96</span> May <span class="m">18</span> <span class="m">17</span>:24 src
</code></pre></div>
<p>And run the executable:</p>
<div class="highlight"><pre><span></span><code>$ ./dist/main/main
Hi, the title element of Google<span class="err">'</span>s webpage is: Google
</code></pre></div>
<p>The end.</p>Assorted bits: 2022-12-092022-12-09T00:00:00-08:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2022-12-09:/2022/12/09/assorted-bits-2022-12-09/<p>I’ve wanted to get in the habit of writing more, so I’m taking some inspiration from old school blogs and sharing some things I've recently enjoyed.</p>
<p>Enjoy your weekend!</p>
<h3>[Music] Spectrum by Max Cooper</h3>
<iframe style="border-radius:12px" src="https://open.spotify.com/embed/track/4rw9xbxHWWRuihfAvQG3M2?utm_source=generator" width="100%" height="152" frameBorder="0" allowfullscreen="" allow="autoplay; clipboard-write; encrypted-media; fullscreen; picture-in-picture" loading="lazy"></iframe>
<p>I'm very into synthesizer centric music. I'm also into music that creates a sonic …</p><p>I’ve wanted to get in the habit of writing more, so I’m taking some inspiration from old school blogs and sharing some things I've recently enjoyed.</p>
<p>Enjoy your weekend!</p>
<h3>[Music] Spectrum by Max Cooper</h3>
<iframe style="border-radius:12px" src="https://open.spotify.com/embed/track/4rw9xbxHWWRuihfAvQG3M2?utm_source=generator" width="100%" height="152" frameBorder="0" allowfullscreen="" allow="autoplay; clipboard-write; encrypted-media; fullscreen; picture-in-picture" loading="lazy"></iframe>
<p>I'm very into synthesizer centric music. I'm also into music that creates a sonic landscape: that feeling that a piece of audio art is being built around you, and you get to sit back and take it all in. This song by <a href="https://maxcooper.net/">Max Cooper</a> nails that. Put on some good headphones, relax, and give it a listen. It made me want to stop what I was doing and make some music.</p>
<h3>[Reading] You Are More Than Just Your Job</h3>
<blockquote>
<p>If I could, I’d tell my younger self to resist letting what job I have (or don’t have) dominate my identity. - <a href="https://www.jopwell.com/thewell/posts/more-than-just-your-job">You Are More Than Just Your Job</a></p>
</blockquote>
<p>I appreciate and echo the sentiment that Jenna Discher shares in the above article. After my <a href="/2022/11/30/this-ones-for-me/">last few years of health surprises</a>, I've learned that allowing any singular piece of my identity to become overly dominant risks an identity crisis when there's an unexpected shock.</p>
<h3>[Reading] Get Numb Before You Get Good</h3>
<blockquote>
<p>Perhaps you’ve attempted to write a blog, or you’ve taken up knitting and are loath to post pictures of your first pair of socks online, or you dread your first pitch meeting and have over-prepped over the weekend. Or perhaps you were like I was: you’d gotten comfortable in your job, and without realising it you had neglected the fear of doing new things for a bit. - <a href="https://commoncog.com/get-numb-get-good/">Get Numb Before You Get Good</a></p>
</blockquote>
<p>Doing anything for the first time is difficult, and that difficulty can prevent us from trying or continuing. It's easy to have unreasonable expectations and be disappointed when our early attempts are not as good as we'd like (or as good as someone else's). I like Cedric Chin's advice to "get numb first" - to just focus on doing - and then worry about getting good.</p>This One's For Me2022-11-30T00:00:00-08:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2022-11-30:/2022/11/30/this-ones-for-me/<p>I had a heart attack earlier this year. A <a href="https://en.wikipedia.org/wiki/Left_anterior_descending_artery#Widow_maker">"widow maker" heart attack</a>. I fell into cardiac arrest and was clinically dead for a couple of minutes. I'm incredibly fortunate two nurses happened to be nearby. They defibrillated me and administered CPR until the ambulance arrived and took me to …</p><p>I had a heart attack earlier this year. A <a href="https://en.wikipedia.org/wiki/Left_anterior_descending_artery#Widow_maker">"widow maker" heart attack</a>. I fell into cardiac arrest and was clinically dead for a couple of minutes. I'm incredibly fortunate two nurses happened to be nearby. They defibrillated me and administered CPR until the ambulance arrived and took me to the hospital. Those nurses quite literally saved my life.</p>
<p>In 2019, I crashed my bike while descending a mountain road at 30mph. I went over the handlebars, cracked my helmet on the pavement, and slid for some time. A mother and daughter found me trying to get back on my bike and ride home. They kindly called an ambulance, then my wife, and waited with me. I don’t really remember the wait, ambulance ride, or hospital visit due to the concussion.<sup>1</sup> I've only ridden my bike twice since.</p>
<p>In 2018, I was diagnosed with <a href="https://en.wikipedia.org/wiki/Chronic_myelogenous_leukemia">chronic myelogenous leukemia</a> (CML). Cancer. Fortunately, it was the best possible kind of cancer: easily manageable with targeted medication (though as of now, incurable). Thanks to the advent of <a href="https://en.wikipedia.org/wiki/Bcr-Abl_tyrosine-kinase_inhibitor">Bcr-Abl tyrosine kinase inhibitors</a>, it shouldn’t effect my lifespan. Still, that word - cancer - carries a lot of weight.</p>
<p>These three events have derailed me over the last few years.<sup>2</sup></p>
<figure>
<img src="/images/cml-exercise-bike-in-hospital.jpg" alt="Riding the exercise bike during my CML hospital stay.">
<figcaption>Riding the exercise bike during my CML hospital stay.</figcaption>
</figure>
<p>When the CML diagnosis came, I was training for what would have been my fourth <a href="https://en.wikipedia.org/wiki/Century_ride">imperial century</a>. Cycling had long been my mental and physical outlet. I'd just begun to take it more seriously, including bike commuting and training throughout winter in Chicago. I was probably in the best shape of my life, both mentally and physically.</p>
<figure>
<img src="/images/bike-top-of-san-bruno.jpg" alt="Atop San Bruno Mountain. I crashed on the descent shortly after taking this picture.">
<figcaption>Atop San Bruno Mountain. I crashed on the descent shortly after taking this picture.</figcaption>
</figure>
<p>When I crashed my bike, I was working my way back into that fitness. My wife and I moved to San Francisco not long after my CML diagnosis. I couldn't have been more excited to do the type of riding I'd longed for: lush, beautiful scenery, mountains, and gravel trails. The crash scared me away from all of that.</p>
<figure>
<img src="/images/sf-skyline-from-fort-point.jpg" alt="Looking back at Alcatraz and the San Francisco skyline from Fort Point. I had the heart attack less than an hour later. I don't remember taking this picture.">
<figcaption>Looking back at Alcatraz and the San Francisco skyline from Fort Point. I had the heart attack about 30 minutes later. I don't remember taking this picture.</figcaption>
</figure>
<p>The heart attack came at the end of a 3.5-mile run at Crissy Field. Again, I was working my way back into that fitness I’d first lost when the CML diagnosis came, and then never fully gained back after being scared off my bike.</p>
<h3>Why am I writing this?</h3>
<p>To finally get it off my chest.</p>
<p>I've written countless drafts of this post over the years. Each has been heavily influenced by the period and mental state in which it was written. They've typically been a mix of dramatics, confusion, and depression. Eventually they reach some form of acceptance, but they all ramble in the same way.</p>
<p>I've never been able to figure out the point of publishing them. It felt like this wasn't the place. It wasn't "who I was." My online identity was that of a software engineer and data scientist. My writing focused on technical topics. My identity felt singular.</p>
<p>Still, I've found it hard to write much of anything over these last few years. I've wanted to, but it always felt as if there was some imaginary hurdle I couldn't clear. I felt mentally blocked. Those drafts were standing in my way.</p>
<p>The point hit me during a recent run through Golden Gate Park. That lush, beautiful scenery I'd dreamed of cycling through when we moved here.</p>
<p>Those unpublished drafts - and this published one - are for me. They've allowed me to let it all out. To process it. To begin moving forward.</p>
<p>Of course, lots of therapy helped too.</p>
<h3>Now</h3>
<p>I feel back, and in better shape than before. Both physically <em>and</em> mentally.</p>
<p>Prior to my heart attack, I couldn’t run a 5k without walking. I never even considered a 10k.<sup>2</sup> I'd never run more than 66km in a month. </p>
<p>Last week I ran a 5k in 26:18, an 8:30 per mile pace. This morning I ran a 10k in 57:18, a 9:13 pace. My last three months of running distances were 75, 89, and 111 kilometers. </p>
<p>Progress.</p>
<p>It's been a long road back, but I'm writing this to remind myself - and maybe you too - that goals are achieved by slow and steady progress. There are no quick fixes for physical and mental health. But you can work your way to where you want to be, little by little over time.</p>
<p>This post is for me. It's me allowing myself to be proud of the mental and physical work I've put in these last few years.</p>
<hr>
<ol>
<li>I only know what happened because the daughter Googled me, found this site and my email address, and checked on me via email.</li>
<li>Let’s not forget a global pandemic in 2020 and 2021.</li>
<li>And here I was thinking I was just out of shape.</li>
</ol>Reviving this space2022-11-18T00:00:00-08:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2022-11-18:/2022/11/18/reviving-this-space/<p>Despite regularly wanting to write more, I haven’t written anything in 18 months. I have countless drafts that I can’t seem to finish, or at least get into a place I feel comfortable publishing. I've mentally blocked myself from doing so.</p>
<p>I can think of plenty of reasons …</p><p>Despite regularly wanting to write more, I haven’t written anything in 18 months. I have countless drafts that I can’t seem to finish, or at least get into a place I feel comfortable publishing. I've mentally blocked myself from doing so.</p>
<p>I can think of plenty of reasons for this, some of which I'll elaborate on in the future, but I think the Twitter-ification of my brain has been one of them. I struggle to think as deeply as I used to.</p>
<p>I’ve also felt hampered by past “success” of some posts. I’ve felt an obligation to stick to the “technical post” theme. My Google Analytics tells me that’s what people come here for. My ego wants to give the people what they want. I need to drive engagement, to get more readers, to hit the front page of Hacker News.</p>
<p>My brain got out of whack. I cared about the wrong things.</p>
<p>I'd like to fix that.</p>
<p>To start, I'm removing Google Analytics from this site and I don't intend to replace it with anything.<sup>1</sup> It's done more harm than good for me.</p>
<p>I also intend to expand the scope of topics I write about. I have mostly written technical content around data science. Put another way, I have mostly written about my career profession. My self-identity has been pretty one dimensional. I'd like to break out of that.</p>
<p>I started this blog as a place to share things I'm working on. As I've gotten older I've found that I don't enjoy "working" outside of work as much as I used to.</p>
<p>I'll still do some of that, but I also want this to be a place where I organize my thoughts around topics I'm interested in. Writing is the means by which I do my best thinking.</p>
<hr>
<ol>
<li>Thanks for the idea, <a href="https://www.treycausey.com/writing.html">Trey</a>.</li>
</ol>Mocking an imported module-level function in Python2021-06-28T00:00:00-07:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2021-06-28:/2021/06/28/mocking-imported-module-function-python/<p>The other day I spent far too much time trying to figure out how to mock a module-level function that was being used inside of my class's method in Python. My googling didn't lead to obvious answers, so I figured it'd be good to document here for future reference.</p>
<p>Imagine …</p><p>The other day I spent far too much time trying to figure out how to mock a module-level function that was being used inside of my class's method in Python. My googling didn't lead to obvious answers, so I figured it'd be good to document here for future reference.</p>
<p>Imagine we have some module-level function like the following:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># file: project/some_module/functions.py</span>
<span class="k">def</span> <span class="nf">fetch_thing</span><span class="p">():</span>
<span class="c1"># query some database</span>
<span class="k">return</span> <span class="n">data</span>
</code></pre></div>
<p>And that we use it inside of a class within a different module:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># file: project/other_module/thing.py</span>
<span class="kn">from</span> <span class="nn">some_module.functions</span> <span class="kn">import</span> <span class="n">fetch_thing</span>
<span class="k">class</span> <span class="nc">Thing</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">fetch_thing</span><span class="p">()</span>
<span class="k">except</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fail_gracefully</span><span class="p">()</span>
</code></pre></div>
<p>In this example, I want to test that a failure fetching from the db will fail gracefully, so I need to mock <code>fetch_thing</code> and have it raise an exception.</p>
<p>I kept trying to mock the function at its module path, like so:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">other_module.thing</span> <span class="kn">import</span> <span class="n">Thing</span>
<span class="n">thing</span> <span class="o">=</span> <span class="n">Thing</span><span class="p">()</span>
<span class="k">with</span> <span class="n">patch</span><span class="o">.</span><span class="n">object</span><span class="p">(</span><span class="s1">'some_module.functions.fetch_thing'</span><span class="p">)</span> <span class="k">as</span> <span class="n">mocked</span><span class="p">:</span>
<span class="n">mocked</span><span class="o">.</span><span class="n">side_effect</span> <span class="o">=</span> <span class="ne">Exception</span><span class="p">(</span><span class="s1">'mocked error'</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">thing</span><span class="o">.</span><span class="n">run</span><span class="p">()</span>
</code></pre></div>
<p>But this isn't right. It turns out that you need to mock/patch the function <strong>within the module it's being imported into.</strong></p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">other_module.thing</span> <span class="kn">import</span> <span class="n">Thing</span>
<span class="n">thing</span> <span class="o">=</span> <span class="n">Thing</span><span class="p">()</span>
<span class="k">with</span> <span class="n">patch</span><span class="o">.</span><span class="n">object</span><span class="p">(</span><span class="s1">'other_module.fetch_thing'</span><span class="p">)</span> <span class="k">as</span> <span class="n">mocked</span><span class="p">:</span>
<span class="n">mocked</span><span class="o">.</span><span class="n">side_effect</span> <span class="o">=</span> <span class="ne">Exception</span><span class="p">(</span><span class="s1">'mocked error'</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">thing</span><span class="o">.</span><span class="n">run</span><span class="p">()</span>
</code></pre></div>
<p>Note the very subtle difference in the string path we are passing to <code>patch.object</code>. Because we are importing the function into <code>other_module</code> where our class uses it, <strong>that</strong> is what we need to mock.</p>Using Go and Twilio to monitor my email2020-12-11T00:00:00-08:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2020-12-11:/2020/12/11/using-go-and-twilio-to-monitor-my-email/<p>Sometimes I'm expecting an email and want to be notified shortly after receiving it. But I also don't want to stare at my inbox, something I don't particularly enjoy checking in the first place.</p>
<p>To illustrate an example, imagine you're browsing a niche site with limited edition goods, some of …</p><p>Sometimes I'm expecting an email and want to be notified shortly after receiving it. But I also don't want to stare at my inbox, something I don't particularly enjoy checking in the first place.</p>
<p>To illustrate an example, imagine you're browsing a niche site with limited edition goods, some of which you love but are out of stock (they're limited, after all). Each item has a helpful "Join waitlist" button, allowing you to provide your email address and receive an email once the item is back in stock. Great feature!</p>
<p>There are a couple key pieces to the above scenario though:</p>
<ol>
<li>The items are limited (supply).</li>
<li>There's a waitlist of unknown size (demand).</li>
</ol>
<p>In effect, we are being told that supply is fixed and demand is not. If demand is far greater than supply it's likely the item will go out of stock again shortly after the email goes out. This is because those who receive the email first will rush to purchase it, knowing that it's a limited item. How can you ensure you see the email shortly after it's sent?</p>
<p>One idea is to just turn on push notifications for all email, but this approach would have a lot of noise and little signal. I'd like to be notified when a <em>specific</em> email arrives, not when <em>any</em> email arrives.</p>
<p>I spend a lot of time in the Messages app texting with friends and family, so a service that sends me a text message would be great, since I'd see it sooner than an email.</p>
<p>Knowing <a href="https://developers.google.com/gmail/api">Gmail has an API</a> and <a href="https://www.twilio.com/referral/XCX3Mu">Twilio</a> would make the text messaging piece easy, this felt like a fun little problem to solve and a good excuse to try a new programming language. I opted for <a href="https://golang.org/">Go</a>.</p>
<h2>Why Go</h2>
<p>I've primarily worked in <a href="https://www.python.org/">Python</a> for the last decade. It's a language that I know and love deeply, and I especially appreciate its emphasis on readability and simplicity. It's a language that allows me to focus on the problem I am solving and doesn't get in the way.</p>
<p>But two common complaints that many Python users eventually have are the language's lack of static typing and that it is slow. While I've rarely found performance to truly be a bottleneck, I have gained an appreciation for statically typed, compiled languages.</p>
<p>Go was born at a time when Python adoption was on the rise thanks to the above qualities. While languages like Java and C++ allowed for more performant solutions, each came with more verbosity and complexity.</p>
<p>Go was designed with developer productivity as a primary concern. One of its creators, Rob Pike, <a href="https://commandcenter.blogspot.com/2012/06/less-is-exponentially-more.html">describes it best</a>:</p>
<blockquote>
<p>What you're given is a set of powerful but easy to understand, easy to use building blocks from which you can assemble—compose—a solution to your problem. It might not end up quite as fast or as sophisticated or as ideologically motivated as the solution you'd write in some of those other languages, but it'll almost certainly be easier to write, easier to read, easier to understand, easier to maintain, and maybe safer.</p>
<p>To put it another way, oversimplifying of course:</p>
<p>Python and Ruby programmers come to Go because they don't have to surrender much expressiveness, but gain performance and get to play with concurrency.</p>
</blockquote>
<p>This philosophy feels very <a href="https://stackoverflow.com/a/25011492/1419514">Pythonic</a> to me. It's the reason I opted to give Go a ... uh, go.</p>
<h2>Code</h2>
<p>A Google search of "golang gmail" brings up a <a href="https://developers.google.com/gmail/api/quickstart/go">quickstart</a> on using the Gmail API and Go to read your inbox labels. The vast majority of this code is authentication handling but it's also almost everything we need.</p>
<p>To search our inbox and send a text when the search has results, we'll add the following functions to the quickstart code:</p>
<ol>
<li><code>queryMessages</code>, which will call Gmail's <a href="https://developers.google.com/gmail/api/reference/rest/v1/users.messages/list"><code>user.messages.list</code></a> method to search a user's inbox and return any matching messages. </li>
<li><code>buildSMS</code>, which will create the message content to be sent via text/SMS message.</li>
<li><code>sendSMS</code>, which will use the <a href="https://www.twilio.com/docs/usage/api">Twilio REST API</a> to send the text message to a given phone number.</li>
</ol>
<h4>queryMessages</h4>
<ol>
<li>Takes inputs of a <a href="https://pkg.go.dev/google.golang.org/api/gmail/v1#Service">Gmail Service object</a>, a string denoting the user, and another string for the search <code>q</code> (e.g. "foo", "from:foo", etc.). Note the <code>*</code> symbol preceeding a type indicates it is a <a href="https://en.wikipedia.org/wiki/Pointer_(computer_programming)">pointer</a>. Go allows for objects to be passed by reference, differing from Python's "<a href="https://docs.python.org/3/faq/programming.html#how-do-i-write-a-function-with-output-parameters-call-by-reference">pass by assignment</a>".</li>
<li>Using the <code>service</code> pointer, calls the Gmail's <code>list</code> endpoint using the <code>q</code> parameter to find any messages matching the search. This is akin to using the search box within Gmail.</li>
<li>Does some logging and checks to ensure the API returns a valid response.</li>
<li>Returns an array of <a href="https://pkg.go.dev/google.golang.org/api/gmail/v1#Message"><code>Message</code></a> pointers.</li>
</ol>
<div class="highlight"><pre><span></span><code><span class="kd">func</span><span class="w"> </span><span class="nx">queryMessages</span><span class="p">(</span><span class="nx">service</span><span class="w"> </span><span class="o">*</span><span class="nx">gmail</span><span class="p">.</span><span class="nx">Service</span><span class="p">,</span><span class="w"> </span><span class="nx">user</span><span class="w"> </span><span class="kt">string</span><span class="p">,</span><span class="w"> </span><span class="nx">q</span><span class="w"> </span><span class="kt">string</span><span class="p">)</span><span class="w"> </span><span class="p">[]</span><span class="o">*</span><span class="nx">gmail</span><span class="p">.</span><span class="nx">Message</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="nx">log</span><span class="p">.</span><span class="nx">Printf</span><span class="p">(</span><span class="s">"Searching for messages containing: %v"</span><span class="p">,</span><span class="w"> </span><span class="nx">q</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="nx">response</span><span class="p">,</span><span class="w"> </span><span class="nx">err</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">service</span><span class="p">.</span><span class="nx">Users</span><span class="p">.</span><span class="nx">Messages</span><span class="p">.</span><span class="nx">List</span><span class="p">(</span><span class="nx">user</span><span class="p">).</span><span class="nx">Q</span><span class="p">(</span><span class="nx">q</span><span class="p">).</span><span class="nx">Do</span><span class="p">()</span><span class="w"></span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="nx">err</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="kc">nil</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="nx">log</span><span class="p">.</span><span class="nx">Fatalf</span><span class="p">(</span><span class="s">"Unable to retrieve messages: %v"</span><span class="p">,</span><span class="w"> </span><span class="nx">err</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="p">}</span><span class="w"></span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="nx">response</span><span class="p">.</span><span class="nx">HTTPStatusCode</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="mi">200</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="nx">log</span><span class="p">.</span><span class="nx">Printf</span><span class="p">(</span><span class="s">"Request returned status code: %v\n"</span><span class="p">,</span><span class="w"> </span><span class="nx">response</span><span class="p">.</span><span class="nx">HTTPStatusCode</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="p">}</span><span class="w"></span>
<span class="w"> </span><span class="nx">log</span><span class="p">.</span><span class="nx">Printf</span><span class="p">(</span><span class="s">"Number of messages found: %v\n"</span><span class="p">,</span><span class="w"> </span><span class="nb">len</span><span class="p">(</span><span class="nx">response</span><span class="p">.</span><span class="nx">Messages</span><span class="p">))</span><span class="w"></span>
<span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="nx">response</span><span class="p">.</span><span class="nx">Messages</span><span class="w"></span>
<span class="p">}</span><span class="w"></span>
</code></pre></div>
<h4>buildSMS</h4>
<p><code>buildSMS</code> takes many of the same inputs as the previous function (there's definitely a nicer way to write this code), but also takes in the list of <code>Messages</code> the previous function returned, as well as whether each <code>Message</code> snippet should be included in the SMS message.</p>
<div class="highlight"><pre><span></span><code><span class="kd">func</span><span class="w"> </span><span class="nx">buildSMS</span><span class="p">(</span><span class="nx">service</span><span class="w"> </span><span class="o">*</span><span class="nx">gmail</span><span class="p">.</span><span class="nx">Service</span><span class="p">,</span><span class="w"> </span><span class="nx">user</span><span class="w"> </span><span class="kt">string</span><span class="p">,</span><span class="w"> </span><span class="nx">messages</span><span class="w"> </span><span class="p">[]</span><span class="o">*</span><span class="nx">gmail</span><span class="p">.</span><span class="nx">Message</span><span class="p">,</span><span class="w"> </span><span class="nx">q</span><span class="w"> </span><span class="kt">string</span><span class="p">,</span><span class="w"> </span><span class="nx">includeSnippets</span><span class="w"> </span><span class="kt">bool</span><span class="p">)</span><span class="w"> </span><span class="kt">string</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="kd">var</span><span class="w"> </span><span class="nx">sb</span><span class="w"> </span><span class="nx">strings</span><span class="p">.</span><span class="nx">Builder</span><span class="w"></span>
<span class="w"> </span><span class="nx">fmt</span><span class="p">.</span><span class="nx">Fprintf</span><span class="p">(</span><span class="o">&</span><span class="nx">sb</span><span class="p">,</span><span class="w"> </span><span class="s">"Hi! You have %v emails matching your search of \"%v\"."</span><span class="p">,</span><span class="w"> </span><span class="nb">len</span><span class="p">(</span><span class="nx">messages</span><span class="p">),</span><span class="w"> </span><span class="nx">q</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="nb">len</span><span class="p">(</span><span class="nx">messages</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="mi">0</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="s">""</span><span class="w"></span>
<span class="w"> </span><span class="p">}</span><span class="w"></span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="nx">includeSnippets</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="kc">true</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="nx">fmt</span><span class="p">.</span><span class="nx">Fprintf</span><span class="p">(</span><span class="o">&</span><span class="nx">sb</span><span class="p">,</span><span class="w"> </span><span class="s">" Here's what they look like.\n"</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="nx">i</span><span class="p">,</span><span class="w"> </span><span class="nx">m</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="k">range</span><span class="w"> </span><span class="nx">messages</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="nx">fmt</span><span class="p">.</span><span class="nx">Printf</span><span class="p">(</span><span class="s">"(%v) Fetching message %v\n"</span><span class="p">,</span><span class="w"> </span><span class="nx">i</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="nx">m</span><span class="p">.</span><span class="nx">Id</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="nx">m</span><span class="p">,</span><span class="w"> </span><span class="nx">err</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">service</span><span class="p">.</span><span class="nx">Users</span><span class="p">.</span><span class="nx">Messages</span><span class="p">.</span><span class="nx">Get</span><span class="p">(</span><span class="nx">user</span><span class="p">,</span><span class="w"> </span><span class="nx">m</span><span class="p">.</span><span class="nx">Id</span><span class="p">).</span><span class="nx">Do</span><span class="p">()</span><span class="w"></span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="nx">err</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="kc">nil</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="nx">log</span><span class="p">.</span><span class="nx">Fatalf</span><span class="p">(</span><span class="s">"Unable to retrieve message ID %v: %v"</span><span class="p">,</span><span class="w"> </span><span class="nx">m</span><span class="p">.</span><span class="nx">Id</span><span class="p">,</span><span class="w"> </span><span class="nx">err</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="p">}</span><span class="w"></span>
<span class="w"> </span><span class="nx">fmt</span><span class="p">.</span><span class="nx">Fprintf</span><span class="p">(</span><span class="o">&</span><span class="nx">sb</span><span class="p">,</span><span class="w"> </span><span class="s">"(%v) - %v\n"</span><span class="p">,</span><span class="w"> </span><span class="nx">i</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="nx">m</span><span class="p">.</span><span class="nx">Snippet</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="p">}</span><span class="w"></span>
<span class="w"> </span><span class="p">}</span><span class="w"></span>
<span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="nx">sb</span><span class="p">.</span><span class="nx">String</span><span class="p">()</span><span class="w"></span>
<span class="p">}</span><span class="w"></span>
</code></pre></div>
<p>Go's <a href="https://golang.org/pkg/strings/#Builder"><code>strings.Builder</code></a> creates an object in memory which allows strings to be written directly to it, thus minimizing any memory copying. When declaring <code>var sb strings.Builder</code> we're getting a block in the memory registry and then writing directly to it with each <code>Fprintf</code> to <code>&sb</code>. Calling <code>sb.String()</code> returns a string of whatever we've written to the <code>Builder</code>.</p>
<h4>sendSMS</h4>
<p>Finally, we need to call the <a href="https://www.twilio.com/docs/sms">Twilio SMS API</a> to send our text. All that's needed is sending an POST request to the <code>/Messages.json</code> endpoint with our message data encoded as json.</p>
<div class="highlight"><pre><span></span><code><span class="kd">func</span><span class="w"> </span><span class="nx">sendSMS</span><span class="p">(</span><span class="nx">phoneNumber</span><span class="w"> </span><span class="kt">string</span><span class="p">,</span><span class="w"> </span><span class="nx">message</span><span class="w"> </span><span class="kt">string</span><span class="p">,</span><span class="w"> </span><span class="nx">config</span><span class="w"> </span><span class="o">*</span><span class="nx">Config</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="nx">msgData</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">url</span><span class="p">.</span><span class="nx">Values</span><span class="p">{}</span><span class="w"></span>
<span class="w"> </span><span class="nx">msgData</span><span class="p">.</span><span class="nx">Set</span><span class="p">(</span><span class="s">"To"</span><span class="p">,</span><span class="w"> </span><span class="nx">phoneNumber</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="nx">msgData</span><span class="p">.</span><span class="nx">Set</span><span class="p">(</span><span class="s">"From"</span><span class="p">,</span><span class="w"> </span><span class="nx">config</span><span class="p">.</span><span class="nx">Twilio</span><span class="p">.</span><span class="nx">PhoneNumber</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="nx">msgData</span><span class="p">.</span><span class="nx">Set</span><span class="p">(</span><span class="s">"Body"</span><span class="p">,</span><span class="w"> </span><span class="nx">message</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="nx">reader</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="o">*</span><span class="nx">strings</span><span class="p">.</span><span class="nx">NewReader</span><span class="p">(</span><span class="nx">msgData</span><span class="p">.</span><span class="nx">Encode</span><span class="p">())</span><span class="w"></span>
<span class="w"> </span><span class="nx">reqURL</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">config</span><span class="p">.</span><span class="nx">Twilio</span><span class="p">.</span><span class="nx">BaseURL</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="s">"/Accounts/"</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nx">config</span><span class="p">.</span><span class="nx">Twilio</span><span class="p">.</span><span class="nx">AccountSID</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="s">"/Messages.json"</span><span class="w"></span>
<span class="w"> </span><span class="nx">client</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="o">&</span><span class="nx">http</span><span class="p">.</span><span class="nx">Client</span><span class="p">{}</span><span class="w"></span>
<span class="w"> </span><span class="nx">req</span><span class="p">,</span><span class="w"> </span><span class="nx">_</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">http</span><span class="p">.</span><span class="nx">NewRequest</span><span class="p">(</span><span class="s">"POST"</span><span class="p">,</span><span class="w"> </span><span class="nx">reqURL</span><span class="p">,</span><span class="w"> </span><span class="o">&</span><span class="nx">reader</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="nx">req</span><span class="p">.</span><span class="nx">SetBasicAuth</span><span class="p">(</span><span class="nx">config</span><span class="p">.</span><span class="nx">Twilio</span><span class="p">.</span><span class="nx">AccountSID</span><span class="p">,</span><span class="w"> </span><span class="nx">config</span><span class="p">.</span><span class="nx">Twilio</span><span class="p">.</span><span class="nx">AuthToken</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="nx">req</span><span class="p">.</span><span class="nx">Header</span><span class="p">.</span><span class="nx">Add</span><span class="p">(</span><span class="s">"Accept"</span><span class="p">,</span><span class="w"> </span><span class="s">"application/json"</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="nx">req</span><span class="p">.</span><span class="nx">Header</span><span class="p">.</span><span class="nx">Add</span><span class="p">(</span><span class="s">"Content-Type"</span><span class="p">,</span><span class="w"> </span><span class="s">"application/x-www-form-urlencoded"</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="nx">response</span><span class="p">,</span><span class="w"> </span><span class="nx">_</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">client</span><span class="p">.</span><span class="nx">Do</span><span class="p">(</span><span class="nx">req</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="kd">var</span><span class="w"> </span><span class="nx">data</span><span class="w"> </span><span class="kd">map</span><span class="p">[</span><span class="kt">string</span><span class="p">]</span><span class="kd">interface</span><span class="p">{}</span><span class="w"></span>
<span class="w"> </span><span class="nx">decoder</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">json</span><span class="p">.</span><span class="nx">NewDecoder</span><span class="p">(</span><span class="nx">response</span><span class="p">.</span><span class="nx">Body</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="nx">err</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="nx">decoder</span><span class="p">.</span><span class="nx">Decode</span><span class="p">(</span><span class="o">&</span><span class="nx">data</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="nx">response</span><span class="p">.</span><span class="nx">StatusCode</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="mi">200</span><span class="w"> </span><span class="o">&&</span><span class="w"> </span><span class="nx">response</span><span class="p">.</span><span class="nx">StatusCode</span><span class="w"> </span><span class="p"><</span><span class="w"> </span><span class="mi">300</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="nx">err</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="kc">nil</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="nx">log</span><span class="p">.</span><span class="nx">Printf</span><span class="p">(</span><span class="s">"Twilio message SID: %v"</span><span class="p">,</span><span class="w"> </span><span class="nx">data</span><span class="p">[</span><span class="s">"sid"</span><span class="p">])</span><span class="w"></span>
<span class="w"> </span><span class="p">}</span><span class="w"></span>
<span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w"> </span><span class="nx">log</span><span class="p">.</span><span class="nx">Printf</span><span class="p">(</span><span class="s">"Twilio returned status: %v"</span><span class="p">,</span><span class="w"> </span><span class="nx">response</span><span class="p">.</span><span class="nx">Status</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="p">}</span><span class="w"></span>
<span class="p">}</span><span class="w"></span>
</code></pre></div>
<h4>Putting it all together</h4>
<p>Putting all the necessary pieces together gives us <a href="https://github.com/gjreda/gmail-text-notifications/blob/master/main.go">this script</a>, which takes an input search term(s) and phone number.</p>
<div class="highlight"><pre><span></span><code>$ go build main.go
$ ./main -q hello -phone +131255555555
<span class="m">2020</span>/12/09 <span class="m">16</span>:28:20 Searching <span class="k">for</span> messages containing: hello
<span class="m">2020</span>/12/09 <span class="m">16</span>:28:21 Number of messages found: <span class="m">100</span>
<span class="m">2020</span>/12/09 <span class="m">16</span>:28:22 Twilio message SID: SM72f7e0080030412284dec3afab19489d
</code></pre></div>
<p><center>
<img src="/images/email-sms-message.jpg" alt="SMS letting me know I have emails matching the search" width="350px">
</center>
I found Go pretty nice to work with and intend to explore it more. It scratches the "statically typed, compiled language" itch I've had recently. I'm particularly intrigued by its concurrency patterns and plan to do some comparisons against Python + pandas for data pipeline tasks.</p>
<p>You can find the code for this project <a href="https://github.com/gjreda/gmail-text-notifications">on my Github</a>.</p>
<p><strong>Additional Reading:</strong></p>
<ul>
<li><a href="https://winterflower.github.io/2017/08/20/the-asterisk-and-the-ampersand/">the asterisk and the ampersand - a golang tale</a></li>
<li><a href="https://commandcenter.blogspot.com/2012/06/less-is-exponentially-more.html">Less is exponentially more</a></li>
</ul>Deploying static sites with Github Actions2020-12-09T00:00:00-08:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2020-12-09:/2020/12/09/deploying-static-sites-with-github-actions/<p>A while back I <a href="http://gregreda.com/2015/03/26/static-site-deployments/">wrote</a> about deploying my site using Github and Travis CI. But recently it seems <a href="https://news.ycombinator.com/item?id=25338983">Travis CI stopped being free for open source projects</a>.</p>
<p>If you're using a static site generator for your site and hosting it on it on S3, you can use <a href="https://docs.github.com/en/free-pro-team@latest/actions">Github Actions</a> to …</p><p>A while back I <a href="http://gregreda.com/2015/03/26/static-site-deployments/">wrote</a> about deploying my site using Github and Travis CI. But recently it seems <a href="https://news.ycombinator.com/item?id=25338983">Travis CI stopped being free for open source projects</a>.</p>
<p>If you're using a static site generator for your site and hosting it on it on S3, you can use <a href="https://docs.github.com/en/free-pro-team@latest/actions">Github Actions</a> to build and deploy your site on each commit (or PR, or whatever).</p>
<h2>Setup</h2>
<p>If you've already set up Travis CI to deploy your site to S3, switching to Github Actions won't be very difficult.</p>
<p>Actions are defined in YAML and need to live at a path of <code>.github/workflows</code> within your repo. We'll name ours <code>deploy.yml</code>, so its path will be <code>.github/workflows/deploy.yml</code>.</p>
<p>Before defining our workflow steps, we'll want to add any necessary secret passwords, keys, tokens, and such to our repo's <a href="https://docs.github.com/en/free-pro-team@latest/actions/reference/encrypted-secrets">encryped secrets</a>. This will allow them to be accessed by the workflow, but securely stored and only visible to those with access to the repo.</p>
<p>Since my site is hosted using S3 and Cloudfront, I'll need secrets for my AWS access keys.</p>
<p><img alt="My repo's Github Secrets page" src="/images/github-secrets.png"></p>
<p>Next, we'll create our <code>deploy.yml</code> file. Github kindly supplies <a href="https://github.com/actions/starter-workflows">starter workflows</a> in many languages, but since this site uses Pelican, a static site generator for Python, we'll use the <a href="https://docs.github.com/en/free-pro-team@latest/actions/guides/building-and-testing-python">Python starter workflow</a>.</p>
<div class="highlight"><pre><span></span><code><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">deploy</span><span class="w"></span>
<span class="nt">on</span><span class="p">:</span><span class="w"></span>
<span class="w"> </span><span class="nt">push</span><span class="p">:</span><span class="w"></span>
<span class="w"> </span><span class="nt">branches</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">[</span><span class="nv">master</span><span class="p p-Indicator">]</span><span class="w"></span>
<span class="nt">jobs</span><span class="p">:</span><span class="w"></span>
<span class="w"> </span><span class="nt">deploy</span><span class="p">:</span><span class="w"></span>
<span class="w"> </span><span class="nt">runs-on</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">ubuntu-latest</span><span class="w"></span>
<span class="w"> </span><span class="nt">steps</span><span class="p">:</span><span class="w"></span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">uses</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">actions/checkout@v2</span><span class="w"></span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Set up Python</span><span class="w"></span>
<span class="w"> </span><span class="nt">uses</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">actions/setup-python@v2</span><span class="w"></span>
<span class="w"> </span><span class="nt">with</span><span class="p">:</span><span class="w"></span>
<span class="w"> </span><span class="nt">python-version</span><span class="p">:</span><span class="w"> </span><span class="s">'2.7'</span><span class="w"></span>
</code></pre></div>
<p>Yes, I'm still using a very old version of Pelican with Python 2.7. I swear I use Python3 everywhere else.</p>
<p>Since we're deploying to S3, we'll need to add a step for configuring our AWS credentials using the <a href="https://github.com/marketplace/actions/configure-aws-credentials-action-for-github-actions">configure credentials action</a>. The <code>aws-region</code> should match whichever region your bucket is in.</p>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Configure AWS credentials</span><span class="w"></span>
<span class="w"> </span><span class="nt">uses</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">aws-actions/configure-aws-credentials@v1</span><span class="w"></span>
<span class="w"> </span><span class="nt">with</span><span class="p">:</span><span class="w"> </span>
<span class="w"> </span><span class="nt">aws-access-key-id</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">${{ secrets.AWS_ACCESS_KEY_ID }}</span><span class="w"></span>
<span class="w"> </span><span class="nt">aws-secret-access-key</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">${{ secrets.AWS_SECRET_ACCESS_KEY }}</span><span class="w"></span>
<span class="w"> </span><span class="nt">aws-region</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">us-east-1</span><span class="w"></span>
</code></pre></div>
<p>Because my website's repo uses <a href="https://git-scm.com/book/en/v2/Git-Tools-Submodules">git submodules</a>, I need to add another step for checking out and updating these submodules on each build.</p>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Build submodules</span><span class="w"></span>
<span class="w"> </span><span class="nt">run</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">|</span><span class="w"></span>
<span class="w"> </span><span class="no">sed -i 's/git@github.com:/https:\/\/github.com\//' .gitmodules</span><span class="w"></span>
<span class="w"> </span><span class="no">git submodule update --init --recursive</span><span class="w"></span>
</code></pre></div>
<p>We also need to <code>pip install</code> any dependencies from <code>requirements.txt</code>, like Pelican.</p>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Install dependencies</span><span class="w"></span>
<span class="w"> </span><span class="nt">run</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">|</span><span class="w"></span>
<span class="w"> </span><span class="no">sudo apt-get install -qq pandoc</span><span class="w"></span>
<span class="w"> </span><span class="no">python -m pip install --upgrade pip</span><span class="w"></span>
<span class="w"> </span><span class="no">pip install -r requirements.txt</span><span class="w"></span>
</code></pre></div>
<p>And finally, we can build our site, deploy it to S3, and invalidate the Cloudfront cache.</p>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Build website</span><span class="w"></span>
<span class="w"> </span><span class="nt">run</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">|</span><span class="w"></span>
<span class="w"> </span><span class="no">pelican content</span><span class="w"></span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Deploy to S3</span><span class="w"></span>
<span class="w"> </span><span class="nt">run</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">|</span><span class="w"></span>
<span class="w"> </span><span class="no">aws s3 sync output/. s3://www.gregreda.com --acl public-read</span><span class="w"></span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Invalidate Cloudfront cache</span><span class="w"></span>
<span class="w"> </span><span class="nt">run</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">|</span><span class="w"></span>
<span class="w"> </span><span class="no">aws configure set preview.cloudfront true</span><span class="w"></span>
<span class="w"> </span><span class="no">aws cloudfront create-invalidation --distribution-id ${{ secrets.AWS_CLOUDFRONT_DISTRIBUTION_ID }} --paths "/*"</span><span class="w"></span>
</code></pre></div>
<p>Putting it all together gives us <a href="https://github.com/gjreda/gregreda.com/blob/master/.github/workflows/deploy.yml">this YAML file</a>, which builds and deploys this website on every commit to <code>master</code>.</p>
<p>That's it. Continuous deployment for your S3 hosted website.</p>newbird: a theme for pelican2020-11-25T00:00:00-08:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2020-11-25:/2020/11/25/newbird-pelican-theme/<p>In 2014, I <a href="https://github.com/gjreda/void">wrote a custom theme</a> for <a href="https://blog.getpelican.com/">Pelican</a>, the static site generator I use for this site.</p>
<p>At the time, there were few themes available and I wanted something that was fairly simple in its design, but also that I understood well enough to tweak as necessary. I opted …</p><p>In 2014, I <a href="https://github.com/gjreda/void">wrote a custom theme</a> for <a href="https://blog.getpelican.com/">Pelican</a>, the static site generator I use for this site.</p>
<p>At the time, there were few themes available and I wanted something that was fairly simple in its design, but also that I understood well enough to tweak as necessary. I opted to use <a href="http://getskeleton.com/">Skeleton</a> for the theme's general structure, but also added a <a href="https://github.com/gjreda/void/blob/master/static/css/void.css">fair amount of custom CSS</a> to get things the way I wanted.</p>
<p>But over time all that custom CSS became more of a pain than it was worth. I wanted something I could just drop in and have it look nice.</p>
<p>Yesterday I came across <a href="https://newcss.net/">new.css</a>, which I feel achieves its goal of sensible design and <a href="https://blog.usejournal.com/the-next-css-frontier-classless-5e66f3f25fdd">classless CSS</a>. It allowed me to quickly create a new Pelican theme with limited CSS-fiddling.</p>
<p>The new theme, which I've named <a href="https://github.com/gjreda/newbird-pelican-theme">newbird</a>, includes support for Google Analytics, <a href="https://developer.twitter.com/en/docs/twitter-for-websites/cards/overview/abouts-cards">Twitter Cards</a>, and <a href="https://developers.facebook.com/docs/sharing/webmasters/">Facebook Open Graph</a>. It also allows for articles to be written in Jupyter Notebooks thanks to Pelican's liquid tags plugin. Notably, I opted not to include any social sharing buttons in order to decrease clutter and page loads.</p>
<p>If you're interested in using newbird for your Pelican-based site, you can find it <a href="https://github.com/gjreda/newbird-pelican-theme">here</a>.</p>Scraping pages behind login forms2020-11-17T00:00:00-08:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2020-11-17:/2020/11/17/scraping-pages-behind-login-forms/<p><em>This is part of a series of posts I have written about web scraping with Python.</em></p>
<ol>
<li><a href="http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/">Web Scraping 101 with Python</a>, which covers the basics of using Python for web scraping.</li>
<li><a href="http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/">Web Scraping 201: Finding the API</a>, which covers when sites load data client-side with Javascript.</li>
<li><a href="http://www.gregreda.com/2016/10/16/asynchronous-scraping-with-python/">Asynchronous Scraping with Python …</a></li></ol><p><em>This is part of a series of posts I have written about web scraping with Python.</em></p>
<ol>
<li><a href="http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/">Web Scraping 101 with Python</a>, which covers the basics of using Python for web scraping.</li>
<li><a href="http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/">Web Scraping 201: Finding the API</a>, which covers when sites load data client-side with Javascript.</li>
<li><a href="http://www.gregreda.com/2016/10/16/asynchronous-scraping-with-python/">Asynchronous Scraping with Python</a>, showing how to use multithreading to speed things up.</li>
<li><a href="http://www.gregreda.com/2020/11/17/scraping-pages-behind-login-forms/">Scraping Pages Behind Login Forms</a>, which shows how to log into sites using Python.</li>
</ol>
<hr>
<p>The other day a friend asked whether there was an easier way for them to get 1000+ Goodreads reviews without manually doing it one-by-one. It sounded like a fun little scraping project to me.</p>
<p>One small complexity was that the user's book reviews were not public, which meant you needed to log into Goodreads to access them. Thankfully, with a little understanding of how HTML forms work, Python's <a href="https://requests.readthedocs.io/en/master/">requests</a> library makes this doable with a few lines of code.</p>
<p>This post walks through how to tackle the problem. If you'd like to jump straight to the code, you can find it <a href="https://github.com/gjreda/goodreads-reviews">on my Github</a>.</p>
<p>While we'll use Goodreads here, the same concepts apply to most websites.</p>
<p>First, you'll need to dig into how the site's login forms work. I find the best way to do this is by finding the page that is solely for login. Here's an example from Goodreads:</p>
<p><img alt="example login page" src="/images/goodreads-login-page.png"></p>
<p>From there, you'll need to find the necessary details of the login form. While this will include some sort of username/email and password, it will likely include a token and possibly other details.</p>
<p>The best way to find these details is by launching your browser's developer tools inside one of the input fields (like username/email). This will bring you to the code that is responsible for the form and allow you to find the details required.</p>
<p><img alt="example login form" src="/images/goodreads-login-form.png"></p>
<p>Using the screenshot above as an example, we can see the form requires some user input fields and as well as some hidden fields:</p>
<ol>
<li>A hidden <code>utf8</code> field with a checkmark value. The checkmark value will be converted to its HTML hexcode on submission, which is <code>&#x2713;</code>.</li>
<li>A hidden <code>authenticity_token</code> with a provided value.</li>
<li>A <code>user[email]</code> which is input via the form.</li>
<li>A <code>user[password]</code> which is input via the form.</li>
<li>A hidden <code>n</code> field with a provided value.</li>
</ol>
<p>When you enter your email and password into the form and press login, the first line in the highlighted red box tells us that the form data is sent via an HTTP POST request to <code>https://www.goodreads.com/user/sign_in</code> (seen in the <code>method</code> and <code>action</code> fields, respectively). The user and password fields are then checked against the site's database to validate the information. Essentially, it's saying "Here are the credentials I was given. Is this a valid user?" If the credentials are valid, you are redirected to some page within the app (like the user's home page).</p>
<p>Once login is successful, a <a href="https://en.wikipedia.org/wiki/HTTP_cookie">cookie</a> is then stored in your browser's memory. Every time you access one of the site's pages, the site checks to make sure the cookie is valid and that you are allowed to access the page you are trying to reach.</p>
<p>To scrape data that is behind login forms, we'll need to replicate this behavior using the requests library. In particular, we'll need to use its <a href="https://requests.readthedocs.io/en/master/user/advanced/#session-objects">Session object</a>, which will capture and store any cookie information for us.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
<span class="kn">import</span> <span class="nn">requests</span>
<span class="n">LOGIN_URL</span> <span class="o">=</span> <span class="s2">"https://www.goodreads.com/user/sign_in"</span>
<span class="k">def</span> <span class="nf">get_authenticity_token</span><span class="p">(</span><span class="n">html</span><span class="p">):</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html</span><span class="p">,</span> <span class="s2">"html.parser"</span><span class="p">)</span>
<span class="n">token</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s1">'input'</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s1">'name'</span><span class="p">:</span> <span class="s1">'authenticity_token'</span><span class="p">})</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">token</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'could not find `authenticity_token` on login form'</span><span class="p">)</span>
<span class="k">return</span> <span class="n">token</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'value'</span><span class="p">)</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">get_login_n</span><span class="p">(</span><span class="n">html</span><span class="p">):</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html</span><span class="p">,</span> <span class="s2">"html.parser"</span><span class="p">)</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s1">'input'</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s1">'name'</span><span class="p">:</span> <span class="s1">'n'</span><span class="p">})</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">n</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'could not find `n` on login form'</span><span class="p">)</span>
<span class="k">return</span> <span class="n">n</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'value'</span><span class="p">)</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="n">email</span> <span class="o">=</span> <span class="s2">"some@email.com"</span> <span class="c1"># login email</span>
<span class="n">password</span> <span class="o">=</span> <span class="s2">"somethingsecret"</span> <span class="c1"># login password</span>
<span class="n">payload</span> <span class="o">=</span> <span class="p">{</span>
<span class="s1">'user[email]'</span><span class="p">:</span> <span class="n">email</span><span class="p">,</span>
<span class="s1">'user[password]'</span><span class="p">:</span> <span class="n">password</span><span class="p">,</span>
<span class="s1">'utf8'</span><span class="p">:</span> <span class="s1">'&#x2713;'</span><span class="p">,</span>
<span class="p">}</span>
<span class="n">session</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span>
<span class="n">session</span><span class="o">.</span><span class="n">headers</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'User-Agent'</span><span class="p">:</span> <span class="p">(</span><span class="s1">'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) '</span>
<span class="s1">'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36'</span><span class="p">)}</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">LOGIN_URL</span><span class="p">)</span>
<span class="n">token</span> <span class="o">=</span> <span class="n">get_authenticity_token</span><span class="p">(</span><span class="n">response</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">get_login_n</span><span class="p">(</span><span class="n">response</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
<span class="n">payload</span><span class="o">.</span><span class="n">update</span><span class="p">({</span>
<span class="s1">'authenticity_token'</span><span class="p">:</span> <span class="n">token</span><span class="p">,</span>
<span class="s1">'n'</span><span class="p">:</span> <span class="n">n</span>
<span class="p">})</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"attempting to log in as </span><span class="si">{</span><span class="n">email</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">session</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="n">LOGIN_URL</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">payload</span><span class="p">)</span> <span class="c1"># perform login</span>
</code></pre></div>
<p>If the POST request in the last line is successful, our session object should now contain a cookie that allows us to programmatically access the same pages that our user normally has access to. We'll simply need to request these pages using <code>session.get</code> and then can proceed as I've <a href="/2013/03/03/web-scraping-101-with-python/">previously detailed</a>.</p>
<p>You can find the complete code for this post <a href="https://github.com/gjreda/goodreads-reviews">on my Github</a>.</p>Feature Engineering with Time Gaps2020-02-16T00:00:00-08:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2020-02-16:/2020/02/16/feature-engineering-with-time-gaps/<p>I tend to forget how to write certain blocks of code when I haven't written them in a while. Here's a common machine learning preprocessing task that falls into that category.</p>
<p>Imagine you have some event logs that capture an entity ID (user, store, ad, etc), timestamp, an event name …</p><p>I tend to forget how to write certain blocks of code when I haven't written them in a while. Here's a common machine learning preprocessing task that falls into that category.</p>
<p>Imagine you have some event logs that capture an entity ID (user, store, ad, etc), timestamp, an event name, and maybe some other details. The data looks something like this:</p>
<div class="highlight"><pre><span></span><code>userid timestamp event
789 2019-07-18 01:06:00 login
123 2019-07-19 08:30:00 login
789 2019-07-20 02:39:00 login
789 2019-07-20 08:15:00 login
456 2019-07-20 10:05:00 login
123 2019-07-20 14:40:00 login
123 2019-07-20 18:05:00 login
456 2019-07-21 21:11:00 login
789 2019-07-22 10:05:00 login
123 2019-07-23 09:18:00 login
789 2019-07-23 17:35:00 login
123 2019-07-25 16:49:00 login
789 2019-07-26 12:13:00 login
123 2019-07-27 19:56:00 login
</code></pre></div>
<p>For the sake of simplicity, let's say we want to build a model predicting whether or not a user will login in tomorrow. Our target is <code>y = bool(logins)</code>.</p>
<p>Three features we think will be informative are the user's previous logins, whether they logged in yesterday, and the number of days since their last login. We'll call these features <code>lifetime_logins</code>, <code>logins_yesterday</code>, and <code>days_since_last_login</code>.</p>
<p>Using <a href="https://pandas.pydata.org/">pandas</a>, we aggregate by user and date to get each user's daily count of logins.</p>
<div class="highlight"><pre><span></span><code><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_clipboard</span><span class="p">(</span><span class="n">parse_dates</span><span class="o">=</span><span class="p">[</span><span class="s1">'timestamp'</span><span class="p">])</span>
<span class="n">user_logins</span> <span class="o">=</span> <span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">'timestamp'</span><span class="p">)</span>
<span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'userid'</span><span class="p">,</span> <span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">freq</span><span class="o">=</span><span class="s1">'D'</span><span class="p">)])</span>
<span class="o">.</span><span class="n">size</span><span class="p">()</span>
<span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="s1">'logins'</span><span class="p">))</span>
<span class="c1"># userid timestamp</span>
<span class="c1"># 123 2019-07-19 1</span>
<span class="c1"># 2019-07-20 2</span>
<span class="c1"># 2019-07-23 1</span>
<span class="c1"># 2019-07-25 1</span>
<span class="c1"># 2019-07-27 1</span>
<span class="c1"># 456 2019-07-20 1</span>
<span class="c1"># 2019-07-21 1</span>
<span class="c1"># 789 2019-07-18 1</span>
<span class="c1"># 2019-07-20 2</span>
<span class="c1"># 2019-07-22 1</span>
<span class="c1"># 2019-07-23 1</span>
<span class="c1"># 2019-07-26 1</span>
<span class="c1"># Name: logins, dtype: int64</span>
</code></pre></div>
<p>But we're missing critical information. This is when the brain fart happens.</p>
<p>Recall the structure of our logs. Notice they omit records for when the user had no activity. In order to create our features, we need to fill in time gaps for each user and then roll that information forward.</p>
<p>This goal of this post is to help me remember how to do this in the future.</p>
<h3>Filling Time Gaps</h3>
<p>First, we need to put each user on a continuous time scale.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># create a continuous DatetimeIndex at a daily level</span>
<span class="n">dates</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">date_range</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">timestamp</span><span class="o">.</span><span class="n">min</span><span class="p">()</span><span class="o">.</span><span class="n">date</span><span class="p">(),</span>
<span class="n">df</span><span class="o">.</span><span class="n">timestamp</span><span class="o">.</span><span class="n">max</span><span class="p">()</span><span class="o">.</span><span class="n">date</span><span class="p">(),</span>
<span class="n">freq</span><span class="o">=</span><span class="s1">'1D'</span><span class="p">)</span>
<span class="c1"># get unique set of user ids</span>
<span class="n">users</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">'userid'</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">()</span>
<span class="c1"># create a MultiIndex that is the product (cross-join) of</span>
<span class="c1"># users and DatetimeIndexes</span>
<span class="n">idx</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">MultiIndex</span><span class="o">.</span><span class="n">from_product</span><span class="p">([</span><span class="n">users</span><span class="p">,</span> <span class="n">dates</span><span class="p">],</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s1">'userid'</span><span class="p">,</span> <span class="s1">'timestamp'</span><span class="p">])</span>
<span class="c1"># and reindex our `user_logins` counts by it</span>
<span class="n">user_logins</span> <span class="o">=</span> <span class="n">user_logins</span><span class="o">.</span><span class="n">reindex</span><span class="p">(</span><span class="n">idx</span><span class="p">)</span>
<span class="c1"># userid timestamp</span>
<span class="c1"># 789 2019-07-18 1.0</span>
<span class="c1"># 2019-07-19 NaN</span>
<span class="c1"># 2019-07-20 2.0</span>
<span class="c1"># 2019-07-21 NaN</span>
<span class="c1"># 2019-07-22 1.0</span>
<span class="c1"># 2019-07-23 1.0</span>
<span class="c1"># 2019-07-24 NaN</span>
<span class="c1"># 2019-07-25 NaN</span>
<span class="c1"># 2019-07-26 1.0</span>
<span class="c1"># 2019-07-27 NaN</span>
</code></pre></div>
<p>This gives us a continuous daily time series for each user. You can see what this looks like for user 789 above.</p>
<p>An important thing to note is that <code>idx</code> will need to be on the same time scale as the current <code>DatetimeIndex</code> in <code>user_logins</code>. Because we aggregated at a daily level using <code>pd.Grouper(freq='D')</code>, the <code>MultiIndex</code> we are using to <code>reindex</code> should also be at a daily level.</p>
<h3>Creating Features</h3>
<p>Now we're free to create our features. We can zero-fill days each user did not log in. We also need to convert our <code>user_logins</code> to a DataFrame, which allows us to create the new feature columns (e.g. <code>logins_yesterday</code>).</p>
<div class="highlight"><pre><span></span><code><span class="n">user_logins</span> <span class="o">=</span> <span class="n">user_logins</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">to_frame</span><span class="p">()</span>
<span class="n">user_logins</span><span class="p">[</span><span class="s1">'logins_yesterday'</span><span class="p">]</span> <span class="o">=</span> <span class="n">user_logins</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="s1">'userid'</span><span class="p">)[</span><span class="s1">'logins'</span><span class="p">]</span><span class="o">.</span><span class="n">shift</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># logins logins_yesterday</span>
<span class="c1"># userid timestamp</span>
<span class="c1"># 789 2019-07-18 1.0 NaN</span>
<span class="c1"># 2019-07-19 0.0 1.0</span>
<span class="c1"># 2019-07-20 2.0 0.0</span>
<span class="c1"># 2019-07-21 0.0 2.0</span>
<span class="c1"># 2019-07-22 1.0 0.0</span>
<span class="c1"># 123 2019-07-18 0.0 NaN</span>
<span class="c1"># 2019-07-19 1.0 0.0</span>
<span class="c1"># 2019-07-20 2.0 1.0</span>
<span class="c1"># 2019-07-21 0.0 2.0</span>
<span class="c1"># 2019-07-22 0.0 0.0</span>
<span class="c1"># 456 2019-07-18 0.0 NaN</span>
<span class="c1"># 2019-07-19 0.0 0.0</span>
<span class="c1"># 2019-07-20 1.0 0.0</span>
<span class="c1"># 2019-07-21 1.0 1.0</span>
<span class="c1"># 2019-07-22 0.0 1.0</span>
</code></pre></div>
<p>The <code>lifetime_logins</code> and <code>login_streak</code> features need to be context dependant to avoid data leakage when training our model. Our features need to represent what would have been the correct values <em>at the time</em>. We can do this by rolling information forward with <code>shift</code>.</p>
<div class="highlight"><pre><span></span><code><span class="n">user_logins</span><span class="p">[</span><span class="s1">'lifetime_logins'</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">user_logins</span>
<span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="s1">'userid'</span><span class="p">)</span>
<span class="o">.</span><span class="n">logins</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span>
<span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="s1">'userid'</span><span class="p">)</span><span class="o">.</span><span class="n">shift</span><span class="p">(</span><span class="mi">1</span><span class="p">))</span>
<span class="n">user_logins</span><span class="p">[</span><span class="s1">'days_since_last_login'</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">user_logins</span>
<span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="s1">'userid'</span><span class="p">)</span>
<span class="o">.</span><span class="n">cumsum</span><span class="p">()</span>
<span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'userid'</span><span class="p">,</span> <span class="s1">'logins'</span><span class="p">])</span>
<span class="o">.</span><span class="n">cumcount</span><span class="p">()</span>
<span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="s1">'userid'</span><span class="p">)</span><span class="o">.</span><span class="n">shift</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="s1">'days_since_last_login'</span><span class="p">))</span>
<span class="c1"># logins logins_yesterday lifetime_logins days_since_last_login</span>
<span class="c1"># userid timestamp</span>
<span class="c1"># 789 2019-07-18 1.0 NaN NaN NaN</span>
<span class="c1"># 2019-07-19 0.0 1.0 1.0 0.0</span>
<span class="c1"># 2019-07-20 2.0 0.0 1.0 1.0</span>
<span class="c1"># 2019-07-21 0.0 2.0 3.0 0.0</span>
<span class="c1"># 2019-07-22 1.0 0.0 3.0 1.0</span>
<span class="c1"># 123 2019-07-18 0.0 NaN NaN NaN</span>
<span class="c1"># 2019-07-19 1.0 0.0 0.0 0.0</span>
<span class="c1"># 2019-07-20 2.0 1.0 1.0 0.0</span>
<span class="c1"># 2019-07-21 0.0 2.0 3.0 0.0</span>
<span class="c1"># 2019-07-22 0.0 0.0 3.0 1.0</span>
<span class="c1"># 456 2019-07-18 0.0 NaN NaN NaN</span>
<span class="c1"># 2019-07-19 0.0 0.0 0.0 0.0</span>
<span class="c1"># 2019-07-20 1.0 0.0 0.0 1.0</span>
<span class="c1"># 2019-07-21 1.0 1.0 1.0 0.0</span>
<span class="c1"># 2019-07-22 0.0 1.0 2.0 0.0</span>
</code></pre></div>
<p>This can also be extended to create rolling features: something like <code>logins_last_n_days</code> where <code>n = [7, 14, 21]</code>.</p>
<div class="highlight"><pre><span></span><code><span class="k">for</span> <span class="n">n</span> <span class="ow">in</span> <span class="p">[</span><span class="mi">7</span><span class="p">,</span> <span class="mi">14</span><span class="p">,</span> <span class="mi">21</span><span class="p">]:</span>
<span class="n">col</span> <span class="o">=</span> <span class="s1">'logins_last_</span><span class="si">{}</span><span class="s1">_days'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
<span class="n">user_logins</span><span class="p">[</span><span class="n">col</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">user_logins</span>
<span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="s1">'userid'</span><span class="p">)</span>
<span class="o">.</span><span class="n">logins</span>
<span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">d</span><span class="p">:</span> <span class="n">d</span><span class="o">.</span><span class="n">rolling</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">.</span><span class="n">shift</span><span class="p">(</span><span class="mi">1</span><span class="p">)))</span>
</code></pre></div>
<p>Hopefully you've found this post helpful. I know my future self will.</p>Lenny Dykstra, His Strike Zone, & Bayesian Stats2018-07-08T00:00:00-07:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2018-07-08:/2018/07/08/dykstra-strike-zone-bayesian-stats/<p>In 2015, former Major Leaguer Lenny Dykstra went on Colin Cowherd’s radio show and claimed that he used to hire private investigators to find dirt on umpires. The intention of doing so was to turn that dirt into a more favorable strike zone for himself. You can find the …</p><p>In 2015, former Major Leaguer Lenny Dykstra went on Colin Cowherd’s radio show and claimed that he used to hire private investigators to find dirt on umpires. The intention of doing so was to turn that dirt into a more favorable strike zone for himself. You can find the clip <a href="https://www.youtube.com/watch?v=fvhb_BjTDmk">here</a>.</p>
<blockquote>
<p>"It wasn't a coincidence I led the league in walks the next few years." - Lenny Dykstra</p>
</blockquote>
<p>Over at <a href="https://www.fangraphs.com/">Fangraphs</a>, Sheryl Ring wrote an <a href="https://www.fangraphs.com/blogs/did-lenny-dykstra-extort-umpires/">interesting article</a> exploring whether Dykstra's claims would amount to extortion in a legal sense. In order to do so, she needed to start by assuming that Dykstra's claims were truthful, though both Sheryl and Fangraphs commenters wondered whether there is any objective evidence that Dykstra benefitted.</p>
<p>That's the question I'd like to explore in this post. Did Lenny Dykstra benefit from a more favorable strike zone? What do his numbers say?</p>
<p>Since <a href="https://en.wikipedia.org/wiki/PITCHf/x">PITCHf/x</a> wasn't around when Dykstra played, we can't look directly at balls and strikes called against him. However, we can use his career numbers and some <a href="https://en.wikipedia.org/wiki/Bayesian_statistics">Bayesian statistics</a> to generate expected walk totals.</p>
<h3>Analysis</h3>
<p>On the show, Dykstra's statement about "leading the league in walks the next few years" gives us a clue as to when this might have started - 1993 - the only year he led the league in walks.</p>
<p>Up until 1993, Dykstra had walked 384 times in 3,667 plate appearances - good for a walk rate of 10.5%. In 1993 and 1994 though, his walk rates climbed to 16.7% and 17.6%, respectively. How likely were those numbers based on his career up until those points?</p>
<p>It's safe for us to assume that his "true" walk rate at that point was somewhere around 10.5% - this was his career BB% and we had a lot of data in support of it (3,667 PAs)</p>
<p>We can model this assumption about his "true" ability to draw a walk as a <a href="https://en.wikipedia.org/wiki/Beta_distribution">beta distribution</a> using his pre-1993 numbers as the parameters of our model. Note that a beta distribution is parameterized by α, which represents the number of successes of an event, and β, which represents the number of failures for the same event. </p>
<p>\begin{equation}
Beta(α, β)
\end{equation}
\begin{equation}
Beta(BB, PA - BB)
\end{equation}
\begin{equation}
Beta(384, 3283)
\end{equation}</p>
<p>In this case, α is Dykstra's total walks prior to 1993 and β is the number of times he did not draw a walk during that same period.</p>
<p><img alt="dykstra-beta-prior" src="/images/dykstra-beta-prior.png"></p>
<p>Using this beta, we can simulate the range of values we'd expect his 1993 walk total to fall within, based his number of plate appearances from that season. This gives us both an idea of his expected 1993 BB% as well as his total walks.</p>
<p><img alt="dykstra-walk-sims-93" src="/images/dykstra-walk-sims-93.png"></p>
<p>On the left, we've simulated the range we'd have expected his 1993 BB% rate to fall within, based on his career numbers up until that point. Using this range, we can then obtain an expected distribution for his total walks, based his total plate appearances in 1993 (shown right).</p>
<p>You'll note that the red lines indicate Dykstra's 1993 numbers, which fall well outside of our expected ranges, indicating that in none of our simulations did Dykstra match his 1993 numbers.</p>
<p>Taking this approach a step further, we can update our beta distribution to include the 1993 season, allowing us to understand what we'd have expected in his 1994 season.</p>
<p>\begin{equation}
Beta(384 + 129, 3283 + (773 - 129))
\end{equation}
\begin{equation}
Beta(513, 3927)
\end{equation}</p>
<p><img alt="dykstra-walk-sims-94" src="/images/dykstra-walk-sims-94.png"></p>
<p>You'll note that the chart on the left includes our previous beta distribution in light blue, which is based on his career up until 1993. When incorporating his surprising 1993 walk numbers, our expected BB% shifts to the right, resulting in the purple distribution shown -- 1993 has given us new evidence to suggest Dykstra has a better eye.</p>
<p>Still, updating our model to include 1993 does not result in numbers we would have expected for 1994. In only 0.02% of our simulations did Dykstra achieve the 68 walks he produced in 1994.</p>
<p>Said differently, he did probably have some dirt on umpires, resulting in a more favorable strike zone.</p>
<p>It's reasonable to ask whether or not the league-wide walk rate changed around the 1993 and 1994 seasons. <a href="https://www.baseball-reference.com/pi/shareit/jlkOc">It didn't</a>, ultimately staying relatively constant throughout Dykstra's career.</p>
<p><img alt="mlb-bb-rate" src="/images/mlb-bb-rate.png"></p>
<p>While it’s possible that his eye improved mightily between the 1992 and 1993 seasons, it's highly unlikely. As the analysis above shows, his walk numbers fall well outside of what we would have expected.</p>
<p>You can find the code and data for this analysis <a href="https://github.com/gjreda/notebooks/tree/master/dykstra-walks">here</a>.</p>Hiring Data Scientists2018-02-04T00:00:00-08:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2018-02-04:/2018/02/04/hiring-data-scientists/<p>Chicago's a big city that feels small -- everyone seems only a degree or two
away from one another. This feels especially true within Chicago's tech and data
science communities.</p>
<p>As a result, I occasionally get asked about hiring data scientists.
Specifically, how do you vet, hire, and evaluate a data …</p><p>Chicago's a big city that feels small -- everyone seems only a degree or two
away from one another. This feels especially true within Chicago's tech and data
science communities.</p>
<p>As a result, I occasionally get asked about hiring data scientists.
Specifically, how do you vet, hire, and evaluate a data scientist if you don't
have existing experience (either personally or within your company)?</p>
<p>While I feel this is a hard problem -- and one I never have a great answer
to -- I figured sharing how I think about hiring might prove helpful to others.</p>
<h3>How I think about it</h3>
<p>My general belief is that so long as the candidate clears some
programming bar and some quantitative bar (neither of which should be too high),
the most important things for success are <em><strong>curiosity and skepticism</strong></em>.</p>
<p>The programming and quantitative bars should be based <em><strong>solely on examples of
work they'd do on the job</strong></em>.</p>
<p>I do this via a small take-home assignment that should take candidates a couple
of hours. The assignment asks variations of questions they are likely to encounter
in the role. A small dataset is provided, which the questions reference.</p>
<p>In my opinion, the programming and quantitative bars mostly come down to:</p>
<ul>
<li>
<p>Can they write code to do what they need to? Getting data out of a database, analysis at scale, automating some regular analysis, etc.</p>
</li>
<li>
<p>Do they think from a quantitative perspective? Do they think probabilistically?</p>
</li>
<li>
<p>Can they build a basic model and evaluate it? Do they understand statistics enough such that their analysis will not be <em>harmful</em>?</p>
<ul>
<li>My belief is that <em>no data</em> is preferable to <em>bad data</em>. With no data, you're forced to seek alternative forms of information (e.g. talking to users). With bad data, you risk drawing improper conclusions that lead you astray -- false confidence.</li>
</ul>
</li>
</ul>
<p>Assuming the candidate clears these bars, I believe curiosity and skepticism are the two most
important attributes for success.</p>
<p>If they are curious, they will continue to fill gaps in their knowledge, learn
new approaches to problems, and seek to continuously learn the business/product
side -- and how their work can add value to it. A data scientist that has a
tendency to go down rabbit holes can be a good thing if properly directed.</p>
<p>If they are skeptical, they'll refine everyone's
thought process by questioning things in a healthy way. They'll innately seek to
prove things believed to be true and they'll seek to answer questions
that arise -- be it via their own curiosity or others'. This skepticism also
acts as a check -- they'll seek alternative ways to prove and
test their own work, cautiously fearful of creating bad data that can lead to
improper action.</p>
<h3>My interview process</h3>
<ol>
<li>
<p>Phone screen with recruiter (30 mins)</p>
</li>
<li>
<p>Phone screen with me (30 mins)</p>
</li>
<li>
<p>Take-home assignment (~2 hours)</p>
</li>
<li>
<p>In-office interview (3 hours)</p>
</li>
</ol>
<p>The phone screens are really about feeling the person out, learning about what they're
looking for next, and digging into specifics about their past experience/work.</p>
<p>The take-home assignment acts as the "programming and statistical bar" with
respect to the job -- brief examples of questions or problems they might work on
in the role. We ask candidates to provide any code they wrote or charts
they created to answer all questions, even if it's exploratory in nature. We
also ask that they be prepared to discuss their work during interviews.</p>
<p>My in-office interviews are three one-hour interviews. Usually, two one-hour
interviews with myself and my team, and a joint interview with a
product manager and platform engineer. Time is <em>always</em> left for the candidate
to ask the interviewers questions.</p>
<p>I should note that depending on how tenured the candidate is and their existing
body of work, this process might change slightly. My approach tends to change when
candidates can point me to prior work (GitHub, side projects, blog posts).</p>
<p>If you’re looking for more details an ideal hiring process for data scientists, <a href="http://treycausey.com/hiring_data_scientists.html">Trey Causey’s advice is excellent</a> and has influenced much of my thinking. Similarly, <a href="http://qethanm.cc/">Q McCallum</a>’s series on <a href="http://qethanm.cc/2018/01/23/common-mistakes-in-data-science-hiring-part-1/">common data science hiring mistakes</a> offers practical advice to determine whether you actually need a data scientist, and that you can hire and retain them. Finally, <a href="https://mpopov.com/">Mikhail Popov</a>'s piece on <a href="https://blog.wikimedia.org/2017/02/02/hiring-data-scientist/">Wikipedia's approach to data science hiring</a> is worth your time.</p>My Experience as a Freelance Data Scientist2017-01-07T00:00:00-08:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2017-01-07:/2017/01/07/freelance-data-science-experience/<p>Every so often, data scientists who are thinking about going off on their own will email me with questions about my year of freelancing (2015). In my most recent response, I was a little more detailed than usual, so I figured it'd make sense as a blog post too.</p>
<p>If …</p><p>Every so often, data scientists who are thinking about going off on their own will email me with questions about my year of freelancing (2015). In my most recent response, I was a little more detailed than usual, so I figured it'd make sense as a blog post too.</p>
<p>If my response comes across as negative, that's certainly not the intention -- being straight-forward about my experience is.</p>
<p>I learned a lot, it just wasn't for me. Working by yourself on short(ish)-term things can get old.</p>
<h3>How was your year of freelancing?</h3>
<p>Generally, it was good and I learned a lot.</p>
<p>My reason for setting out on my own was really about scratching an itch I've always had (and I suspect many of us have) - can I strike it out on my own?</p>
<p>The freedom was really nice and if you're able to find the work, you can likely work less than you would full-time while making more money. That said, it's certainly not for everyone.</p>
<h3>Why'd you stop?</h3>
<p>I didn't find it very rewarding in a non-monetary sense.</p>
<p>Freelancing/consulting doesn't really give you the luxury of thinking long-term about something like a product company does. Typically, a client hires you to do something, you do it, and then you're gone.</p>
<p>Thinking long-term and deeply through all the ways data / data science can be impactful upon a business or product is something I really enjoy -- "ohh, we can build a recommendations engine with this ... the search results we're displaying to the user here aren't great -- we can use this data to improve them, etc." I definitely enjoy a more of a slant towards data scientist + product manager than I do data scientist + software engineer.</p>
<p>As an individual freelancer, landing this sort of "feature" work is very hard because:</p>
<ol>
<li>
<p>You're one person and typically these are not small projects. You only have so much capacity (24 hours/day), so it'd take more time for you to do it than it would a team.</p>
</li>
<li>
<p>Companies often want these things to be a "core competency" in that they do not want someone to build Big Important Thing and disappear. They are risk-averse.</p>
</li>
<li>
<p>You didn't strike out on your own to build Big Important Thing and then really just maintain that thing for one client in perpetuity (which would likely happen if the company allowed you to build it) -- you started freelancing because you (presumably) wanted some variety.</p>
</li>
</ol>
<p>Companies often have a Thing In Mind they want you to do -- or, they want to "buy" your time for some period (e.g. 80 hours over the next three months at $/hour -- a retainer).</p>
<p>When they have a Thing In Mind, it is much more likely to be dirty work that they do not feel is the best use of their existing team's time than it is to be something they need you, the consultant, for.</p>
<p>When on retainer, I found the experience to be similar, except it can be a bunch of ad-hoc tasks that come up ("can you pull this data for me") that you didn't know would be the case when you signed the contract.</p>
<p>This is all a long way of saying, in my experience, <em>a non-trivial portion of you has to be ok with being a mercenary</em> -- do the thing you're being paid to do and not worry about the rest.</p>
<p>I struggled with that internally and thus did not find the work very stimulating -- I like buying into something, giving it my all, and thinking about the various directions it can be taken.</p>
<h3>Tips or lessons learned?</h3>
<p>So many, but here are a few:</p>
<h4><a href="https://en.wikipedia.org/wiki/KISS_principle">Keep It Simple, Stupid</a></h4>
<p>This isn't specific to freelancing per se, but it was something freelancing emphasized.</p>
<p>I think data scientists (generally) have a bad habit of latching onto specific words a stakeholder says, while ignoring the other words in the request.<sup>1</sup> For example:</p>
<blockquote>
<p>"What's the optimal number of leads that a rep should get? We want to get directionally better."</p>
</blockquote>
<p>As data scientists, we hear "optimal number" and we start thinking about doing complex math and building models. We end up ignoring the most important part: "We want to get directionally better" -- our stakeholder is telling us "we don't know much about this right now -- help!"</p>
<p>We need to start simple -- maybe some basic exploratory work + charts -- and surface that back to our stakeholder, giving them the opportunity to say "cool, this is all I needed" or "this is good, but keep going." We need to allow our stakeholder to choose incremental progress and we should not assume they need the more complex (and time-intensive) solution.</p>
<h4>Try to get systems access before the project begins</h4>
<p>This probably isn't a high priority for the systems team at your client. Thus, if the process of getting you access to things (databases, vpn, etc.) starts the same day the project is set to start, the first day or two will wind up being a waste of time.</p>
<h4>Productized consulting</h4>
<p>Nail down exactly what you do or create. Have a fixed price for doing it. Don't deviate from that. Turn your "consulting" into a product.</p>
<p>For example, you'll build a churn model for $XX. My brother's company is a <a href="https://ethercycle.com/pricing/">good example of this</a>.</p>
<p>Try not to sell hours. Which leads me to ...</p>
<h4>Don't bill hourly</h4>
<p>Tracking hours sucks and also limits your margin. Try to sell daily or weekly rates (or productized consulting).</p>
<p>Better yet, if you have a well defined scope (ideal, but sometimes hard) and you know the amount of time the project will take you, then set a price on the project. The risk here is the project taking longer than you anticipated and now you're really just doing free work.</p>
<p>I was fearful of underestimating the amount of time something would take me, so I billed hourly. It wasn't fun though. Additionally, all of my clients ended up being retainers.</p>
<h3>My biggest fears involve health insurance. Do you have any good resources?</h3>
<p>Not really. I just used healthcare.gov and went with a BCBS PPO because I'm pretty risk-averse.</p>
<h3>In the data science pipeline, where did your services fall (e.g. Databases > Data Cleaning > Business Intelligence > Advanced Analytics)? Did you do everything?</h3>
<p>This was something I should have been better about -- I never really established what my services were. My thesis in going freelance was more about feeling there was a gap in the data consulting market.</p>
<p>My belief was that there was (in 2015 ... and probably still is) a population of companies trying to figure out how to utilize their data, who are not interested in bringing on a consulting firm ($$$), and don't necessarily know if they need a data scientist full-time yet. I felt uniquely positioned to fill that gap due to being GrubHub's first data hire and also having prior consulting experience (PwC, Datascope).</p>
<p>For the reasons mentioned in the second question, I'd classify most of the work I ended up doing as Business Intelligence, along with some product/marketing analytics work -- instrumentation, how to think about using the data, etc -- but never in a "building data-driven features/products" sense. No machine learning or similar.</p>
<hr class="small" id="footnotes">
<p></hr>
1. I have a longer post about this in the works.</p>[Talk] Data-Informed vs Data-Driven2016-11-20T00:00:00-08:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2016-11-20:/2016/11/20/data-informed-pydata-video/<p>A while back (July 2015), I was fortunate to speak at PyData in Seattle about the pitfalls of <em>overreliance</em> on data. You can find the slides <a href="https://github.com/gjreda/pydata2015sea/blob/4a846a7d069601cc5f886e53863a17d7fd68f2a8/data-informed-vs-data-driven-with-notes.pdf">here</a>.</p>
<p>My talk centered around my belief that the term "data-driven," when taken at its face value, should not be something we strive for …</p><p>A while back (July 2015), I was fortunate to speak at PyData in Seattle about the pitfalls of <em>overreliance</em> on data. You can find the slides <a href="https://github.com/gjreda/pydata2015sea/blob/4a846a7d069601cc5f886e53863a17d7fd68f2a8/data-informed-vs-data-driven-with-notes.pdf">here</a>.</p>
<p>My talk centered around my belief that the term "data-driven," when taken at its face value, should not be something we strive for. Instead, we should seek to be "data-informed." Pedantic, I know, but for those that do not work within the field, I think the distinction is important.</p>
<p>Here's its abstract (pardon my snark):</p>
<blockquote>
<p>Companies can't stop gushing about how "data-driven" they are - how they're using "big data" and "data science" to synergize and streamline all the things. But being driven by data alone is a flawed approach. Instead, companies should seek to be "data-informed" - interweaving designers, UXers, and data scientists so that each side is able to perfectly complement the one another.</p>
<p>This talk will discuss the importance of allowing data and user research to complement one another, in addition to the pitfalls of being driven by data alone (for instance, the cons of A/B testing).</p>
</blockquote>
<p>While the actual talk isn't one of my best - I sound like I'm reading from cards (I kind of was) - I'm still a big believer in the overall message.</p>
<p>We shouldn't be surprised that being data-informed is ultimately a better approach. Simply, we're just adding more information - quantitative <em>and</em> qualitative - to our existing dataset and weighing that information appropriately.</p>
<p>The best decisions make use of all relevant information, not a limited set, much like the best algorithms are those developed with the best data and features.</p>
<p><center></p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/yHo3B3BbppM" frameborder="0" allowfullscreen></iframe>
<p></center></p>Asynchronous Scraping with Python2016-10-16T00:00:00-07:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2016-10-16:/2016/10/16/asynchronous-scraping-with-python/<p><em>This is part of a series of posts I have written about web scraping with Python.</em></p>
<ol>
<li><a href="http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/">Web Scraping 101 with Python</a>, which covers the basics of using Python for web scraping.</li>
<li><a href="http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/">Web Scraping 201: Finding the API</a>, which covers when sites load data client-side with Javascript.</li>
<li><a href="http://www.gregreda.com/2016/10/16/asynchronous-scraping-with-python/">Asynchronous Scraping with Python …</a></li></ol><p><em>This is part of a series of posts I have written about web scraping with Python.</em></p>
<ol>
<li><a href="http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/">Web Scraping 101 with Python</a>, which covers the basics of using Python for web scraping.</li>
<li><a href="http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/">Web Scraping 201: Finding the API</a>, which covers when sites load data client-side with Javascript.</li>
<li><a href="http://www.gregreda.com/2016/10/16/asynchronous-scraping-with-python/">Asynchronous Scraping with Python</a>, showing how to use multithreading to speed things up.</li>
<li><a href="http://www.gregreda.com/2020/11/17/scraping-pages-behind-login-forms/">Scraping Pages Behind Login Forms</a>, which shows how to log into sites using Python.</li>
</ol>
<hr>
<p>Previously, I've written about the <a href="http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/">basics of scraping</a> and how you can <a href="http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/">find API calls</a> in order to fetch data that isn't easily downloadable.</p>
<p>For simplicity, the code in these posts has always been synchronous -- given a list of URLs, we process one, then the next, then the next, and so on. While this makes for code that's straight-forward, it can also be slow.</p>
<p>This doesn't have to be the case though. Scraping is often an example of code that is <a href="https://en.wikipedia.org/wiki/Embarrassingly_parallel">embarrassingly parallel</a>. With some slight changes, our tasks can be done asynchronously, allowing us to process more than one URL at a time.</p>
<p>In version 3.2, Python introduced the <a href="https://docs.python.org/3/library/concurrent.futures.html"><code>concurrent.futures</code></a> module, which is a joy to use for parallelizing tasks like scraping. The rest of this post will show how we can use the module to make our previously synchronous code asynchronous.</p>
<h3>Parallelizing your tasks</h3>
<p>Imagine we have a list of several thousand URLs. In previous posts, we've always written something that looks like this:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">csv</span> <span class="kn">import</span> <span class="n">DictWriter</span>
<span class="n">URLS</span> <span class="o">=</span> <span class="p">[</span> <span class="o">...</span> <span class="p">]</span> <span class="c1"># thousands of urls for pages we'd like to parse</span>
<span class="k">def</span> <span class="nf">parse</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="c1"># our logic for parsing the page</span>
<span class="k">return</span> <span class="n">data</span> <span class="c1"># probably a dict</span>
<span class="n">results</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">url</span> <span class="ow">in</span> <span class="n">URLS</span><span class="p">:</span> <span class="c1"># go through each url one by one</span>
<span class="n">results</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">parse</span><span class="p">(</span><span class="n">url</span><span class="p">))</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'results.csv'</span><span class="p">,</span> <span class="s1">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">writer</span> <span class="o">=</span> <span class="n">DictWriter</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">fieldnames</span><span class="o">=</span><span class="n">results</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">keys</span><span class="p">())</span>
<span class="n">writer</span><span class="o">.</span><span class="n">writeheader</span><span class="p">()</span>
<span class="n">writer</span><span class="o">.</span><span class="n">writerows</span><span class="p">(</span><span class="n">results</span><span class="p">)</span>
</code></pre></div>
<p>The above is an example of synchronous code -- we're looping through a list of URLs, processing one at a time. If the list of URLs is relatively small or we're not concerned about execution time, there's little reason to <a href="https://en.wikipedia.org/wiki/Task_parallelism">parallelize</a> these tasks -- we might as well keep things simple and wait it out.</p>
<p>However, sometimes we have a huge list of URLs -- at least several thousand -- and we can't wait hours for them to finish.</p>
<p>With <code>concurrent.futures</code>, we can work on multiple URLs at once by adding a <code>ProcessPoolExecutor</code> and making a slight change to how we fetch our results.</p>
<p>But first, a reminder: <em>if you're scraping, don't be a jerk</em>. Space out your requests appropriately and don't hammer the site (i.e. use <code>time.sleep</code> to wait briefly between each request and set <code>max_workers</code> to a small number). Being a jerk runs the risk of getting your IP address blocked -- good luck getting that data now.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">concurrent.futures</span> <span class="kn">import</span> <span class="n">ProcessPoolExecutor</span>
<span class="kn">import</span> <span class="nn">concurrent.futures</span>
<span class="n">URLS</span> <span class="o">=</span> <span class="p">[</span> <span class="o">...</span> <span class="p">]</span>
<span class="k">def</span> <span class="nf">parse</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="c1"># our logic for parsing the page</span>
<span class="k">return</span> <span class="n">data</span> <span class="c1"># still probably a dict</span>
<span class="k">with</span> <span class="n">ProcessPoolExecutor</span><span class="p">(</span><span class="n">max_workers</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> <span class="k">as</span> <span class="n">executor</span><span class="p">:</span>
<span class="n">future_results</span> <span class="o">=</span> <span class="p">{</span><span class="n">executor</span><span class="o">.</span><span class="n">submit</span><span class="p">(</span><span class="n">parse</span><span class="p">,</span> <span class="n">url</span><span class="p">):</span> <span class="n">url</span> <span class="k">for</span> <span class="n">url</span> <span class="ow">in</span> <span class="n">URLS</span><span class="p">}</span>
<span class="n">results</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">future</span> <span class="ow">in</span> <span class="n">concurrent</span><span class="o">.</span><span class="n">futures</span><span class="o">.</span><span class="n">as_completed</span><span class="p">(</span><span class="n">future_results</span><span class="p">):</span>
<span class="n">results</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">future</span><span class="o">.</span><span class="n">result</span><span class="p">())</span>
</code></pre></div>
<p>In the above code, we're submitting tasks to the executor -- four workers -- each of which will execute the <code>parse</code> function against a URL. This execution does not happen immediately. For each submission, the executor returns an instance of a <code>Future</code>, which tells us that our task will be executed at some point in the ... well, future. The <code>as_completed</code> function watches our <code>future_results</code> for completion, upon which we'll be able to fetch each result via the <code>result</code> method.</p>
<p>My favorite part about this module is the clarity of its API -- tasks are <em>submitted</em> to an <em>executor</em>, which is made up of one or more workers, each of which is churning through our tasks. Because our tasks are executed asynchronously, we are not waiting for a given task's completion before submitting another -- we are doing so at-will, with completion happening in the <em>future</em>. Once completed, we can get the task's <em>result</em>.</p>
<h3>Closing up</h3>
<p>With a few changes to your code and some <code>concurrent.futures</code> love, you no longer have to fetch those basketball stats one page at a time.</p>
<p>But don't be a jerk either.</p>Visualizing the 2015 NL Cy Young Race2015-11-19T00:00:00-08:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2015-11-19:/2015/11/19/nl-cyyoung-viz-2015/<p>This year's National League Cy Young race was pretty much a toss-up, with each of
<a href="http://www.baseball-reference.com/players/a/arrieja01.shtml">Jake Arrieta</a>, <a href="http://www.baseball-reference.com/players/g/greinza01.shtml">Zack Greinke</a>, and <a href="http://www.baseball-reference.com/players/k/kershcl01.shtml">Clayton Kershaw</a> putting up numbers
we haven't seen in a decade or more.</p>
<p>By now we know that Arrieta wins the award, but being the Cubs homer I am, I …</p><p>This year's National League Cy Young race was pretty much a toss-up, with each of
<a href="http://www.baseball-reference.com/players/a/arrieja01.shtml">Jake Arrieta</a>, <a href="http://www.baseball-reference.com/players/g/greinza01.shtml">Zack Greinke</a>, and <a href="http://www.baseball-reference.com/players/k/kershcl01.shtml">Clayton Kershaw</a> putting up numbers
we haven't seen in a decade or more.</p>
<p>By now we know that Arrieta wins the award, but being the Cubs homer I am, I
started digging into the data a few weeks ago in attempt to show that Arrieta
<em>should</em> win the award. However, as is often the case when walking into an
analysis with preconcieved notions of its findings, I was left unable to make my
case with a straight face.</p>
<p>Unable to confidently make the case that <em>any</em> of the contenders were more
deserving of the award than their peers, I decided to turn my work into an article highlighting the historic years each of them had. Unfortunately, the article
never wound up published, but you can still read it <a href="https://github.com/gjreda/cy-young-NL-2015/blob/master/README.md">here</a>,
though it's obviously outdated now.</p>
<p>Since I tend to use this site more for technical posts, it seemed like a good
idea to walk through a couple pieces of my work -- if you're interested in
everything, <a href="https://github.com/gjreda/cy-young-NL-2015">I've pushed it up to GitHub</a>.</p>
<h2>Preprocessing</h2>
<p>In order to show the stats I cared about and their progression throughout each
pitcher's season, I needed to do some preprocessing of the data. Specifically,
I needed to calculate a variety of statistics that are not included in the
game logs from <a href="http://www.baseball-reference.com">Baseball Reference</a>.</p>
<p>After loading the dataset and transforming the innings pitched (IP) field to a
numeric value, you'll see a fairly large section of code
<a href="https://github.com/gjreda/cy-young-NL-2015/blob/master/cy-young.ipynb">in the notebook</a>
that looks like this:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Partial innings are stored as 7.1 or 7.2 in the Baseball Reference data.</span>
<span class="c1"># Convert it to properly represent 1/3 or 2/3 of an inning</span>
<span class="c1"># (necessary for various rate calculations).</span>
<span class="k">def</span> <span class="nf">to_innings</span><span class="p">(</span><span class="n">IP</span><span class="p">):</span>
<span class="n">full</span><span class="p">,</span> <span class="n">partial</span> <span class="o">=</span> <span class="nb">map</span><span class="p">(</span><span class="nb">float</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="n">IP</span><span class="p">)</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">'.'</span><span class="p">))</span>
<span class="k">return</span> <span class="n">full</span> <span class="o">+</span> <span class="p">(</span><span class="n">partial</span> <span class="o">/</span> <span class="mf">3.</span><span class="p">)</span>
<span class="c1"># example: 7.1 --> 7.3333</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'IP'</span><span class="p">]</span> <span class="o">=</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">IP</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">to_innings</span><span class="p">)</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'rollingIP'</span><span class="p">]</span> <span class="o">=</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">IP</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'IPGame'</span><span class="p">]</span> <span class="o">=</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">rollingIP</span> <span class="o">/</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">Rk</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'rollingER'</span><span class="p">]</span> <span class="o">=</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">ER</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'rollingERA'</span><span class="p">]</span> <span class="o">=</span> <span class="n">arrieta</span><span class="p">[</span><span class="s1">'rollingER'</span><span class="p">]</span> <span class="o">/</span> <span class="p">(</span><span class="n">arrieta</span><span class="p">[</span><span class="s1">'rollingIP'</span><span class="p">]</span> <span class="o">/</span> <span class="mf">9.</span><span class="p">)</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'strikeoutsPerIP'</span><span class="p">]</span> <span class="o">=</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">SO</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">/</span> <span class="n">arrieta</span><span class="p">[</span><span class="s1">'rollingIP'</span><span class="p">]</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'K/9'</span><span class="p">]</span> <span class="o">=</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">SO</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">/</span> <span class="p">(</span><span class="n">arrieta</span><span class="p">[</span><span class="s1">'rollingIP'</span><span class="p">]</span> <span class="o">/</span> <span class="mf">9.</span><span class="p">)</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'strikeoutsPerBF'</span><span class="p">]</span> <span class="o">=</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">SO</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">/</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">BF</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'hitsPerIP'</span><span class="p">]</span> <span class="o">=</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">H</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">/</span> <span class="n">arrieta</span><span class="p">[</span><span class="s1">'rollingIP'</span><span class="p">]</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'hitsPerAB'</span><span class="p">]</span> <span class="o">=</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">H</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">/</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">AB</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'rollingWHIP'</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">arrieta</span><span class="o">.</span><span class="n">H</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">+</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">BB</span><span class="o">.</span><span class="n">cumsum</span><span class="p">())</span> <span class="o">/</span> <span class="n">arrieta</span><span class="p">[</span><span class="s1">'rollingIP'</span><span class="p">]</span>
<span class="c1"># opponents against</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'1B'</span><span class="p">]</span> <span class="o">=</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">H</span> <span class="o">-</span> <span class="p">(</span><span class="n">arrieta</span><span class="p">[</span><span class="s1">'2B'</span><span class="p">]</span> <span class="o">+</span> <span class="n">arrieta</span><span class="p">[</span><span class="s1">'3B'</span><span class="p">]</span> <span class="o">+</span> <span class="n">arrieta</span><span class="p">[</span><span class="s1">'HR'</span><span class="p">])</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'AVG'</span><span class="p">]</span> <span class="o">=</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">H</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">/</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">AB</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'OBP'</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">arrieta</span><span class="o">.</span><span class="n">H</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">+</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">BB</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">+</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">HBP</span><span class="o">.</span><span class="n">cumsum</span><span class="p">())</span> \
<span class="o">/</span> <span class="p">(</span><span class="n">arrieta</span><span class="o">.</span><span class="n">AB</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">+</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">BB</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">+</span>
<span class="n">arrieta</span><span class="o">.</span><span class="n">HBP</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">+</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">SF</span><span class="o">.</span><span class="n">cumsum</span><span class="p">())</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'SLG'</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">arrieta</span><span class="p">[</span><span class="s1">'1B'</span><span class="p">]</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">+</span> <span class="p">(</span><span class="n">arrieta</span><span class="p">[</span><span class="s1">'2B'</span><span class="p">]</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">*</span> <span class="mi">2</span><span class="p">)</span> <span class="o">+</span>
<span class="p">(</span><span class="n">arrieta</span><span class="p">[</span><span class="s1">'3B'</span><span class="p">]</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">*</span> <span class="mi">3</span><span class="p">)</span> <span class="o">+</span> <span class="p">(</span><span class="n">arrieta</span><span class="p">[</span><span class="s1">'HR'</span><span class="p">]</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">*</span> <span class="mi">4</span><span class="p">))</span> \
<span class="o">/</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">AB</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'OPS'</span><span class="p">]</span> <span class="o">=</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">OBP</span> <span class="o">+</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">SLG</span>
<span class="c1"># rates</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'BABIP'</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">arrieta</span><span class="o">.</span><span class="n">H</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">-</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">HR</span><span class="o">.</span><span class="n">cumsum</span><span class="p">())</span> \
<span class="o">/</span> <span class="p">(</span><span class="n">arrieta</span><span class="o">.</span><span class="n">AB</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">-</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">SO</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">-</span>
<span class="n">arrieta</span><span class="o">.</span><span class="n">HR</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">+</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">SF</span><span class="o">.</span><span class="n">cumsum</span><span class="p">())</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'HR%'</span><span class="p">]</span> <span class="o">=</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">HR</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">/</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">BF</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'XBH%'</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">arrieta</span><span class="p">[</span><span class="s1">'2B'</span><span class="p">]</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">+</span> <span class="n">arrieta</span><span class="p">[</span><span class="s1">'3B'</span><span class="p">]</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">+</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'HR'</span><span class="p">]</span><span class="o">.</span><span class="n">cumsum</span><span class="p">())</span> <span class="o">/</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">BF</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'K%'</span><span class="p">]</span> <span class="o">=</span> <span class="n">arrieta</span><span class="p">[</span><span class="s1">'SO'</span><span class="p">]</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">/</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">BF</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'IP%'</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">arrieta</span><span class="o">.</span><span class="n">AB</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">-</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">SO</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">-</span>
<span class="n">arrieta</span><span class="o">.</span><span class="n">HR</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">+</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">SF</span><span class="o">.</span><span class="n">cumsum</span><span class="p">())</span> \
<span class="o">/</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">BF</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span>
<span class="n">arrieta</span><span class="p">[</span><span class="s1">'GB%'</span><span class="p">]</span> <span class="o">=</span> <span class="n">arrieta</span><span class="p">[</span><span class="s1">'GB'</span><span class="p">]</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">/</span>
<span class="p">(</span><span class="n">arrieta</span><span class="o">.</span><span class="n">AB</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">-</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">SO</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">-</span>
<span class="n">arrieta</span><span class="o">.</span><span class="n">HR</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">+</span> <span class="n">arrieta</span><span class="o">.</span><span class="n">SF</span><span class="o">.</span><span class="n">cumsum</span><span class="p">())</span>
</code></pre></div>
<p>Here we're adding new, cumulative statistics to each pitcher's DataFrame (e.g.
we can easily say what Arrieta's ERA was after his fourth start, or what his
batting average on balls in-play (BABIP) was in the second half of the season).</p>
<h2>Visualizing their seasons</h2>
<p>Now that we have various statistics on a rolling basis, we need a way to
compare their performances throughout the season. Thankfully, this is a perfect
use case for <a href="https://en.wikipedia.org/wiki/Small_multiple">small multiples</a>,
which is a technique meant specifically for comparison.</p>
<p>To do so, we can create a dictionary where each pitcher is a key, and the value
is another dictionary containing that pitcher's DataFrame, as well as a color
and line style which we'll use in our plot. Then, we'll create a grid of empty
subplots, which will be populated by looping through our <code>PITCHERS</code> dictionary.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">OrderedDict</span>
<span class="n">PITCHERS</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'Arrieta'</span><span class="p">:</span> <span class="p">{</span><span class="s1">'df'</span><span class="p">:</span> <span class="n">arrieta</span><span class="p">,</span> <span class="s1">'color'</span><span class="p">:</span> <span class="n">ja</span><span class="p">,</span> <span class="s1">'style'</span><span class="p">:</span> <span class="s1">'-'</span><span class="p">},</span>
<span class="s1">'Greinke'</span><span class="p">:</span> <span class="p">{</span><span class="s1">'df'</span><span class="p">:</span> <span class="n">greinke</span><span class="p">,</span> <span class="s1">'color'</span><span class="p">:</span> <span class="n">zg</span><span class="p">,</span> <span class="s1">'style'</span><span class="p">:</span> <span class="s1">'-'</span><span class="p">},</span>
<span class="s1">'Kershaw'</span><span class="p">:</span> <span class="p">{</span><span class="s1">'df'</span><span class="p">:</span> <span class="n">kershaw</span><span class="p">,</span> <span class="s1">'color'</span><span class="p">:</span> <span class="n">kc</span><span class="p">,</span> <span class="s1">'style'</span><span class="p">:</span> <span class="s1">'--'</span><span class="p">}}</span>
<span class="n">PITCHERS</span> <span class="o">=</span> <span class="n">OrderedDict</span><span class="p">(</span><span class="nb">sorted</span><span class="p">(</span><span class="n">PITCHERS</span><span class="o">.</span><span class="n">items</span><span class="p">()))</span>
<span class="n">stats</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'IP%'</span><span class="p">,</span> <span class="s1">'BABIP'</span><span class="p">,</span> <span class="s1">'XBH%'</span><span class="p">,</span> <span class="s1">'HR%'</span><span class="p">,</span> <span class="s1">'K%'</span><span class="p">]</span>
<span class="n">row_titles</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'</span><span class="si">{}</span><span class="s1">'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">row_title</span><span class="p">)</span> <span class="k">for</span> <span class="n">row_title</span> <span class="ow">in</span> <span class="n">PITCHERS</span><span class="o">.</span><span class="n">keys</span><span class="p">()]</span>
<span class="n">col_titles</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'</span><span class="si">{}</span><span class="s1">'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">col_title</span><span class="p">)</span> <span class="k">for</span> <span class="n">col_title</span> <span class="ow">in</span> <span class="n">stats</span><span class="p">]</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">axes</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span><span class="mi">6</span><span class="p">),</span> <span class="n">nrows</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">PITCHERS</span><span class="p">),</span>
<span class="n">ncols</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">stats</span><span class="p">),</span> <span class="n">sharex</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">fig</span><span class="o">.</span><span class="n">tight_layout</span><span class="p">(</span><span class="n">pad</span><span class="o">=</span><span class="mf">1.2</span><span class="p">,</span> <span class="n">h_pad</span><span class="o">=</span><span class="mf">1.5</span><span class="p">)</span> <span class="c1"># adjust layout spacing</span>
<span class="c1"># label each column with stat name</span>
<span class="k">for</span> <span class="n">ax</span><span class="p">,</span> <span class="n">col_title</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">col_titles</span><span class="p">):</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="n">col_title</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
<span class="c1"># label each row with player name</span>
<span class="k">for</span> <span class="n">ax</span><span class="p">,</span> <span class="n">row_title</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">axes</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">row_titles</span><span class="p">):</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="n">row_title</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span> <span class="n">labelpad</span><span class="o">=</span><span class="mi">40</span><span class="p">)</span>
<span class="c1"># create grid - one chart for each pitcher + stat combination</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">pitcher</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">PITCHERS</span><span class="o">.</span><span class="n">items</span><span class="p">()):</span>
<span class="k">for</span> <span class="n">j</span><span class="p">,</span> <span class="n">stat</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">stats</span><span class="p">):</span>
<span class="n">title</span> <span class="o">=</span> <span class="s1">'</span><span class="si">{}</span><span class="s1">: </span><span class="si">{}</span><span class="s1">'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">stat</span><span class="p">)</span>
<span class="n">pitcher</span><span class="p">[</span><span class="s1">'df'</span><span class="p">][</span><span class="n">stat</span><span class="p">]</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">axes</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">],</span> <span class="n">color</span><span class="o">=</span><span class="n">pitcher</span><span class="p">[</span><span class="s1">'color'</span><span class="p">],</span>
<span class="n">linestyle</span><span class="o">=</span><span class="n">pitcher</span><span class="p">[</span><span class="s1">'style'</span><span class="p">])</span>
<span class="c1"># for ease of comparison, let's plot the other pitchers on the same chart</span>
<span class="c1"># but let's make them a light grey with the appropriate linestyle</span>
<span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">PITCHERS</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
<span class="k">if</span> <span class="n">k</span> <span class="o">!=</span> <span class="n">name</span><span class="p">:</span>
<span class="n">v</span><span class="p">[</span><span class="s1">'df'</span><span class="p">][</span><span class="n">stat</span><span class="p">]</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">axes</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">],</span> <span class="n">color</span><span class="o">=</span><span class="s1">'grey'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.4</span><span class="p">,</span>
<span class="n">linestyle</span><span class="o">=</span><span class="n">v</span><span class="p">[</span><span class="s1">'style'</span><span class="p">])</span>
<span class="n">axes</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">]</span><span class="o">.</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s1">'both'</span><span class="p">,</span> <span class="n">which</span><span class="o">=</span><span class="s1">'major'</span><span class="p">,</span> <span class="n">labelsize</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">]</span><span class="o">.</span><span class="n">axvline</span><span class="p">(</span><span class="n">allstarbreak</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">'k'</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s1">':'</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">]</span><span class="o">.</span><span class="n">yaxis</span><span class="o">.</span><span class="n">set_major_locator</span><span class="p">(</span><span class="n">MaxNLocator</span><span class="p">(</span><span class="n">nbins</span><span class="o">=</span><span class="mi">4</span><span class="p">))</span>
<span class="n">axes</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">1.</span><span class="p">)</span> <span class="c1"># IP%</span>
<span class="n">axes</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">.500</span><span class="p">)</span> <span class="c1"># BABIP</span>
<span class="n">axes</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="mi">2</span><span class="p">]</span><span class="o">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">.16</span><span class="p">)</span> <span class="c1"># XBH%</span>
<span class="n">axes</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="mi">3</span><span class="p">]</span><span class="o">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">.04</span><span class="p">)</span> <span class="c1"># HR%</span>
<span class="n">axes</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="mi">4</span><span class="p">]</span><span class="o">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">.36</span><span class="p">)</span> <span class="c1"># K%</span>
<span class="n">plt</span><span class="o">.</span><span class="n">savefig</span><span class="p">(</span><span class="s1">'images/rates-comparison.png'</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s1">'tight'</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">120</span><span class="p">)</span>
</code></pre></div>
<p>The resulting output is a 3 x 5 grid of charts, where each row corresponds to a
pitcher, and each column is a statistic.</p>
<p><img alt="2015 NL Cy Young Race" src="https://raw.githubusercontent.com/gjreda/cy-young-NL-2015/master/images/rates-comparison.png"></p>
<p>Again, this technique is meant for comparing different dimensions (people,
cities, departments, etc.) against one another.</p>
<p>For instance, looking down the
left-most column, we can see that batters put the ball in play (IP%) about
equally against Arrieta and Greinke, but less so against Kershaw. Looking down
the far right column, we can see that Kershaw was put in play less often
due to his stronger ability to strike hitters out (K%).</p>
<h2>Comparing batted ball exit velocity</h2>
<p>With <a href="https://en.wikipedia.org/wiki/PITCHf/x">PITCHf/x</a> installed in every MLB
park, we can also look at data around each pitch made throughout
the season. <a href="baseballsavant.com">Baseball Savant</a> is a great source of this data.</p>
<p>Since it still wasn't clear who should win the award after looking at a variety
of stats, it seemed interesting to answer the most basic question: Which pitcher
was hit harder? We know <a href="http://fivethirtyeight.com/features/chase-utley-is-the-unluckiest-man-in-baseball/">there's a significant relationship</a>
between a batted ball's exit velocity and its likelihood to wind up a hit, so
this should give us some indication of who was the more difficult pitcher to
face.</p>
<p><img alt="Exit Velocity Distribution By Pitcher" src="/images/bb-velocity-distributions.png"></p>
<p>Looking at the observed distributions of their batted ball exit velocity doesn't
tell us much
-- Arrieta's mean exit velocity was 85.0 MPH, Greinke's 88.4, and Kershaw's 84.9.
Those numbers are pretty close -- so close that we shouldn't assume they're
statistically significant, so let's test that using the
<a href="https://en.wikipedia.org/wiki/Bootstrapping_(statistics)">bootstrap</a>.</p>
<p>With bootstrapping, we generate N random samples of our dataset (typically
1,000 or 10,000). Since we're interested in speaking about the "average" batted
ball exit velocity, we take the mean of each random sample, resulting in an
approximation of the mean's true distribution. From there, we can look at the
95% confidence intervals to test for significance.</p>
<div class="highlight"><pre><span></span><code><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">49</span><span class="p">)</span> <span class="c1"># set random seed for consistency</span>
<span class="c1"># only sample from pitches that were hit</span>
<span class="n">arrietaBBs</span> <span class="o">=</span> <span class="n">arrietaPitches</span><span class="p">[</span><span class="n">arrietaPitches</span><span class="o">.</span><span class="n">batted_ball_velocity</span> <span class="o">></span> <span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">batted_ball_velocity</span>
<span class="n">greinkeBBs</span> <span class="o">=</span> <span class="n">greinkePitches</span><span class="p">[</span><span class="n">greinkePitches</span><span class="o">.</span><span class="n">batted_ball_velocity</span> <span class="o">></span> <span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">batted_ball_velocity</span>
<span class="n">kershawBBs</span> <span class="o">=</span> <span class="n">kershawPitches</span><span class="p">[</span><span class="n">kershawPitches</span><span class="o">.</span><span class="n">batted_ball_velocity</span> <span class="o">></span> <span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">batted_ball_velocity</span>
<span class="n">arrietaSamples</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">greinkeSamples</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">kershawSamples</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># generate 1000 randomly sampled datasets for each pitcher</span>
<span class="c1"># each sampled dataset is the same length as our observed dataset</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1000</span><span class="p">):</span>
<span class="n">arrietaSamples</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">arrietaBBs</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">arrietaBBs</span><span class="p">),</span> <span class="n">replace</span><span class="o">=</span><span class="kc">True</span><span class="p">))</span>
<span class="n">greinkeSamples</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">greinkeBBs</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">greinkeBBs</span><span class="p">),</span> <span class="n">replace</span><span class="o">=</span><span class="kc">True</span><span class="p">))</span>
<span class="n">kershawSamples</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">kershawBBs</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">kershawBBs</span><span class="p">),</span> <span class="n">replace</span><span class="o">=</span><span class="kc">True</span><span class="p">))</span>
<span class="c1"># get the mean of each randomly sampled dataset</span>
<span class="n">arrietaMeans</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">obs</span><span class="p">)</span> <span class="k">for</span> <span class="n">obs</span> <span class="ow">in</span> <span class="n">arrietaSamples</span><span class="p">]</span>
<span class="n">greinkeMeans</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">obs</span><span class="p">)</span> <span class="k">for</span> <span class="n">obs</span> <span class="ow">in</span> <span class="n">greinkeSamples</span><span class="p">]</span>
<span class="n">kershawMeans</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">obs</span><span class="p">)</span> <span class="k">for</span> <span class="n">obs</span> <span class="ow">in</span> <span class="n">kershawSamples</span><span class="p">]</span>
<span class="c1"># plot the distributions</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">arrietaMeans</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">.5</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'Arrieta'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">ja</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">greinkeMeans</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">.6</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'Greinke'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">zg</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">kershawMeans</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">.3</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'Kershaw'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">kc</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="s1">'best'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">'Avg. Batted Ball Velocity'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">spines</span><span class="p">[</span><span class="s1">'right'</span><span class="p">]</span><span class="o">.</span><span class="n">set_visible</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">spines</span><span class="p">[</span><span class="s1">'left'</span><span class="p">]</span><span class="o">.</span><span class="n">set_visible</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">spines</span><span class="p">[</span><span class="s1">'top'</span><span class="p">]</span><span class="o">.</span><span class="n">set_visible</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">xaxis</span><span class="o">.</span><span class="n">set_ticks_position</span><span class="p">(</span><span class="s1">'bottom'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s1">'both'</span><span class="p">,</span> <span class="n">which</span><span class="o">=</span><span class="s1">'major'</span><span class="p">,</span> <span class="n">labelsize</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">get_yaxis</span><span class="p">()</span><span class="o">.</span><span class="n">set_ticks</span><span class="p">([])</span>
<span class="n">plt</span><span class="o">.</span><span class="n">savefig</span><span class="p">(</span><span class="s1">'images/avg-batted-ball-velocity.png'</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s1">'tight'</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">120</span><span class="p">);</span>
</code></pre></div>
<p>![Batted Ball Exit Velocity]
(https://raw.githubusercontent.com/gjreda/cy-young-NL-2015/master/images/avg-batted-ball-velocity.png)</p>
<p>While the above chart doesn't explicitly show their 95% confidence intervals, it's pretty
clear that Greinke's mean exit velocity is significant when compared to Arrieta
and Kershaw -- allowing us to say that, on average, Greinke was hit harder
throughout the season than both Arrieta and Kershaw. We cannot confidently say
there was a difference in exit velocity when comparing Arrieta and Kershaw to
each other though.</p>
<p>The chart above is especially interesting in the context of our small
multiples charts.
In particular, that Greinke had the lowest ERA, batting average on balls in play
(BABIP), and extra base hit rate (XBH%) of the three, <em>despite</em> allowing harder contact.
This suggests that Greinke received a bit more help from his defense than
Arrieta and Kershaw.</p>
<p>If you're interested in more analysis on the season each of these three had,
<a href="https://twitter.com/DCameronFG">Dave Cameron</a> at <a href="http://www.fangraphs.com">FanGraphs</a> has an excellent write-up <a href="http://www.fangraphs.com/blogs/explaining-my-nl-cy-young-ballot/">explaining the rationale
behind his vote</a>.</p>
<hr>
<p>Hope you've enjoyed the post, and <a href="https://www.twitter.com/gjreda">let me know</a> if you have any questions.</p>Cohort Analysis with Python2015-08-23T00:00:00-07:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2015-08-23:/2015/08/23/cohort-analysis-with-python/<p>Despite having done it countless times, I regularly forget how to build a <a href="https://en.wikipedia.org/wiki/Cohort_analysis">cohort analysis</a> with Python and <a href="http://pandas.pydata.org/">pandas</a>. I’ve decided it’s a good idea to finally write it out - step by step - so I can refer back to this post later on. Hopefully others find it useful …</p><p>Despite having done it countless times, I regularly forget how to build a <a href="https://en.wikipedia.org/wiki/Cohort_analysis">cohort analysis</a> with Python and <a href="http://pandas.pydata.org/">pandas</a>. I’ve decided it’s a good idea to finally write it out - step by step - so I can refer back to this post later on. Hopefully others find it useful as well.</p>
<p>I’ll start by walking through what cohort analysis is and why it’s commonly used in startups and other growth businesses. Then, we’ll create one from a standard purchase dataset.</p>
<h2>What is cohort analysis?</h2>
<p>A cohort is a group of users who share something in common, be it their sign-up date, first purchase month, birth date, acquisition channel, etc. Cohort analysis is the method by which these groups are tracked over time, helping you spot trends, understand repeat behaviors (purchases, engagement, amount spent, etc.), and monitor your customer and revenue retention.</p>
<p>It’s common for cohorts to be created based on a customer’s first usage of the platform, where "usage" is dependent on your business’ key metrics. For Uber or Lyft, usage would be booking a trip through one of their apps. For GrubHub, it’s ordering some food. For AirBnB, it’s booking a stay.</p>
<p>With these companies, a purchase is at their core, be it taking a trip or ordering dinner — their revenues are tied to their users’ purchase behavior.</p>
<p>In others, a purchase is not central to the business model and the business is more interested in "engagement" with the platform. Facebook and Twitter are examples of this - are you visiting their sites every day? Are you performing some action on them - maybe a "like" on Facebook or a "favorite" on a tweet?<sup>1</sup></p>
<p>When building a cohort analysis, it’s important to consider the relationship between the event or interaction you’re tracking and its relationship to your business model.</p>
<h2>Why is it valuable?</h2>
<p>Cohort analysis can be helpful when it comes to understanding your business’ health and "stickiness" - the loyalty of your customers. Stickiness is critical since <a href="https://hbr.org/2014/10/the-value-of-keeping-the-right-customers/">it’s far cheaper and easier to keep a current customer than to acquire a new one</a>. For startups, it’s also a key indicator of <a href="https://en.wikipedia.org/wiki/Product/market_fit">product-market fit</a>.</p>
<p>Additionally, your product evolves over time. New features are added and removed, the design changes, etc. Observing individual groups over time is a starting point to understanding how these changes affect user behavior.</p>
<p>It’s also a good way to visualize your user retention/churn as well as formulating a basic understanding of their lifetime value.</p>
<h2>An example</h2>
<p>Imagine we have a dataset like the one below (you can find it <a href="http://dmanalytics.org/wp-content/uploads/2014/10/chapter-12-relay-foods.xlsx">here</a>):</p>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>OrderId</th>
<th>OrderDate</th>
<th>UserId</th>
<th>TotalCharges</th>
<th>CommonId</th>
<th>PupId</th>
<th>PickupDate</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>262</td>
<td>2009-01-11</td>
<td>47</td>
<td>50.67</td>
<td>TRQKD</td>
<td>2</td>
<td>2009-01-12</td>
</tr>
<tr>
<th>1</th>
<td>278</td>
<td>2009-01-20</td>
<td>47</td>
<td>26.60</td>
<td>4HH2S</td>
<td>3</td>
<td>2009-01-20</td>
</tr>
<tr>
<th>2</th>
<td>294</td>
<td>2009-02-03</td>
<td>47</td>
<td>38.71</td>
<td>3TRDC</td>
<td>2</td>
<td>2009-02-04</td>
</tr>
<tr>
<th>3</th>
<td>301</td>
<td>2009-02-06</td>
<td>47</td>
<td>53.38</td>
<td>NGAZJ</td>
<td>2</td>
<td>2009-02-09</td>
</tr>
<tr>
<th>4</th>
<td>302</td>
<td>2009-02-06</td>
<td>47</td>
<td>14.28</td>
<td>FFYHD</td>
<td>2</td>
<td>2009-02-09</td>
</tr>
</tbody>
</table>
<p>Pretty standard purchase data with IDs for the order and user, as well as the order date and purchase amount.</p>
<p>We want to go from the data above to something like this:</p>
<p><img alt="example cohort chart" src="/images/cohort-example.png"></p>
<p>Here’s how we get there.</p>
<h2>Code</h2>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib</span> <span class="k">as</span> <span class="nn">mpl</span>
<span class="n">pd</span><span class="o">.</span><span class="n">set_option</span><span class="p">(</span><span class="s1">'max_columns'</span><span class="p">,</span> <span class="mi">50</span><span class="p">)</span>
<span class="n">mpl</span><span class="o">.</span><span class="n">rcParams</span><span class="p">[</span><span class="s1">'lines.linewidth'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">2</span>
<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">'/Users/gjreda/Dropbox/datasets/relay-foods.xlsx'</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>OrderId</th>
<th>OrderDate</th>
<th>UserId</th>
<th>TotalCharges</th>
<th>CommonId</th>
<th>PupId</th>
<th>PickupDate</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>262</td>
<td>2009-01-11</td>
<td>47</td>
<td>50.67</td>
<td>TRQKD</td>
<td>2</td>
<td>2009-01-12</td>
</tr>
<tr>
<th>1</th>
<td>278</td>
<td>2009-01-20</td>
<td>47</td>
<td>26.60</td>
<td>4HH2S</td>
<td>3</td>
<td>2009-01-20</td>
</tr>
<tr>
<th>2</th>
<td>294</td>
<td>2009-02-03</td>
<td>47</td>
<td>38.71</td>
<td>3TRDC</td>
<td>2</td>
<td>2009-02-04</td>
</tr>
</tbody>
</table>
<h3>1. Create a period column based on the OrderDate</h3>
<p>Since we're doing monthly cohorts, we'll be looking at the total monthly behavior of our users. Therefore, we don't want granular OrderDate data (right now).</p>
<div class="highlight"><pre><span></span><code><span class="n">df</span><span class="p">[</span><span class="s1">'OrderPeriod'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">OrderDate</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">'%Y-%m'</span><span class="p">))</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>OrderId</th>
<th>OrderDate</th>
<th>UserId</th>
<th>TotalCharges</th>
<th>CommonId</th>
<th>PupId</th>
<th>PickupDate</th>
<th>OrderPeriod</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>262</td>
<td>2009-01-11</td>
<td>47</td>
<td>50.67</td>
<td>TRQKD</td>
<td>2</td>
<td>2009-01-12</td>
<td>2009-01</td>
</tr>
<tr>
<th>1</th>
<td>278</td>
<td>2009-01-20</td>
<td>47</td>
<td>26.60</td>
<td>4HH2S</td>
<td>3</td>
<td>2009-01-20</td>
<td>2009-01</td>
</tr>
<tr>
<th>2</th>
<td>294</td>
<td>2009-02-03</td>
<td>47</td>
<td>38.71</td>
<td>3TRDC</td>
<td>2</td>
<td>2009-02-04</td>
<td>2009-02</td>
</tr>
<tr>
<th>3</th>
<td>301</td>
<td>2009-02-06</td>
<td>47</td>
<td>53.38</td>
<td>NGAZJ</td>
<td>2</td>
<td>2009-02-09</td>
<td>2009-02</td>
</tr>
<tr>
<th>4</th>
<td>302</td>
<td>2009-02-06</td>
<td>47</td>
<td>14.28</td>
<td>FFYHD</td>
<td>2</td>
<td>2009-02-09</td>
<td>2009-02</td>
</tr>
</tbody>
</table>
<h3>2. Determine the user's cohort group (based on their first order)</h3>
<p>Create a new column called <code>CohortGroup</code>, which is the year and month in which the user's first purchase occurred.</p>
<div class="highlight"><pre><span></span><code><span class="n">df</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">'UserId'</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">'CohortGroup'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="mi">0</span><span class="p">)[</span><span class="s1">'OrderDate'</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">()</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">'%Y-%m'</span><span class="p">))</span>
<span class="n">df</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>UserId</th>
<th>OrderId</th>
<th>OrderDate</th>
<th>TotalCharges</th>
<th>CommonId</th>
<th>PupId</th>
<th>PickupDate</th>
<th>OrderPeriod</th>
<th>CohortGroup</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>47</td>
<td>262</td>
<td>2009-01-11</td>
<td>50.67</td>
<td>TRQKD</td>
<td>2</td>
<td>2009-01-12</td>
<td>2009-01</td>
<td>2009-01</td>
</tr>
<tr>
<th>1</th>
<td>47</td>
<td>278</td>
<td>2009-01-20</td>
<td>26.60</td>
<td>4HH2S</td>
<td>3</td>
<td>2009-01-20</td>
<td>2009-01</td>
<td>2009-01</td>
</tr>
<tr>
<th>2</th>
<td>47</td>
<td>294</td>
<td>2009-02-03</td>
<td>38.71</td>
<td>3TRDC</td>
<td>2</td>
<td>2009-02-04</td>
<td>2009-02</td>
<td>2009-01</td>
</tr>
<tr>
<th>3</th>
<td>47</td>
<td>301</td>
<td>2009-02-06</td>
<td>53.38</td>
<td>NGAZJ</td>
<td>2</td>
<td>2009-02-09</td>
<td>2009-02</td>
<td>2009-01</td>
</tr>
<tr>
<th>4</th>
<td>47</td>
<td>302</td>
<td>2009-02-06</td>
<td>14.28</td>
<td>FFYHD</td>
<td>2</td>
<td>2009-02-09</td>
<td>2009-02</td>
<td>2009-01</td>
</tr>
</tbody>
</table>
<h3>3. Rollup data by CohortGroup & OrderPeriod</h3>
<p>Since we're looking at monthly cohorts, we need to aggregate users, orders, and amount spent by the CohortGroup within the month (OrderPeriod).</p>
<div class="highlight"><pre><span></span><code><span class="n">grouped</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'CohortGroup'</span><span class="p">,</span> <span class="s1">'OrderPeriod'</span><span class="p">])</span>
<span class="c1"># count the unique users, orders, and total revenue per Group + Period</span>
<span class="n">cohorts</span> <span class="o">=</span> <span class="n">grouped</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'UserId'</span><span class="p">:</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="o">.</span><span class="n">nunique</span><span class="p">,</span>
<span class="s1">'OrderId'</span><span class="p">:</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="o">.</span><span class="n">nunique</span><span class="p">,</span>
<span class="s1">'TotalCharges'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">})</span>
<span class="c1"># make the column names more meaningful</span>
<span class="n">cohorts</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s1">'UserId'</span><span class="p">:</span> <span class="s1">'TotalUsers'</span><span class="p">,</span>
<span class="s1">'OrderId'</span><span class="p">:</span> <span class="s1">'TotalOrders'</span><span class="p">},</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">cohorts</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th></th>
<th>TotalOrders</th>
<th>TotalUsers</th>
<th>TotalCharges</th>
</tr>
<tr>
<th>CohortGroup</th>
<th>OrderPeriod</th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="5" valign="top">2009-01</th>
<th>2009-01</th>
<td>30</td>
<td>22</td>
<td>1850.255</td>
</tr>
<tr>
<th>2009-02</th>
<td>25</td>
<td>8</td>
<td>1351.065</td>
</tr>
<tr>
<th>2009-03</th>
<td>26</td>
<td>10</td>
<td>1357.360</td>
</tr>
<tr>
<th>2009-04</th>
<td>28</td>
<td>9</td>
<td>1604.500</td>
</tr>
<tr>
<th>2009-05</th>
<td>26</td>
<td>10</td>
<td>1575.625</td>
</tr>
</tbody>
</table>
<h3>4. Label the CohortPeriod for each CohortGroup</h3>
<p>We want to look at how each cohort has behaved in the months following their first purchase, so we'll need to index each cohort to their first purchase month. For example, CohortPeriod = 1 will be the cohort's first month, CohortPeriod = 2 is their second, and so on.</p>
<p>This allows us to compare cohorts across various stages of their lifetime.</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">cohort_period</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="sd">"""</span>
<span class="sd"> Creates a `CohortPeriod` column, which is the Nth period based on the user's first purchase.</span>
<span class="sd"> Example</span>
<span class="sd"> -------</span>
<span class="sd"> Say you want to get the 3rd month for every user:</span>
<span class="sd"> df.sort(['UserId', 'OrderTime', inplace=True)</span>
<span class="sd"> df = df.groupby('UserId').apply(cohort_period)</span>
<span class="sd"> df[df.CohortPeriod == 3]</span>
<span class="sd"> """</span>
<span class="n">df</span><span class="p">[</span><span class="s1">'CohortPeriod'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">return</span> <span class="n">df</span>
<span class="n">cohorts</span> <span class="o">=</span> <span class="n">cohorts</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">cohort_period</span><span class="p">)</span>
<span class="n">cohorts</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th></th>
<th>TotalOrders</th>
<th>TotalUsers</th>
<th>TotalCharges</th>
<th>CohortPeriod</th>
</tr>
<tr>
<th>CohortGroup</th>
<th>OrderPeriod</th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="5" valign="top">2009-01</th>
<th>2009-01</th>
<td>30</td>
<td>22</td>
<td>1850.255</td>
<td>1</td>
</tr>
<tr>
<th>2009-02</th>
<td>25</td>
<td>8</td>
<td>1351.065</td>
<td>2</td>
</tr>
<tr>
<th>2009-03</th>
<td>26</td>
<td>10</td>
<td>1357.360</td>
<td>3</td>
</tr>
<tr>
<th>2009-04</th>
<td>28</td>
<td>9</td>
<td>1604.500</td>
<td>4</td>
</tr>
<tr>
<th>2009-05</th>
<td>26</td>
<td>10</td>
<td>1575.625</td>
<td>5</td>
</tr>
</tbody>
</table>
<h3>5. Make sure we did all that right</h3>
<p>Let's test data points from the original DataFrame with their corresponding values in the new cohorts DataFrame to make sure all our data transformations worked as expected. As long as none of these raise an exception, we're good.</p>
<div class="highlight"><pre><span></span><code><span class="n">x</span> <span class="o">=</span> <span class="n">df</span><span class="p">[(</span><span class="n">df</span><span class="o">.</span><span class="n">CohortGroup</span> <span class="o">==</span> <span class="s1">'2009-01'</span><span class="p">)</span> <span class="o">&</span> <span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">OrderPeriod</span> <span class="o">==</span> <span class="s1">'2009-01'</span><span class="p">)]</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">cohorts</span><span class="o">.</span><span class="n">ix</span><span class="p">[(</span><span class="s1">'2009-01'</span><span class="p">,</span> <span class="s1">'2009-01'</span><span class="p">)]</span>
<span class="k">assert</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s1">'UserId'</span><span class="p">]</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span> <span class="o">==</span> <span class="n">y</span><span class="p">[</span><span class="s1">'TotalUsers'</span><span class="p">])</span>
<span class="k">assert</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s1">'TotalCharges'</span><span class="p">]</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">==</span> <span class="n">y</span><span class="p">[</span><span class="s1">'TotalCharges'</span><span class="p">]</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="mi">2</span><span class="p">))</span>
<span class="k">assert</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s1">'OrderId'</span><span class="p">]</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span> <span class="o">==</span> <span class="n">y</span><span class="p">[</span><span class="s1">'TotalOrders'</span><span class="p">])</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">df</span><span class="p">[(</span><span class="n">df</span><span class="o">.</span><span class="n">CohortGroup</span> <span class="o">==</span> <span class="s1">'2009-01'</span><span class="p">)</span> <span class="o">&</span> <span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">OrderPeriod</span> <span class="o">==</span> <span class="s1">'2009-09'</span><span class="p">)]</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">cohorts</span><span class="o">.</span><span class="n">ix</span><span class="p">[(</span><span class="s1">'2009-01'</span><span class="p">,</span> <span class="s1">'2009-09'</span><span class="p">)]</span>
<span class="k">assert</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s1">'UserId'</span><span class="p">]</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span> <span class="o">==</span> <span class="n">y</span><span class="p">[</span><span class="s1">'TotalUsers'</span><span class="p">])</span>
<span class="k">assert</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s1">'TotalCharges'</span><span class="p">]</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">==</span> <span class="n">y</span><span class="p">[</span><span class="s1">'TotalCharges'</span><span class="p">]</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="mi">2</span><span class="p">))</span>
<span class="k">assert</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s1">'OrderId'</span><span class="p">]</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span> <span class="o">==</span> <span class="n">y</span><span class="p">[</span><span class="s1">'TotalOrders'</span><span class="p">])</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">df</span><span class="p">[(</span><span class="n">df</span><span class="o">.</span><span class="n">CohortGroup</span> <span class="o">==</span> <span class="s1">'2009-05'</span><span class="p">)</span> <span class="o">&</span> <span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">OrderPeriod</span> <span class="o">==</span> <span class="s1">'2009-09'</span><span class="p">)]</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">cohorts</span><span class="o">.</span><span class="n">ix</span><span class="p">[(</span><span class="s1">'2009-05'</span><span class="p">,</span> <span class="s1">'2009-09'</span><span class="p">)]</span>
<span class="k">assert</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s1">'UserId'</span><span class="p">]</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span> <span class="o">==</span> <span class="n">y</span><span class="p">[</span><span class="s1">'TotalUsers'</span><span class="p">])</span>
<span class="k">assert</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s1">'TotalCharges'</span><span class="p">]</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">==</span> <span class="n">y</span><span class="p">[</span><span class="s1">'TotalCharges'</span><span class="p">]</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="mi">2</span><span class="p">))</span>
<span class="k">assert</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s1">'OrderId'</span><span class="p">]</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span> <span class="o">==</span> <span class="n">y</span><span class="p">[</span><span class="s1">'TotalOrders'</span><span class="p">])</span>
</code></pre></div>
<h3>User Retention by Cohort Group</h3>
<p>We want to look at the percentage change of each CohortGroup over time -- not the absolute change.</p>
<p>To do this, we'll first need to create a pandas Series containing each CohortGroup and its size.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># reindex the DataFrame</span>
<span class="n">cohorts</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">cohorts</span><span class="o">.</span><span class="n">set_index</span><span class="p">([</span><span class="s1">'CohortGroup'</span><span class="p">,</span> <span class="s1">'CohortPeriod'</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="c1"># create a Series holding the total size of each CohortGroup</span>
<span class="n">cohort_group_size</span> <span class="o">=</span> <span class="n">cohorts</span><span class="p">[</span><span class="s1">'TotalUsers'</span><span class="p">]</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">first</span><span class="p">()</span>
<span class="n">cohort_group_size</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go">CohortGroup</span>
<span class="go">2009-01 22</span>
<span class="go">2009-02 15</span>
<span class="go">2009-03 13</span>
<span class="go">2009-04 39</span>
<span class="go">2009-05 50</span>
<span class="go">Name: TotalUsers, dtype: int64</span>
</code></pre></div>
<p>Now, we'll need to divide the <code>TotalUsers</code> values in <code>cohorts</code> by <code>cohort_group_size</code>. Since DataFrame operations are performed based on the indices of the objects, we'll use <code>unstack</code> on our cohorts DataFrame to create a matrix where each column represents a CohortGroup and each row is the CohortPeriod corresponding to that group.</p>
<p>To illustrate what <code>unstack</code> does, recall the first five <code>TotalUsers</code> values:</p>
<div class="highlight"><pre><span></span><code><span class="n">cohorts</span><span class="p">[</span><span class="s1">'TotalUsers'</span><span class="p">]</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go">CohortGroup CohortPeriod</span>
<span class="go">2009-01 1 22</span>
<span class="go"> 2 8</span>
<span class="go"> 3 10</span>
<span class="go"> 4 9</span>
<span class="go"> 5 10</span>
<span class="go">Name: TotalUsers, dtype: int64</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="n">cohorts</span><span class="p">[</span><span class="s1">'TotalUsers'</span><span class="p">]</span><span class="o">.</span><span class="n">unstack</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>CohortGroup</th>
<th>2009-01</th>
<th>2009-02</th>
<th>2009-03</th>
<th>2009-04</th>
<th>2009-05</th>
<th>2009-06</th>
<th>2009-07</th>
<th>2009-08</th>
<th>2009-09</th>
<th>2009-10</th>
<th>2009-11</th>
<th>2009-12</th>
<th>2010-01</th>
<th>2010-02</th>
<th>2010-03</th>
</tr>
<tr>
<th>CohortPeriod</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>1</th>
<td>22</td>
<td>15</td>
<td>13</td>
<td>39</td>
<td>50</td>
<td>32</td>
<td>50</td>
<td>31</td>
<td>37</td>
<td>54</td>
<td>130</td>
<td>65</td>
<td>95</td>
<td>100</td>
<td>24</td>
</tr>
<tr>
<th>2</th>
<td>8</td>
<td>3</td>
<td>4</td>
<td>13</td>
<td>13</td>
<td>15</td>
<td>23</td>
<td>11</td>
<td>15</td>
<td>17</td>
<td>32</td>
<td>17</td>
<td>50</td>
<td>19</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>10</td>
<td>5</td>
<td>5</td>
<td>10</td>
<td>12</td>
<td>9</td>
<td>13</td>
<td>9</td>
<td>14</td>
<td>12</td>
<td>26</td>
<td>18</td>
<td>26</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>9</td>
<td>1</td>
<td>4</td>
<td>13</td>
<td>5</td>
<td>6</td>
<td>10</td>
<td>7</td>
<td>8</td>
<td>13</td>
<td>29</td>
<td>7</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>5</th>
<td>10</td>
<td>4</td>
<td>1</td>
<td>6</td>
<td>4</td>
<td>7</td>
<td>11</td>
<td>6</td>
<td>13</td>
<td>13</td>
<td>13</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
<p>Now, we can utilize broadcasting to divide each column by the corresponding <code>cohort_group_size</code>.</p>
<p>The resulting DataFrame, <code>user_retention</code>, contains the percentage of users from the cohort purchasing within the given period. For instance, 38.4% of users in the 2009-03 purchased again in month 3 (which would be May 2009).</p>
<div class="highlight"><pre><span></span><code><span class="n">user_retention</span> <span class="o">=</span> <span class="n">cohorts</span><span class="p">[</span><span class="s1">'TotalUsers'</span><span class="p">]</span><span class="o">.</span><span class="n">unstack</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">divide</span><span class="p">(</span><span class="n">cohort_group_size</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">user_retention</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>CohortGroup</th>
<th>2009-01</th>
<th>2009-02</th>
<th>2009-03</th>
<th>2009-04</th>
<th>2009-05</th>
<th>2009-06</th>
<th>2009-07</th>
<th>2009-08</th>
<th>2009-09</th>
<th>2009-10</th>
<th>2009-11</th>
<th>2009-12</th>
<th>2010-01</th>
<th>2010-02</th>
<th>2010-03</th>
</tr>
<tr>
<th>CohortPeriod</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>1</th>
<td>1.000000</td>
<td>1.000000</td>
<td>1.000000</td>
<td>1.000000</td>
<td>1.00</td>
<td>1.00000</td>
<td>1.00</td>
<td>1.000000</td>
<td>1.000000</td>
<td>1.000000</td>
<td>1.000000</td>
<td>1.000000</td>
<td>1.000000</td>
<td>1.00</td>
<td>1</td>
</tr>
<tr>
<th>2</th>
<td>0.363636</td>
<td>0.200000</td>
<td>0.307692</td>
<td>0.333333</td>
<td>0.26</td>
<td>0.46875</td>
<td>0.46</td>
<td>0.354839</td>
<td>0.405405</td>
<td>0.314815</td>
<td>0.246154</td>
<td>0.261538</td>
<td>0.526316</td>
<td>0.19</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>0.454545</td>
<td>0.333333</td>
<td>0.384615</td>
<td>0.256410</td>
<td>0.24</td>
<td>0.28125</td>
<td>0.26</td>
<td>0.290323</td>
<td>0.378378</td>
<td>0.222222</td>
<td>0.200000</td>
<td>0.276923</td>
<td>0.273684</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>0.409091</td>
<td>0.066667</td>
<td>0.307692</td>
<td>0.333333</td>
<td>0.10</td>
<td>0.18750</td>
<td>0.20</td>
<td>0.225806</td>
<td>0.216216</td>
<td>0.240741</td>
<td>0.223077</td>
<td>0.107692</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>5</th>
<td>0.454545</td>
<td>0.266667</td>
<td>0.076923</td>
<td>0.153846</td>
<td>0.08</td>
<td>0.21875</td>
<td>0.22</td>
<td>0.193548</td>
<td>0.351351</td>
<td>0.240741</td>
<td>0.100000</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>6</th>
<td>0.363636</td>
<td>0.266667</td>
<td>0.153846</td>
<td>0.179487</td>
<td>0.12</td>
<td>0.15625</td>
<td>0.20</td>
<td>0.258065</td>
<td>0.243243</td>
<td>0.129630</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>7</th>
<td>0.363636</td>
<td>0.266667</td>
<td>0.153846</td>
<td>0.102564</td>
<td>0.06</td>
<td>0.09375</td>
<td>0.22</td>
<td>0.129032</td>
<td>0.216216</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>8</th>
<td>0.318182</td>
<td>0.333333</td>
<td>0.230769</td>
<td>0.153846</td>
<td>0.10</td>
<td>0.09375</td>
<td>0.14</td>
<td>0.129032</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>9</th>
<td>0.318182</td>
<td>0.333333</td>
<td>0.153846</td>
<td>0.051282</td>
<td>0.10</td>
<td>0.31250</td>
<td>0.14</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>10</th>
<td>0.318182</td>
<td>0.266667</td>
<td>0.076923</td>
<td>0.102564</td>
<td>0.08</td>
<td>0.09375</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
<p>Finally, we can plot the cohorts over time in an effort to spot behavioral differences or similarities. Two common cohort charts are line graphs and heatmaps, both of which are shown below.</p>
<p>Notice that the first period of each cohort is 100% -- this is because our cohorts are based on each user's first purchase, meaning everyone in the cohort purchased in month 1.</p>
<div class="highlight"><pre><span></span><code><span class="n">user_retention</span><span class="p">[[</span><span class="s1">'2009-06'</span><span class="p">,</span> <span class="s1">'2009-07'</span><span class="p">,</span> <span class="s1">'2009-08'</span><span class="p">]]</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">5</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">'Cohorts: User Retention'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mf">12.1</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlim</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">12</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">'</span><span class="si">% o</span><span class="s1">f Cohort Purchasing'</span><span class="p">);</span>
</code></pre></div>
<p><img alt="cohort retention curves" src="/images/cohort-example.png"></p>
<div class="highlight"><pre><span></span><code><span class="c1"># Creating heatmaps in matplotlib is more difficult than it should be.</span>
<span class="c1"># Thankfully, Seaborn makes them easy for us.</span>
<span class="c1"># http://stanford.edu/~mwaskom/software/seaborn/</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="nn">sns</span>
<span class="n">sns</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="n">style</span><span class="o">=</span><span class="s1">'white'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">8</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">'Cohorts: User Retention'</span><span class="p">)</span>
<span class="n">sns</span><span class="o">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">user_retention</span><span class="o">.</span><span class="n">T</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="n">user_retention</span><span class="o">.</span><span class="n">T</span><span class="o">.</span><span class="n">isnull</span><span class="p">(),</span> <span class="n">annot</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">fmt</span><span class="o">=</span><span class="s1">'.0%'</span><span class="p">);</span>
</code></pre></div>
<p><img alt="cohort retention heatmap" src="/images/cohort-retention-heatmap.png"></p>
<p>Unsurprisingly, we can see from the above chart that fewer users tend to purchase as time goes on.</p>
<p>However, we can also see that the 2009-01 cohort is the strongest, which enables us to ask targeted questions about this cohort compared to others -- what other attributes (besides first purchase month) do these users share which might be causing them to stick around? How were the majority of these users acquired? Was there a specific marketing campaign that brought them in? Did they take advantage of a promotion at sign-up? The answers to these questions would inform future marketing and product efforts.</p>
<h2>Further work</h2>
<p>User retention is only one way of using cohorts to look at your business — we could have also looked at revenue retention. That is, the percentage of each cohort’s month 1 revenue returning in subsequent periods. User retention is important, but we shouldn’t lose sight of the revenue each cohort is bringing in (and how much of it is returning).</p>
<p>Hopefully you’ve found this post useful. If I’ve missed anything, <a href="https://twitter.com/gjreda">let me know</a>.</p>
<h2>Additional Resources</h2>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Cohort_analysis">Cohort Analysis</a> on Wikipedia</li>
<li><a href="http://christophjanz.blogspot.de/2012/05/know-your-user-cohorts.html">Know Your User Cohorts</a> by Christoph Janz</li>
<li><a href="http://avc.com/2009/10/the-cohort-analysis/">The Cohort Analysis</a> by Fred Wilson (Union Square Ventures)</li>
<li><a href="http://www.quora.com/What-exactly-is-cohort-analysis">What exactly is cohort analysis?</a> on Quora</li>
</ul>
<hr class="small" id="footnotes">
<p></hr>
1. While a purchase might not be at the core of these businesses, they still might occur (e.g. "Buy" buttons on tweets are of value to Twitter, but users and engagement are what the platform is about).</p>Nonsensical beer reviews via Markov chains2015-03-30T00:00:00-07:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2015-03-30:/2015/03/30/beer-review-markov-chains/<p>I’ve had a bunch of beer reviews and ratings data sitting on my hard drive for about year. For a beer nerd like me, that’s a pretty cool dataset, yet I’ve let it collect digital dust.</p>
<p>Fast forward to last week, where somehow I wound up in …</p><p>I’ve had a bunch of beer reviews and ratings data sitting on my hard drive for about year. For a beer nerd like me, that’s a pretty cool dataset, yet I’ve let it collect digital dust.</p>
<p>Fast forward to last week, where somehow I wound up in the Wikipedia Death Spiral. You know what I mean - you click a link to a Wikipedia article, that article takes you to a new one, then you’re on another, and another … we’ve all been there. And it’s kind of awesome.</p>
<p>Well, the rabbit hole led me to <a href="http://en.wikipedia.org/wiki/Markov_chain">Markov chains</a>, which seemed like a good excuse to mess around with that beer review data.</p>
<h2>What are Markov chains?</h2>
<p>Markov chains are a random process that transitions to various states, where the “next state” is based on its probability distribution, given the current state.</p>
<p>Imagine we have the following sequence of days, where S indicates it was sunny and R indicates it was rainy:</p>
<blockquote>
<p>S S R R S R S S R R R R S R S S S R</p>
</blockquote>
<p>Let’s pick a random beginning “state” - let’s just say it’s S (sunny). The next state is based <strong>only</strong> on the current state. Since our current state is S, we only need to look at observations immediately following a sunny day.</p>
<p>To illustrate, let’s look at the weather pattern again, this time putting the observations to be considered in bold.</p>
<blockquote>
<p>S <strong>S</strong> <strong>R</strong> R S <strong>R</strong> S <strong>S</strong> <strong>R</strong> R R R S <strong>R</strong> S <strong>S</strong> <strong>S</strong> <strong>R</strong></p>
</blockquote>
<p>Even though there are 18 observations, only nine need to be considered for the possible next state. Of the nine, four are S and five are R, giving us a 44% (4/9) chance of the next state being sunny and a 55% (5/9) chance of it being rainy.</p>
<p>Now, let’s assume our beginning state (S) transitioned to a second state of R (which it had a 55% chance of doing). Here are the states we need to consider for the possible third state:</p>
<blockquote>
<p>S S R <strong>R</strong> <strong>S</strong> R <strong>S</strong> S R <strong>R</strong> <strong>R</strong> <strong>R</strong> <strong>S</strong> R <strong>S</strong> S S R</p>
</blockquote>
<p>There’s an equal chance (4/8) the third state will be S or R.</p>
<p>With a second-order Markov chain, the current state is two observations. Let’s assume a beginning state of SR and use the same weather sequence as above, again putting the possible next states in bold.</p>
<blockquote>
<p>S S R <strong>R</strong> S R <strong>S</strong> S R <strong>R</strong> R R S R <strong>S</strong> S S R</p>
</blockquote>
<p>This time there are only four observations to consider as possible “next states,” with an equal chance it’ll be S or R.</p>
<p>Let’s assume the “next state” picked is R. Now our current (second) state is RR - the S from our beginning state is forgotten. The following are possible third states:</p>
<blockquote>
<p>S S R R <strong>S</strong> R S S R R <strong>R</strong> <strong>R</strong> <strong>S</strong> R S S S R</p>
</blockquote>
<p>Again, there’s an equal chance of our third state being S or R.</p>
<p>We can continue picking “next states” and eventually we’ll have generated a random, yet probabilistic sequence of weather.</p>
<p>These same principles can be used to generate a sentence from text data - pick a random beginning state (word) from the text and then pick the next word based on the likelihood of it occurring, given the current word. A first-order Markov sentence would have a one word current state, a second-order would have a two word current state, … and so on.</p>
<p>The larger the corpus and the higher the order, the more sense these Markov generated sentences make. Good thing I have a lot of beer reviews.</p>
<h2>The (mini) project</h2>
<p>This seemed ripe for a Twitter bot, so I created <a href="https://twitter.com/BeerSnobSays">BeerSnobSays</a>, which tweets nonsensical beer reviews generated via second-order Markov chains.</p>
<p>Not everything it tweets makes much sense:</p>
<blockquote>
<p>dissipates about a finger of head and some mild spice interwoven and even beer at a local Greek restaurant.</p>
<p>a big thumbs up though and there are plenty other choices that I was really no distinguishing characteristics that stand out.</p>
<p>those who are looking for a beer best characteristic of this beer into the hype and the lager style that is unwelcome.</p>
</blockquote>
<p>But some of it is pretty funny:</p>
<blockquote>
<p>off by itself, the taste of apple juice colored brew with a nice warming alcohol bathes your noodle in its dryness.</p>
<p>is almost like sour grains with a hint of booze in the finish, with sweet orange peels and pine sap.</p>
<p>a charred woodiness and smoke can run into pineapple, oranges and citrusy oils with a clean alcohol sting at the bottom of the recipe.</p>
<p>the berry aspect is evident but the tartness and dryness from the beer starts off surprisingly pleasant.</p>
</blockquote>
<p>I’m not sure if that last one’s from the bot or a famous poet.</p>
<p>You can <a href="https://twitter.com/gjreda">follow me</a> and <a href="https://twitter.com/BeerSnobSays">BeerSnobSays</a> on Twitter. You can also find the code for the bot <a href="https://github.com/gjreda/beer-snob-says">on GitHub</a>.</p>Using Travis & GitHub to deploy static sites2015-03-26T00:00:00-07:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2015-03-26:/2015/03/26/static-site-deployments/<p><strong>Update:</strong> As of December 2020, Travis CI has stopped being free for open source projects. If you've been using Travis to deploy your static site, I recommend migrating to Github Actions. I've written a post about how to do so <a href="/2020/12/09/deploying-static-sites-with-github-actions/">here</a>.</p>
<hr>
<p>I’m an unabashed supporter of “Keep It Simple …</p><p><strong>Update:</strong> As of December 2020, Travis CI has stopped being free for open source projects. If you've been using Travis to deploy your static site, I recommend migrating to Github Actions. I've written a post about how to do so <a href="/2020/12/09/deploying-static-sites-with-github-actions/">here</a>.</p>
<hr>
<p>I’m an unabashed supporter of “Keep It Simple, Stupid” solutions - it’s the reason I use <a href="http://docs.getpelican.com/en/3.5.0/">Pelican</a> for this website and host it on <a href="http://aws.amazon.com/s3/">S3</a>.</p>
<p>However, I haven’t been completely satisfied with the process of writing a new post or making changes to <a href="https://github.com/gjreda/void">my theme</a>. It’s felt repetitive - make a change, generate site, check change, regenerate site, and eventually push to S3. Due to the extra steps of generating and pushing, I never felt able to focus on just the change at hand.</p>
<p>I wanted to focus, but also maintain the flexibility of a static site.</p>
<p>Enter <a href="https://travis-ci.org">TravisCI</a>. Travis is a <a href="http://en.wikipedia.org/wiki/Continuous_integration">continuous integration</a> (CI) service hosted at GitHub. Setup a <code>.travis.yml</code> file, check your code into GitHub, and Travis will build the project based on the steps laid out in your <code>.travis.yml</code>. A common use-case of CI is automatically running a test suite against each new commit to make sure a change didn’t break functionality of the app.</p>
<p>Since Pelican is just a Python application, and Travis has S3 integration, I’m now using it to regenerate and deploy my site every time I push a change to it on GitHub.</p>
<p>If you’re using Pelican (or any other static site generator) and hosting on S3, here’s how to set things up.</p>
<h2>Setup</h2>
<p>First, sign up for Travis - you’ll just need to login with your GitHub account. Travis will then sync with your GitHub repos. Turn on the GitHub repo(s) you’ll be using it with. For me, it’s just my website.</p>
<p><img alt="travis-enabled-repo" src="/images/travis-enabled-repo.png"></p>
<p>Next, create a new <a href="http://aws.amazon.com/iam/">Identity & Access Management</a> (IAM) user on AWS for Travis. Make note of the security credentials - the Access Key ID and Secret Access Key. You’ll need these later.</p>
<p>Also, since this user will need to write files to S3, make sure it has the <em>AmazonS3FullAccess</em> policy. To do so, click on your new user in the IAM dashboard, click “Attach Policy” (in the Managed Policies section), select <em>AmazonS3FullAccess</em>, and attach. Done.</p>
<p><img alt="attach-s3-policy" src="/images/attach-s3-policy.png"></p>
<p>Now, you’ll need to add your AWS Access Key ID and Secret Access Key to your repo’s environment variables in Travis. These are needed in order to write your site’s files to S3.</p>
<p><img alt="travis-environment-variables" src="/images/travis-env-variables.png"></p>
<p>Lastly, you’ll need to add a <code>.travis.yml</code> file to the root of your project. This tells Travis how to build the application (in this case, a static site generator). Here’s what <a href="https://github.com/gjreda/gregreda.com/blob/master/.travis.yml">mine</a> looks like:</p>
<div class="highlight"><pre><span></span><code><span class="n">language</span><span class="o">:</span><span class="w"> </span><span class="n">python</span><span class="w"></span>
<span class="n">python</span><span class="o">:</span><span class="w"></span>
<span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="s2">"2.7"</span><span class="w"></span>
<span class="n">cache</span><span class="o">:</span><span class="w"> </span><span class="n">apt</span><span class="w"></span>
<span class="n">install</span><span class="o">:</span><span class="w"></span>
<span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="s2">"sudo apt-get install pandoc"</span><span class="w"></span>
<span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="s2">"pip install -r requirements.txt"</span><span class="w"></span>
<span class="n">script</span><span class="o">:</span><span class="w"> </span><span class="s2">"pelican content/"</span><span class="w"></span>
<span class="n">deploy</span><span class="o">:</span><span class="w"></span>
<span class="w"> </span><span class="n">provider</span><span class="o">:</span><span class="w"> </span><span class="n">s3</span><span class="w"></span>
<span class="w"> </span><span class="n">access_key_id</span><span class="o">:</span><span class="w"> </span><span class="n">$AWS_ACCESS_KEY</span><span class="w"> </span><span class="err">#</span><span class="w"> </span><span class="n">declared</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">Travis</span><span class="w"> </span><span class="n">repo</span><span class="w"> </span><span class="n">settings</span><span class="w"></span>
<span class="w"> </span><span class="n">secret_access_key</span><span class="o">:</span><span class="w"> </span><span class="n">$AWS_SECRET_KEY</span><span class="w"></span>
<span class="w"> </span><span class="n">bucket</span><span class="o">:</span><span class="w"> </span><span class="n">www</span><span class="o">.</span><span class="na">gregreda</span><span class="o">.</span><span class="na">com</span><span class="w"></span>
<span class="w"> </span><span class="n">endpoint</span><span class="o">:</span><span class="w"> </span><span class="n">www</span><span class="o">.</span><span class="na">gregreda</span><span class="o">.</span><span class="na">com</span><span class="o">.</span><span class="na">s3</span><span class="o">-</span><span class="n">website</span><span class="o">-</span><span class="n">us</span><span class="o">-</span><span class="n">east</span><span class="o">-</span><span class="mi">1</span><span class="o">.</span><span class="na">amazonaws</span><span class="o">.</span><span class="na">com</span><span class="w"></span>
<span class="w"> </span><span class="n">region</span><span class="o">:</span><span class="w"> </span><span class="n">us</span><span class="o">-</span><span class="n">east</span><span class="o">-</span><span class="mi">1</span><span class="w"></span>
<span class="w"> </span><span class="n">skip_cleanup</span><span class="o">:</span><span class="w"> </span><span class="kc">true</span><span class="w"></span>
<span class="w"> </span><span class="n">local</span><span class="o">-</span><span class="n">dir</span><span class="o">:</span><span class="w"> </span><span class="n">output</span><span class="w"></span>
<span class="w"> </span><span class="n">acl</span><span class="o">:</span><span class="w"> </span><span class="n">public_read</span><span class="w"></span>
<span class="w"> </span><span class="n">detect_encoding</span><span class="o">:</span><span class="w"> </span><span class="kc">true</span><span class="w"></span>
<span class="n">notifications</span><span class="o">:</span><span class="w"></span>
<span class="w"> </span><span class="n">email</span><span class="o">:</span><span class="w"></span>
<span class="w"> </span><span class="n">on_failure</span><span class="o">:</span><span class="w"> </span><span class="n">always</span><span class="w"></span>
</code></pre></div>
<p>Here’s a quick rundown:</p>
<ul>
<li><code>language</code> - The language in which the application is written. Since we’re using Pelican, it’s Python, but Travis supports a variety of languages. We also specify a version on the next line.</li>
<li><code>install</code> - This tells Travis any dependencies that need to be installed via apt-get. Some of my posts have IPython Notebook integration, which uses <a href="http://johnmacfarlane.net/pandoc/">pandoc</a>. I’m also using pip to install the required Python packages (like Pelican).</li>
<li><code>script</code> - Your build command. In this case, it’s just <code>pelican content</code>, which generates the static site based off of what’s in the content directory. By default, Pelican writes the site to a local directory called <code>output</code>, which we need in the deploy step.</li>
<li><code>deploy</code> - Since Travis has <a href="http://docs.travis-ci.com/user/deployment/s3/">S3 deployment</a> built-in, all we need to do is tell it which directory (<code>local-dir</code>) to put where (your <code>bucket</code> and its related <code>endpoint</code> and <code>region</code>). Note that we’re also using our AWS keys - the variable names used here must match the names we provided in the environment variables section earlier.</li>
<li><code>notifications</code> - By default, Travis will email you the results of each build. I’ve turned them off, but there are other <a href="http://docs.travis-ci.com/user/notifications/">notification options</a> as well.</li>
</ul>
<p>The above is really just a subset of the functionality Travis provides - you can even declare scripts to be run before and after install, or before and after your deploy. Check out the <a href="http://docs.travis-ci.com/user/build-configuration/">build configuration</a> section of the docs if you’re interested in learning more.</p>
<p>Now, every time I push a commit to GitHub, Travis will clone my repo, <code>cd</code> to it, build, and deploy my site all based on what’s in my <code>.travis.yml</code> file. And I get to focus on writing.</p>
<p>Have questions? <a href="https://twitter.com/gjreda">Let me know</a>.</p>Web Scraping 201: finding the API2015-02-15T00:00:00-08:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2015-02-15:/2015/02/15/web-scraping-finding-the-api/<p><em>This is part of a series of posts I have written about web scraping with Python.</em></p>
<ol>
<li><a href="http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/">Web Scraping 101 with Python</a>, which covers the basics of using Python for web scraping.</li>
<li><a href="http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/">Web Scraping 201: Finding the API</a>, which covers when sites load data client-side with Javascript.</li>
<li><a href="http://www.gregreda.com/2016/10/16/asynchronous-scraping-with-python/">Asynchronous Scraping with Python …</a></li></ol><p><em>This is part of a series of posts I have written about web scraping with Python.</em></p>
<ol>
<li><a href="http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/">Web Scraping 101 with Python</a>, which covers the basics of using Python for web scraping.</li>
<li><a href="http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/">Web Scraping 201: Finding the API</a>, which covers when sites load data client-side with Javascript.</li>
<li><a href="http://www.gregreda.com/2016/10/16/asynchronous-scraping-with-python/">Asynchronous Scraping with Python</a>, showing how to use multithreading to speed things up.</li>
<li><a href="http://www.gregreda.com/2020/11/17/scraping-pages-behind-login-forms/">Scraping Pages Behind Login Forms</a>, which shows how to log into sites using Python.</li>
</ol>
<hr>
<p><strong>Update</strong>: Sorry folks, it looks like the NBA doesn't make shot log data accessible anymore. The same principles of this post still apply, but the particular example used is no longer functional. I do not intend to rewrite this post.</p>
<hr>
<p>Previously, I explained <a href="http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/">how to scrape a page</a> where the data is rendered <em>server-side</em>. However, the increasing popularity of Javascript frameworks such as <a href="https://angularjs.org">AngularJS</a> coupled with <a href="http://en.wikipedia.org/wiki/Representational_state_transfer#Applied_to_web_services">RESTful APIs</a> means that fewer sites are generated server-side and are instead being rendered <em>client-side</em>.</p>
<p>In this post, I’ll give a brief overview of the differences between the two and show how to find the underlying API, allowing you to get the data you’re looking for.</p>
<h2>Server-side vs client-side</h2>
<p>Imagine we have a database of sports statistics and would like to build a web application on top of it (e.g. something like <a href="http://www.basketball-reference.com/">Basketball Reference</a>).</p>
<p>If we build our web app using a server-side framework like <a href="https://www.djangoproject.com/">Django</a> [1], something akin to the following happens each time a user visits a page.</p>
<ol>
<li>User’s browser sends a request to the server hosting our application.</li>
<li>Our server processes the request, checking to make sure the URL requested exists (amongst other things).</li>
<li>If the requested URL does not exist, send an error back to the user’s browser and direct them to a <a href="http://en.wikipedia.org/wiki/HTTP_404#Custom_error_pages">404 page</a>.</li>
<li>If the requested URL does exist, execute some code <em>on the server</em> which gets data from our database. Let’s say the user wants to see <a href="http://www.basketball-reference.com/players/w/walljo01/gamelog/2015/">John Wall’s game-by-game stats</a> for the 2014-15 NBA season. In this case, our Django/Python code queries the database and receives the data.</li>
<li>Our Django/Python code injects the data into our application’s <a href="http://en.wikipedia.org/wiki/Web_template_system">templates</a> to complete the HTML for the page.</li>
<li>Finally, the server sends the HTML to the user’s browser (a <em>response</em> to their <em>request</em>) and the page is displayed.</li>
</ol>
<p>To illustrate the last step, go to <a href="http://www.basketball-reference.com/players/w/walljo01/gamelog/2015/">John Wall’s game log</a> and <a href="view-source:http://www.basketball-reference.com/players/w/walljo01/gamelog/2015/">view the page source</a>. Ctrl+f or Cmd+f and search for “2014-10-29”. This is the first row of the game-by-game stats table. We know the page was created server-side because the data is present in the page source.</p>
<p>However, if the web application is built with a client-side framework like Angular, the process is slightly different. In this case, the server still sends the static content (the HTML, CSS, and Javascript), but the HTML is only a template - it doesn’t hold any data. Separately, the Javascript in the server response fetches the data from an API and uses it to create the page <em>client-side</em>.</p>
<p>To illustrate, view the source of <a href="http://stats.nba.com/player/#!/202322/tracking/shotslogs/">John Wall’s shot log</a> page on NBA.com - there’s no data to scrape! <a href="view-source:http://stats.nba.com/player/#!/202322/tracking/shotslogs/">See for yourself</a>. Ctrl+f or Cmd+f for “Was @“. Despite there being many instances of it in the shot log table, none found in the page source.</p>
<p>If you’re thinking “Oh crap, I can’t scrape this data,” well, you’re in luck! Applications using an API are often <em>easier</em> to scrape - you just need to know how to find the API. Which means I should probably tell you how to do that.</p>
<h2>Finding the API</h2>
<p>With a client-side app, your browser is doing much of the work. And because your browser is what’s rendering the HTML, we can use it to see where the data is coming from using its built-in developer tools.</p>
<p>To illustrate, I’ll be using Chrome, but Firefox should be more or less the same (Internet Explorer users … you should switch to Chrome or Firefox and not look back).</p>
<p>To open Chrome’s Developer Tools, go to View -> Developer -> Developer Tools. In Firefox, it’s Tools -> Web Developer -> Toggle Tools. We’ll be using the Network tab, so click on that one. It should be empty.</p>
<p>Now, go to the page that has your data. In this case, it’s <a href="http://stats.nba.com/player/#!/202322/tracking/shotslogs/">John Wall’s shot logs</a>. If you’re already on the page, hit refresh. Your Network tab should look similar to this:</p>
<p><img alt="network tab example" src="/images/scraping-network-tab.png"></p>
<p>Next, click on the XHR filter. XHR is short for <a href="http://en.wikipedia.org/wiki/XMLHttpRequest">XMLHttpRequest</a> - this is the type of request used to fetch XML or JSON data. You should see a couple entries in this table (screenshot below). One of them is the API request that returns the data you’re looking for (in this case, John Wall’s shots).</p>
<p><img alt="XHR requests example" src="/images/scraping-xhr-tab.png"></p>
<p>At this point, you’ll need to explore a bit to determine which request is the one you want. For our example, the one starting with “playerdashptshotlog” sounds promising. Let’s click on it and view it in the Preview tab. Things should now look like this:</p>
<p><img alt="API response preview" src="/images/scraping-api-preview.png"></p>
<p>Bingo! That’s the API endpoint. We can use the Preview tab to explore the response.</p>
<p><img alt="API results preview" src="/images/scraping-api-results-preview.png"></p>
<p>You should see a couple of objects:</p>
<ol>
<li>The resource name - <em>playerdashptshotlog</em>.</li>
<li>The parameters (you might need to expand the resource section). These are the request parameters that were passed to the API. You can think of them like the <code>WHERE</code> clause of a SQL query. This request has parameters of <code>Season=2014-15</code> and <code>PlayerID=202322</code> (amongst others). Change the parameters in the URL and you’ll get different data (more on that in a bit).</li>
<li>The result sets. This is self-explanatory.</li>
<li>Within the result sets, you’ll find the headers and row set. Each object in the row set is essentially the result of a database query, while the headers tell you the column order. We can see that the first item in each row corresponds to the Game_ID, while the second is the Matchup.</li>
</ol>
<p>Now, go to the Headers tab, grab the request URL, and open it in a new browser tab, we’ll see the data we’re looking for (example below). Note that I'm using <a href="https://chrome.google.com/webstore/detail/jsonview/chklaanhfefbnpoihckbnefhakgolnmc?hl=en">JSONView</a>, which nicely formats JSON in your browser.</p>
<p><img alt="API response" src="/images/scraping-api-response.png"></p>
<p>To grab this data, we can use something like Python’s <a href="http://docs.python-requests.org/en/latest/">requests</a>. Here’s an example:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">requests</span>
<span class="n">shots_url</span> <span class="o">=</span> <span class="s1">'http://stats.nba.com/stats/playerdashptshotlog?'</span><span class="o">+</span> \
<span class="s1">'DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00&'</span> <span class="o">+</span> \
<span class="s1">'Location=&Month=0&OpponentTeamID=0&Outcome=&Period=0&'</span> <span class="o">+</span> \
<span class="s1">'PlayerID=202322&Season=2014-15&SeasonSegment=&'</span> <span class="o">+</span> \
<span class="s1">'SeasonType=Regular+Season&TeamID=0&VsConference=&VsDivision='</span>
<span class="c1"># request the URL and parse the JSON</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">shots_url</span><span class="p">)</span>
<span class="n">response</span><span class="o">.</span><span class="n">raise_for_status</span><span class="p">()</span> <span class="c1"># raise exception if invalid response</span>
<span class="n">shots</span> <span class="o">=</span> <span class="n">response</span><span class="o">.</span><span class="n">json</span><span class="p">()[</span><span class="s1">'resultSets'</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="s1">'rowSet'</span><span class="p">]</span>
<span class="c1"># do whatever we want with the shots data</span>
<span class="n">do_things</span><span class="p">(</span><span class="n">shots</span><span class="p">)</span>
</code></pre></div>
<p>That’s it. Now you have the data and can get to work.</p>
<p>Note that passing different parameter values to the API yields different results. For instance, change the Season parameter to 2013-14 - now you have John Wall’s shots for the 2013-14 season. Change the PlayerID to 201935 - now you have James Harden’s shots.</p>
<p>Additionally, different APIs return different types of data. Some might send XML; others, JSON. Some might store the results in an array of arrays; others, an array of maps or dictionaries. Some might not return the column headers at all. Things are vary between sites.</p>
<p>Had a situation where you haven't been able to find the data you're looking for in the page source? Well, now you know how to find it.</p>
<p><em>Was there something I missed? Have questions? <a href="https://twitter.com/gjreda">Let me know</a>.</em></p>
<hr class=“small”>
<p>[1] Really this can be any server-side framework - Ruby on Rails, PHP’s Drupal or CodeIgniter, etc.</p>[Talk] Translating SQL to pandas2014-12-22T00:00:00-08:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2014-12-22:/2014/12/22/translating-sql-to-pandas-video/<p>A few weeks ago, I gave a <a href="http://pandas.pydata.org/">pandas</a> tutorial at <a href="http://pydata.org/nyc2014/">PyData NYC</a> titled "Translating SQL to pandas. And back." I don't remember why I put the "And back" in there - if you can translate things one way, you can translate them the other way, too.</p>
<p>Anyway, here's the abstract:</p>
<blockquote>
<p>SQL …</p></blockquote><p>A few weeks ago, I gave a <a href="http://pandas.pydata.org/">pandas</a> tutorial at <a href="http://pydata.org/nyc2014/">PyData NYC</a> titled "Translating SQL to pandas. And back." I don't remember why I put the "And back" in there - if you can translate things one way, you can translate them the other way, too.</p>
<p>Anyway, here's the abstract:</p>
<blockquote>
<p>SQL is still the bread-and-butter of the data world, and data analysts/scientists/engineers need to have some familiarity with it as the world runs on relational databases.</p>
<p>When first learning pandas (and coming from a database background), I found myself wanting to be able to compare equivalent pandas and SQL statements side-by-side, knowing that it would allow me to pick up the library quickly, but most importantly, apply it to my workflow.</p>
<p>This tutorial will provide an introduction to both syntaxes, allowing those inexperienced with either SQL or pandas to learn a bit of both, while also bridging the gap between the two, so that practitioners of one can learn the other from their perspective. Additionally, I'll discuss the tradeoffs between each and why one might be better suited for some tasks than the other.</p>
</blockquote>
<p>Having never been to a technical conference, much less given a talk at one, it was quite a new experience for me - and something I'd like to do again.</p>
<p>I highly recommend giving a talk at an event like PyData if you ever have the opportunity. And if you think you don't have anything interesting to say, or aren't experienced enough to give a tutorial, or are just plain nervous ... don't worry, I felt all those things too. You should do it anyway.</p>
<p>Below is the video of my talk. You can find the accompanying materials <a href="https://github.com/gjreda/pydata2014nyc">here</a>.</p>
<div class="center">
<iframe width="560" height="315" src="//www.youtube.com/embed/1uVWjdAbgBg" allowfullscreen></iframe>
</div>Scraping Craigslist for sold out concert tickets2014-07-27T00:00:00-07:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2014-07-27:/2014/07/27/scraping-craigslist-for-tickets/<p>Recently, I've been listening to a lot of lo-fi rock band, <a href="http://en.wikipedia.org/wiki/Cloud_Nothings">Cloud Nothings</a>. Their album, <a href="http://www.amazon.com/gp/product/B00HZJH97Q/ref=as_li_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B00HZJH97Q&linkCode=as2&tag=gjreda-20&linkId=H7HYP35ZYKFAKH7H">Here & Nowhere Else</a>, has been <a href="http://www.metacritic.com/music/here-and-nowhere-else/cloud-nothings">critically lauded</a>, including <a href="http://pitchfork.com/reviews/albums/19075-cloud-nothings-here-and-nowhere-else/">garnering "Best New Music" from Pitchfork</a>. As a result, when they came to Chicago's tiny Lincoln Hall in May, tickets sold out in a hurry - well before …</p><p>Recently, I've been listening to a lot of lo-fi rock band, <a href="http://en.wikipedia.org/wiki/Cloud_Nothings">Cloud Nothings</a>. Their album, <a href="http://www.amazon.com/gp/product/B00HZJH97Q/ref=as_li_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B00HZJH97Q&linkCode=as2&tag=gjreda-20&linkId=H7HYP35ZYKFAKH7H">Here & Nowhere Else</a>, has been <a href="http://www.metacritic.com/music/here-and-nowhere-else/cloud-nothings">critically lauded</a>, including <a href="http://pitchfork.com/reviews/albums/19075-cloud-nothings-here-and-nowhere-else/">garnering "Best New Music" from Pitchfork</a>. As a result, when they came to Chicago's tiny Lincoln Hall in May, tickets sold out in a hurry - well before I found out about the show. Desperately wanting to go, I started checking Craigslist every day or two for tickets.</p>
<p>Lincoln Hall only holds about 500 people, so Craigslist postings were few and far between. When a post did pop up, I always ended up seeing it a couple hours after it was posted and was too late - the tickets had been sold. Noticing that my frustration was beginning to grow, I figured it was time to automate my Craigslist searches for tickets.</p>
<p>If you search on Craigslist and look at the URL of the results page, you'll notice that it looks very similar to this:</p>
<p><img alt="Craigslist Search Results URL" src="/images/craigslist-search-results-url.png"></p>
<p>Note the section that says <code>query=this+is+my+search+term</code> - that's where your search term gets passed to the databases that back Craigslist (with spaces replaced by + signs). This means we can write code to automate any "for sale" search by hitting <code>http://<city>.craigslist.org/search/sss?query=<term></code> where <code><city></code> corresponds to the subdomain of your city's respective Craigslist and <code><term></code> is our search term.</p>
<p>For my use case, there were very few Craigslist results for each search of "Cloud Nothings" and none of them were spammy. I decided to write a script which would run every 10 minutes and send me a text message if any of the results were new. If I got a text, I could quickly head over to Craigslist, email the seller, and go back about my day. I was lucky that ticket brokers hadn't started putting "Cloud Nothings" in their spammy posts - if they had, this solution likely would not have worked - the text messages would have been more noise than signal.</p>
<p>Thankfully, it worked. I was able to get a ticket for face value two nights before the show.</p>
<p>In the sections below, I'll walk through the code behind it all. If you're unfamiliar with web scraping, I suggest reading my previous posts <a href="http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/">here</a> and <a href="http://www.gregreda.com/2013/05/06/more-web-scraping-with-python/">here</a>.</p>
<h3>Code Walk-Through</h3>
<p>Most of the code's functionality is contained within the four functions below.</p>
<h4>parse_results</h4>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">parse_results</span><span class="p">(</span><span class="n">search_term</span><span class="p">):</span>
<span class="n">results</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">search_term</span> <span class="o">=</span> <span class="n">search_term</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">' '</span><span class="p">,</span> <span class="s1">'+'</span><span class="p">)</span>
<span class="n">search_url</span> <span class="o">=</span> <span class="n">BASE_URL</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">search_term</span><span class="p">)</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">urlopen</span><span class="p">(</span><span class="n">search_url</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">())</span>
<span class="n">rows</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s1">'div'</span><span class="p">,</span> <span class="s1">'content'</span><span class="p">)</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s1">'p'</span><span class="p">,</span> <span class="s1">'row'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">rows</span><span class="p">:</span>
<span class="n">url</span> <span class="o">=</span> <span class="s1">'http://chicago.craigslist.org'</span> <span class="o">+</span> <span class="n">row</span><span class="o">.</span><span class="n">a</span><span class="p">[</span><span class="s1">'href'</span><span class="p">]</span>
<span class="n">create_date</span> <span class="o">=</span> <span class="n">row</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s1">'span'</span><span class="p">,</span> <span class="s1">'date'</span><span class="p">)</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span>
<span class="n">title</span> <span class="o">=</span> <span class="n">row</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s1">'a'</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span>
<span class="n">results</span><span class="o">.</span><span class="n">append</span><span class="p">({</span><span class="s1">'url'</span><span class="p">:</span> <span class="n">url</span><span class="p">,</span> <span class="s1">'create_date'</span><span class="p">:</span> <span class="n">create_date</span><span class="p">,</span> <span class="s1">'title'</span><span class="p">:</span> <span class="n">title</span><span class="p">})</span>
<span class="k">return</span> <span class="n">results</span>
</code></pre></div>
<p>The above function takes a <code>search_term</code>, which is used to execute a search on Craigslist. It returns a list of dictionaries, where each dictionary represents a post found within the search results.</p>
<p>Note the global <code>BASE_URL</code> variable - this is the search results URL mentioned earlier. Here, we're injecting our search term into the section of the URL that had <code>query=<term></code>.</p>
<p>The majority of this function utilizes <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a> to parse the HTML of Craigslist's search results page. For each post in the search results, we store the URL of the post, its creation date, and its title.</p>
<p>In the next function, we'll write these results to a CSV file, which we'll later use to check whether or not there are "new" posts.</p>
<h4>write_results</h4>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">write_results</span><span class="p">(</span><span class="n">results</span><span class="p">):</span>
<span class="sd">"""Writes list of dictionaries to file."""</span>
<span class="n">fields</span> <span class="o">=</span> <span class="n">results</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">keys</span><span class="p">()</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'results.csv'</span><span class="p">,</span> <span class="s1">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">dw</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">DictWriter</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">fieldnames</span><span class="o">=</span><span class="n">fields</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s1">'|'</span><span class="p">)</span>
<span class="n">dw</span><span class="o">.</span><span class="n">writer</span><span class="o">.</span><span class="n">writerow</span><span class="p">(</span><span class="n">dw</span><span class="o">.</span><span class="n">fieldnames</span><span class="p">)</span>
<span class="n">dw</span><span class="o">.</span><span class="n">writerows</span><span class="p">(</span><span class="n">results</span><span class="p">)</span>
</code></pre></div>
<p>As mentioned above, <code>write_results</code> takes a list of dictionaries and writes them to a CSV file called <code>results.csv</code>. Each line of the file will store a post's title, create date, and URL.</p>
<p>You can think of this file similarly to how you might think of a database - we're storing information that we'll need to refer to later on. Since we aren't storing much data, there's really no need to use something like SQLite, MySQL or any other datastore - a text file works just fine for our use case. I'm a big proponent of <a href="http://en.wikipedia.org/wiki/KISS_principle">KISS methodology</a> (Keep It Simple, Stupid).</p>
<h4>has_new_records</h4>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">has_new_records</span><span class="p">(</span><span class="n">results</span><span class="p">):</span>
<span class="n">current_posts</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">results</span><span class="p">]</span>
<span class="n">fields</span> <span class="o">=</span> <span class="n">results</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">keys</span><span class="p">()</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="s1">'results.csv'</span><span class="p">):</span>
<span class="k">return</span> <span class="kc">True</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'results.csv'</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">reader</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">DictReader</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">fieldnames</span><span class="o">=</span><span class="n">fields</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s1">'|'</span><span class="p">)</span>
<span class="n">seen_posts</span> <span class="o">=</span> <span class="p">[</span><span class="n">row</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]</span> <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">reader</span><span class="p">]</span>
<span class="n">is_new</span> <span class="o">=</span> <span class="kc">False</span>
<span class="k">for</span> <span class="n">post</span> <span class="ow">in</span> <span class="n">current_posts</span><span class="p">:</span>
<span class="k">if</span> <span class="n">post</span> <span class="ow">in</span> <span class="n">seen_posts</span><span class="p">:</span>
<span class="k">pass</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">is_new</span> <span class="o">=</span> <span class="kc">True</span>
<span class="k">return</span> <span class="n">is_new</span>
</code></pre></div>
<p>This function determines whether or not any of the posts are new (not present in the results from the last time our code was run).</p>
<p>It takes a list of dictionaries (exactly the same as the one <code>parse_results</code> returns) and checks it against the CSV file we created with the <code>write_results</code> function. Since a URL can only point to one post, we can consider it a <a href="http://en.wikipedia.org/wiki/Unique_key">unique key</a> to check against.</p>
<p>If any of the URLs in results are not found within the CSV file, this function will return <code>True</code>, which we'll use as a trigger to sending off a text message as notification.</p>
<h4>send_text</h4>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">send_text</span><span class="p">(</span><span class="n">phone_number</span><span class="p">,</span> <span class="n">msg</span><span class="p">):</span>
<span class="n">fromaddr</span> <span class="o">=</span> <span class="s2">"Craigslist Checker"</span>
<span class="n">toaddrs</span> <span class="o">=</span> <span class="n">phone_number</span> <span class="o">+</span> <span class="s2">"@txt.att.net"</span>
<span class="n">msg</span> <span class="o">=</span> <span class="p">(</span><span class="s2">"From: </span><span class="si">{0}</span><span class="se">\r\n</span><span class="s2">To: </span><span class="si">{1}</span><span class="se">\r\n\r\n</span><span class="si">{2}</span><span class="s2">"</span><span class="p">)</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">fromaddr</span><span class="p">,</span> <span class="n">toaddrs</span><span class="p">,</span> <span class="n">msg</span><span class="p">)</span>
<span class="n">server</span> <span class="o">=</span> <span class="n">smtplib</span><span class="o">.</span><span class="n">SMTP</span><span class="p">(</span><span class="s1">'smtp.gmail.com:587'</span><span class="p">)</span>
<span class="n">server</span><span class="o">.</span><span class="n">starttls</span><span class="p">()</span>
<span class="n">server</span><span class="o">.</span><span class="n">login</span><span class="p">(</span><span class="n">config</span><span class="o">.</span><span class="n">email</span><span class="p">[</span><span class="s1">'username'</span><span class="p">],</span> <span class="n">config</span><span class="o">.</span><span class="n">email</span><span class="p">[</span><span class="s1">'password'</span><span class="p">])</span>
<span class="n">server</span><span class="o">.</span><span class="n">sendmail</span><span class="p">(</span><span class="n">fromaddr</span><span class="p">,</span> <span class="n">toaddrs</span><span class="p">,</span> <span class="n">msg</span><span class="p">)</span>
<span class="n">server</span><span class="o">.</span><span class="n">quit</span><span class="p">()</span>
</code></pre></div>
<p><code>send_text</code> requires two parameters - the first being the 10-digit phone number that will receive the SMS message, and the second being the content of the message.</p>
<p>This function makes use of the <a href="http://en.wikipedia.org/wiki/Simple_Mail_Transfer_Protocol">Simple Mail Transfer Protocol</a> (or SMTP) as well as AT&T's email-to-SMS gateway (notice the <code>@txt.att.net</code>). This allows us to use a GMail account to send the text message.</p>
<p>Note that if you are not a GMail user or do not use AT&T for your cell phone service, you'll need to make some changes to this function. You can find a list of other email-to-SMS gateways <a href="http://www.emailtextmessages.com/">here</a>.</p>
<p>Since this function uses my GMail credentials, I've stored them in a separate Python file which I am referencing when I call <code>config.email['username']</code> and <code>config.email['password']</code>. You can find the config setup <a href="https://github.com/gjreda/craigslist-checker/blob/master/config.py">here</a>. Just make sure you don't accidentally check in your GMail credentials if you're putting this on GitHub.</p>
<h4>Putting it all together</h4>
<p>You can take a look at the final script <a href="https://github.com/gjreda/craigslist-checker/blob/master/craigslist.py">here</a>. Feel free to use it however you'd like. Deploying it is as simple as spinning up a micro EC2 instance and setting up a cronjob to run the script as often as you'd like.</p>
<p>Did you like this post? Was there something I missed? <a href="https://twitter.com/gjreda">Let me know on Twitter</a>.</p>Principles of good data analysis2014-03-23T00:00:00-07:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2014-03-23:/2014/03/23/principles-of-good-data-analysis/<p>Data analysis is hard.</p>
<p>What makes it hard is the intuitive aspect of it - knowing the direction you want to take based on the limited information you have at the moment. Additionally, it's communicating the results and showing <em>why</em> your analysis is right that makes this all the more difficult …</p><p>Data analysis is hard.</p>
<p>What makes it hard is the intuitive aspect of it - knowing the direction you want to take based on the limited information you have at the moment. Additionally, it's communicating the results and showing <em>why</em> your analysis is right that makes this all the more difficult - doing it deeply, at scale, and in a <em>consistent</em> fashion.</p>
<p>Having been a part of many of these deep-dive analyses, I've noticed some "principles" that I've found useful to follow throughout.</p>
<h4>Know your approach</h4>
<p>Before you begin the analysis, know the questions you're trying to answer and what you're trying to accomplish - don't fall into an analytical rabbit hole. Additionally, you should know some basic things about your potential data - what data sources are available to answer the questions? How is that data structured? Is it in a database? CSVs? Third-party APIs? What tools will you be able to use for the analysis?</p>
<p>Your approach will likely change throughout, but it's helpful to start with a plan and adjust.</p>
<h4>Know how the data was generated</h4>
<p>Once you've settled on your approach and data sources, you need to make sure you understand how the data was generated or captured, especially if you are using your own company's data.</p>
<p>For instance, let's say you're a data scientist at Amazon and you're doing some analysis on orders. Let's assume there's a table somewhere in the Amazon world called "orders" that stores data about an order. Does this table store incomplete orders? What is the interaction on Amazon.com that creates a new record in this table? If I start an order and do not <em>fully</em> complete the payment flow, will a record have been written to this table? What <em>exactly</em> does each field in the table mean?</p>
<p>You need to know this level of detail in order to have confidence in your analysis - your audience will ask these questions.</p>
<h4>Profile your data</h4>
<p>Once you're confident you're looking at the right data, you need to develop some familiarity it. Not only will this allow you to gain a basic understanding of what you're looking at, but it also allows you to gain a certain level of comfort that things are still "right" later on in the analysis.</p>
<p>For example, I was once helping a friend analyze a fairly large time series dataset (~10GB). The results of the analysis didn't intuitively jive with me - something felt off. When digging deeper into the analysis, I decided to plot the events by date and noticed we had two days without any data - that shouldn't have been the case.</p>
<p>Profiling your data early on helps to ensure your work throughout the analysis - you'll notice sooner when something is "off."</p>
<h4>Facet all the things</h4>
<p>I'm increasingly convinced that <a href="http://en.wikipedia.org/wiki/Simpson's_paradox">Simpson's Paradox</a> is one of the most important things for anyone working with data to understand. In cases of Simpson's paradox, a trend appearing in different groups of data disappears when the groups are combined and looked at in aggregate. It illustrates the importance of looking at your data by multiple dimensions.</p>
<p>As an example, take a look at the below table.</p>
<p><img alt="Simpson's paradox (combined)" src="/images/simpsons-paradox-combined.png"></p>
<p>The above table shows admission rates for men and women into the University of California, Berkeley's graduate programs for the fall of 1973. Based on the above numbers, the University was sued for an alledged bias against women. However, when faceting the data by sex AND department, we see women were actually admitted into many departments' graduate programs at a rate higher than men.</p>
<p><img alt="Simpson's paradox (splits)" src="/images/simpsons-paradox-split.png"></p>
<p>This is probably the most infamous case of Simpson's paradox. The folks over at Berkeley's VUDLab have put together a <a href="http://vudlab.com/simpsons/">fantastic visualization</a> allowing you to explore the data further.</p>
<p>When going through your data, do so with Simpson's paradox in mind. It's extremely important to understand how aggregate statistics can be misleading and why looking at your data from multiple facets is necessary.</p>
<h4>Be skeptical</h4>
<p>In addition to profiling and faceting your data, you <em>need</em> to be skeptical throughout your analysis. If something doesn't look or feel right, it probably isn't. Pore through your data to make sure nothing unexpected going on, and if there <em>is</em> something unexpected, make sure you understand why it's occurring and are comfortable with it before you proceed.</p>
<p>I'd argue that no data is better than incorrect data in most cases. Make sure the base layer of your analysis is correct.</p>
<h4>Think like a trial lawyer</h4>
<p>A good trial attorney will prepare their case while also considering how the opposition might respond. When the opposition does present, our attorney will (hopefully) have prepared for that very piece of new evidence or testimony, easily allowing he/she to counter in a meaningful way.</p>
<p>Much like a good trial attorney, you need to think ahead and consider the audience of your analysis and the questions they might ask. Preparing appropriately for those will lend to the credibility of your work. No one likes to hear "I'm not sure, I didn't look at that" and you don't want to be caught flat-footed.</p>
<h4>Clarify your assumptions</h4>
<p>It's unlikely that your data is perfect and it probably doesn't capture everything you need to complete a thorough and exhaustive analysis - you'll need to hold some assumptions throughout your work. These need to be explicitly stated when you're sharing results.</p>
<p>Additionally, your stakeholders are crucial in helping you determine your assumptions. You should be working with them and other domain experts to ensure your assumptions are logical and unbiased.</p>
<h4>Check your work</h4>
<p>It seems obvious, but people just don't check their work sometimes. Understandably, there are deadlines, quick turnarounds, and last minute requests; however, I can assure you that your audience would rather your results be correct than quick.</p>
<p>I find it useful to regularly check the basic statistics of the data (sums, counts, etc.) throughout an analysis in order to make sure nothing is lost along the way - essentially creating a trail of breadcrumbs I can follow backwards in case something doesn't seem right later on.</p>
<h4>Communicate</h4>
<p>Lastly, the whole process should be a conversation with stakeholders - don't work in a silo. It's possible your audience isn't necessarily concerned with decimal point accuracy - maybe they just want to understand directional impact.</p>
<p>In the end, remember that data analysis is most often about <em>solving a problem</em> and that problem has stakeholders - you should be working <em>with</em> them to answer the questions that are most important; not necessarily those that are most interesting. Interesting doesn't always mean "valuable."</p>Finding the midpoint of film releases2014-01-23T00:00:00-08:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2014-01-23:/2014/01/23/film-releases-midpoint/<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js"></script>
<script src="http://d3js.org/d3.v3.min.js"></script>
<style>
.chart {
font: 12px sans-serif;
}
.axis path,
.axis line {
fill: none;
stroke: #000;
shape-rendering: crispEdges;
}
.x.axis path {
/*display: none;*/
}
.line {
fill: none;
stroke: steelblue;
stroke-width: 1.5px;
}
.bar {
fill: steelblue;
}
.overlay {
fill: none;
pointer-events: all;
}
.focus circle {
fill: none;
stroke: steelblue;
}
.mouseover-text {
color: black;
/*font-weight: bold;*/
font-size: 14px;
}
</style>
<blockquote>
<p>"We're talking …</p></blockquote><script src="https://ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js"></script>
<script src="http://d3js.org/d3.v3.min.js"></script>
<style>
.chart {
font: 12px sans-serif;
}
.axis path,
.axis line {
fill: none;
stroke: #000;
shape-rendering: crispEdges;
}
.x.axis path {
/*display: none;*/
}
.line {
fill: none;
stroke: steelblue;
stroke-width: 1.5px;
}
.bar {
fill: steelblue;
}
.overlay {
fill: none;
pointer-events: all;
}
.focus circle {
fill: none;
stroke: steelblue;
}
.mouseover-text {
color: black;
/*font-weight: bold;*/
font-size: 14px;
}
</style>
<blockquote>
<p>"We're talking about Thunderdome. It's from before you were born."</p>
<p>"Most movies are from before I was born."</p>
</blockquote>
<p>That statement spurred a pretty interesting question: <em>what's the date where that statement is no longer true?</em> Put another way, <em>what date in history has an equal number of films made before and after it?</em></p>
<p>My birthday, November 4, 1985, <em>felt</em> like a relatively safe date, but really, no one had a clue if what I said was true, including me. Guesses by about a dozen coworkers included dates from September 1963 all the way to September 2001.</p>
<p>Knowing that <a href="http://imdb.com">IMDB</a> makes their <a href="http://www.imdb.com/interfaces">data publicly available</a>, I decided to find the actual date. Using the most current <a href="ftp://ftp.fu-berlin.de/pub/misc/movies/database/release-dates.list.gz">releases.list file</a> (1/17/14 at the time of writing), I held the following assumptions:</p>
<ol>
<li>
<p>Only films. The release-dates.list also includes TV shows and video games. It also includes movies that went straight to video - those count.</p>
</li>
<li>
<p>Films with a release date in the future do not count.</p>
</li>
<li>
<p>If the film was released multiple times (different release dates for different countries), use the earliest release date.</p>
</li>
<li>
<p>If only a release month and year were provided, assume the 15th of that month.</p>
</li>
<li>
<p>If only a release year was provided, assume <a href="http://en.wikipedia.org/wiki/July_2">July 2nd</a> of that year.</p>
</li>
</ol>
<p>The result?</p>
<div id="vis" class="chart"></div>
<p>May 15, 2002.</p>
<p>Of course, given the current rate at which films are being made, this analysis is already out of date.</p>
<p>For those interested, you can find the code <a href="https://github.com/gjreda/movie-release-timeline">here</a>.</p>
<script>
var path = "/data/movie-releases.tsv"
// dynamically generate chart width
var parentWidth = $("#content").innerWidth();
var margin = {top: 20, right: 50, bottom: 20, left: 50},
width = parentWidth - margin.left - margin.right,
height = (parentWidth/2.0) - margin.top - margin.bottom;
var monthNames = [ "Jan.", "Feb.", "Mar.", "Apr.", "May", "June",
"July", "Aug.", "Sep.", "Oct.", "Nov.", "Dec." ];
var parseDate = d3.time.format("%Y-%m").parse,
bisectDate = d3.bisector(function(d) { return d.release_date; }).left,
formatPercent = function(d) { return (d * 100).toFixed(2) + "%"; },
formatDate = function(d) { return monthNames[d.getMonth()] + " " + d.getFullYear(); };
var x = d3.time.scale().range([0, width]);
var y = d3.scale.linear().range([height, 0]);
var xAxis = d3.svg.axis().scale(x).orient("bottom");
var yAxis = d3.svg.axis().scale(y).orient("left");
var line = d3.svg.line()
.x(function(d) { return x(d.release_date); })
.y(function(d) { return y(d.cumulative); });
var svg = d3.select("#vis").append("svg")
.attr("width", width + margin.left + margin.right)
.attr("height", height + margin.top + margin.bottom)
.append("g")
.attr("transform", "translate(" + margin.left + "," + margin.top + ")");
d3.tsv(path, function(error, data) {
data.forEach(function(d) {
d.release_date = parseDate(d.release_date);
d.percentage = +d.percentage;
d.cumulative = +d.cumulative;
});
x.domain(d3.extent(data, function(d) { return d.release_date; }));
y.domain(d3.extent(data, function(d) { return d.cumulative; }));
svg.append("g")
.attr("class", "x axis")
.attr("transform", "translate(0," + height + ")")
.call(xAxis);
svg.append("g")
.attr("class", "y axis")
.call(yAxis)
.append("text")
.attr("transform", "rotate(-90)")
.attr("y", 6)
.attr("dy", ".71em")
.style("text-anchor", "end")
.text("% of total films");
svg.append("path")
.datum(data)
.attr("class", "line")
.attr("d", line);
// mouseover labels
var focus = svg.append("g")
.attr("class", "focus")
.style("display", "none");
focus.append("circle")
.attr("r", 4.5);
focus.append("text")
.attr("x", 9)
.attr("dy", ".35em");
svg.append("rect")
.attr("class", "overlay")
.attr("width", width)
.attr("height", height)
.on("mouseover", function() { focus.style("display", null); })
.on("mouseout", function() { focus.style("display", "none"); })
.on("mousemove", mousemove);
var textArea = svg.append("text")
.attr("class", "mouseover-text")
.attr("x", width - 125)
.attr("y", height - 10)
.on("mouseover", function() { focus.style("display", null); })
.on("mouseout", function() { focus.style("display", "none"); })
.on("mousemove", mousemove);
function mousemove() {
var x0 = x.invert(d3.mouse(this)[0]),
i = bisectDate(data, x0, 1),
d0 = data[i - 1],
d1 = data[i],
d = x0 - d0.release_date > d1.release_date - x0 ? d1 : d0;
focus.attr("transform", "translate(" + x(d.release_date) + "," + y(d.cumulative) + ")");
textArea.text(formatDate(d.release_date) + ": " + formatPercent(d.cumulative));
}
});
</script>3-pointers after offensive rebounds2013-12-26T00:00:00-08:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2013-12-26:/2013/12/26/three-pointers-after-offensive-rebounds/<p>I love college basketball. A lot. My beloved Marquette Golden Eagles are probably the only sports team that can really put me in agony.</p>
<p>Last season, I was watching a <a href="https://www.espn.com/mens-college-basketball/recap?gameId=330730269">nail-biter against Notre Dame</a>. With 3:36 left, Notre Dame's Jack Cooley grabbed an offensive board and promptly passed out …</p><p>I love college basketball. A lot. My beloved Marquette Golden Eagles are probably the only sports team that can really put me in agony.</p>
<p>Last season, I was watching a <a href="https://www.espn.com/mens-college-basketball/recap?gameId=330730269">nail-biter against Notre Dame</a>. With 3:36 left, Notre Dame's Jack Cooley grabbed an offensive board and promptly passed out to Pat Connaughton for a successful three-pointer. It was a dagger.</p>
<p>One thing stuck out to me though - after the shot, <a href="https://en.wikipedia.org/wiki/Jay_Bilas">Jay Bilas</a>, who was calling the game with <a href="https://en.wikipedia.org/wiki/Bill_Raftery">Bill Raftery</a> and <a href="https://en.wikipedia.org/wiki/Sean_McDonough">Sean McDonough</a>, stated that the best time to attempt a 3-pointer is after an offensive rebound.</p>
<p>Intuitively, this statement makes sense - the defensive front line is crashing the boards in hopes of getting a rebound to end the offensive possession, while the defensive backcourt is out on the wings looking for an outlet pass from their teammates, likely leaving their offensive counterparts unguarded.</p>
<p>I've never seen any data that indicates whether three pointers are more successful after an offensive rebound though, much less whether it's the best time to shoot one. It seemed like something worth investigating.</p>
<p>In the following analysis, we'll try to determine whether there is a material difference between "normal" 3P% (those not shot after an offensive rebound) and 3P% when the shot was preceded by an offensive rebound.</p>
<p>I'll be going step by step through data collection, munging, and analysis. If you're just interested in the answer, skip to the last section.</p>
<h3>Getting the data</h3>
<p>ESPN has <a href="https://www.espn.com/mens-college-basketball/playbyplay?gameId=330620221">play-by-play data</a> for almost every NCAA Division I game. I've written a python script that will collect all of this data for a given date range. You can find the script <a href="https://gist.github.com/gjreda/7175267">here</a>. If you're unfamiliar with web scraping, <a href="/2013/03/03/web-scraping-101-with-python/">check out the tutorial</a> I wrote previously.</p>
<h3>Analysis</h3>
<p>Now let's start our analysis using <a href="https://pandas.pydata.org/">pandas</a>.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">glob</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="c1"># read PSVs into DataFrame</span>
<span class="n">games</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">files</span> <span class="o">=</span> <span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">'*.psv'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">files</span><span class="p">:</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s1">'|'</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">'game_id'</span><span class="p">]</span> <span class="o">=</span> <span class="n">f</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">'.psv'</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span>
<span class="n">games</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'Read </span><span class="si">{0}</span><span class="s1"> games'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">games</span><span class="p">)))</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go">Read 2931 games</span>
</code></pre></div>
<p>To start, we need to find all incidents of a three pointer immediately after an offensive rebound.</p>
<p>This format is kind of crappy though - since events are in separate columns for the home and away teams, we'd have to write logic to check against each column. Let's munge our data into a slightly different format - one column for <code>team</code>, which will indicate home/away, and another for <code>event</code>, which will store the description of what occurred.</p>
<div class="highlight"><pre><span></span><code><span class="n">games_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">(</span><span class="n">games</span><span class="p">)</span>
<span class="c1"># add event_id to maintain event order</span>
<span class="c1"># we can use the index since pandas defaults to the Nth row of the file</span>
<span class="n">games_df</span><span class="p">[</span><span class="s1">'event_id'</span><span class="p">]</span> <span class="o">=</span> <span class="n">games_df</span><span class="o">.</span><span class="n">index</span>
<span class="c1"># melt data into one column for home/away and another for event</span>
<span class="c1"># maintain play order by sorting on event_id</span>
<span class="n">melted</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">melt</span><span class="p">(</span><span class="n">games_df</span><span class="p">,</span> <span class="n">id_vars</span><span class="o">=</span><span class="p">[</span><span class="s1">'event_id'</span><span class="p">,</span> <span class="s1">'game_id'</span><span class="p">,</span> <span class="s1">'time'</span><span class="p">,</span> <span class="s1">'score'</span><span class="p">],</span>
<span class="n">var_name</span><span class="o">=</span><span class="s1">'team'</span><span class="p">,</span> <span class="n">value_name</span><span class="o">=</span><span class="s1">'event'</span><span class="p">)</span>
<span class="n">melted</span><span class="o">.</span><span class="n">sort_index</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="p">[</span><span class="s1">'game_id'</span><span class="p">,</span> <span class="s1">'event_id'</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="c1"># drop rows with NaN events - an event only belongs to one team</span>
<span class="n">melted</span> <span class="o">=</span> <span class="n">melted</span><span class="p">[</span><span class="n">melted</span><span class="o">.</span><span class="n">event</span><span class="o">.</span><span class="n">notnull</span><span class="p">()]</span>
<span class="nb">print</span><span class="p">(</span><span class="n">melted</span><span class="p">[</span><span class="mi">10</span><span class="p">:</span><span class="mi">15</span><span class="p">])</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go"> event_id game_id time score team event</span>
<span class="go">917522 10 330010066 19:11 0-0 home Melvin Ejim missed Free Throw.</span>
<span class="go">11 11 330010066 19:11 0-0 away Jeremiah Kreisberg Defensive Rebound.</span>
<span class="go">12 12 330010066 18:37 0-0 away Justin Sears missed Jumper.</span>
<span class="go">917525 13 330010066 18:37 0-0 home Percy Gibson Defensive Rebound.</span>
<span class="go">917526 14 330010066 18:31 0-0 home Chris Babb missed Three Point Jumper.</span>
</code></pre></div>
<p>We need to know whether the three pointers were missed or made - let's write a function called <code>get_shot_result</code> to extract the shot result from the <code>event</code> column. We can apply it to every row that contains a three pointer, storing the results in a new column called <code>shot_result</code>.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># label whether three pointers were made or missed</span>
<span class="n">get_shot_result</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s1">'(made|missed)'</span><span class="p">,</span> <span class="n">x</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">shot3</span> <span class="o">=</span> <span class="n">melted</span><span class="o">.</span><span class="n">event</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">'Three Point'</span><span class="p">)</span>
<span class="n">melted</span><span class="p">[</span><span class="s1">'shot_result'</span><span class="p">]</span> <span class="o">=</span> <span class="n">melted</span><span class="p">[</span><span class="n">shot3</span><span class="p">]</span><span class="o">.</span><span class="n">event</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">get_shot_result</span><span class="p">)</span>
</code></pre></div>
<p>Now let's write a function to label the events that meet our criteria - a three point attempt that was preceded by an offensive rebound. We can use shift(1) to reference the event on the previous row.</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">criteria</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="sd">"""Labels if the three pointer was preceded by an offensive rebound."""</span>
<span class="n">df</span><span class="p">[</span><span class="s1">'after_oreb'</span><span class="p">]</span> <span class="o">=</span> <span class="p">((</span><span class="n">df</span><span class="o">.</span><span class="n">event</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">'Three Point'</span><span class="p">))</span> <span class="o">&</span> \
<span class="n">df</span><span class="o">.</span><span class="n">event</span><span class="o">.</span><span class="n">shift</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">'Offensive Rebound'</span><span class="p">))</span>
<span class="n">df</span><span class="o">.</span><span class="n">after_oreb</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="kc">False</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df</span>
<span class="n">melted</span> <span class="o">=</span> <span class="n">melted</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'game_id'</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">criteria</span><span class="p">)</span>
<span class="n">melted</span><span class="p">[</span><span class="n">melted</span><span class="o">.</span><span class="n">shot_result</span><span class="o">.</span><span class="n">notnull</span><span class="p">()]</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>event_id</th>
<th>game_id</th>
<th>time</th>
<th>score</th>
<th>team</th>
<th>event</th>
<th>shot_result</th>
<th>after_oreb</th>
</tr>
</thead>
<tbody>
<tr>
<th>2 </th>
<td> 2</td>
<td> 330010066</td>
<td> 19:31</td>
<td> 0-0</td>
<td> away</td>
<td> Austin Morgan missed Three Point Jumper.</td>
<td> missed</td>
<td> False</td>
</tr>
<tr>
<th>917518</th>
<td> 6</td>
<td> 330010066</td>
<td> 19:14</td>
<td> 0-0</td>
<td> home</td>
<td> Will Clyburn missed Three Point Jumper.</td>
<td> missed</td>
<td> True</td>
</tr>
<tr>
<th>917526</th>
<td> 14</td>
<td> 330010066</td>
<td> 18:31</td>
<td> 0-0</td>
<td> home</td>
<td> Chris Babb missed Three Point Jumper.</td>
<td> missed</td>
<td> False</td>
</tr>
</tbody>
</table>
<p>Finally, we can calculate the 3P% for our groups and plot the results.</p>
<div class="highlight"><pre><span></span><code><span class="n">threes</span> <span class="o">=</span> <span class="n">melted</span><span class="p">[</span><span class="n">melted</span><span class="o">.</span><span class="n">shot_result</span><span class="o">.</span><span class="n">notnull</span><span class="p">()]</span>
<span class="n">attempts</span> <span class="o">=</span> <span class="n">threes</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'shot_result'</span><span class="p">,</span> <span class="s1">'after_oreb'</span><span class="p">])</span><span class="o">.</span><span class="n">size</span><span class="p">()</span><span class="o">.</span><span class="n">unstack</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">attempts</span><span class="p">[</span><span class="s1">'perc'</span><span class="p">]</span> <span class="o">=</span> <span class="n">attempts</span><span class="o">.</span><span class="n">made</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">float</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">attempts</span><span class="o">.</span><span class="n">made</span> <span class="o">+</span> <span class="n">attempts</span><span class="o">.</span><span class="n">missed</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">attempts</span><span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go">shot_result made missed perc</span>
<span class="go">after_oreb </span>
<span class="go">False 33244 63608 0.343245</span>
<span class="go">True 2861 5341 0.348817</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="n">attempts</span><span class="o">.</span><span class="n">index</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'No'</span><span class="p">,</span> <span class="s1">'Yes'</span><span class="p">]</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">[</span><span class="mi">8</span><span class="p">,</span> <span class="mi">6</span><span class="p">])</span>
<span class="n">attempts</span><span class="o">.</span><span class="n">perc</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s1">'bar'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">'3P%'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">'After Offensive Rebound?'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">grid</span><span class="p">(</span><span class="kc">False</span><span class="p">);</span>
</code></pre></div>
<p><img alt="3P% after offensive rebound vs not" src="/images/three-point-percent-after-oreb.png"></p>
<h3>Digging deeper</h3>
<p>At the most basic level, the difference is negligible. Not all post-offensive rebound three pointers are created equally though. Let's investigate further to see whether three pointers shot shortly after the rebound are more successful. Let's look at only three pointers that were shot within seven seconds of the offensive rebound.</p>
<p>To do so, we'll need to munge our data a bit more in order to calculate the seconds elapsed between rebound and shot attempt.</p>
<div class="highlight"><pre><span></span><code><span class="n">melted</span><span class="p">[</span><span class="s1">'minutes'</span><span class="p">]</span> <span class="o">=</span> <span class="n">melted</span><span class="o">.</span><span class="n">time</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">int</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">':'</span><span class="p">)[</span><span class="mi">0</span><span class="p">]))</span>
<span class="n">melted</span><span class="p">[</span><span class="s1">'seconds'</span><span class="p">]</span> <span class="o">=</span> <span class="n">melted</span><span class="o">.</span><span class="n">time</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">int</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">':'</span><span class="p">)[</span><span class="mi">1</span><span class="p">]))</span>
</code></pre></div>
<p>Notice below that time-outs and end of periods are duplicated within the data (this is because they originally appeared as both a home and away event).</p>
<div class="highlight"><pre><span></span><code><span class="n">duped_cols</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'game_id'</span><span class="p">,</span> <span class="s1">'event_id'</span><span class="p">,</span> <span class="s1">'time'</span><span class="p">,</span> <span class="s1">'event'</span><span class="p">]</span>
<span class="n">melted</span><span class="p">[</span><span class="n">melted</span><span class="o">.</span><span class="n">duplicated</span><span class="p">(</span><span class="n">cols</span><span class="o">=</span><span class="n">duped_cols</span><span class="p">)][:</span><span class="mi">3</span><span class="p">]</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>event_id</th>
<th>game_id</th>
<th>time</th>
<th>score</th>
<th>team</th>
<th>event</th>
<th>shot_result</th>
<th>after_oreb</th>
<th>minutes</th>
<th>seconds</th>
</tr>
</thead>
<tbody>
<tr>
<th>917546</th>
<td> 34</td>
<td> 330010066</td>
<td> 15:22</td>
<td> Official TV Timeout.</td>
<td> home</td>
<td> Official TV Timeout.</td>
<td> NaN</td>
<td> False</td>
<td> 15</td>
<td> 22</td>
</tr>
<tr>
<th>917559</th>
<td> 47</td>
<td> 330010066</td>
<td> 13:40</td>
<td> Yale Full Timeout.</td>
<td> home</td>
<td> Yale Full Timeout.</td>
<td> NaN</td>
<td> False</td>
<td> 13</td>
<td> 40</td>
</tr>
<tr>
<th>917575</th>
<td> 63</td>
<td> 330010066</td>
<td> 11:53</td>
<td> Official TV Timeout.</td>
<td> home</td>
<td> Official TV Timeout.</td>
<td> NaN</td>
<td> False</td>
<td> 11</td>
<td> 53</td>
</tr>
</tbody>
</table>
<p>Let's get rid of them so we can easily label events within each period (keeping them in will throw our function off a bit).</p>
<div class="highlight"><pre><span></span><code><span class="n">melted</span><span class="o">.</span><span class="n">drop_duplicates</span><span class="p">(</span><span class="n">cols</span><span class="o">=</span><span class="p">[</span><span class="s1">'game_id'</span><span class="p">,</span> <span class="s1">'event_id'</span><span class="p">,</span> <span class="s1">'event'</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</code></pre></div>
<p>Now we can label each period based on when the <code>End of ...</code> event appears - events before it are the first half; events after, the second half. To do so, we can use <code>cumsum</code>, which will treat <code>True</code> values as 1. This means we can just shift our results down a row and take the running total.</p>
<div class="highlight"><pre><span></span><code><span class="n">melted</span><span class="p">[</span><span class="s1">'period_end'</span><span class="p">]</span> <span class="o">=</span> <span class="n">melted</span><span class="o">.</span><span class="n">event</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">'End of'</span><span class="p">))</span>
<span class="n">melted</span><span class="p">[</span><span class="n">melted</span><span class="o">.</span><span class="n">period_end</span><span class="p">]</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>event_id</th>
<th>game_id</th>
<th>time</th>
<th>score</th>
<th>team</th>
<th>event</th>
<th>shot_result</th>
<th>after_oreb</th>
<th>minutes</th>
<th>seconds</th>
<th>period_end</th>
</tr>
</thead>
<tbody>
<tr>
<th>171</th>
<td> 171</td>
<td> 330010066</td>
<td> 0:00</td>
<td> End of the 1st Half.</td>
<td> away</td>
<td> End of the 1st Half.</td>
<td> NaN</td>
<td> False</td>
<td> 0</td>
<td> 0</td>
<td> True</td>
</tr>
<tr>
<th>480</th>
<td> 123</td>
<td> 330010120</td>
<td> 0:00</td>
<td> End of the 1st Half.</td>
<td> away</td>
<td> End of the 1st Half.</td>
<td> NaN</td>
<td> False</td>
<td> 0</td>
<td> 0</td>
<td> True</td>
</tr>
<tr>
<th>770</th>
<td> 142</td>
<td> 330010228</td>
<td> 0:00</td>
<td> End of the 1st Half.</td>
<td> away</td>
<td> End of the 1st Half.</td>
<td> NaN</td>
<td> False</td>
<td> 0</td>
<td> 0</td>
<td> True</td>
</tr>
</tbody>
</table>
<div class="highlight"><pre><span></span><code><span class="n">calculate_period</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">shift</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">melted</span><span class="p">[</span><span class="s1">'period'</span><span class="p">]</span> <span class="o">=</span> <span class="n">melted</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'game_id'</span><span class="p">)</span><span class="o">.</span><span class="n">period_end</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">calculate_period</span><span class="p">)</span>
</code></pre></div>
<p>Now we can use the period to calculate the total time left in the game.</p>
<div class="highlight"><pre><span></span><code><span class="n">melted</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">'game_id'</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="c1"># 40min regulation game + (# periods - 2 halves) * 5min OTs</span>
<span class="n">gametime</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="mi">40</span> <span class="o">+</span> <span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="mi">2</span><span class="p">)</span> <span class="o">*</span> <span class="mi">5</span>
<span class="n">melted</span><span class="p">[</span><span class="s1">'gametime'</span><span class="p">]</span> <span class="o">=</span> <span class="n">melted</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">period</span><span class="o">.</span><span class="n">max</span><span class="p">()</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">gametime</span><span class="p">)</span>
<span class="n">melted</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="s1">'game_id'</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</code></pre></div>
<p>Setting the game_id as the index was necessary for this function since pandas naturally tries to match on indexes.</p>
<div class="highlight"><pre><span></span><code><span class="n">melted</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'gametime'</span><span class="p">)</span><span class="o">.</span><span class="n">game_id</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="n">gametime</span>
<span class="mi">35</span> <span class="mi">54</span>
<span class="mi">40</span> <span class="mi">2365</span>
<span class="mi">45</span> <span class="mi">442</span>
<span class="mi">50</span> <span class="mi">61</span>
<span class="mi">55</span> <span class="mi">7</span>
<span class="mi">60</span> <span class="mi">1</span>
<span class="mi">65</span> <span class="mi">1</span>
<span class="n">dtype</span><span class="p">:</span> <span class="n">int64</span>
</code></pre></div>
<p>Notice above that some games have a total gametime of 35 minutes. All college basketball games are at least 40 minutes, so something is off.</p>
<p>It turns out there are some inconsistencies in ESPN's play-by-play data - a couple games do not have the "End of the 1st Half." event. This throws off our <code>period</code> and <code>gametime</code> calculations. Such is life when dealing with scraped data though.</p>
<p>Let's keep things simple and just assume they were normal, non-OT games.</p>
<div class="highlight"><pre><span></span><code><span class="n">melted</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">melted</span><span class="o">.</span><span class="n">gametime</span> <span class="o">==</span> <span class="mi">35</span><span class="p">,</span> <span class="s1">'gametime'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">40</span>
</code></pre></div>
<p>Now let's normalize the event times to seconds left in the game. This will allow us to see how much time elapsed between the offensive rebound and three point attempt.</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">clock_to_secs_left</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="sd">"""Calculates the total seconds left in the game."""</span>
<span class="n">df</span><span class="p">[</span><span class="s1">'secs_left'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">nan</span>
<span class="n">df</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">period</span> <span class="o">==</span> <span class="mi">1</span><span class="p">,</span> <span class="s1">'secs_left'</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">minutes</span> <span class="o">*</span> <span class="mi">60</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1200</span> <span class="o">+</span> <span class="n">df</span><span class="o">.</span><span class="n">seconds</span>
<span class="n">df</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">period</span> <span class="o">></span> <span class="mi">1</span><span class="p">,</span> <span class="s1">'secs_left'</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">minutes</span> <span class="o">*</span> <span class="mi">60</span><span class="p">)</span> <span class="o">+</span> <span class="n">df</span><span class="o">.</span><span class="n">seconds</span>
<span class="k">return</span> <span class="n">df</span>
<span class="n">clock_to_secs_left</span><span class="p">(</span><span class="n">melted</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">melted</span><span class="p">[[</span><span class="s1">'game_id'</span><span class="p">,</span> <span class="s1">'time'</span><span class="p">,</span> <span class="s1">'event'</span><span class="p">,</span> <span class="s1">'period'</span><span class="p">,</span> <span class="s1">'secs_left'</span><span class="p">]][:</span><span class="mi">5</span><span class="p">])</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go"> game_id time event period secs_left</span>
<span class="go">0 330010066 19:47 Chris Babb Turnover. 1 2387</span>
<span class="go">1 330010066 19:45 Austin Morgan Steal. 1 2385</span>
<span class="go">2 330010066 19:31 Austin Morgan missed Three Point Jumper. 1 2371</span>
<span class="go">3 330010066 19:31 Korie Lucious Defensive Rebound. 1 2371</span>
<span class="go">4 330010066 19:21 Korie Lucious missed Jumper. 1 2361</span>
</code></pre></div>
<p>We can finally see how much time elapsed between offensive rebound and three point attempt.</p>
<p>We'll create a new field which will store the seconds elapsed since the previous event. Then we'll create a new DataFrame called <code>threes_after_orebs</code> which will hold three point attempts that were shot within seven seconds of an offensive rebound.</p>
<div class="highlight"><pre><span></span><code><span class="n">melted</span><span class="p">[</span><span class="s1">'secs_elapsed'</span><span class="p">]</span> <span class="o">=</span> <span class="n">melted</span><span class="o">.</span><span class="n">secs_left</span><span class="o">.</span><span class="n">shift</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">-</span> <span class="n">melted</span><span class="o">.</span><span class="n">secs_left</span>
<span class="n">mask</span> <span class="o">=</span> <span class="p">(</span><span class="n">melted</span><span class="o">.</span><span class="n">secs_elapsed</span> <span class="o">>=</span> <span class="mi">0</span><span class="p">)</span> <span class="o">&</span> <span class="p">(</span><span class="n">melted</span><span class="o">.</span><span class="n">secs_elapsed</span> <span class="o"><=</span> <span class="mi">7</span><span class="p">)</span>
<span class="n">threes_after_orebs</span> <span class="o">=</span> <span class="n">melted</span><span class="p">[</span><span class="n">melted</span><span class="o">.</span><span class="n">after_oreb</span> <span class="o">&</span> <span class="n">mask</span><span class="p">]</span>
</code></pre></div>
<p>Finally, let's group our data by the seconds elapsed and shot result to get the three point percentage for each bucket, which we can plot.</p>
<div class="highlight"><pre><span></span><code><span class="n">grouped</span> <span class="o">=</span> <span class="n">threes_after_orebs</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'shot_result'</span><span class="p">,</span> <span class="s1">'secs_elapsed'</span><span class="p">])</span><span class="o">.</span><span class="n">size</span><span class="p">()</span>
<span class="n">grouped</span> <span class="o">=</span> <span class="n">grouped</span><span class="o">.</span><span class="n">unstack</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">grouped</span><span class="p">[</span><span class="s1">'attempts'</span><span class="p">]</span> <span class="o">=</span> <span class="n">grouped</span><span class="o">.</span><span class="n">made</span> <span class="o">+</span> <span class="n">grouped</span><span class="o">.</span><span class="n">missed</span>
<span class="n">grouped</span><span class="p">[</span><span class="s1">'percentage'</span><span class="p">]</span> <span class="o">=</span> <span class="n">grouped</span><span class="o">.</span><span class="n">made</span> <span class="o">/</span> <span class="n">grouped</span><span class="o">.</span><span class="n">attempts</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">float</span><span class="p">)</span>
<span class="n">t</span> <span class="o">=</span> <span class="n">threes</span><span class="o">.</span><span class="n">shot_result</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
<span class="n">t</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="s1">'made'</span><span class="p">])</span> <span class="o">/</span> <span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="s1">'made'</span><span class="p">]</span> <span class="o">+</span> <span class="n">t</span><span class="p">[</span><span class="s1">'missed'</span><span class="p">])</span>
<span class="n">figsize</span><span class="p">(</span><span class="mf">12.5</span><span class="p">,</span> <span class="mi">7</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">grouped</span><span class="o">.</span><span class="n">percentage</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'O-Reb 3P%'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s1">'#377EB8'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">hlines</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'"Normal" 3P%'</span><span class="p">,</span> <span class="n">linestyles</span><span class="o">=</span><span class="s1">'--'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">'Seconds Since Offensive Rebound'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">8</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">'3-Point Percentage'</span><span class="p">,</span> <span class="n">labelpad</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">grid</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="s1">'lower right'</span><span class="p">);</span>
</code></pre></div>
<p><img alt="3P% vs seconds since offensive rebound" src="/images/three-pointers-secs-since-oreb.png"></p>
<p>It looks like there might be some truth to Jay Bilas' statement that the best time to shoot a three pointer is after an offensive rebound. However, we can go one step further and simulate our way to a numeric value of how correct he was.</p>
<h3>Simulations</h3>
<p>To start, we'll create two Series based on the <code>melted</code> DataFrame that we have been using throughout the analysis. One Series, which we'll call <code>normal</code>, will hold the results of three pointers that were normally attempted - that is, they were not shot immediately after an offensive rebound. The other Series, <code>after</code>, will contain the results of those shot after an offensive rebound. <code>True</code> will be used to indicate a made basket.</p>
<div class="highlight"><pre><span></span><code><span class="n">convert</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="kc">True</span> <span class="k">if</span> <span class="n">x</span> <span class="o">==</span> <span class="s1">'made'</span> <span class="k">else</span> <span class="kc">False</span>
<span class="n">normal_criteria</span> <span class="o">=</span> <span class="p">(</span><span class="n">melted</span><span class="o">.</span><span class="n">after_oreb</span> <span class="o">==</span> <span class="kc">False</span><span class="p">)</span> <span class="o">&</span> <span class="n">melted</span><span class="o">.</span><span class="n">shot_result</span><span class="o">.</span><span class="n">notnull</span><span class="p">()</span>
<span class="n">normal</span> <span class="o">=</span> <span class="n">melted</span><span class="p">[</span><span class="n">normal_criteria</span><span class="p">]</span><span class="o">.</span><span class="n">shot_result</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">convert</span><span class="p">)</span>
<span class="n">after_criteria</span> <span class="o">=</span> <span class="p">(</span><span class="n">melted</span><span class="o">.</span><span class="n">after_oreb</span><span class="p">)</span> <span class="o">&</span> <span class="n">melted</span><span class="o">.</span><span class="n">shot_result</span><span class="o">.</span><span class="n">notnull</span><span class="p">()</span> <span class="o">&</span> \
<span class="p">(</span><span class="n">melted</span><span class="o">.</span><span class="n">secs_elapsed</span> <span class="o"><=</span> <span class="mi">7</span><span class="p">)</span>
<span class="n">after</span> <span class="o">=</span> <span class="n">melted</span><span class="p">[</span><span class="n">after_criteria</span><span class="p">]</span><span class="o">.</span><span class="n">shot_result</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">convert</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"After O-Reb 3P%:"</span><span class="p">,</span> <span class="n">after</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Sample Size:"</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">after</span><span class="p">))</span>
<span class="nb">print</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"All other 3P%:"</span><span class="p">,</span> <span class="n">normal</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Sample Size:"</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">normal</span><span class="p">))</span>
<span class="nb">print</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Absolute difference: </span><span class="si">%.4f</span><span class="s2">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">after</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span> <span class="o">-</span> <span class="n">normal</span><span class="o">.</span><span class="n">mean</span><span class="p">()))</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go">After O-Reb 3P%: 0.350169109357</span>
<span class="go">Sample Size: 4435</span>
<span class="go">All other 3P%: 0.343264007931</span>
<span class="go">Sample Size: 96838</span>
<span class="go">Absolute difference: 0.0069</span>
</code></pre></div>
<p>While we have data for 2,932 games, it turns out that three pointers shot within seven seconds of an offensive rebound aren't very common - it only occurred 4,435 times, while "normal" three pointers were shot 96,838 times.</p>
<p>The much smaller population means that we're more uncertain about the "true" success rate of those after offensive rebounds.</p>
<p>Using <a href="http://pymc-devs.github.io/pymc/">pymc</a>, we can run a simulations to determine how likely it is that three pointers after offensive rebounds really are easier, while taking into account this uncertainty.</p>
<p>We'll first assume a uniform distribution between 30% and 40% for both "normal" three pointers and those after offensive rebounds. This means that we believe the "true" success rate of each to be somewhere between 30-40%. This seems reasonable given that the 3P% for all of NCAA Division I basketball during the 2012-2013 season was 33.89% [source].</p>
<p>We then generate observations using our <code>normal</code> and <code>after</code> Series. Our observations are an example of a Bernoulli distribution, meaning that the outcome is binary - the three pointer was either made (True) or missed (False).</p>
<p>We'll then run 20,000 simulations using our existing data and assumptions.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">pymc</span> <span class="k">as</span> <span class="nn">pm</span>
<span class="c1"># no chance 3P% is out of this range</span>
<span class="n">p_normal</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">Uniform</span><span class="p">(</span><span class="s2">"p_normal"</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.4</span><span class="p">)</span>
<span class="n">p_after</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">Uniform</span><span class="p">(</span><span class="s2">"p_after"</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.4</span><span class="p">)</span>
<span class="nd">@pm</span><span class="o">.</span><span class="n">deterministic</span>
<span class="k">def</span> <span class="nf">delta</span><span class="p">(</span><span class="n">p_normal</span><span class="o">=</span><span class="n">p_normal</span><span class="p">,</span> <span class="n">p_after</span><span class="o">=</span><span class="n">p_after</span><span class="p">):</span>
<span class="k">return</span> <span class="n">p_after</span> <span class="o">-</span> <span class="n">p_normal</span>
<span class="c1"># scraped observations</span>
<span class="n">obs_normal</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">Bernoulli</span><span class="p">(</span><span class="s2">"obs_normal"</span><span class="p">,</span> <span class="n">p_normal</span><span class="p">,</span> <span class="n">value</span><span class="o">=</span><span class="n">normal</span><span class="p">,</span> <span class="n">observed</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">obs_after</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">Bernoulli</span><span class="p">(</span><span class="s2">"obs_after"</span><span class="p">,</span> <span class="n">p_after</span><span class="p">,</span> <span class="n">value</span><span class="o">=</span><span class="n">after</span><span class="p">,</span> <span class="n">observed</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">m</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">MCMC</span><span class="p">([</span><span class="n">p_normal</span><span class="p">,</span> <span class="n">p_after</span><span class="p">,</span> <span class="n">delta</span><span class="p">,</span> <span class="n">obs_normal</span><span class="p">,</span> <span class="n">obs_after</span><span class="p">])</span>
<span class="n">m</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="mi">20000</span><span class="p">)</span>
<span class="n">p_normal_samples</span> <span class="o">=</span> <span class="n">m</span><span class="o">.</span><span class="n">trace</span><span class="p">(</span><span class="s2">"p_normal"</span><span class="p">)[:]</span>
<span class="n">p_after_samples</span> <span class="o">=</span> <span class="n">m</span><span class="o">.</span><span class="n">trace</span><span class="p">(</span><span class="s2">"p_after"</span><span class="p">)[:]</span>
<span class="n">delta_samples</span> <span class="o">=</span> <span class="n">m</span><span class="o">.</span><span class="n">trace</span><span class="p">(</span><span class="s2">"delta"</span><span class="p">)[:]</span>
</code></pre></div>
<p>Finally, we can plot the results of our simulations.</p>
<div class="highlight"><pre><span></span><code><span class="n">figsize</span><span class="p">(</span><span class="mf">12.5</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">311</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlim</span><span class="p">(</span><span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.4</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.401</span><span class="p">,</span> <span class="mf">0.01</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">300</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">p_normal_samples</span><span class="p">,</span> <span class="n">histtype</span><span class="o">=</span><span class="s1">'stepfilled'</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">normed</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
<span class="n">color</span><span class="o">=</span><span class="s1">'#E41A1C'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'3P% "Normal"'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">vlines</span><span class="p">(</span><span class="n">normal</span><span class="o">.</span><span class="n">mean</span><span class="p">(),</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">300</span><span class="p">,</span> <span class="n">linestyles</span><span class="o">=</span><span class="s1">'--'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'True "Normal" 3P%'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">grid</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">312</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlim</span><span class="p">(</span><span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.4</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.401</span><span class="p">,</span> <span class="mf">0.01</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">300</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">p_after_samples</span><span class="p">,</span> <span class="n">histtype</span><span class="o">=</span><span class="s1">'stepfilled'</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">normed</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
<span class="n">color</span><span class="o">=</span><span class="s1">'#4DAF4A'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'3P% After Off. Reb.'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">vlines</span><span class="p">(</span><span class="n">after</span><span class="o">.</span><span class="n">mean</span><span class="p">(),</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">300</span><span class="p">,</span> <span class="n">linestyles</span><span class="o">=</span><span class="s1">'--'</span><span class="p">,</span>
<span class="n">label</span><span class="o">=</span><span class="s1">'True 3P% After Off. Reb.'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">grid</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">313</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlim</span><span class="p">(</span><span class="o">-</span><span class="mf">0.05</span><span class="p">,</span> <span class="mf">0.05</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="o">-</span><span class="mf">0.05</span><span class="p">,</span> <span class="mf">0.051</span><span class="p">,</span> <span class="mf">0.01</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">300</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">delta_samples</span><span class="p">,</span> <span class="n">histtype</span><span class="o">=</span><span class="s1">'stepfilled'</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">normed</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
<span class="n">color</span><span class="o">=</span><span class="s1">'#377EB8'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'Delta'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">vlines</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">300</span><span class="p">,</span> <span class="n">linestyles</span><span class="o">=</span><span class="s1">'--'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'$H_0$ (No difference)'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">grid</span><span class="p">(</span><span class="kc">False</span><span class="p">);</span>
</code></pre></div>
<p><img alt="3P% simulations" src="/images/three-pointer-simulations.png"></p>
<p>Notice the much narrower distribution at the top? This is because we had so many observations for "normal" three pointers. There wasn't much uncertainty! It's quite different than the second distribution, where the wider width of the distribution indicates are greater level of uncertainty on those attempted after offensive rebounds. This is due to the much smaller sample size.</p>
<p>The final distribution shows the difference between "normal" and "after" three pointers in our simulations. Much more often than not, "after" three pointers had a higher 3P% in the simulations.</p>
<p>In what percentage of simulations was "after" better than "normal" though?</p>
<div class="highlight"><pre><span></span><code><span class="nb">print</span><span class="p">(</span><span class="s2">"3P</span><span class="si">% a</span><span class="s2">fter offensive rebounds was more successful "</span>
<span class="s2">"in </span><span class="si">{0:.1f}% o</span><span class="s2">f simulations"</span><span class="p">)</span><span class="o">.</span><span class="n">format</span><span class="p">((</span><span class="n">delta_samples</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span> <span class="o">*</span> <span class="mi">100</span><span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go">3P% after offensive rebounds was more successful in 83.1% of simulations</span>
</code></pre></div>
<p>Looks like Jay Bilas is pretty right about this one. While the absolute difference in 3P% is small, those after an offensive rebound were more successful in about 83% of our simulations.</p>
<p>Was there something I missed or got wrong? I'd love to hear from you.</p>Using pandas on the MovieLens dataset2013-10-26T03:00:00-07:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2013-10-26:/2013/10/26/using-pandas-on-the-movielens-dataset/<p><em>UPDATE: If you're interested in learning pandas from a SQL perspective and would prefer to watch a video, you can find video of my 2014 PyData NYC talk <a href="http://reda.io/sql2pandas">here</a>.</em></p>
<p><em>This is part three of a three part introduction to <a href="http://pandas.pydata.org">pandas</a>, a Python library for data analysis. The tutorial is primarily …</em></p><p><em>UPDATE: If you're interested in learning pandas from a SQL perspective and would prefer to watch a video, you can find video of my 2014 PyData NYC talk <a href="http://reda.io/sql2pandas">here</a>.</em></p>
<p><em>This is part three of a three part introduction to <a href="http://pandas.pydata.org">pandas</a>, a Python library for data analysis. The tutorial is primarily geared towards SQL users, but is useful for anyone wanting to get started with the library.</em></p>
<ul>
<li><a href="/2013/10/26/intro-to-pandas-data-structures/">Part 1: Intro to pandas data structures</a>, covers the basics of the library's two main data structures - Series and DataFrames.</li>
<li><a href="/2013/10/26/working-with-pandas-dataframes/">Part 2: Working with DataFrames</a>, dives a bit deeper into the functionality of DataFrames. It shows how to inspect, select, filter, merge, combine, and group your data.</li>
<li><a href="/2013/10/26/using-pandas-on-the-movielens-dataset/">Part 3: Using pandas with the MovieLens dataset</a>, applies the learnings of the first two parts in order to answer a few basic analysis questions about the MovieLens ratings data.</li>
</ul>
<h2>Using pandas on the MovieLens dataset</h2>
<p>To show pandas in a more "applied" sense, let's use it to answer some questions about the <a href="https://grouplens.org/datasets/movielens/">MovieLens</a> dataset. Recall that we've already read our data into DataFrames and merged it.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># pass in column names for each CSV</span>
<span class="n">u_cols</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'user_id'</span><span class="p">,</span> <span class="s1">'age'</span><span class="p">,</span> <span class="s1">'sex'</span><span class="p">,</span> <span class="s1">'occupation'</span><span class="p">,</span> <span class="s1">'zip_code'</span><span class="p">]</span>
<span class="n">users</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'ml-100k/u.user'</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s1">'|'</span><span class="p">,</span> <span class="n">names</span><span class="o">=</span><span class="n">u_cols</span><span class="p">,</span>
<span class="n">encoding</span><span class="o">=</span><span class="s1">'latin-1'</span><span class="p">)</span>
<span class="n">r_cols</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'user_id'</span><span class="p">,</span> <span class="s1">'movie_id'</span><span class="p">,</span> <span class="s1">'rating'</span><span class="p">,</span> <span class="s1">'unix_timestamp'</span><span class="p">]</span>
<span class="n">ratings</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'ml-100k/u.data'</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s1">'</span><span class="se">\t</span><span class="s1">'</span><span class="p">,</span> <span class="n">names</span><span class="o">=</span><span class="n">r_cols</span><span class="p">,</span>
<span class="n">encoding</span><span class="o">=</span><span class="s1">'latin-1'</span><span class="p">)</span>
<span class="c1"># the movies file contains columns indicating the movie's genres</span>
<span class="c1"># let's only load the first five columns of the file with usecols</span>
<span class="n">m_cols</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'movie_id'</span><span class="p">,</span> <span class="s1">'title'</span><span class="p">,</span> <span class="s1">'release_date'</span><span class="p">,</span> <span class="s1">'video_release_date'</span><span class="p">,</span> <span class="s1">'imdb_url'</span><span class="p">]</span>
<span class="n">movies</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'ml-100k/u.item'</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s1">'|'</span><span class="p">,</span> <span class="n">names</span><span class="o">=</span><span class="n">m_cols</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">),</span>
<span class="n">encoding</span><span class="o">=</span><span class="s1">'latin-1'</span><span class="p">)</span>
<span class="c1"># create one merged DataFrame</span>
<span class="n">movie_ratings</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">movies</span><span class="p">,</span> <span class="n">ratings</span><span class="p">)</span>
<span class="n">lens</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">movie_ratings</span><span class="p">,</span> <span class="n">users</span><span class="p">)</span>
</code></pre></div>
<h3>What are the 25 most rated movies?</h3>
<div class="highlight"><pre><span></span><code><span class="n">most_rated</span> <span class="o">=</span> <span class="n">lens</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'title'</span><span class="p">)</span><span class="o">.</span><span class="n">size</span><span class="p">()</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)[:</span><span class="mi">25</span><span class="p">]</span>
<span class="n">most_rated</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go">title</span>
<span class="go">Star Wars (1977) 583</span>
<span class="go">Contact (1997) 509</span>
<span class="go">Fargo (1996) 508</span>
<span class="go">Return of the Jedi (1983) 507</span>
<span class="go">Liar Liar (1997) 485</span>
<span class="go">English Patient, The (1996) 481</span>
<span class="go">Scream (1996) 478</span>
<span class="go">Toy Story (1995) 452</span>
<span class="go">Air Force One (1997) 431</span>
<span class="go">Independence Day (ID4) (1996) 429</span>
<span class="go">Raiders of the Lost Ark (1981) 420</span>
<span class="go">Godfather, The (1972) 413</span>
<span class="go">Pulp Fiction (1994) 394</span>
<span class="go">Twelve Monkeys (1995) 392</span>
<span class="go">Silence of the Lambs, The (1991) 390</span>
<span class="go">Jerry Maguire (1996) 384</span>
<span class="go">Chasing Amy (1997) 379</span>
<span class="go">Rock, The (1996) 378</span>
<span class="go">Empire Strikes Back, The (1980) 367</span>
<span class="go">Star Trek: First Contact (1996) 365</span>
<span class="go">Back to the Future (1985) 350</span>
<span class="go">Titanic (1997) 350</span>
<span class="go">Mission: Impossible (1996) 344</span>
<span class="go">Fugitive, The (1993) 336</span>
<span class="go">Indiana Jones and the Last Crusade (1989) 331</span>
<span class="go">dtype: int64</span>
</code></pre></div>
<p>There's a lot going on in the code above, but it's very idomatic. We're splitting the DataFrame into groups by movie title and applying the <code>size</code> method to get the count of records in each group. Then we order our results in descending order and limit the output to the top 25 using Python's slicing syntax.</p>
<p>In SQL, this would be equivalent to:</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span><span class="w"> </span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="w"></span>
<span class="k">FROM</span><span class="w"> </span><span class="n">lens</span><span class="w"></span>
<span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">title</span><span class="w"></span>
<span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="k">DESC</span><span class="w"></span>
<span class="k">LIMIT</span><span class="w"> </span><span class="mi">25</span><span class="p">;</span><span class="w"></span>
</code></pre></div>
<p>Alternatively, pandas has a nifty <code>value_counts</code> method - yes, this is simpler - the goal above was to show a basic <code>groupby</code> example.</p>
<div class="highlight"><pre><span></span><code><span class="n">lens</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()[:</span><span class="mi">25</span><span class="p">]</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go">Star Wars (1977) 583</span>
<span class="go">Contact (1997) 509</span>
<span class="go">Fargo (1996) 508</span>
<span class="go">Return of the Jedi (1983) 507</span>
<span class="go">Liar Liar (1997) 485</span>
<span class="go">English Patient, The (1996) 481</span>
<span class="go">Scream (1996) 478</span>
<span class="go">Toy Story (1995) 452</span>
<span class="go">Air Force One (1997) 431</span>
<span class="go">Independence Day (ID4) (1996) 429</span>
<span class="go">Raiders of the Lost Ark (1981) 420</span>
<span class="go">Godfather, The (1972) 413</span>
<span class="go">Pulp Fiction (1994) 394</span>
<span class="go">Twelve Monkeys (1995) 392</span>
<span class="go">Silence of the Lambs, The (1991) 390</span>
<span class="go">Jerry Maguire (1996) 384</span>
<span class="go">Chasing Amy (1997) 379</span>
<span class="go">Rock, The (1996) 378</span>
<span class="go">Empire Strikes Back, The (1980) 367</span>
<span class="go">Star Trek: First Contact (1996) 365</span>
<span class="go">Titanic (1997) 350</span>
<span class="go">Back to the Future (1985) 350</span>
<span class="go">Mission: Impossible (1996) 344</span>
<span class="go">Fugitive, The (1993) 336</span>
<span class="go">Indiana Jones and the Last Crusade (1989) 331</span>
<span class="go">Name: title, dtype: int64</span>
</code></pre></div>
<h3>Which movies are most highly rated?</h3>
<div class="highlight"><pre><span></span><code><span class="n">movie_stats</span> <span class="o">=</span> <span class="n">lens</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'title'</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'rating'</span><span class="p">:</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">size</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">]})</span>
<span class="n">movie_stats</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th colspan="2" halign="left">rating</th>
</tr>
<tr>
<th></th>
<th>size</th>
<th>mean</th>
</tr>
<tr>
<th>title</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>'Til There Was You (1997)</th>
<td>9</td>
<td>2.333333</td>
</tr>
<tr>
<th>1-900 (1994)</th>
<td>5</td>
<td>2.600000</td>
</tr>
<tr>
<th>101 Dalmatians (1996)</th>
<td>109</td>
<td>2.908257</td>
</tr>
<tr>
<th>12 Angry Men (1957)</th>
<td>125</td>
<td>4.344000</td>
</tr>
<tr>
<th>187 (1997)</th>
<td>41</td>
<td>3.024390</td>
</tr>
</tbody>
</table>
<p>We can use the <code>agg</code> method to pass a dictionary specifying the columns to aggregate (as keys) and a list of functions we'd like to apply.</p>
<p>Let's sort the resulting DataFrame so that we can see which movies have the highest average score.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># sort by rating average</span>
<span class="n">movie_stats</span><span class="o">.</span><span class="n">sort_values</span><span class="p">([(</span><span class="s1">'rating'</span><span class="p">,</span> <span class="s1">'mean'</span><span class="p">)],</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th colspan="2" halign="left">rating</th>
</tr>
<tr>
<th></th>
<th>size</th>
<th>mean</th>
</tr>
<tr>
<th>title</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>They Made Me a Criminal (1939)</th>
<td>1</td>
<td>5</td>
</tr>
<tr>
<th>Marlene Dietrich: Shadow and Light (1996)</th>
<td>1</td>
<td>5</td>
</tr>
<tr>
<th>Saint of Fort Washington, The (1993)</th>
<td>2</td>
<td>5</td>
</tr>
<tr>
<th>Someone Else's America (1995)</th>
<td>1</td>
<td>5</td>
</tr>
<tr>
<th>Star Kid (1997)</th>
<td>3</td>
<td>5</td>
</tr>
</tbody>
</table>
<p>Because <code>movie_stats</code> is a DataFrame, we use the <code>sort</code> method - only Series objects use <code>order</code>. Additionally, because our columns are now a <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced">MultiIndex</a>, we need to pass in a tuple specifying how to sort.</p>
<p>The above movies are rated so rarely that we can't count them as quality films. Let's only look at movies that have been rated at least 100 times.</p>
<div class="highlight"><pre><span></span><code><span class="n">atleast_100</span> <span class="o">=</span> <span class="n">movie_stats</span><span class="p">[</span><span class="s1">'rating'</span><span class="p">][</span><span class="s1">'size'</span><span class="p">]</span> <span class="o">>=</span> <span class="mi">100</span>
<span class="n">movie_stats</span><span class="p">[</span><span class="n">atleast_100</span><span class="p">]</span><span class="o">.</span><span class="n">sort_values</span><span class="p">([(</span><span class="s1">'rating'</span><span class="p">,</span> <span class="s1">'mean'</span><span class="p">)],</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)[:</span><span class="mi">15</span><span class="p">]</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th colspan="2" halign="left">rating</th>
</tr>
<tr>
<th></th>
<th>size</th>
<th>mean</th>
</tr>
<tr>
<th>title</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>Close Shave, A (1995)</th>
<td>112</td>
<td>4.491071</td>
</tr>
<tr>
<th>Schindler's List (1993)</th>
<td>298</td>
<td>4.466443</td>
</tr>
<tr>
<th>Wrong Trousers, The (1993)</th>
<td>118</td>
<td>4.466102</td>
</tr>
<tr>
<th>Casablanca (1942)</th>
<td>243</td>
<td>4.456790</td>
</tr>
<tr>
<th>Shawshank Redemption, The (1994)</th>
<td>283</td>
<td>4.445230</td>
</tr>
<tr>
<th>Rear Window (1954)</th>
<td>209</td>
<td>4.387560</td>
</tr>
<tr>
<th>Usual Suspects, The (1995)</th>
<td>267</td>
<td>4.385768</td>
</tr>
<tr>
<th>Star Wars (1977)</th>
<td>583</td>
<td>4.358491</td>
</tr>
<tr>
<th>12 Angry Men (1957)</th>
<td>125</td>
<td>4.344000</td>
</tr>
<tr>
<th>Citizen Kane (1941)</th>
<td>198</td>
<td>4.292929</td>
</tr>
<tr>
<th>To Kill a Mockingbird (1962)</th>
<td>219</td>
<td>4.292237</td>
</tr>
<tr>
<th>One Flew Over the Cuckoo's Nest (1975)</th>
<td>264</td>
<td>4.291667</td>
</tr>
<tr>
<th>Silence of the Lambs, The (1991)</th>
<td>390</td>
<td>4.289744</td>
</tr>
<tr>
<th>North by Northwest (1959)</th>
<td>179</td>
<td>4.284916</td>
</tr>
<tr>
<th>Godfather, The (1972)</th>
<td>413</td>
<td>4.283293</td>
</tr>
</tbody>
</table>
<p>Those results look realistic. Notice that we used boolean indexing to filter our <code>movie_stats</code> frame.</p>
<p>We broke this question down into many parts, so here's the Python needed to get the 15 movies with the highest average rating, requiring that they had at least 100 ratings:</p>
<div class="highlight"><pre><span></span><code><span class="n">movie_stats</span> <span class="o">=</span> <span class="n">lens</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'title'</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'rating'</span><span class="p">:</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">size</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">]})</span>
<span class="n">atleast_100</span> <span class="o">=</span> <span class="n">movie_stats</span><span class="p">[</span><span class="s1">'rating'</span><span class="p">]</span><span class="o">.</span><span class="n">size</span> <span class="o">>=</span> <span class="mi">100</span>
<span class="n">movie_stats</span><span class="p">[</span><span class="n">atleast_100</span><span class="p">]</span><span class="o">.</span><span class="n">sort_values</span><span class="p">([(</span><span class="s1">'rating'</span><span class="p">,</span> <span class="s1">'mean'</span><span class="p">)],</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)[:</span><span class="mi">15</span><span class="p">]</span>
</code></pre></div>
<p>The SQL equivalent would be:</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span><span class="w"> </span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="w"> </span><span class="k">size</span><span class="p">,</span><span class="w"> </span><span class="k">AVG</span><span class="p">(</span><span class="n">rating</span><span class="p">)</span><span class="w"> </span><span class="n">mean</span><span class="w"></span>
<span class="k">FROM</span><span class="w"> </span><span class="n">lens</span><span class="w"></span>
<span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">title</span><span class="w"></span>
<span class="k">HAVING</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="mi">100</span><span class="w"></span>
<span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">3</span><span class="w"> </span><span class="k">DESC</span><span class="w"></span>
<span class="k">LIMIT</span><span class="w"> </span><span class="mi">15</span><span class="p">;</span><span class="w"></span>
</code></pre></div>
<h3>Limiting our population going forward</h3>
<p>Going forward, let's only look at the 50 most rated movies. Let's make a Series of movies that meet this threshold so we can use it for filtering later.</p>
<div class="highlight"><pre><span></span><code><span class="n">most_50</span> <span class="o">=</span> <span class="n">lens</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'movie_id'</span><span class="p">)</span><span class="o">.</span><span class="n">size</span><span class="p">()</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)[:</span><span class="mi">50</span><span class="p">]</span>
</code></pre></div>
<p>The SQL to match this would be:</p>
<div class="highlight"><pre><span></span><code><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">most_50</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="p">(</span><span class="w"></span>
<span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="n">movie_id</span><span class="p">,</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="w"></span>
<span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">lens</span><span class="w"></span>
<span class="w"> </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">movie_id</span><span class="w"></span>
<span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="k">DESC</span><span class="w"></span>
<span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">50</span><span class="w"></span>
<span class="p">);</span><span class="w"></span>
</code></pre></div>
<p>This table would then allow us to use EXISTS, IN, or JOIN whenever we wanted to filter our results. Here's an example using EXISTS:</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"></span>
<span class="k">FROM</span><span class="w"> </span><span class="n">lens</span><span class="w"></span>
<span class="k">WHERE</span><span class="w"> </span><span class="k">EXISTS</span><span class="w"> </span><span class="p">(</span><span class="k">SELECT</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">most_50</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">lens</span><span class="p">.</span><span class="n">movie_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">most_50</span><span class="p">.</span><span class="n">movie_id</span><span class="p">);</span><span class="w"></span>
</code></pre></div>
<h3>Which movies are most controversial amongst different ages?</h3>
<p>Let's look at how these movies are viewed across different age groups. First, let's look at how age is distributed amongst our users.</p>
<div class="highlight"><pre><span></span><code><span class="n">users</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">bins</span><span class="o">=</span><span class="mi">30</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">"Distribution of users' ages"</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">'count of users'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">'age'</span><span class="p">);</span>
</code></pre></div>
<p><img alt="Distribution of user ages" src="/images/pandas-movielens-age-histogram.png"></p>
<p>pandas' integration with <a href="https://matplotlib.org/index.html">matplotlib</a> makes basic graphing of Series/DataFrames trivial. In this case, just call hist on the column to produce a histogram. We can also use <a href="https://matplotlib.org/stable/tutorials/introductory/pyplot.html">matplotlib.pyplot</a> to customize our graph a bit (always label your axes).</p>
<h3>Binning our users</h3>
<p>I don't think it'd be very useful to compare individual ages - let's bin our users into age groups using <code>pandas.cut</code>.</p>
<div class="highlight"><pre><span></span><code><span class="n">labels</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'0-9'</span><span class="p">,</span> <span class="s1">'10-19'</span><span class="p">,</span> <span class="s1">'20-29'</span><span class="p">,</span> <span class="s1">'30-39'</span><span class="p">,</span> <span class="s1">'40-49'</span><span class="p">,</span> <span class="s1">'50-59'</span><span class="p">,</span> <span class="s1">'60-69'</span><span class="p">,</span> <span class="s1">'70-79'</span><span class="p">]</span>
<span class="n">lens</span><span class="p">[</span><span class="s1">'age_group'</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">cut</span><span class="p">(</span><span class="n">lens</span><span class="o">.</span><span class="n">age</span><span class="p">,</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">81</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span> <span class="n">right</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">labels</span><span class="o">=</span><span class="n">labels</span><span class="p">)</span>
<span class="n">lens</span><span class="p">[[</span><span class="s1">'age'</span><span class="p">,</span> <span class="s1">'age_group'</span><span class="p">]]</span><span class="o">.</span><span class="n">drop_duplicates</span><span class="p">()[:</span><span class="mi">10</span><span class="p">]</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>age</th>
<th>age_group</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>60</td>
<td>60-69</td>
</tr>
<tr>
<th>397</th>
<td>21</td>
<td>20-29</td>
</tr>
<tr>
<th>459</th>
<td>33</td>
<td>30-39</td>
</tr>
<tr>
<th>524</th>
<td>30</td>
<td>30-39</td>
</tr>
<tr>
<th>782</th>
<td>23</td>
<td>20-29</td>
</tr>
<tr>
<th>995</th>
<td>29</td>
<td>20-29</td>
</tr>
<tr>
<th>1229</th>
<td>26</td>
<td>20-29</td>
</tr>
<tr>
<th>1664</th>
<td>31</td>
<td>30-39</td>
</tr>
<tr>
<th>1942</th>
<td>24</td>
<td>20-29</td>
</tr>
<tr>
<th>2270</th>
<td>32</td>
<td>30-39</td>
</tr>
</tbody>
</table>
<p><code>pandas.cut</code> allows you to bin numeric data. In the above lines, we first created labels to name our bins, then split our users into eight bins of ten years (0-9, 10-19, 20-29, etc.). Our use of <code>right=False</code> told the function that we wanted the bins to be exclusive of the max age in the bin (e.g. a 30 year old user gets the 30s label).</p>
<p>Now we can now compare ratings across age groups.</p>
<div class="highlight"><pre><span></span><code><span class="n">lens</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'age_group'</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'rating'</span><span class="p">:</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">size</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">]})</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr>
<th></th>
<th colspan="2" halign="left">rating</th>
</tr>
<tr>
<th></th>
<th>size</th>
<th>mean</th>
</tr>
<tr>
<th>age_group</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>0-9</th>
<td>43</td>
<td>3.767442</td>
</tr>
<tr>
<th>10-19</th>
<td>8181</td>
<td>3.486126</td>
</tr>
<tr>
<th>20-29</th>
<td>39535</td>
<td>3.467333</td>
</tr>
<tr>
<th>30-39</th>
<td>25696</td>
<td>3.554444</td>
</tr>
<tr>
<th>40-49</th>
<td>15021</td>
<td>3.591772</td>
</tr>
<tr>
<th>50-59</th>
<td>8704</td>
<td>3.635800</td>
</tr>
<tr>
<th>60-69</th>
<td>2623</td>
<td>3.648875</td>
</tr>
<tr>
<th>70-79</th>
<td>197</td>
<td>3.649746</td>
</tr>
</tbody>
</table>
<p>Young users seem a bit more critical than other age groups. Let's look at how the 50 most rated movies are viewed across each age group. We can use the <code>most_50</code> Series we created earlier for filtering.</p>
<div class="highlight"><pre><span></span><code><span class="n">lens</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">'movie_id'</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">by_age</span> <span class="o">=</span> <span class="n">lens</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">most_50</span><span class="o">.</span><span class="n">index</span><span class="p">]</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'title'</span><span class="p">,</span> <span class="s1">'age_group'</span><span class="p">])</span>
<span class="n">by_age</span><span class="o">.</span><span class="n">rating</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">15</span><span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go">title age_group</span>
<span class="go">Air Force One (1997) 10-19 3.647059</span>
<span class="go"> 20-29 3.666667</span>
<span class="go"> 30-39 3.570000</span>
<span class="go"> 40-49 3.555556</span>
<span class="go"> 50-59 3.750000</span>
<span class="go"> 60-69 3.666667</span>
<span class="go"> 70-79 3.666667</span>
<span class="go">Alien (1979) 10-19 4.111111</span>
<span class="go"> 20-29 4.026087</span>
<span class="go"> 30-39 4.103448</span>
<span class="go"> 40-49 3.833333</span>
<span class="go"> 50-59 4.272727</span>
<span class="go"> 60-69 3.500000</span>
<span class="go"> 70-79 4.000000</span>
<span class="go">Aliens (1986) 10-19 4.050000</span>
<span class="go">Name: rating, dtype: float64</span>
</code></pre></div>
<p>Notice that both the title and age group are indexes here, with the average rating value being a Series. This is going to produce a really long list of values.</p>
<p>Wouldn't it be nice to see the data as a table? Each title as a row, each age group as a column, and the average rating in each cell.</p>
<p>Behold! The magic of <code>unstack</code>!</p>
<div class="highlight"><pre><span></span><code><span class="n">by_age</span><span class="o">.</span><span class="n">rating</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="o">.</span><span class="n">unstack</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">)[</span><span class="mi">10</span><span class="p">:</span><span class="mi">20</span><span class="p">]</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>age_group</th>
<th>0-9</th>
<th>10-19</th>
<th>20-29</th>
<th>30-39</th>
<th>40-49</th>
<th>50-59</th>
<th>60-69</th>
<th>70-79</th>
</tr>
<tr>
<th>title</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>E.T. the Extra-Terrestrial (1982)</th>
<td>0</td>
<td>3.680000</td>
<td>3.609091</td>
<td>3.806818</td>
<td>4.160000</td>
<td>4.368421</td>
<td>4.375000</td>
<td>0.000000</td>
</tr>
<tr>
<th>Empire Strikes Back, The (1980)</th>
<td>4</td>
<td>4.642857</td>
<td>4.311688</td>
<td>4.052083</td>
<td>4.100000</td>
<td>3.909091</td>
<td>4.250000</td>
<td>5.000000</td>
</tr>
<tr>
<th>English Patient, The (1996)</th>
<td>5</td>
<td>3.739130</td>
<td>3.571429</td>
<td>3.621849</td>
<td>3.634615</td>
<td>3.774648</td>
<td>3.904762</td>
<td>4.500000</td>
</tr>
<tr>
<th>Fargo (1996)</th>
<td>0</td>
<td>3.937500</td>
<td>4.010471</td>
<td>4.230769</td>
<td>4.294118</td>
<td>4.442308</td>
<td>4.000000</td>
<td>4.333333</td>
</tr>
<tr>
<th>Forrest Gump (1994)</th>
<td>5</td>
<td>4.047619</td>
<td>3.785714</td>
<td>3.861702</td>
<td>3.847826</td>
<td>4.000000</td>
<td>3.800000</td>
<td>0.000000</td>
</tr>
<tr>
<th>Fugitive, The (1993)</th>
<td>0</td>
<td>4.320000</td>
<td>3.969925</td>
<td>3.981481</td>
<td>4.190476</td>
<td>4.240000</td>
<td>3.666667</td>
<td>0.000000</td>
</tr>
<tr>
<th>Full Monty, The (1997)</th>
<td>0</td>
<td>3.421053</td>
<td>4.056818</td>
<td>3.933333</td>
<td>3.714286</td>
<td>4.146341</td>
<td>4.166667</td>
<td>3.500000</td>
</tr>
<tr>
<th>Godfather, The (1972)</th>
<td>0</td>
<td>4.400000</td>
<td>4.345070</td>
<td>4.412844</td>
<td>3.929412</td>
<td>4.463415</td>
<td>4.125000</td>
<td>0.000000</td>
</tr>
<tr>
<th>Groundhog Day (1993)</th>
<td>0</td>
<td>3.476190</td>
<td>3.798246</td>
<td>3.786667</td>
<td>3.851064</td>
<td>3.571429</td>
<td>3.571429</td>
<td>4.000000</td>
</tr>
<tr>
<th>Independence Day (ID4) (1996)</th>
<td>0</td>
<td>3.595238</td>
<td>3.291429</td>
<td>3.389381</td>
<td>3.718750</td>
<td>3.888889</td>
<td>2.750000</td>
<td>0.000000</td>
</tr>
</tbody>
</table>
<p><code>unstack</code>, well, unstacks the specified level of a MultiIndex (by default, <code>groupby</code> turns the grouped field into an index - since we grouped by two fields, it became a MultiIndex). We unstacked the second index (remember that Python uses 0-based indexes), and then filled in NULL values with 0.</p>
<p>If we would have used:</p>
<div class="highlight"><pre><span></span><code><span class="n">by_age</span><span class="o">.</span><span class="n">rating</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="o">.</span><span class="n">unstack</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
</code></pre></div>
<p>We would have had our age groups as rows and movie titles as columns.</p>
<h3>Which movies do men and women most disagree on?</h3>
<p><em>EDIT: I realized after writing this question that Wes McKinney basically went through the exact same question in his book. It's a good, yet simple example of pivot_table, so I'm going to leave it here. Seriously though, <a href="https://www.amazon.com/Python-Data-Analysis-Wrangling-Jupyter/dp/109810403X/ref=sr_1_1">go buy the book</a>.</em></p>
<p>Think about how you'd have to do this in SQL for a second. You'd have to use a combination of IF/CASE statements with aggregate functions in order to pivot your dataset. Your query would look something like this:</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span><span class="w"> </span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="k">AVG</span><span class="p">(</span><span class="k">IF</span><span class="p">(</span><span class="n">sex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'F'</span><span class="p">,</span><span class="w"> </span><span class="n">rating</span><span class="p">,</span><span class="w"> </span><span class="k">NULL</span><span class="p">)),</span><span class="w"> </span><span class="k">AVG</span><span class="p">(</span><span class="k">IF</span><span class="p">(</span><span class="n">sex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'M'</span><span class="p">,</span><span class="w"> </span><span class="n">rating</span><span class="p">,</span><span class="w"> </span><span class="k">NULL</span><span class="p">))</span><span class="w"></span>
<span class="k">FROM</span><span class="w"> </span><span class="n">lens</span><span class="w"></span>
<span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">title</span><span class="p">;</span><span class="w"></span>
</code></pre></div>
<p>Imagine how annoying it'd be if you had to do this on more than two columns.</p>
<p>DataFrame's have a pivot_table method that makes these kinds of operations much easier (and less verbose).</p>
<div class="highlight"><pre><span></span><code><span class="n">lens</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="s1">'movie_id'</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">pivoted</span> <span class="o">=</span> <span class="n">lens</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s1">'movie_id'</span><span class="p">,</span> <span class="s1">'title'</span><span class="p">],</span>
<span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s1">'sex'</span><span class="p">],</span>
<span class="n">values</span><span class="o">=</span><span class="s1">'rating'</span><span class="p">,</span>
<span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">pivoted</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>sex</th>
<th>F</th>
<th>M</th>
</tr>
<tr>
<th>movie_id</th>
<th>title</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>1</th>
<th>Toy Story (1995)</th>
<td>3.789916</td>
<td>3.909910</td>
</tr>
<tr>
<th>2</th>
<th>GoldenEye (1995)</th>
<td>3.368421</td>
<td>3.178571</td>
</tr>
<tr>
<th>3</th>
<th>Four Rooms (1995)</th>
<td>2.687500</td>
<td>3.108108</td>
</tr>
<tr>
<th>4</th>
<th>Get Shorty (1995)</th>
<td>3.400000</td>
<td>3.591463</td>
</tr>
<tr>
<th>5</th>
<th>Copycat (1995)</th>
<td>3.772727</td>
<td>3.140625</td>
</tr>
</tbody>
</table>
<div class="highlight"><pre><span></span><code><span class="n">pivoted</span><span class="p">[</span><span class="s1">'diff'</span><span class="p">]</span> <span class="o">=</span> <span class="n">pivoted</span><span class="o">.</span><span class="n">M</span> <span class="o">-</span> <span class="n">pivoted</span><span class="o">.</span><span class="n">F</span>
<span class="n">pivoted</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>sex</th>
<th>F</th>
<th>M</th>
<th>diff</th>
</tr>
<tr>
<th>movie_id</th>
<th>title</th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>1</th>
<th>Toy Story (1995)</th>
<td>3.789916</td>
<td>3.909910</td>
<td>0.119994</td>
</tr>
<tr>
<th>2</th>
<th>GoldenEye (1995)</th>
<td>3.368421</td>
<td>3.178571</td>
<td>-0.189850</td>
</tr>
<tr>
<th>3</th>
<th>Four Rooms (1995)</th>
<td>2.687500</td>
<td>3.108108</td>
<td>0.420608</td>
</tr>
<tr>
<th>4</th>
<th>Get Shorty (1995)</th>
<td>3.400000</td>
<td>3.591463</td>
<td>0.191463</td>
</tr>
<tr>
<th>5</th>
<th>Copycat (1995)</th>
<td>3.772727</td>
<td>3.140625</td>
<td>-0.632102</td>
</tr>
</tbody>
</table>
<div class="highlight"><pre><span></span><code><span class="n">pivoted</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="s1">'movie_id'</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">disagreements</span> <span class="o">=</span> <span class="n">pivoted</span><span class="p">[</span><span class="n">pivoted</span><span class="o">.</span><span class="n">movie_id</span><span class="o">.</span><span class="n">isin</span><span class="p">(</span><span class="n">most_50</span><span class="o">.</span><span class="n">index</span><span class="p">)][</span><span class="s1">'diff'</span><span class="p">]</span>
<span class="n">disagreements</span><span class="o">.</span><span class="n">sort_values</span><span class="p">()</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s1">'barh'</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">[</span><span class="mi">9</span><span class="p">,</span> <span class="mi">15</span><span class="p">])</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">'Male vs. Female Avg. Ratings</span><span class="se">\n</span><span class="s1">(Difference > 0 = Favored by Men)'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">'Title'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">'Average Rating Difference'</span><span class="p">);</span>
</code></pre></div>
<p><img alt="bar chart of rating difference between men and women" src="/images/pandas-movielens-rating-differences.png"></p>
<p>Of course men like Terminator more than women. Independence Day though? Really?</p>
<h3>Additional Resources</h3>
<ul>
<li><a href="https://pandas.pydata.org/pandas-docs/stable/">pandas documentation</a></li>
<li><a href="https://pyvideo.org/search?models=videos.video&q=pandas">pandas videos from PyCon</a></li>
<li><a href="http://manishamde.github.io/blog/2013/03/07/pandas-and-python-top-10/">pandas and Python top 10</a></li>
<li><a href="https://tomaugspurger.github.io/modern-1-intro.html">Tom Augspurger's Modern pandas series</a></li>
<li><a href="https://www.youtube.com/watch?v=otCriSKVV_8&ab_channel=PyData">Video</a> from Tom's pandas tutorial at PyData Seattle 2015</li>
</ul>
<p><strong>Closing</strong></p>
<p>This is the point where I finally wrap this tutorial up. Hopefully I've covered the basics well enough to pique your interest and help you get started with the library. If I've missed something critical, feel free to <a href="https://twitter.com/gjreda">let me know on Twitter</a> or in the comments - I'd love constructive feedback.</p>Working with DataFrames2013-10-26T02:00:00-07:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2013-10-26:/2013/10/26/working-with-pandas-dataframes/<p><em>UPDATE: If you're interested in learning pandas from a SQL perspective and would prefer to watch a video, you can find video of my 2014 PyData NYC talk <a href="http://reda.io/sql2pandas">here</a>.</em></p>
<p><em>This is part two of a three part introduction to <a href="http://pandas.pydata.org">pandas</a>, a Python library for data analysis. The tutorial is primarily …</em></p><p><em>UPDATE: If you're interested in learning pandas from a SQL perspective and would prefer to watch a video, you can find video of my 2014 PyData NYC talk <a href="http://reda.io/sql2pandas">here</a>.</em></p>
<p><em>This is part two of a three part introduction to <a href="http://pandas.pydata.org">pandas</a>, a Python library for data analysis. The tutorial is primarily geared towards SQL users, but is useful for anyone wanting to get started with the library.</em></p>
<ul>
<li><a href="/2013/10/26/intro-to-pandas-data-structures/">Part 1: Intro to pandas data structures</a>, covers the basics of the library's two main data structures - Series and DataFrames.</li>
<li><a href="/2013/10/26/working-with-pandas-dataframes/">Part 2: Working with DataFrames</a>, dives a bit deeper into the functionality of DataFrames. It shows how to inspect, select, filter, merge, combine, and group your data.</li>
<li><a href="/2013/10/26/using-pandas-on-the-movielens-dataset/">Part 3: Using pandas with the MovieLens dataset</a>, applies the learnings of the first two parts in order to answer a few basic analysis questions about the MovieLens ratings data.</li>
</ul>
<h2>Working with DataFrames</h2>
<p>Now that we can get data into a DataFrame, we can finally start working with them. pandas has an abundance of functionality, far too much for me to cover in this introduction. I'd encourage anyone interested in diving deeper into the library to check out its <a href="https://pandas.pydata.org/pandas-docs/stable/">excellent documentation</a>. Or just use Google - there are a lot of Stack Overflow questions and blog posts covering specifics of the library.</p>
<p>We'll be using the <a href="https://grouplens.org/datasets/movielens/">MovieLens</a> dataset in many examples going forward. The dataset contains 100,000 ratings made by 943 users on 1,682 movies.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># pass in column names for each CSV</span>
<span class="n">u_cols</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'user_id'</span><span class="p">,</span> <span class="s1">'age'</span><span class="p">,</span> <span class="s1">'sex'</span><span class="p">,</span> <span class="s1">'occupation'</span><span class="p">,</span> <span class="s1">'zip_code'</span><span class="p">]</span>
<span class="n">users</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'ml-100k/u.user'</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s1">'|'</span><span class="p">,</span> <span class="n">names</span><span class="o">=</span><span class="n">u_cols</span><span class="p">,</span>
<span class="n">encoding</span><span class="o">=</span><span class="s1">'latin-1'</span><span class="p">)</span>
<span class="n">r_cols</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'user_id'</span><span class="p">,</span> <span class="s1">'movie_id'</span><span class="p">,</span> <span class="s1">'rating'</span><span class="p">,</span> <span class="s1">'unix_timestamp'</span><span class="p">]</span>
<span class="n">ratings</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'ml-100k/u.data'</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s1">'</span><span class="se">\t</span><span class="s1">'</span><span class="p">,</span> <span class="n">names</span><span class="o">=</span><span class="n">r_cols</span><span class="p">,</span>
<span class="n">encoding</span><span class="o">=</span><span class="s1">'latin-1'</span><span class="p">)</span>
<span class="c1"># the movies file contains columns indicating the movie's genres</span>
<span class="c1"># let's only load the first five columns of the file with usecols</span>
<span class="n">m_cols</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'movie_id'</span><span class="p">,</span> <span class="s1">'title'</span><span class="p">,</span> <span class="s1">'release_date'</span><span class="p">,</span> <span class="s1">'video_release_date'</span><span class="p">,</span> <span class="s1">'imdb_url'</span><span class="p">]</span>
<span class="n">movies</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'ml-100k/u.item'</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s1">'|'</span><span class="p">,</span> <span class="n">names</span><span class="o">=</span><span class="n">m_cols</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">),</span>
<span class="n">encoding</span><span class="o">=</span><span class="s1">'latin-1'</span><span class="p">)</span>
</code></pre></div>
<h3>Inspection</h3>
<div class="highlight"><pre><span></span><code><span class="n">movies</span><span class="o">.</span><span class="n">info</span><span class="p">()</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go"><class 'pandas.core.frame.DataFrame'></span>
<span class="go">Int64Index: 1682 entries, 0 to 1681</span>
<span class="go">Data columns (total 5 columns):</span>
<span class="go">movie_id 1682 non-null int64</span>
<span class="go">title 1682 non-null object</span>
<span class="go">release_date 1681 non-null object</span>
<span class="go">video_release_date 0 non-null float64</span>
<span class="go">imdb_url 1679 non-null object</span>
<span class="go">dtypes: float64(1), int64(1), object(3)</span>
<span class="go">memory usage: 78.8+ KB</span>
</code></pre></div>
<p>The output tells a few things about our DataFrame.</p>
<ol>
<li>It's obviously an instance of a DataFrame.</li>
<li>Each row was assigned an index of 0 to N-1, where N is the number of rows in the DataFrame. pandas will do this by default if an index is not specified. Don't worry, this can be changed later.</li>
<li>There are 1,682 rows (every row must have an index).</li>
<li>Our dataset has five total columns, one of which isn't populated at all (video_release_date) and two that are missing some values (release_date and imdb_url).</li>
<li>The last datatypes of each column, but not necessarily in the corresponding order to the listed columns. You should use the <code>dtypes</code> method to get the datatype for each column.</li>
<li>An approximate amount of RAM used to hold the DataFrame. See the <code>.memory_usage</code> method</li>
</ol>
<div class="highlight"><pre><span></span><code><span class="n">movies</span><span class="o">.</span><span class="n">dtypes</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go">movie_id int64</span>
<span class="go">title object</span>
<span class="go">release_date object</span>
<span class="go">video_release_date float64</span>
<span class="go">imdb_url object</span>
<span class="go">dtype: object</span>
</code></pre></div>
<p>DataFrame's also have a <code>describe</code> method, which is great for seeing basic statistics about the dataset's numeric columns. Be careful though, since this will return information on all columns of a numeric datatype.</p>
<div class="highlight"><pre><span></span><code><span class="n">users</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>user_id</th>
<th>age</th>
</tr>
</thead>
<tbody>
<tr>
<th>count</th>
<td>943.000000</td>
<td>943.000000</td>
</tr>
<tr>
<th>mean</th>
<td>472.000000</td>
<td>34.051962</td>
</tr>
<tr>
<th>std</th>
<td>272.364951</td>
<td>12.192740</td>
</tr>
<tr>
<th>min</th>
<td>1.000000</td>
<td>7.000000</td>
</tr>
<tr>
<th>25%</th>
<td>236.500000</td>
<td>25.000000</td>
</tr>
<tr>
<th>50%</th>
<td>472.000000</td>
<td>31.000000</td>
</tr>
<tr>
<th>75%</th>
<td>707.500000</td>
<td>43.000000</td>
</tr>
<tr>
<th>max</th>
<td>943.000000</td>
<td>73.000000</td>
</tr>
</tbody>
</table>
<p>Notice user_id was included since it's numeric. Since this is an ID value, the stats for it don't really matter.</p>
<p>We can quickly see the average age of our users is just above 34 years old, with the youngest being 7 and the oldest being 73. The median age is 31, with the youngest quartile of users being 25 or younger, and the oldest quartile being at least 43.</p>
<p>You've probably noticed that I've used the <code>head</code> method regularly throughout this post - by default, <code>head</code> displays the first five records of the dataset, while <code>tail</code> displays the last five.</p>
<div class="highlight"><pre><span></span><code><span class="n">movies</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>movie_id</th>
<th>title</th>
<th>release_date</th>
<th>video_release_date</th>
<th>imdb_url</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>Toy Story (1995)</td>
<td>01-Jan-1995</td>
<td>NaN</td>
<td>http://us.imdb.com/M/title-exact?Toy%20Story%2...</td>
</tr>
<tr>
<th>1</th>
<td>2</td>
<td>GoldenEye (1995)</td>
<td>01-Jan-1995</td>
<td>NaN</td>
<td>http://us.imdb.com/M/title-exact?GoldenEye%20(...</td>
</tr>
<tr>
<th>2</th>
<td>3</td>
<td>Four Rooms (1995)</td>
<td>01-Jan-1995</td>
<td>NaN</td>
<td>http://us.imdb.com/M/title-exact?Four%20Rooms%...</td>
</tr>
<tr>
<th>3</th>
<td>4</td>
<td>Get Shorty (1995)</td>
<td>01-Jan-1995</td>
<td>NaN</td>
<td>http://us.imdb.com/M/title-exact?Get%20Shorty%...</td>
</tr>
<tr>
<th>4</th>
<td>5</td>
<td>Copycat (1995)</td>
<td>01-Jan-1995</td>
<td>NaN</td>
<td>http://us.imdb.com/M/title-exact?Copycat%20(1995)</td>
</tr>
</tbody>
</table>
<div class="highlight"><pre><span></span><code><span class="n">movies</span><span class="o">.</span><span class="n">tail</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>movie_id</th>
<th>title</th>
<th>release_date</th>
<th>video_release_date</th>
<th>imdb_url</th>
</tr>
</thead>
<tbody>
<tr>
<th>1679</th>
<td>1680</td>
<td>Sliding Doors (1998)</td>
<td>01-Jan-1998</td>
<td>NaN</td>
<td>http://us.imdb.com/Title?Sliding+Doors+(1998)</td>
</tr>
<tr>
<th>1680</th>
<td>1681</td>
<td>You So Crazy (1994)</td>
<td>01-Jan-1994</td>
<td>NaN</td>
<td>http://us.imdb.com/M/title-exact?You%20So%20Cr...</td>
</tr>
<tr>
<th>1681</th>
<td>1682</td>
<td>Scream of Stone (Schrei aus Stein) (1991)</td>
<td>08-Mar-1996</td>
<td>NaN</td>
<td>http://us.imdb.com/M/title-exact?Schrei%20aus%...</td>
</tr>
</tbody>
</table>
<p>Alternatively, Python's regular <a href="https://docs.python.org/release/2.3.5/whatsnew/section-slices.html">slicing</a> syntax works as well.</p>
<div class="highlight"><pre><span></span><code><span class="n">movies</span><span class="p">[</span><span class="mi">20</span><span class="p">:</span><span class="mi">22</span><span class="p">]</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>movie_id</th>
<th>title</th>
<th>release_date</th>
<th>video_release_date</th>
<th>imdb_url</th>
</tr>
</thead>
<tbody>
<tr>
<th>20</th>
<td>21</td>
<td>Muppet Treasure Island (1996)</td>
<td>16-Feb-1996</td>
<td>NaN</td>
<td>http://us.imdb.com/M/title-exact?Muppet%20Trea...</td>
</tr>
<tr>
<th>21</th>
<td>22</td>
<td>Braveheart (1995)</td>
<td>16-Feb-1996</td>
<td>NaN</td>
<td>http://us.imdb.com/M/title-exact?Braveheart%20...</td>
</tr>
</tbody>
</table>
<h3>Selecting</h3>
<p>You can think of a DataFrame as a group of Series that share an index (in this case the column headers). This makes it easy to select specific columns.</p>
<p>Selecting a single column from the DataFrame will return a Series object.</p>
<div class="highlight"><pre><span></span><code><span class="n">users</span><span class="p">[</span><span class="s1">'occupation'</span><span class="p">]</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go">0 technician</span>
<span class="go">1 other</span>
<span class="go">2 writer</span>
<span class="go">3 technician</span>
<span class="go">4 other</span>
<span class="go">Name: occupation, dtype: object</span>
</code></pre></div>
<p>To select multiple columns, simply pass a list of column names to the DataFrame, the output of which will be a DataFrame.</p>
<div class="highlight"><pre><span></span><code><span class="nb">print</span><span class="p">(</span><span class="n">users</span><span class="p">[[</span><span class="s1">'age'</span><span class="p">,</span> <span class="s1">'zip_code'</span><span class="p">]]</span><span class="o">.</span><span class="n">head</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="c1"># can also store in a variable to use later</span>
<span class="n">columns_you_want</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'occupation'</span><span class="p">,</span> <span class="s1">'sex'</span><span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="n">users</span><span class="p">[</span><span class="n">columns_you_want</span><span class="p">]</span><span class="o">.</span><span class="n">head</span><span class="p">())</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go"> age zip_code</span>
<span class="go">0 24 85711</span>
<span class="go">1 53 94043</span>
<span class="go">2 23 32067</span>
<span class="go">3 24 43537</span>
<span class="go">4 33 15213</span>
<span class="go"> occupation sex</span>
<span class="go">0 technician M</span>
<span class="go">1 other F</span>
<span class="go">2 writer M</span>
<span class="go">3 technician M</span>
<span class="go">4 other F</span>
</code></pre></div>
<p>Row selection can be done multiple ways, but doing so by an individual index or boolean indexing are typically easiest.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># users older than 25</span>
<span class="nb">print</span><span class="p">(</span><span class="n">users</span><span class="p">[</span><span class="n">users</span><span class="o">.</span><span class="n">age</span> <span class="o">></span> <span class="mi">25</span><span class="p">]</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">3</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="c1"># users aged 40 AND male</span>
<span class="nb">print</span><span class="p">(</span><span class="n">users</span><span class="p">[(</span><span class="n">users</span><span class="o">.</span><span class="n">age</span> <span class="o">==</span> <span class="mi">40</span><span class="p">)</span> <span class="o">&</span> <span class="p">(</span><span class="n">users</span><span class="o">.</span><span class="n">sex</span> <span class="o">==</span> <span class="s1">'M'</span><span class="p">)]</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">3</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="c1"># users younger than 30 OR female</span>
<span class="nb">print</span><span class="p">(</span><span class="n">users</span><span class="p">[(</span><span class="n">users</span><span class="o">.</span><span class="n">sex</span> <span class="o">==</span> <span class="s1">'F'</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">users</span><span class="o">.</span><span class="n">age</span> <span class="o"><</span> <span class="mi">30</span><span class="p">)]</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">3</span><span class="p">))</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go"> user_id age sex occupation zip_code</span>
<span class="go">1 2 53 F other 94043</span>
<span class="go">4 5 33 F other 15213</span>
<span class="go">5 6 42 M executive 98101</span>
<span class="go"> user_id age sex occupation zip_code</span>
<span class="go">18 19 40 M librarian 02138</span>
<span class="go">82 83 40 M other 44133</span>
<span class="go">115 116 40 M healthcare 97232</span>
<span class="go"> user_id age sex occupation zip_code</span>
<span class="go">0 1 24 M technician 85711</span>
<span class="go">1 2 53 F other 94043</span>
<span class="go">2 3 23 M writer 32067</span>
</code></pre></div>
<p>Since our index is kind of meaningless right now, let's set it to the user_id using the <code>set_index</code> method. By default, <code>set_index</code> returns a new DataFrame, so you'll have to specify if you'd like the changes to occur in place.</p>
<p>This has confused me in the past, so look carefully at the code and output below.</p>
<div class="highlight"><pre><span></span><code><span class="nb">print</span><span class="p">(</span><span class="n">users</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">'user_id'</span><span class="p">)</span><span class="o">.</span><span class="n">head</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">users</span><span class="o">.</span><span class="n">head</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"</span><span class="se">\n</span><span class="s2">^^^ I didn't actually change the DataFrame. ^^^</span><span class="se">\n</span><span class="s2">"</span><span class="p">)</span>
<span class="n">with_new_index</span> <span class="o">=</span> <span class="n">users</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">'user_id'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">with_new_index</span><span class="o">.</span><span class="n">head</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"</span><span class="se">\n</span><span class="s2">^^^ set_index actually returns a new DataFrame. ^^^</span><span class="se">\n</span><span class="s2">"</span><span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go"> age sex occupation zip_code</span>
<span class="go">user_id </span>
<span class="go">1 24 M technician 85711</span>
<span class="go">2 53 F other 94043</span>
<span class="go">3 23 M writer 32067</span>
<span class="go">4 24 M technician 43537</span>
<span class="go">5 33 F other 15213</span>
<span class="go"> user_id age sex occupation zip_code</span>
<span class="go">0 1 24 M technician 85711</span>
<span class="go">1 2 53 F other 94043</span>
<span class="go">2 3 23 M writer 32067</span>
<span class="go">3 4 24 M technician 43537</span>
<span class="go">4 5 33 F other 15213</span>
<span class="go">^^^ I didn't actually change the DataFrame. ^^^</span>
<span class="go"> age sex occupation zip_code</span>
<span class="go">user_id </span>
<span class="go">1 24 M technician 85711</span>
<span class="go">2 53 F other 94043</span>
<span class="go">3 23 M writer 32067</span>
<span class="go">4 24 M technician 43537</span>
<span class="go">5 33 F other 15213</span>
<span class="go">^^^ set_index actually returns a new DataFrame. ^^^</span>
</code></pre></div>
<p>If you want to modify your existing DataFrame, use the <code>inplace</code> parameter. Most DataFrame methods return new a DataFrames, while offering an <code>inplace</code> parameter. Note that the <code>inplace</code> version might not actually be any more efficient (in terms of speed or memory usage) than the regular version.</p>
<div class="highlight"><pre><span></span><code><span class="n">users</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">'user_id'</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">users</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>age</th>
<th>sex</th>
<th>occupation</th>
<th>zip_code</th>
</tr>
<tr>
<th>user_id</th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>1</th>
<td>24</td>
<td>M</td>
<td>technician</td>
<td>85711</td>
</tr>
<tr>
<th>2</th>
<td>53</td>
<td>F</td>
<td>other</td>
<td>94043</td>
</tr>
<tr>
<th>3</th>
<td>23</td>
<td>M</td>
<td>writer</td>
<td>32067</td>
</tr>
<tr>
<th>4</th>
<td>24</td>
<td>M</td>
<td>technician</td>
<td>43537</td>
</tr>
<tr>
<th>5</th>
<td>33</td>
<td>F</td>
<td>other</td>
<td>15213</td>
</tr>
</tbody>
</table>
<p>Notice that we've lost the default pandas 0-based index and moved the user_id into its place. We can select rows by position using the <code>iloc</code> method.</p>
<div class="highlight"><pre><span></span><code><span class="nb">print</span><span class="p">(</span><span class="n">users</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">99</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">users</span><span class="o">.</span><span class="n">iloc</span><span class="p">[[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">50</span><span class="p">,</span> <span class="mi">300</span><span class="p">]])</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go">age 36</span>
<span class="go">sex M</span>
<span class="go">occupation executive</span>
<span class="go">zip_code 90254</span>
<span class="go">Name: 100, dtype: object</span>
<span class="go"> age sex occupation zip_code</span>
<span class="go">user_id </span>
<span class="go">2 53 F other 94043</span>
<span class="go">51 28 M educator 16509</span>
<span class="go">301 24 M student 55439</span>
</code></pre></div>
<p>And we can select rows by label with the <code>loc</code> method.</p>
<div class="highlight"><pre><span></span><code><span class="nb">print</span><span class="p">(</span><span class="n">users</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="mi">100</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">users</span><span class="o">.</span><span class="n">loc</span><span class="p">[[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">51</span><span class="p">,</span> <span class="mi">301</span><span class="p">]])</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go">age 36</span>
<span class="go">sex M</span>
<span class="go">occupation executive</span>
<span class="go">zip_code 90254</span>
<span class="go">Name: 100, dtype: object</span>
<span class="go"> age sex occupation zip_code</span>
<span class="go">user_id </span>
<span class="go">2 53 F other 94043</span>
<span class="go">51 28 M educator 16509</span>
<span class="go">301 24 M student 55439</span>
</code></pre></div>
<p>If we realize later that we liked the old pandas default index, we can just <code>reset_index</code>. The same rules for <code>inplace</code> apply.</p>
<div class="highlight"><pre><span></span><code><span class="n">users</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">users</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>user_id</th>
<th>age</th>
<th>sex</th>
<th>occupation</th>
<th>zip_code</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>24</td>
<td>M</td>
<td>technician</td>
<td>85711</td>
</tr>
<tr>
<th>1</th>
<td>2</td>
<td>53</td>
<td>F</td>
<td>other</td>
<td>94043</td>
</tr>
<tr>
<th>2</th>
<td>3</td>
<td>23</td>
<td>M</td>
<td>writer</td>
<td>32067</td>
</tr>
<tr>
<th>3</th>
<td>4</td>
<td>24</td>
<td>M</td>
<td>technician</td>
<td>43537</td>
</tr>
<tr>
<th>4</th>
<td>5</td>
<td>33</td>
<td>F</td>
<td>other</td>
<td>15213</td>
</tr>
</tbody>
</table>
<p>The simplified rules of indexing are
- Use <code>loc</code> for label-based indexing
- Use <code>iloc</code> for positional indexing
I've found that I can usually get by with boolean indexing, <code>loc</code> and <code>iloc</code>, but pandas has a whole host of <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html">other ways to do selection</a>.</p>
<h3>Joining</h3>
<p>Throughout an analysis, we'll often need to merge/join datasets as data is typically stored in a <a href="https://en.wikipedia.org/wiki/Relational_database">relational</a> manner.</p>
<p>Our MovieLens data is a good example of this - a rating requires both a user and a movie, and the datasets are linked together by a key - in this case, the user_id and movie_id. It's possible for a user to be associated with zero or many ratings and movies. Likewise, a movie can be rated zero or many times, by a number of different users.</p>
<p>Like SQL's JOIN clause, <code>pandas.merge</code> allows two DataFrames to be joined on one or more keys. The function provides a series of parameters (<code>on, left_on, right_on, left_index, right_index</code>) allowing you to specify the columns or indexes on which to join.</p>
<p>By default, <code>pandas.merge</code> operates as an inner join, which can be changed using the <code>how</code> parameter.</p>
<p>From the function's docstring:</p>
<blockquote>
<p>how : {'left', 'right', 'outer', 'inner'}, default 'inner'
- left: use only keys from left frame (SQL: left outer join)
- right: use only keys from right frame (SQL: right outer join)
- outer: use union of keys from both frames (SQL: full outer join)
- inner: use intersection of keys from both frames (SQL: inner join)</p>
</blockquote>
<p>Below are some examples of what each look like.</p>
<div class="highlight"><pre><span></span><code><span class="n">left_frame</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">'key'</span><span class="p">:</span> <span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">),</span>
<span class="s1">'left_value'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'a'</span><span class="p">,</span> <span class="s1">'b'</span><span class="p">,</span> <span class="s1">'c'</span><span class="p">,</span> <span class="s1">'d'</span><span class="p">,</span> <span class="s1">'e'</span><span class="p">]})</span>
<span class="n">right_frame</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">'key'</span><span class="p">:</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">7</span><span class="p">),</span>
<span class="s1">'right_value'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'f'</span><span class="p">,</span> <span class="s1">'g'</span><span class="p">,</span> <span class="s1">'h'</span><span class="p">,</span> <span class="s1">'i'</span><span class="p">,</span> <span class="s1">'j'</span><span class="p">]})</span>
<span class="nb">print</span><span class="p">(</span><span class="n">left_frame</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">right_frame</span><span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go"> key left_value</span>
<span class="go">0 0 a</span>
<span class="go">1 1 b</span>
<span class="go">2 2 c</span>
<span class="go">3 3 d</span>
<span class="go">4 4 e</span>
<span class="go"> key right_value</span>
<span class="go">0 2 f</span>
<span class="go">1 3 g</span>
<span class="go">2 4 h</span>
<span class="go">3 5 i</span>
<span class="go">4 6 j</span>
</code></pre></div>
<h4>inner join (default)</h4>
<div class="highlight"><pre><span></span><code><span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">left_frame</span><span class="p">,</span> <span class="n">right_frame</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="s1">'key'</span><span class="p">,</span> <span class="n">how</span><span class="o">=</span><span class="s1">'inner'</span><span class="p">)</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>key</th>
<th>left_value</th>
<th>right_value</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2</td>
<td>c</td>
<td>f</td>
</tr>
<tr>
<th>1</th>
<td>3</td>
<td>d</td>
<td>g</td>
</tr>
<tr>
<th>2</th>
<td>4</td>
<td>e</td>
<td>h</td>
</tr>
</tbody>
</table>
<p>We lose values from both frames since certain keys do not match up. The SQL equivalent is:</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span><span class="w"> </span><span class="n">left_frame</span><span class="p">.</span><span class="k">key</span><span class="p">,</span><span class="w"> </span><span class="n">left_frame</span><span class="p">.</span><span class="n">left_value</span><span class="p">,</span><span class="w"> </span><span class="n">right_frame</span><span class="p">.</span><span class="n">right_value</span><span class="w"></span>
<span class="k">FROM</span><span class="w"> </span><span class="n">left_frame</span><span class="w"></span>
<span class="k">INNER</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">right_frame</span><span class="w"></span>
<span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">left_frame</span><span class="p">.</span><span class="k">key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">right_frame</span><span class="p">.</span><span class="k">key</span><span class="p">;</span><span class="w"></span>
</code></pre></div>
<p>Had our key columns not been named the same, we could have used the <code>left_on</code> and <code>right_on</code> parameters to specify which fields to join from each frame.</p>
<div class="highlight"><pre><span></span><code><span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">left_frame</span><span class="p">,</span> <span class="n">right_frame</span><span class="p">,</span> <span class="n">left_on</span><span class="o">=</span><span class="s1">'left_key'</span><span class="p">,</span> <span class="n">right_on</span><span class="o">=</span><span class="s1">'right_key'</span><span class="p">)</span>
</code></pre></div>
<p>Alternatively, if our keys were indexes, we could use the <code>left_index</code> or <code>right_index</code> parameters, which accept a True/False value. You can mix and match columns and indexes like so:</p>
<div class="highlight"><pre><span></span><code><span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">left_frame</span><span class="p">,</span> <span class="n">right_frame</span><span class="p">,</span> <span class="n">left_on</span><span class="o">=</span><span class="s1">'key'</span><span class="p">,</span> <span class="n">right_index</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</code></pre></div>
<h4>left outer join</h4>
<div class="highlight"><pre><span></span><code><span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">left_frame</span><span class="p">,</span> <span class="n">right_frame</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="s1">'key'</span><span class="p">,</span> <span class="n">how</span><span class="o">=</span><span class="s1">'left'</span><span class="p">)</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>key</th>
<th>left_value</th>
<th>right_value</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0</td>
<td>a</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>b</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>2</td>
<td>c</td>
<td>f</td>
</tr>
<tr>
<th>3</th>
<td>3</td>
<td>d</td>
<td>g</td>
</tr>
<tr>
<th>4</th>
<td>4</td>
<td>e</td>
<td>h</td>
</tr>
</tbody>
</table>
<p>We keep everything from the left frame, pulling in the value from the right frame where the keys match up. The right_value is NULL where keys do not match (NaN).</p>
<p>SQL Equivalent:</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span><span class="w"> </span><span class="n">left_frame</span><span class="p">.</span><span class="k">key</span><span class="p">,</span><span class="w"> </span><span class="n">left_frame</span><span class="p">.</span><span class="n">left_value</span><span class="p">,</span><span class="w"> </span><span class="n">right_frame</span><span class="p">.</span><span class="n">right_value</span><span class="w"></span>
<span class="k">FROM</span><span class="w"> </span><span class="n">left_frame</span><span class="w"></span>
<span class="k">LEFT</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">right_frame</span><span class="w"></span>
<span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">left_frame</span><span class="p">.</span><span class="k">key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">right_frame</span><span class="p">.</span><span class="k">key</span><span class="p">;</span><span class="w"></span>
</code></pre></div>
<h4>right outer join</h4>
<div class="highlight"><pre><span></span><code><span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">left_frame</span><span class="p">,</span> <span class="n">right_frame</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="s1">'key'</span><span class="p">,</span> <span class="n">how</span><span class="o">=</span><span class="s1">'right'</span><span class="p">)</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>key</th>
<th>left_value</th>
<th>right_value</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2</td>
<td>c</td>
<td>f</td>
</tr>
<tr>
<th>1</th>
<td>3</td>
<td>d</td>
<td>g</td>
</tr>
<tr>
<th>2</th>
<td>4</td>
<td>e</td>
<td>h</td>
</tr>
<tr>
<th>3</th>
<td>5</td>
<td>NaN</td>
<td>i</td>
</tr>
<tr>
<th>4</th>
<td>6</td>
<td>NaN</td>
<td>j</td>
</tr>
</tbody>
</table>
<p>This time we've kept everything from the right frame with the left_value being NULL where the right frame's key did not find a match.</p>
<p>SQL Equivalent:</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span><span class="w"> </span><span class="n">right_frame</span><span class="p">.</span><span class="k">key</span><span class="p">,</span><span class="w"> </span><span class="n">left_frame</span><span class="p">.</span><span class="n">left_value</span><span class="p">,</span><span class="w"> </span><span class="n">right_frame</span><span class="p">.</span><span class="n">right_value</span><span class="w"></span>
<span class="k">FROM</span><span class="w"> </span><span class="n">left_frame</span><span class="w"></span>
<span class="k">RIGHT</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">right_frame</span><span class="w"></span>
<span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">left_frame</span><span class="p">.</span><span class="k">key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">right_frame</span><span class="p">.</span><span class="k">key</span><span class="p">;</span><span class="w"></span>
</code></pre></div>
<h4>full outer join</h4>
<div class="highlight"><pre><span></span><code><span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">left_frame</span><span class="p">,</span> <span class="n">right_frame</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="s1">'key'</span><span class="p">,</span> <span class="n">how</span><span class="o">=</span><span class="s1">'outer'</span><span class="p">)</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>key</th>
<th>left_value</th>
<th>right_value</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0</td>
<td>a</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>b</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>2</td>
<td>c</td>
<td>f</td>
</tr>
<tr>
<th>3</th>
<td>3</td>
<td>d</td>
<td>g</td>
</tr>
<tr>
<th>4</th>
<td>4</td>
<td>e</td>
<td>h</td>
</tr>
<tr>
<th>5</th>
<td>5</td>
<td>NaN</td>
<td>i</td>
</tr>
<tr>
<th>6</th>
<td>6</td>
<td>NaN</td>
<td>j</td>
</tr>
</tbody>
</table>
<p>We've kept everything from both frames, regardless of whether or not there was a match on both sides. Where there was not a match, the values corresponding to that key are NULL.</p>
<p>SQL Equivalent (though some databases don't allow FULL JOINs (e.g. MySQL)):</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span><span class="w"> </span><span class="n">IFNULL</span><span class="p">(</span><span class="n">left_frame</span><span class="p">.</span><span class="k">key</span><span class="p">,</span><span class="w"> </span><span class="n">right_frame</span><span class="p">.</span><span class="k">key</span><span class="p">)</span><span class="w"> </span><span class="k">key</span><span class="w"></span>
<span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="n">left_frame</span><span class="p">.</span><span class="n">left_value</span><span class="p">,</span><span class="w"> </span><span class="n">right_frame</span><span class="p">.</span><span class="n">right_value</span><span class="w"></span>
<span class="k">FROM</span><span class="w"> </span><span class="n">left_frame</span><span class="w"></span>
<span class="k">FULL</span><span class="w"> </span><span class="k">OUTER</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">right_frame</span><span class="w"></span>
<span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">left_frame</span><span class="p">.</span><span class="k">key</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">right_frame</span><span class="p">.</span><span class="k">key</span><span class="p">;</span><span class="w"></span>
</code></pre></div>
<h3>Combining</h3>
<p>pandas also provides a way to combine DataFrames along an axis - <code>pandas.concat</code>. While the function is equivalent to SQL's UNION clause, there's a lot more that can be done with it.</p>
<p><code>pandas.concat</code> takes a list of Series or DataFrames and returns a Series or DataFrame of the concatenated objects. Note that because the function takes list, you can combine many objects at once.</p>
<div class="highlight"><pre><span></span><code><span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">left_frame</span><span class="p">,</span> <span class="n">right_frame</span><span class="p">])</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>key</th>
<th>left_value</th>
<th>right_value</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0</td>
<td>a</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>b</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>2</td>
<td>c</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>3</td>
<td>d</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>4</td>
<td>e</td>
<td>NaN</td>
</tr>
<tr>
<th>0</th>
<td>2</td>
<td>NaN</td>
<td>f</td>
</tr>
<tr>
<th>1</th>
<td>3</td>
<td>NaN</td>
<td>g</td>
</tr>
<tr>
<th>2</th>
<td>4</td>
<td>NaN</td>
<td>h</td>
</tr>
<tr>
<th>3</th>
<td>5</td>
<td>NaN</td>
<td>i</td>
</tr>
<tr>
<th>4</th>
<td>6</td>
<td>NaN</td>
<td>j</td>
</tr>
</tbody>
</table>
<p>By default, the function will vertically append the objects to one another, combining columns with the same name. We can see above that values not matching up will be NULL.</p>
<p>Additionally, objects can be concatentated side-by-side using the function's axis parameter.</p>
<div class="highlight"><pre><span></span><code><span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">left_frame</span><span class="p">,</span> <span class="n">right_frame</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>key</th>
<th>left_value</th>
<th>key</th>
<th>right_value</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0</td>
<td>a</td>
<td>2</td>
<td>f</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>b</td>
<td>3</td>
<td>g</td>
</tr>
<tr>
<th>2</th>
<td>2</td>
<td>c</td>
<td>4</td>
<td>h</td>
</tr>
<tr>
<th>3</th>
<td>3</td>
<td>d</td>
<td>5</td>
<td>i</td>
</tr>
<tr>
<th>4</th>
<td>4</td>
<td>e</td>
<td>6</td>
<td>j</td>
</tr>
</tbody>
</table>
<p><code>pandas.concat</code> can be used in a variety of ways; however, I've typically only used it to combine Series/DataFrames into one unified object. The <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#concatenating-objects">documentation</a> has some examples on the ways it can be used.</p>
<h3>Grouping</h3>
<p>Grouping in pandas took some time for me to grasp, but it's pretty awesome once it clicks.</p>
<p>pandas <code>groupby</code> method draws largely from the <a href="https://vita.had.co.nz/papers/plyr.html">split-apply-combine strategy for data analysis</a>. If you're not familiar with this methodology, I highly suggest you read up on it. It does a great job of illustrating how to properly think through a data problem, which I feel is more important than any technical skill a data analyst/scientist can possess.</p>
<p>When approaching a data analysis problem, you'll often break it apart into manageable pieces, perform some operations on each of the pieces, and then put everything back together again (this is the gist split-apply-combine strategy). pandas <code>groupby</code> is great for these problems (R users should check out the <a href="http://plyr.had.co.nz/">plyr</a> and <a href="https://github.com/tidyverse/dplyr">dplyr</a> packages).</p>
<p>If you've ever used SQL's GROUP BY or an Excel Pivot Table, you've thought with this mindset, probably without realizing it.</p>
<p>Assume we have a DataFrame and want to get the average for each group - visually, the split-apply-combine method looks like this:
<img src="http://i.imgur.com/yjNkiwL.png" alt="Source: Gratuitously borrowed from <a href="http://courses.had.co.nz/12-oscon/">Hadley Wickham's Data Science in R slides</a>"></p>
<p>The City of Chicago is kind enough to publish all city employee salaries to its open data portal. Let's go through some basic <code>groupby</code> examples using this data.</p>
<div class="highlight"><pre><span></span><code>!head -n <span class="m">3</span> city-of-chicago-salaries.csv
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go">Name,Position Title,Department,Employee Annual Salary</span>
<span class="go">"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$85512.00</span>
<span class="go">"AARON, JEFFERY M",POLICE OFFICER,POLICE,$75372.00</span>
</code></pre></div>
<p>Since the data contains a dollar sign for each salary, python will treat the field as a series of strings. We can use the <code>converters</code> parameter to change this when reading in the file.</p>
<blockquote>
<p>converters : dict. optional
- Dict of functions for converting values in certain columns. Keys can either be integers or column labels</p>
</blockquote>
<div class="highlight"><pre><span></span><code><span class="n">headers</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'name'</span><span class="p">,</span> <span class="s1">'title'</span><span class="p">,</span> <span class="s1">'department'</span><span class="p">,</span> <span class="s1">'salary'</span><span class="p">]</span>
<span class="n">chicago</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'city-of-chicago-salaries.csv'</span><span class="p">,</span>
<span class="n">header</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
<span class="n">names</span><span class="o">=</span><span class="n">headers</span><span class="p">,</span>
<span class="n">converters</span><span class="o">=</span><span class="p">{</span><span class="s1">'salary'</span><span class="p">:</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">float</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">'$'</span><span class="p">,</span> <span class="s1">''</span><span class="p">))})</span>
<span class="n">chicago</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>title</th>
<th>department</th>
<th>salary</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>AARON, ELVIA J</td>
<td>WATER RATE TAKER</td>
<td>WATER MGMNT</td>
<td>85512</td>
</tr>
<tr>
<th>1</th>
<td>AARON, JEFFERY M</td>
<td>POLICE OFFICER</td>
<td>POLICE</td>
<td>75372</td>
</tr>
<tr>
<th>2</th>
<td>AARON, KIMBERLEI R</td>
<td>CHIEF CONTRACT EXPEDITER</td>
<td>GENERAL SERVICES</td>
<td>80916</td>
</tr>
<tr>
<th>3</th>
<td>ABAD JR, VICENTE M</td>
<td>CIVIL ENGINEER IV</td>
<td>WATER MGMNT</td>
<td>99648</td>
</tr>
<tr>
<th>4</th>
<td>ABBATACOLA, ROBERT J</td>
<td>ELECTRICAL MECHANIC</td>
<td>AVIATION</td>
<td>89440</td>
</tr>
</tbody>
</table>
<p>pandas <code>groupby</code> returns a DataFrameGroupBy object which has a variety of methods, many of which are similar to standard SQL aggregate functions.</p>
<div class="highlight"><pre><span></span><code><span class="n">by_dept</span> <span class="o">=</span> <span class="n">chicago</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'department'</span><span class="p">)</span>
<span class="n">by_dept</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go"><pandas.core.groupby.DataFrameGroupBy object at 0x1128ca1d0></span>
</code></pre></div>
<p>Calling <code>count</code> returns the total number of NOT NULL values within each column. If we were interested in the total number of records in each group, we could use <code>size</code>.</p>
<div class="highlight"><pre><span></span><code><span class="nb">print</span><span class="p">(</span><span class="n">by_dept</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">head</span><span class="p">())</span> <span class="c1"># NOT NULL records within each column</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">by_dept</span><span class="o">.</span><span class="n">size</span><span class="p">()</span><span class="o">.</span><span class="n">tail</span><span class="p">())</span> <span class="c1"># total records for each department</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go"> name title salary</span>
<span class="go">department </span>
<span class="go">ADMIN HEARNG 42 42 42</span>
<span class="go">ANIMAL CONTRL 61 61 61</span>
<span class="go">AVIATION 1218 1218 1218</span>
<span class="go">BOARD OF ELECTION 110 110 110</span>
<span class="go">BOARD OF ETHICS 9 9 9</span>
<span class="go">department</span>
<span class="go">PUBLIC LIBRARY 926</span>
<span class="go">STREETS & SAN 2070</span>
<span class="go">TRANSPORTN 1168</span>
<span class="go">TREASURER 25</span>
<span class="go">WATER MGMNT 1857</span>
<span class="go">dtype: int64</span>
</code></pre></div>
<p>Summation can be done via <code>sum</code>, averaging by <code>mean</code>, etc. (if it's a SQL function, chances are it exists in pandas). Oh, and there's median too, something not available in most databases.</p>
<div class="highlight"><pre><span></span><code><span class="nb">print</span><span class="p">(</span><span class="n">by_dept</span><span class="o">.</span><span class="n">sum</span><span class="p">()[</span><span class="mi">20</span><span class="p">:</span><span class="mi">25</span><span class="p">])</span> <span class="c1"># total salaries of each department</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">by_dept</span><span class="o">.</span><span class="n">mean</span><span class="p">()[</span><span class="mi">20</span><span class="p">:</span><span class="mi">25</span><span class="p">])</span> <span class="c1"># average salary of each department</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">by_dept</span><span class="o">.</span><span class="n">median</span><span class="p">()[</span><span class="mi">20</span><span class="p">:</span><span class="mi">25</span><span class="p">])</span> <span class="c1"># take that, RDBMS!</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go"> salary</span>
<span class="go">department </span>
<span class="go">HUMAN RESOURCES 4850928.0</span>
<span class="go">INSPECTOR GEN 4035150.0</span>
<span class="go">IPRA 7006128.0</span>
<span class="go">LAW 31883920.2</span>
<span class="go">LICENSE APPL COMM 65436.0</span>
<span class="go"> salary</span>
<span class="go">department </span>
<span class="go">HUMAN RESOURCES 71337.176471</span>
<span class="go">INSPECTOR GEN 80703.000000</span>
<span class="go">IPRA 82425.035294</span>
<span class="go">LAW 70853.156000</span>
<span class="go">LICENSE APPL COMM 65436.000000</span>
<span class="go"> salary</span>
<span class="go">department </span>
<span class="go">HUMAN RESOURCES 68496</span>
<span class="go">INSPECTOR GEN 76116</span>
<span class="go">IPRA 82524</span>
<span class="go">LAW 66492</span>
<span class="go">LICENSE APPL COMM 65436</span>
</code></pre></div>
<p>Operations can also be done on an individual Series within a grouped object. Say we were curious about the five departments with the most distinct titles - the pandas equivalent to:</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span><span class="w"> </span><span class="n">department</span><span class="p">,</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="k">DISTINCT</span><span class="w"> </span><span class="n">title</span><span class="p">)</span><span class="w"></span>
<span class="k">FROM</span><span class="w"> </span><span class="n">chicago</span><span class="w"></span>
<span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">department</span><span class="w"></span>
<span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="k">DESC</span><span class="w"></span>
<span class="k">LIMIT</span><span class="w"> </span><span class="mi">5</span><span class="p">;</span><span class="w"></span>
</code></pre></div>
<p>pandas is a lot less verbose here ...</p>
<div class="highlight"><pre><span></span><code><span class="n">by_dept</span><span class="o">.</span><span class="n">title</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)[:</span><span class="mi">5</span><span class="p">]</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go">department</span>
<span class="go">WATER MGMNT 153</span>
<span class="go">TRANSPORTN 150</span>
<span class="go">POLICE 130</span>
<span class="go">AVIATION 125</span>
<span class="go">HEALTH 118</span>
<span class="go">Name: title, dtype: int64</span>
</code></pre></div>
<h3>split-apply-combine</h3>
<p>The real power of <code>groupby</code> comes from it's split-apply-combine ability.</p>
<p>What if we wanted to see the highest paid employee within each department. Given our current dataset, we'd have to do something like this in SQL:</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"></span>
<span class="k">FROM</span><span class="w"> </span><span class="n">chicago</span><span class="w"> </span><span class="k">c</span><span class="w"></span>
<span class="k">INNER</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="p">(</span><span class="w"></span>
<span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="n">department</span><span class="p">,</span><span class="w"> </span><span class="k">max</span><span class="p">(</span><span class="n">salary</span><span class="p">)</span><span class="w"> </span><span class="n">max_salary</span><span class="w"></span>
<span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">chicago</span><span class="w"></span>
<span class="w"> </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">department</span><span class="w"></span>
<span class="p">)</span><span class="w"> </span><span class="n">m</span><span class="w"></span>
<span class="k">ON</span><span class="w"> </span><span class="k">c</span><span class="p">.</span><span class="n">department</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">m</span><span class="p">.</span><span class="n">department</span><span class="w"></span>
<span class="k">AND</span><span class="w"> </span><span class="k">c</span><span class="p">.</span><span class="n">salary</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">m</span><span class="p">.</span><span class="n">max_salary</span><span class="p">;</span><span class="w"></span>
</code></pre></div>
<p>This would give you the highest paid person in each department, but it would return multiple if there were many equally high paid people within a department.</p>
<p>Alternatively, you could alter the table, add a column, and then write an update statement to populate that column. However, that's not always an option.</p>
<p><em>Note: This would be a lot easier in PostgreSQL, T-SQL, and possibly Oracle due to the existence of partition/window/analytic functions. I've chosen to use MySQL syntax throughout this tutorial because of it's popularity. Unfortunately, MySQL doesn't have similar functions.</em></p>
<p>Using <code>groupby</code> we can define a function (which we'll call <code>ranker</code>) that will label each record from 1 to N, where N is the number of employees within the department. We can then call <code>apply</code> to, well, apply that function to each group (in this case, each department).</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">ranker</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="sd">"""Assigns a rank to each employee based on salary, with 1 being the highest paid.</span>
<span class="sd"> Assumes the data is DESC sorted."""</span>
<span class="n">df</span><span class="p">[</span><span class="s1">'dept_rank'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">))</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">return</span> <span class="n">df</span>
<span class="n">chicago</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s1">'salary'</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">chicago</span> <span class="o">=</span> <span class="n">chicago</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'department'</span><span class="p">)</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">ranker</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">chicago</span><span class="p">[</span><span class="n">chicago</span><span class="o">.</span><span class="n">dept_rank</span> <span class="o">==</span> <span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">7</span><span class="p">))</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go"> name title department \</span>
<span class="go">18039 MC CARTHY, GARRY F SUPERINTENDENT OF POLICE POLICE </span>
<span class="go">8004 EMANUEL, RAHM MAYOR MAYOR'S OFFICE </span>
<span class="go">25588 SANTIAGO, JOSE A FIRE COMMISSIONER FIRE </span>
<span class="go">763 ANDOLINO, ROSEMARIE S COMMISSIONER OF AVIATION AVIATION </span>
<span class="go">4697 CHOUCAIR, BECHARA N COMMISSIONER OF HEALTH HEALTH </span>
<span class="go">21971 PATTON, STEPHEN R CORPORATION COUNSEL LAW </span>
<span class="go">12635 HOLT, ALEXANDRA D BUDGET DIR BUDGET & MGMT </span>
<span class="go"> salary dept_rank </span>
<span class="go">18039 260004 1 </span>
<span class="go">8004 216210 1 </span>
<span class="go">25588 202728 1 </span>
<span class="go">763 186576 1 </span>
<span class="go">4697 177156 1 </span>
<span class="go">21971 173664 1 </span>
<span class="go">12635 169992 1 </span>
</code></pre></div>
<p><em>Move onto part three, <a href="/2013/10/26/using-pandas-on-the-movielens-dataset/">using pandas with the MovieLens dataset</a>.</em></p>Intro to pandas data structures2013-10-26T01:00:00-07:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2013-10-26:/2013/10/26/intro-to-pandas-data-structures/<p><em>UPDATE: If you're interested in learning pandas from a SQL perspective and would prefer to watch a video, you can find video of my 2014 PyData NYC talk <a href="http://reda.io/sql2pandas">here</a>.</em></p>
<p><a href="/2013/01/23/translating-sql-to-pandas-part1/">A while back I claimed</a> I was going to write a couple of posts on translating <a href="http://pandas.pydata.org">pandas</a> to SQL. I never …</p><p><em>UPDATE: If you're interested in learning pandas from a SQL perspective and would prefer to watch a video, you can find video of my 2014 PyData NYC talk <a href="http://reda.io/sql2pandas">here</a>.</em></p>
<p><a href="/2013/01/23/translating-sql-to-pandas-part1/">A while back I claimed</a> I was going to write a couple of posts on translating <a href="http://pandas.pydata.org">pandas</a> to SQL. I never followed up. However, the other week a couple of coworkers expressed their interest in learning a bit more about it - this seemed like a good reason to revisit the topic.
What follows is a fairly thorough introduction to the library. I chose to break it into three parts as I felt it was too long and daunting as one.</p>
<ul>
<li><a href="/2013/10/26/intro-to-pandas-data-structures/">Part 1: Intro to pandas data structures</a>, covers the basics of the library's two main data structures - Series and DataFrames.</li>
<li><a href="/2013/10/26/working-with-pandas-dataframes/">Part 2: Working with DataFrames</a>, dives a bit deeper into the functionality of DataFrames. It shows how to inspect, select, filter, merge, combine, and group your data.</li>
<li><a href="/2013/10/26/using-pandas-on-the-movielens-dataset/">Part 3: Using pandas with the MovieLens dataset</a>, applies the learnings of the first two parts in order to answer a few basic analysis questions about the MovieLens ratings data.</li>
</ul>
<p>If you'd like to follow along, you can find the necessary CSV files <a href="https://github.com/gjreda/gregreda.com/tree/master/content/notebooks/data">here</a> and the MovieLens dataset <a href="http://files.grouplens.org/datasets/movielens/ml-100k.zip">here</a>.
My goal for this tutorial is to teach the basics of pandas by comparing and contrasting its syntax with SQL. Since all of my coworkers are familiar with SQL, I feel this is the best way to provide a context that can be easily understood by the intended audience.
If you're interested in learning more about the library, pandas author <a href="https://twitter.com/wesmckinn">Wes McKinney</a> has written <a href="http://www.amazon.com/gp/product/1449319793/ref=as_li_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=1449319793&linkCode=as2&tag=gjreda-20&linkId=MCGW4C4NOBRVV5OC">Python for Data Analysis</a>, which covers it in much greater detail.</p>
<h3>What is it?</h3>
<p><a href="http://pandas.pydata.org/">pandas</a> is an open source <a href="http://www.python.org/">Python</a> library for data analysis. Python has always been great for prepping and munging data, but it's never been great for analysis - you'd usually end up using <a href="http://www.r-project.org/">R</a> or loading it into a database and using SQL (or worse, Excel). pandas makes Python great for analysis.</p>
<h2>Data Structures</h2>
<p>pandas introduces two new data structures to Python - <a href="http://pandas.pydata.org/pandas-docs/dev/dsintro.html#series">Series</a> and <a href="http://pandas.pydata.org/pandas-docs/dev/dsintro.html#dataframe">DataFrame</a>, both of which are built on top of <a href="http://www.numpy.org/">NumPy</a> (this means it's fast).</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="n">pd</span><span class="o">.</span><span class="n">set_option</span><span class="p">(</span><span class="s1">'max_columns'</span><span class="p">,</span> <span class="mi">50</span><span class="p">)</span>
<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
</code></pre></div>
<h3>Series</h3>
<p>A Series is a one-dimensional object similar to an array, list, or column in a table. It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># create a Series with an arbitrary list</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">([</span><span class="mi">7</span><span class="p">,</span> <span class="s1">'Heisenberg'</span><span class="p">,</span> <span class="mf">3.14</span><span class="p">,</span> <span class="o">-</span><span class="mi">1789710578</span><span class="p">,</span> <span class="s1">'Happy Eating!'</span><span class="p">])</span>
<span class="n">s</span>
</code></pre></div>
<pre>
0 7
1 Heisenberg
2 3.14
3 -1789710578
4 Happy Eating!
dtype: object
</pre>
<p>Alternatively, you can specify an index to use when creating the Series.</p>
<div class="highlight"><pre><span></span><code><span class="n">s</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">([</span><span class="mi">7</span><span class="p">,</span> <span class="s1">'Heisenberg'</span><span class="p">,</span> <span class="mf">3.14</span><span class="p">,</span> <span class="o">-</span><span class="mi">1789710578</span><span class="p">,</span> <span class="s1">'Happy Eating!'</span><span class="p">],</span>
<span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s1">'A'</span><span class="p">,</span> <span class="s1">'Z'</span><span class="p">,</span> <span class="s1">'C'</span><span class="p">,</span> <span class="s1">'Y'</span><span class="p">,</span> <span class="s1">'E'</span><span class="p">])</span>
<span class="n">s</span>
</code></pre></div>
<pre>
A 7
Z Heisenberg
C 3.14
Y -1789710578
E Happy Eating!
dtype: object
</pre>
<p>The Series constructor can convert a dictonary as well, using the keys of the dictionary as its index.</p>
<div class="highlight"><pre><span></span><code><span class="n">d</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'Chicago'</span><span class="p">:</span> <span class="mi">1000</span><span class="p">,</span> <span class="s1">'New York'</span><span class="p">:</span> <span class="mi">1300</span><span class="p">,</span> <span class="s1">'Portland'</span><span class="p">:</span> <span class="mi">900</span><span class="p">,</span> <span class="s1">'San Francisco'</span><span class="p">:</span> <span class="mi">1100</span><span class="p">,</span>
<span class="s1">'Austin'</span><span class="p">:</span> <span class="mi">450</span><span class="p">,</span> <span class="s1">'Boston'</span><span class="p">:</span> <span class="kc">None</span><span class="p">}</span>
<span class="n">cities</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="n">d</span><span class="p">)</span>
<span class="n">cities</span>
</code></pre></div>
<pre>
Austin 450
Boston NaN
Chicago 1000
New York 1300
Portland 900
San Francisco 1100
dtype: float64
</pre>
<p>You can use the index to select specific items from the Series ...</p>
<div class="highlight"><pre><span></span><code><span class="n">cities</span><span class="p">[</span><span class="s1">'Chicago'</span><span class="p">]</span>
</code></pre></div>
<pre>
1000.0
</pre>
<div class="highlight"><pre><span></span><code><span class="n">cities</span><span class="p">[[</span><span class="s1">'Chicago'</span><span class="p">,</span> <span class="s1">'Portland'</span><span class="p">,</span> <span class="s1">'San Francisco'</span><span class="p">]]</span>
</code></pre></div>
<pre>
Chicago 1000
Portland 900
San Francisco 1100
dtype: float64
</pre>
<p>Or you can use boolean indexing for selection.</p>
<div class="highlight"><pre><span></span><code><span class="n">cities</span><span class="p">[</span><span class="n">cities</span> <span class="o"><</span> <span class="mi">1000</span><span class="p">]</span>
</code></pre></div>
<pre>
Austin 450
Portland 900
dtype: float64
</pre>
<p>That last one might be a little weird, so let's make it more clear - cities < 1000 returns a Series of True/False values, which we then pass to our Series cities, returning the corresponding True items.</p>
<div class="highlight"><pre><span></span><code><span class="n">less_than_1000</span> <span class="o">=</span> <span class="n">cities</span> <span class="o"><</span> <span class="mi">1000</span>
<span class="nb">print</span><span class="p">(</span><span class="n">less_than_1000</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">cities</span><span class="p">[</span><span class="n">less_than_1000</span><span class="p">])</span>
</code></pre></div>
<pre>
Austin True
Boston False
Chicago False
New York False
Portland True
San Francisco False
dtype: bool
Austin 450
Portland 900
dtype: float64
</pre>
<p>You can also change the values in a Series on the fly.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># changing based on the index</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'Old value:'</span><span class="p">,</span> <span class="n">cities</span><span class="p">[</span><span class="s1">'Chicago'</span><span class="p">])</span>
<span class="n">cities</span><span class="p">[</span><span class="s1">'Chicago'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1400</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'New value:'</span><span class="p">,</span> <span class="n">cities</span><span class="p">[</span><span class="s1">'Chicago'</span><span class="p">])</span>
</code></pre></div>
<pre>
('Old value:', 1000.0)
('New value:', 1400.0)
</pre>
<div class="highlight"><pre><span></span><code><span class="c1"># changing values using boolean logic</span>
<span class="nb">print</span><span class="p">(</span><span class="n">cities</span><span class="p">[</span><span class="n">cities</span> <span class="o"><</span> <span class="mi">1000</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="n">cities</span><span class="p">[</span><span class="n">cities</span> <span class="o"><</span> <span class="mi">1000</span><span class="p">]</span> <span class="o">=</span> <span class="mi">750</span>
<span class="nb">print</span><span class="p">(</span><span class="n">cities</span><span class="p">[</span><span class="n">cities</span> <span class="o"><</span> <span class="mi">1000</span><span class="p">])</span>
</code></pre></div>
<pre>
Austin 450
Portland 900
dtype: float64
Austin 750
Portland 750
dtype: float64
</pre>
<p>What if you aren't sure whether an item is in the Series? You can check using idiomatic Python.</p>
<div class="highlight"><pre><span></span><code><span class="nb">print</span><span class="p">(</span><span class="s1">'Seattle'</span> <span class="ow">in</span> <span class="n">cities</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'San Francisco'</span> <span class="ow">in</span> <span class="n">cities</span><span class="p">)</span>
</code></pre></div>
<pre>
False
True
</pre>
<p>Mathematical operations can be done using scalars and functions.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># divide city values by 3</span>
<span class="n">cities</span> <span class="o">/</span> <span class="mi">3</span>
</code></pre></div>
<pre>
Austin 250.000000
Boston NaN
Chicago 466.666667
New York 433.333333
Portland 250.000000
San Francisco 366.666667
dtype: float64
</pre>
<div class="highlight"><pre><span></span><code><span class="c1"># square city values</span>
<span class="n">np</span><span class="o">.</span><span class="n">square</span><span class="p">(</span><span class="n">cities</span><span class="p">)</span>
</code></pre></div>
<pre>
Austin 562500
Boston NaN
Chicago 1960000
New York 1690000
Portland 562500
San Francisco 1210000
dtype: float64
</pre>
<p>You can add two Series together, which returns a union of the two Series with the addition occurring on the shared index values. Values on either Series that did not have a shared index will produce a NULL/NaN (not a number).</p>
<div class="highlight"><pre><span></span><code><span class="nb">print</span><span class="p">(</span><span class="n">cities</span><span class="p">[[</span><span class="s1">'Chicago'</span><span class="p">,</span> <span class="s1">'New York'</span><span class="p">,</span> <span class="s1">'Portland'</span><span class="p">]])</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">cities</span><span class="p">[[</span><span class="s1">'Austin'</span><span class="p">,</span> <span class="s1">'New York'</span><span class="p">]])</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">cities</span><span class="p">[[</span><span class="s1">'Chicago'</span><span class="p">,</span> <span class="s1">'New York'</span><span class="p">,</span> <span class="s1">'Portland'</span><span class="p">]]</span> <span class="o">+</span> <span class="n">cities</span><span class="p">[[</span><span class="s1">'Austin'</span><span class="p">,</span> <span class="s1">'New York'</span><span class="p">]])</span>
</code></pre></div>
<pre>
Chicago 1400
New York 1300
Portland 750
dtype: float64
Austin 750
New York 1300
dtype: float64
Austin NaN
Chicago NaN
New York 2600
Portland NaN
dtype: float64
</pre>
<p>Notice that because Austin, Chicago, and Portland were not found in both Series, they were returned with NULL/NaN values.</p>
<p>NULL checking can be performed with <code>isnull</code> and <code>notnull</code>.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># returns a boolean series indicating which values aren't NULL</span>
<span class="n">cities</span><span class="o">.</span><span class="n">notnull</span><span class="p">()</span>
</code></pre></div>
<pre>
Austin True
Boston False
Chicago True
New York True
Portland True
San Francisco True
dtype: bool
</pre>
<div class="highlight"><pre><span></span><code><span class="c1"># use boolean logic to grab the NULL cities</span>
<span class="nb">print</span><span class="p">(</span><span class="n">cities</span><span class="o">.</span><span class="n">isnull</span><span class="p">())</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">cities</span><span class="p">[</span><span class="n">cities</span><span class="o">.</span><span class="n">isnull</span><span class="p">()])</span>
</code></pre></div>
<pre>
Austin False
Boston True
Chicago False
New York False
Portland False
San Francisco False
dtype: bool
Boston NaN
dtype: float64
</pre>
<h2>DataFrame</h2>
<p>A DataFrame is a tablular data structure comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can also think of a DataFrame as a group of Series objects that share an index (the column names).
For the rest of the tutorial, we'll be primarily working with DataFrames.</p>
<h3>Reading Data</h3>
<p>To create a DataFrame out of common Python data structures, we can pass a dictionary of lists to the DataFrame constructor.</p>
<p>Using the <code>columns</code> parameter allows us to tell the constructor how we'd like the columns ordered. By default, the DataFrame constructor will order the columns alphabetically (though this isn't the case when reading from a file - more on that next).</p>
<div class="highlight"><pre><span></span><code><span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'year'</span><span class="p">:</span> <span class="p">[</span><span class="mi">2010</span><span class="p">,</span> <span class="mi">2011</span><span class="p">,</span> <span class="mi">2012</span><span class="p">,</span> <span class="mi">2011</span><span class="p">,</span> <span class="mi">2012</span><span class="p">,</span> <span class="mi">2010</span><span class="p">,</span> <span class="mi">2011</span><span class="p">,</span> <span class="mi">2012</span><span class="p">],</span>
<span class="s1">'team'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'Bears'</span><span class="p">,</span> <span class="s1">'Bears'</span><span class="p">,</span> <span class="s1">'Bears'</span><span class="p">,</span> <span class="s1">'Packers'</span><span class="p">,</span> <span class="s1">'Packers'</span><span class="p">,</span> <span class="s1">'Lions'</span><span class="p">,</span> <span class="s1">'Lions'</span><span class="p">,</span> <span class="s1">'Lions'</span><span class="p">],</span>
<span class="s1">'wins'</span><span class="p">:</span> <span class="p">[</span><span class="mi">11</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">11</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">4</span><span class="p">],</span>
<span class="s1">'losses'</span><span class="p">:</span> <span class="p">[</span><span class="mi">5</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">12</span><span class="p">]}</span>
<span class="n">football</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s1">'year'</span><span class="p">,</span> <span class="s1">'team'</span><span class="p">,</span> <span class="s1">'wins'</span><span class="p">,</span> <span class="s1">'losses'</span><span class="p">])</span>
<span class="n">football</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>year</th>
<th>team</th>
<th>wins</th>
<th>losses</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2010</td>
<td>Bears</td>
<td>11</td>
<td>5</td>
</tr>
<tr>
<th>1</th>
<td>2011</td>
<td>Bears</td>
<td>8</td>
<td>8</td>
</tr>
<tr>
<th>2</th>
<td>2012</td>
<td>Bears</td>
<td>10</td>
<td>6</td>
</tr>
<tr>
<th>3</th>
<td>2011</td>
<td>Packers</td>
<td>15</td>
<td>1</td>
</tr>
<tr>
<th>4</th>
<td>2012</td>
<td>Packers</td>
<td>11</td>
<td>5</td>
</tr>
<tr>
<th>5</th>
<td>2010</td>
<td>Lions</td>
<td>6</td>
<td>10</td>
</tr>
<tr>
<th>6</th>
<td>2011</td>
<td>Lions</td>
<td>10</td>
<td>6</td>
</tr>
<tr>
<th>7</th>
<td>2012</td>
<td>Lions</td>
<td>4</td>
<td>12</td>
</tr>
</tbody>
</table>
<p>Much more often, you'll have a dataset you want to read into a DataFrame. Let's go through several common ways of doing so.</p>
<h4>CSV</h4>
<p>Reading a CSV is as simple as calling the read_csv function. By default, the read_csv function expects the column separator to be a comma, but you can change that using the <code>sep</code> parameter.</p>
<div class="highlight"><pre><span></span><code>%cd ~/Dropbox/tutorials/pandas/
</code></pre></div>
<pre>
/Users/gjreda/Dropbox (Personal)/tutorials/pandas
</pre>
<div class="highlight"><pre><span></span><code><span class="c1"># Source: baseball-reference.com/players/r/riverma01.shtml</span>
!head -n <span class="m">5</span> mariano-rivera.csv
</code></pre></div>
<pre>
Year,Age,Tm,Lg,W,L,W-L%,ERA,G,GS,GF,CG,SHO,SV,IP,H,R,ER,HR,BB,IBB,SO,HBP,BK,WP,BF,ERA+,WHIP,H/9,HR/9,BB/9,SO/9,SO/BB,Awards
1995,25,NYY,AL,5,3,.625,5.51,19,10,2,0,0,0,67.0,71,43,41,11,30,0,51,2,1,0,301,84,1.507,9.5,1.5,4.0,6.9,1.70,
1996,26,NYY,AL,8,3,.727,2.09,61,0,14,0,0,5,107.2,73,25,25,1,34,3,130,2,0,1,425,240,0.994,6.1,0.1,2.8,10.9,3.82,CYA-3MVP-12
1997,27,NYY,AL,6,4,.600,1.88,66,0,56,0,0,43,71.2,65,17,15,5,20,6,68,0,0,2,301,239,1.186,8.2,0.6,2.5,8.5,3.40,ASMVP-25
1998,28,NYY,AL,3,0,1.000,1.91,54,0,49,0,0,36,61.1,48,13,13,3,17,1,36,1,0,0,246,233,1.060,7.0,0.4,2.5,5.3,2.12,
</pre>
<div class="highlight"><pre><span></span><code><span class="n">from_csv</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'mariano-rivera.csv'</span><span class="p">)</span>
<span class="n">from_csv</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Year</th>
<th>Age</th>
<th>Tm</th>
<th>Lg</th>
<th>W</th>
<th>L</th>
<th>W-L%</th>
<th>ERA</th>
<th>G</th>
<th>GS</th>
<th>GF</th>
<th>CG</th>
<th>SHO</th>
<th>SV</th>
<th>IP</th>
<th>H</th>
<th>R</th>
<th>ER</th>
<th>HR</th>
<th>BB</th>
<th>IBB</th>
<th>SO</th>
<th>HBP</th>
<th>BK</th>
<th>WP</th>
<th>BF</th>
<th>ERA+</th>
<th>WHIP</th>
<th>H/9</th>
<th>HR/9</th>
<th>BB/9</th>
<th>SO/9</th>
<th>SO/BB</th>
<th>Awards</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1995</td>
<td>25</td>
<td>NYY</td>
<td>AL</td>
<td>5</td>
<td>3</td>
<td>0.625</td>
<td>5.51</td>
<td>19</td>
<td>10</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>67.0</td>
<td>71</td>
<td>43</td>
<td>41</td>
<td>11</td>
<td>30</td>
<td>0</td>
<td>51</td>
<td>2</td>
<td>1</td>
<td>0</td>
<td>301</td>
<td>84</td>
<td>1.507</td>
<td>9.5</td>
<td>1.5</td>
<td>4.0</td>
<td>6.9</td>
<td>1.70</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>1996</td>
<td>26</td>
<td>NYY</td>
<td>AL</td>
<td>8</td>
<td>3</td>
<td>0.727</td>
<td>2.09</td>
<td>61</td>
<td>0</td>
<td>14</td>
<td>0</td>
<td>0</td>
<td>5</td>
<td>107.2</td>
<td>73</td>
<td>25</td>
<td>25</td>
<td>1</td>
<td>34</td>
<td>3</td>
<td>130</td>
<td>2</td>
<td>0</td>
<td>1</td>
<td>425</td>
<td>240</td>
<td>0.994</td>
<td>6.1</td>
<td>0.1</td>
<td>2.8</td>
<td>10.9</td>
<td>3.82</td>
<td>CYA-3MVP-12</td>
</tr>
<tr>
<th>2</th>
<td>1997</td>
<td>27</td>
<td>NYY</td>
<td>AL</td>
<td>6</td>
<td>4</td>
<td>0.600</td>
<td>1.88</td>
<td>66</td>
<td>0</td>
<td>56</td>
<td>0</td>
<td>0</td>
<td>43</td>
<td>71.2</td>
<td>65</td>
<td>17</td>
<td>15</td>
<td>5</td>
<td>20</td>
<td>6</td>
<td>68</td>
<td>0</td>
<td>0</td>
<td>2</td>
<td>301</td>
<td>239</td>
<td>1.186</td>
<td>8.2</td>
<td>0.6</td>
<td>2.5</td>
<td>8.5</td>
<td>3.40</td>
<td>ASMVP-25</td>
</tr>
<tr>
<th>3</th>
<td>1998</td>
<td>28</td>
<td>NYY</td>
<td>AL</td>
<td>3</td>
<td>0</td>
<td>1.000</td>
<td>1.91</td>
<td>54</td>
<td>0</td>
<td>49</td>
<td>0</td>
<td>0</td>
<td>36</td>
<td>61.1</td>
<td>48</td>
<td>13</td>
<td>13</td>
<td>3</td>
<td>17</td>
<td>1</td>
<td>36</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>246</td>
<td>233</td>
<td>1.060</td>
<td>7.0</td>
<td>0.4</td>
<td>2.5</td>
<td>5.3</td>
<td>2.12</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>1999</td>
<td>29</td>
<td>NYY</td>
<td>AL</td>
<td>4</td>
<td>3</td>
<td>0.571</td>
<td>1.83</td>
<td>66</td>
<td>0</td>
<td>63</td>
<td>0</td>
<td>0</td>
<td>45</td>
<td>69.0</td>
<td>43</td>
<td>15</td>
<td>14</td>
<td>2</td>
<td>18</td>
<td>3</td>
<td>52</td>
<td>3</td>
<td>1</td>
<td>2</td>
<td>268</td>
<td>257</td>
<td>0.884</td>
<td>5.6</td>
<td>0.3</td>
<td>2.3</td>
<td>6.8</td>
<td>2.89</td>
<td>ASCYA-3MVP-14</td>
</tr>
</tbody>
</table>
<p>Our file had headers, which the function inferred upon reading in the file. Had we wanted to be more explicit, we could have passed <code>header=None</code> to the function along with a list of column names to use:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Source: pro-football-reference.com/players/M/MannPe00/touchdowns/passing/2012/</span>
!head -n <span class="m">5</span> peyton-passing-TDs-2012.csv
</code></pre></div>
<pre>
1,1,2012-09-09,DEN,,PIT,W 31-19,3,71,Demaryius Thomas,Trail 7-13,Lead 14-13*
2,1,2012-09-09,DEN,,PIT,W 31-19,4,1,Jacob Tamme,Trail 14-19,Lead 22-19*
3,2,2012-09-17,DEN,@,ATL,L 21-27,2,17,Demaryius Thomas,Trail 0-20,Trail 7-20
4,3,2012-09-23,DEN,,HOU,L 25-31,4,38,Brandon Stokley,Trail 11-31,Trail 18-31
5,3,2012-09-23,DEN,,HOU,L 25-31,4,6,Joel Dreessen,Trail 18-31,Trail 25-31
</pre>
<div class="highlight"><pre><span></span><code><span class="n">cols</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'num'</span><span class="p">,</span> <span class="s1">'game'</span><span class="p">,</span> <span class="s1">'date'</span><span class="p">,</span> <span class="s1">'team'</span><span class="p">,</span> <span class="s1">'home_away'</span><span class="p">,</span> <span class="s1">'opponent'</span><span class="p">,</span>
<span class="s1">'result'</span><span class="p">,</span> <span class="s1">'quarter'</span><span class="p">,</span> <span class="s1">'distance'</span><span class="p">,</span> <span class="s1">'receiver'</span><span class="p">,</span> <span class="s1">'score_before'</span><span class="p">,</span>
<span class="s1">'score_after'</span><span class="p">]</span>
<span class="n">no_headers</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'peyton-passing-TDs-2012.csv'</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s1">','</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span>
<span class="n">names</span><span class="o">=</span><span class="n">cols</span><span class="p">)</span>
<span class="n">no_headers</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>num</th>
<th>game</th>
<th>date</th>
<th>team</th>
<th>home_away</th>
<th>opponent</th>
<th>result</th>
<th>quarter</th>
<th>distance</th>
<th>receiver</th>
<th>score_before</th>
<th>score_after</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>1</td>
<td>2012-09-09</td>
<td>DEN</td>
<td>NaN</td>
<td>PIT</td>
<td>W 31-19</td>
<td>3</td>
<td>71</td>
<td>Demaryius Thomas</td>
<td>Trail 7-13</td>
<td>Lead 14-13*</td>
</tr>
<tr>
<th>1</th>
<td>2</td>
<td>1</td>
<td>2012-09-09</td>
<td>DEN</td>
<td>NaN</td>
<td>PIT</td>
<td>W 31-19</td>
<td>4</td>
<td>1</td>
<td>Jacob Tamme</td>
<td>Trail 14-19</td>
<td>Lead 22-19*</td>
</tr>
<tr>
<th>2</th>
<td>3</td>
<td>2</td>
<td>2012-09-17</td>
<td>DEN</td>
<td>@</td>
<td>ATL</td>
<td>L 21-27</td>
<td>2</td>
<td>17</td>
<td>Demaryius Thomas</td>
<td>Trail 0-20</td>
<td>Trail 7-20</td>
</tr>
<tr>
<th>3</th>
<td>4</td>
<td>3</td>
<td>2012-09-23</td>
<td>DEN</td>
<td>NaN</td>
<td>HOU</td>
<td>L 25-31</td>
<td>4</td>
<td>38</td>
<td>Brandon Stokley</td>
<td>Trail 11-31</td>
<td>Trail 18-31</td>
</tr>
<tr>
<th>4</th>
<td>5</td>
<td>3</td>
<td>2012-09-23</td>
<td>DEN</td>
<td>NaN</td>
<td>HOU</td>
<td>L 25-31</td>
<td>4</td>
<td>6</td>
<td>Joel Dreessen</td>
<td>Trail 18-31</td>
<td>Trail 25-31</td>
</tr>
</tbody>
</table>
<p>pandas' various reader functions have many parameters allowing you to do things like skipping lines of the file, parsing dates, or specifying how to handle NA/NULL datapoints.</p>
<p>There's also a set of writer functions for writing to a variety of formats (CSVs, HTML tables, JSON). They function exactly as you'd expect and are typically called <code>to_format</code>:</p>
<div class="highlight"><pre><span></span><code><span class="n">my_dataframe</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">'path_to_file.csv'</span><span class="p">)</span>
</code></pre></div>
<p><a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html">Take a look at the IO documentation</a> to familiarize yourself with file reading/writing functionality.</p>
<h4>Excel</h4>
<p>Know who hates VBA? Me. I bet you do, too. Thankfully, pandas allows you to read and write Excel files, so you can easily read from Excel, write your code in Python, and then write back out to Excel - no need for VBA.</p>
<p>Reading Excel files requires the <a href="https://pypi.org/project/xlrd/">xlrd</a> library. You can install it via pip (pip install xlrd).</p>
<p>Let's first write a DataFrame to Excel.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># this is the DataFrame we created from a dictionary earlier</span>
<span class="n">football</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>year</th>
<th>team</th>
<th>wins</th>
<th>losses</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2010</td>
<td>Bears</td>
<td>11</td>
<td>5</td>
</tr>
<tr>
<th>1</th>
<td>2011</td>
<td>Bears</td>
<td>8</td>
<td>8</td>
</tr>
<tr>
<th>2</th>
<td>2012</td>
<td>Bears</td>
<td>10</td>
<td>6</td>
</tr>
<tr>
<th>3</th>
<td>2011</td>
<td>Packers</td>
<td>15</td>
<td>1</td>
</tr>
<tr>
<th>4</th>
<td>2012</td>
<td>Packers</td>
<td>11</td>
<td>5</td>
</tr>
</tbody>
</table>
<div class="highlight"><pre><span></span><code><span class="c1"># since our index on the football DataFrame is meaningless, let's not write it</span>
<span class="n">football</span><span class="o">.</span><span class="n">to_excel</span><span class="p">(</span><span class="s1">'football.xlsx'</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code>!ls -l *.xlsx
</code></pre></div>
<pre>
-rw-r--r--@ 1 gjreda staff 5665 Mar 26 17:58 football.xlsx
</pre>
<div class="highlight"><pre><span></span><code><span class="c1"># delete the DataFrame</span>
<span class="k">del</span> <span class="n">football</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="c1"># read from Excel</span>
<span class="n">football</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">'football.xlsx'</span><span class="p">,</span> <span class="s1">'Sheet1'</span><span class="p">)</span>
<span class="n">football</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>year</th>
<th>team</th>
<th>wins</th>
<th>losses</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2010</td>
<td>Bears</td>
<td>11</td>
<td>5</td>
</tr>
<tr>
<th>1</th>
<td>2011</td>
<td>Bears</td>
<td>8</td>
<td>8</td>
</tr>
<tr>
<th>2</th>
<td>2012</td>
<td>Bears</td>
<td>10</td>
<td>6</td>
</tr>
<tr>
<th>3</th>
<td>2011</td>
<td>Packers</td>
<td>15</td>
<td>1</td>
</tr>
<tr>
<th>4</th>
<td>2012</td>
<td>Packers</td>
<td>11</td>
<td>5</td>
</tr>
<tr>
<th>5</th>
<td>2010</td>
<td>Lions</td>
<td>6</td>
<td>10</td>
</tr>
<tr>
<th>6</th>
<td>2011</td>
<td>Lions</td>
<td>10</td>
<td>6</td>
</tr>
<tr>
<th>7</th>
<td>2012</td>
<td>Lions</td>
<td>4</td>
<td>12</td>
</tr>
</tbody>
</table>
<h4>Database</h4>
<p>pandas also has some support for <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html">reading/writing DataFrames directly from/to a database</a>. You'll typically just need to pass a connection object or sqlalchemy engine to the <code>read_sql</code> or <code>to_sql</code> functions within the <code>pandas.io</code> module.</p>
<p>Note that <code>to_sql</code> executes as a series of INSERT INTO statements and thus trades speed for simplicity. If you're writing a large DataFrame to a database, it might be quicker to write the DataFrame to CSV and load that directly using the database's file import arguments.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">pandas.io</span> <span class="kn">import</span> <span class="n">sql</span>
<span class="kn">import</span> <span class="nn">sqlite3</span>
<span class="n">conn</span> <span class="o">=</span> <span class="n">sqlite3</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span><span class="s1">'/Users/gjreda/Dropbox/gregreda.com/_code/towed'</span><span class="p">)</span>
<span class="n">query</span> <span class="o">=</span> <span class="s2">"SELECT * FROM towed WHERE make = 'FORD';"</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">sql</span><span class="o">.</span><span class="n">read_sql</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">con</span><span class="o">=</span><span class="n">conn</span><span class="p">)</span>
<span class="n">results</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>tow_date</th>
<th>make</th>
<th>style</th>
<th>model</th>
<th>color</th>
<th>plate</th>
<th>state</th>
<th>towed_address</th>
<th>phone</th>
<th>inventory</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>01/19/2013</td>
<td>FORD</td>
<td>LL</td>
<td></td>
<td>RED</td>
<td>N786361</td>
<td>IL</td>
<td>400 E. Lower Wacker</td>
<td>(312) 744-7550</td>
<td>877040</td>
</tr>
<tr>
<th>1</th>
<td>01/19/2013</td>
<td>FORD</td>
<td>4D</td>
<td></td>
<td>GRN</td>
<td>L307211</td>
<td>IL</td>
<td>701 N. Sacramento</td>
<td>(773) 265-7605</td>
<td>6738005</td>
</tr>
<tr>
<th>2</th>
<td>01/19/2013</td>
<td>FORD</td>
<td>4D</td>
<td></td>
<td>GRY</td>
<td>P576738</td>
<td>IL</td>
<td>701 N. Sacramento</td>
<td>(773) 265-7605</td>
<td>6738001</td>
</tr>
<tr>
<th>3</th>
<td>01/19/2013</td>
<td>FORD</td>
<td>LL</td>
<td></td>
<td>BLK</td>
<td>N155890</td>
<td>IL</td>
<td>10300 S. Doty</td>
<td>(773) 568-8495</td>
<td>2699210</td>
</tr>
<tr>
<th>4</th>
<td>01/19/2013</td>
<td>FORD</td>
<td>LL</td>
<td></td>
<td>TAN</td>
<td>H953638</td>
<td>IL</td>
<td>10300 S. Doty</td>
<td>(773) 568-8495</td>
<td>2699209</td>
</tr>
</tbody>
</table>
<h4>Clipboard</h4>
<p>While the results of a query can be read directly into a DataFrame, I prefer to read the results directly from the clipboard. I'm often tweaking queries in my SQL client (<a href="http://www.sequelpro.com/">Sequel Pro</a>), so I would rather see the results before I read it into pandas. Once I'm confident I have the data I want, then I'll read it into a DataFrame.</p>
<p>This works just as well with any type of delimited data you've copied to your clipboard. The function does a good job of inferring the delimiter, but you can also use the sep parameter to be explicit.</p>
<p><a href="https://www.baseball-reference.com/players/a/aaronha01.shtml">Hank Aaron</a></p>
<p><img src="http://i.imgur.com/xiySJ2e.png" alt="hank-aaron-stats-screenshot"></p>
<div class="highlight"><pre><span></span><code><span class="n">hank</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_clipboard</span><span class="p">()</span>
<span class="n">hank</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Year</th>
<th>Age</th>
<th>Tm</th>
<th>Lg</th>
<th>G</th>
<th>PA</th>
<th>AB</th>
<th>R</th>
<th>H</th>
<th>2B</th>
<th>3B</th>
<th>HR</th>
<th>RBI</th>
<th>SB</th>
<th>CS</th>
<th>BB</th>
<th>SO</th>
<th>BA</th>
<th>OBP</th>
<th>SLG</th>
<th>OPS</th>
<th>OPS+</th>
<th>TB</th>
<th>GDP</th>
<th>HBP</th>
<th>SH</th>
<th>SF</th>
<th>IBB</th>
<th>Pos</th>
<th>Awards</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1954</td>
<td>20</td>
<td>MLN</td>
<td>NL</td>
<td>122</td>
<td>509</td>
<td>468</td>
<td>58</td>
<td>131</td>
<td>27</td>
<td>6</td>
<td>13</td>
<td>69</td>
<td>2</td>
<td>2</td>
<td>28</td>
<td>39</td>
<td>0.280</td>
<td>0.322</td>
<td>0.447</td>
<td>0.769</td>
<td>104</td>
<td>209</td>
<td>13</td>
<td>3</td>
<td>6</td>
<td>4</td>
<td>NaN</td>
<td>*79</td>
<td>RoY-4</td>
</tr>
<tr>
<th>1</th>
<td>1955 ★</td>
<td>21</td>
<td>MLN</td>
<td>NL</td>
<td>153</td>
<td>665</td>
<td>602</td>
<td>105</td>
<td>189</td>
<td>37</td>
<td>9</td>
<td>27</td>
<td>106</td>
<td>3</td>
<td>1</td>
<td>49</td>
<td>61</td>
<td>0.314</td>
<td>0.366</td>
<td>0.540</td>
<td>0.906</td>
<td>141</td>
<td>325</td>
<td>20</td>
<td>3</td>
<td>7</td>
<td>4</td>
<td>5</td>
<td>*974</td>
<td>AS,MVP-9</td>
</tr>
<tr>
<th>2</th>
<td>1956 ★</td>
<td>22</td>
<td>MLN</td>
<td>NL</td>
<td>153</td>
<td>660</td>
<td>609</td>
<td>106</td>
<td>200</td>
<td>34</td>
<td>14</td>
<td>26</td>
<td>92</td>
<td>2</td>
<td>4</td>
<td>37</td>
<td>54</td>
<td>0.328</td>
<td>0.365</td>
<td>0.558</td>
<td>0.923</td>
<td>151</td>
<td>340</td>
<td>21</td>
<td>2</td>
<td>5</td>
<td>7</td>
<td>6</td>
<td>*9</td>
<td>AS,MVP-3</td>
</tr>
<tr>
<th>3</th>
<td>1957 ★</td>
<td>23</td>
<td>MLN</td>
<td>NL</td>
<td>151</td>
<td>675</td>
<td>615</td>
<td>118</td>
<td>198</td>
<td>27</td>
<td>6</td>
<td>44</td>
<td>132</td>
<td>1</td>
<td>1</td>
<td>57</td>
<td>58</td>
<td>0.322</td>
<td>0.378</td>
<td>0.600</td>
<td>0.978</td>
<td>166</td>
<td>369</td>
<td>13</td>
<td>0</td>
<td>0</td>
<td>3</td>
<td>15</td>
<td>*98</td>
<td>AS,MVP-1</td>
</tr>
<tr>
<th>4</th>
<td>1958 ★</td>
<td>24</td>
<td>MLN</td>
<td>NL</td>
<td>153</td>
<td>664</td>
<td>601</td>
<td>109</td>
<td>196</td>
<td>34</td>
<td>4</td>
<td>30</td>
<td>95</td>
<td>4</td>
<td>1</td>
<td>59</td>
<td>49</td>
<td>0.326</td>
<td>0.386</td>
<td>0.546</td>
<td>0.931</td>
<td>152</td>
<td>328</td>
<td>21</td>
<td>1</td>
<td>0</td>
<td>3</td>
<td>16</td>
<td>*98</td>
<td>AS,MVP-3,GG</td>
</tr>
</tbody>
</table>
<h4>URL</h4>
<p>With <code>read_table</code>, we can also read directly from a URL.</p>
<p>Let's use the <a href="https://raw.githubusercontent.com/gjreda/best-sandwiches/master/data/best-sandwiches-geocode.tsv">best sandwiches data</a> that I <a href="/2013/05/06/more-web-scraping-with-python/">wrote about scraping</a> a while back.</p>
<div class="highlight"><pre><span></span><code><span class="n">url</span> <span class="o">=</span> <span class="s1">'https://raw.github.com/gjreda/best-sandwiches/master/data/best-sandwiches-geocode.tsv'</span>
<span class="c1"># fetch the text from the URL and read it into a DataFrame</span>
<span class="n">from_url</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s1">'</span><span class="se">\t</span><span class="s1">'</span><span class="p">)</span>
<span class="n">from_url</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
</code></pre></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>rank</th>
<th>sandwich</th>
<th>restaurant</th>
<th>description</th>
<th>price</th>
<th>address</th>
<th>city</th>
<th>phone</th>
<th>website</th>
<th>full_address</th>
<th>formatted_address</th>
<th>lat</th>
<th>lng</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>BLT</td>
<td>Old Oak Tap</td>
<td>The B is applewood smoked&mdash;nice and snapp...</td>
<td>$10</td>
<td>2109 W. Chicago Ave.</td>
<td>Chicago</td>
<td>773-772-0406</td>
<td>theoldoaktap.com</td>
<td>2109 W. Chicago Ave., Chicago</td>
<td>2109 West Chicago Avenue, Chicago, IL 60622, USA</td>
<td>41.895734</td>
<td>-87.679960</td>
</tr>
<tr>
<th>1</th>
<td>2</td>
<td>Fried Bologna</td>
<td>Au Cheval</td>
<td>Thought your bologna-eating days had retired w...</td>
<td>$9</td>
<td>800 W. Randolph St.</td>
<td>Chicago</td>
<td>312-929-4580</td>
<td>aucheval.tumblr.com</td>
<td>800 W. Randolph St., Chicago</td>
<td>800 West Randolph Street, Chicago, IL 60607, USA</td>
<td>41.884672</td>
<td>-87.647754</td>
</tr>
<tr>
<th>2</th>
<td>3</td>
<td>Woodland Mushroom</td>
<td>Xoco</td>
<td>Leave it to Rick Bayless and crew to come up w...</td>
<td>$9.50.</td>
<td>445 N. Clark St.</td>
<td>Chicago</td>
<td>312-334-3688</td>
<td>rickbayless.com</td>
<td>445 N. Clark St., Chicago</td>
<td>445 North Clark Street, Chicago, IL 60654, USA</td>
<td>41.890602</td>
<td>-87.630925</td>
</tr>
</tbody>
</table>
<p><em>Move onto the next section, which covers <a href="/2013/10/26/working-with-pandas-dataframes/">working with DataFrames</a>.</em></p>New theme for Pelican2013-10-24T00:00:00-07:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2013-10-24:/2013/10/24/new-theme-for-pelican/<p>I spent some time last weekend making minor changes to this site. Specifically:</p>
<ol>
<li>New typography - headers are <a href="http://www.google.com/fonts/specimen/Droid+Serif">Droid Serif</a>, while everything else is <a href="http://www.google.com/fonts/specimen/Droid+Sans">Droid Sans</a>. Fonts are also a bit bigger (I think it's easier to read).</li>
<li>Added <a href="http://jakevdp.github.io/">Jake Vanderplas'</a> <a href="https://github.com/getpelican/pelican-plugins/tree/master/liquid_tags">liquid tags plugin</a> for Pelican, which allows for easy embedding …</li></ol><p>I spent some time last weekend making minor changes to this site. Specifically:</p>
<ol>
<li>New typography - headers are <a href="http://www.google.com/fonts/specimen/Droid+Serif">Droid Serif</a>, while everything else is <a href="http://www.google.com/fonts/specimen/Droid+Sans">Droid Sans</a>. Fonts are also a bit bigger (I think it's easier to read).</li>
<li>Added <a href="http://jakevdp.github.io/">Jake Vanderplas'</a> <a href="https://github.com/getpelican/pelican-plugins/tree/master/liquid_tags">liquid tags plugin</a> for Pelican, which allows for easy embedding of <a href="http://ipython.org/notebook.html">IPython Notebooks</a>.</li>
</ol>
<p>I plan on continuing to tweak things over time, but I'm pretty happy with the way it looks right now.</p>
<p><a href="https://github.com/gjreda/notebook-simpler">Check it out on GitHub</a> and feel free to use it. It's mobile-friendly, too.</p>Useful Unix commands for data science2013-07-15T00:00:00-07:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2013-07-15:/2013/07/15/unix-commands-for-data-science/<p>Imagine you have a 4.2GB CSV file. It has over 12 million records and 50 columns. All you need from this file is the sum of all values in one particular column.</p>
<p>How would you do it?</p>
<p>Writing a script in <a href="http://www.python.org/">python</a>/<a href="http://www.ruby-lang.org/">ruby</a>/<a href="http://www.perl.org/">perl</a>/whatever would probably take a …</p><p>Imagine you have a 4.2GB CSV file. It has over 12 million records and 50 columns. All you need from this file is the sum of all values in one particular column.</p>
<p>How would you do it?</p>
<p>Writing a script in <a href="http://www.python.org/">python</a>/<a href="http://www.ruby-lang.org/">ruby</a>/<a href="http://www.perl.org/">perl</a>/whatever would probably take a few minutes and then even more time for the script to actually complete. A <a href="http://en.wikipedia.org/wiki/Database">database</a> and <a href="http://en.wikipedia.org/wiki/SQL">SQL</a> would be fairly quick, but then you'd have load the data, which is kind of a pain.</p>
<p>Thankfully, the <a href="http://en.wikipedia.org/wiki/List_of_Unix_utilities">Unix utilities</a> exist and they're awesome.</p>
<p>To get the sum of a column in a huge text file, we can easily use <a href="http://en.wikipedia.org/wiki/AWK_(programming_language)">awk</a>. And we won't even need to read the entire file into memory.</p>
<p>Let's assume our data, which we'll call <em>data.csv</em>, is pipe-delimited ( | ), and we want to sum the fourth column of the file.</p>
<div class="highlight"><pre><span></span><code><span class="n">cat</span> <span class="n">data</span><span class="o">.</span><span class="n">csv</span> <span class="o">|</span> <span class="n">awk</span> <span class="o">-</span><span class="n">F</span> <span class="s2">"|"</span> <span class="s1">'{ sum += $4 } END { printf "</span><span class="si">%.2f</span><span class="se">\n</span><span class="s1">", sum }'</span>
</code></pre></div>
<p>The above line says:</p>
<ol>
<li>Use the <a href="http://en.wikipedia.org/wiki/Cat_(Unix)">cat</a> command to stream (print) the contents of the file to <a href="http://en.wikipedia.org/wiki/Standard_streams">stdout</a>.</li>
<li><a href="http://en.wikipedia.org/wiki/Pipeline_(Unix)">Pipe</a> the streaming contents from our cat command to the next one - awk. </li>
<li>
<p>With <a href="http://en.wikipedia.org/wiki/AWK_(programming_language)">awk</a>:</p>
<ol>
<li>Set the field separator to the pipe character (-F "|"). Note that this has nothing to do with our pipeline in point #2.</li>
<li>Increment the variable <em>sum</em> with the value in the fourth column ($4). Since we used a pipeline in point #2, the contents of each line are being streamed to this statement.</li>
<li>Once the stream is done, print out the value of <em>sum</em>, using <a href="http://www.gnu.org/software/gawk/manual/html_node/Printf-Examples.html">printf</a> to format the value with two decimal places.</li>
</ol>
</li>
</ol>
<p>It took less than two minutes to run on the entire file - much faster than other options and written in a lot fewer characters.</p>
<p><a href="http://www.hilarymason.com">Hilary Mason</a> and <a href="http://www.columbia.edu/~chw2/">Chris Wiggins</a> wrote over at the <a href="http://www.dataists.com/">dataists blog</a> about the importance of any <a href="http://www.dataists.com/2010/09/a-taxonomy-of-data-science/">data scientist being familiar with the command line</a>, and I couldn't agree with them more. The command line is essential to my daily work, so I wanted to share some of the commands I've found most useful.</p>
<p>For those who are a bit newer to the command line than the rest of this post assumes, Hilary previously wrote a <a href="http://www.hilarymason.com/articles/intro-to-the-linux-command-line/">nice introduction to it</a>.</p>
<h3>Other commands</h3>
<h4><a href="http://en.wikipedia.org/wiki/Head_(Unix)">head</a> & <a href="http://en.wikipedia.org/wiki/Tail_(Unix)">tail</a></h4>
<p>Sometimes you just need to inspect the structure of a huge file. That's where <a href="http://en.wikipedia.org/wiki/Head_(Unix)">head</a> and <a href="http://en.wikipedia.org/wiki/Tail_(Unix)">tail</a> come in. Head prints the first ten lines of a file, while tail prints the last ten lines. Optionally, you can include the <em>-N</em> parameter to change the number of lines displayed.</p>
<div class="highlight"><pre><span></span><code><span class="n">head</span> <span class="o">-</span><span class="n">n</span> <span class="mi">3</span> <span class="n">data</span><span class="o">.</span><span class="n">csv</span>
<span class="c1"># time|away|score|home</span>
<span class="c1"># 20:00||0-0|Jump Ball won by Virginia Commonwealt.</span>
<span class="c1"># 19:45||0-0|Juvonte Reddic Turnover.</span>
<span class="n">tail</span> <span class="o">-</span><span class="n">n</span> <span class="mi">3</span> <span class="n">data</span><span class="o">.</span><span class="n">csv</span>
<span class="c1"># 0:14|Trey Davis Turnover.|62-71|</span>
<span class="c1"># 0:14||62-71|Briante Weber Steal.</span>
<span class="c1"># 0:00|End Game|End Game|End Game</span>
</code></pre></div>
<h4><a href="http://en.wikipedia.org/wiki/Wc_(Unix)">wc</a> (word count)</h4>
<p>By default, <a href="http://en.wikipedia.org/wiki/Wc_(Unix)">wc</a> will quickly tell you how many lines, words, and bytes are in a file. If you're looking for just the line count, you can pass the <em>-l</em> parameter in.</p>
<p>I use it most often to verify record counts between files or database tables throughout an analysis.</p>
<div class="highlight"><pre><span></span><code><span class="n">wc</span> <span class="n">data</span><span class="o">.</span><span class="n">csv</span>
<span class="c1"># 377 1697 17129 data.csv</span>
<span class="n">wc</span> <span class="o">-</span><span class="n">l</span> <span class="n">data</span><span class="o">.</span><span class="n">csv</span>
<span class="c1"># 377 data.csv</span>
</code></pre></div>
<h4><a href="http://en.wikipedia.org/wiki/Grep">grep</a></h4>
<p><a href="http://en.wikipedia.org/wiki/Grep">Grep</a> allows you to search through plain text files using <a href="http://en.wikipedia.org/wiki/Regular_expression">regular expressions</a>. I tend <a href="http://regex.info/blog/2006-09-15/247">avoid regular expressions</a> when possible, but still find grep to be invaluable when searching through log files for a particular event.</p>
<p>There's an assortment of extra parameters you can use with grep, but the ones I tend to use the most are <em>-i</em> (ignore case), <em>-r</em> (recursively search directories), <em>-B N</em> (N lines before), <em>-A N</em> (N lines after).</p>
<div class="highlight"><pre><span></span><code><span class="n">grep</span> <span class="o">-</span><span class="n">i</span> <span class="o">-</span><span class="n">B</span> <span class="mi">1</span> <span class="o">-</span><span class="n">A</span> <span class="mi">1</span> <span class="n">steal</span> <span class="n">data</span><span class="o">.</span><span class="n">csv</span>
<span class="c1"># 17:25||2-4|Darius Theus Turnover.</span>
<span class="c1"># 17:25|Terrell Vinson Steal.|2-4|</span>
<span class="c1"># 17:18|Chaz Williams made Layup. Assisted by Terrell Vinson.|4-4|</span>
</code></pre></div>
<h4><a href="http://en.wikipedia.org/wiki/Sed">sed</a></h4>
<p><a href="http://en.wikipedia.org/wiki/Sed">Sed</a> is similar to <a href="http://en.wikipedia.org/wiki/Grep">grep</a> and <a href="http://en.wikipedia.org/wiki/AWK_(programming_language)">awk</a> in many ways, however I find that I most often use it when needing to do some find and replace magic on a very large file. The usual occurrence is when I've received a CSV file that was generated on Windows and my <a href="http://stackoverflow.com/questions/6373888/converting-newline-formatting-from-mac-to-windows">Mac isn't able to handle the carriage return</a> properly.</p>
<div class="highlight"><pre><span></span><code><span class="n">grep</span> <span class="n">Block</span> <span class="n">data</span><span class="o">.</span><span class="n">csv</span> <span class="o">|</span> <span class="n">head</span> <span class="o">-</span><span class="n">n</span> <span class="mi">3</span>
<span class="c1"># 16:43||5-4|Juvonte Reddic Block.</span>
<span class="c1"># 15:37||7-6|Troy Daniels Block.</span>
<span class="c1"># 14:05|Raphiael Putney Block.|11-8|</span>
<span class="n">sed</span> <span class="o">-</span><span class="n">e</span> <span class="s1">'s/Block/Rejection/g'</span> <span class="n">data</span><span class="o">.</span><span class="n">csv</span> <span class="o">></span> <span class="n">rejection</span><span class="o">.</span><span class="n">csv</span>
<span class="c1"># replace all instances of the word 'Block' in data.csv with 'Rejection'</span>
<span class="c1"># stream the results to a new file called rejection.csv</span>
<span class="n">grep</span> <span class="n">Rejection</span> <span class="n">rejection</span><span class="o">.</span><span class="n">csv</span> <span class="o">|</span> <span class="n">head</span> <span class="o">-</span><span class="n">n</span> <span class="mi">3</span>
<span class="c1"># 16:43||5-4|Juvonte Reddic Rejection.</span>
<span class="c1"># 15:37||7-6|Troy Daniels Rejection.</span>
<span class="c1"># 14:05|Raphiael Putney Rejection.|11-8|</span>
</code></pre></div>
<h4><a href="http://en.wikipedia.org/wiki/Sort_(Unix)">sort</a> & <a href="http://en.wikipedia.org/wiki/Uniq">uniq</a></h4>
<p><a href="http://en.wikipedia.org/wiki/Sort_(Unix)">Sort</a> outputs the lines of a file in order based on a column key using the <em>-k</em> parameter. If a key isn't specified, sort will treat each line as a concatenated string and sort based on the values of the first column. The <em>-n</em> and <em>-r</em> parameters allow you to sort numerically and in reverse order, respectively.</p>
<div class="highlight"><pre><span></span><code><span class="n">head</span> <span class="o">-</span><span class="n">n</span> <span class="mi">5</span> <span class="n">data</span><span class="o">.</span><span class="n">csv</span>
<span class="c1"># time|away|score|home</span>
<span class="c1"># 20:00||0-0|Jump Ball won by Virginia Commonwealt.</span>
<span class="c1"># 19:45||0-0|Juvonte Reddic Turnover.</span>
<span class="c1"># 19:45|Chaz Williams Steal.|0-0|</span>
<span class="c1"># 19:39|Sampson Carter missed Layup.|0-0|</span>
<span class="n">head</span> <span class="o">-</span><span class="n">n</span> <span class="mi">5</span> <span class="n">data</span><span class="o">.</span><span class="n">csv</span> <span class="o">|</span> <span class="n">sort</span>
<span class="c1"># 19:39|Sampson Carter missed Layup.|0-0|</span>
<span class="c1"># 19:45|Chaz Williams Steal.|0-0|</span>
<span class="c1"># 19:45||0-0|Juvonte Reddic Turnover.</span>
<span class="c1"># 20:00||0-0|Jump Ball won by Virginia Commonwealt.</span>
<span class="c1"># time|away|score|home</span>
<span class="c1"># columns separated by '|', sort on column 2 (-k2), case insensitive (-f)</span>
<span class="n">head</span> <span class="o">-</span><span class="n">n</span> <span class="mi">5</span> <span class="n">data</span><span class="o">.</span><span class="n">csv</span> <span class="o">|</span> <span class="n">sort</span> <span class="o">-</span><span class="n">f</span> <span class="o">-</span><span class="n">t</span><span class="s1">'|'</span> <span class="o">-</span><span class="n">k2</span>
<span class="c1"># time|away|score|home</span>
<span class="c1"># 19:45|Chaz Williams Steal.|0-0|</span>
<span class="c1"># 19:39|Sampson Carter missed Layup.|0-0|</span>
<span class="c1"># 20:00||0-0|Jump Ball won by Virginia Commonwealt.</span>
<span class="c1"># 19:45||0-0|Juvonte Reddic Turnover.</span>
</code></pre></div>
<p>Sometimes you want to check for duplicate records in a large text file - that's when <a href="http://en.wikipedia.org/wiki/Uniq">uniq</a> comes in handy. By using the <em>-c</em> parameter, uniq will output the count of occurrences along with the line. You can also use the <em>-d</em> and <em>-u</em> parameters to output only duplicated or unique records.</p>
<div class="highlight"><pre><span></span><code><span class="n">sort</span> <span class="n">data</span><span class="o">.</span><span class="n">csv</span> <span class="o">|</span> <span class="n">uniq</span> <span class="o">-</span><span class="n">c</span> <span class="o">|</span> <span class="n">sort</span> <span class="o">-</span><span class="n">nr</span> <span class="o">|</span> <span class="n">head</span> <span class="o">-</span><span class="n">n</span> <span class="mi">7</span>
<span class="c1"># 2 8:47|Maxie Esho missed Layup.|46-54|</span>
<span class="c1"># 2 8:47|Maxie Esho Offensive Rebound.|46-54|</span>
<span class="c1"># 2 7:38|Trey Davis missed Free Throw.|51-56|</span>
<span class="c1"># 2 12:12||16-11|Rob Brandenberg missed Free Throw.</span>
<span class="c1"># 1 time|away|score|home</span>
<span class="c1"># 1 9:51||20-11|Juvonte Reddic Steal.</span>
<span class="n">sort</span> <span class="n">data</span><span class="o">.</span><span class="n">csv</span> <span class="o">|</span> <span class="n">uniq</span> <span class="o">-</span><span class="n">d</span>
<span class="c1"># 12:12||16-11|Rob Brandenberg missed Free Throw.</span>
<span class="c1"># 7:38|Trey Davis missed Free Throw.|51-56|</span>
<span class="c1"># 8:47|Maxie Esho Offensive Rebound.|46-54|</span>
<span class="c1"># 8:47|Maxie Esho missed Layup.|46-54|</span>
<span class="n">sort</span> <span class="n">data</span><span class="o">.</span><span class="n">csv</span> <span class="o">|</span> <span class="n">uniq</span> <span class="o">-</span><span class="n">u</span> <span class="o">|</span> <span class="n">wc</span> <span class="o">-</span><span class="n">l</span>
<span class="c1"># 369 (unique lines)</span>
</code></pre></div>
<p>While it's sometimes difficult to remember all of the parameters for the Unix commands, getting familiar with them has been beneficial to my productivity and allowed me to avoid many headaches when working with large text files.</p>
<p>Hopefully you'll find them as useful as I have.</p>
<p><em>Additional Resources:</em></p>
<ul>
<li><a href="http://www.drbunsen.org/explorations-in-unix/">Explorations in Unix</a> by <a href="http://www.drbunsen.org/">Seth Brown</a></li>
<li><a href="http://www.ceri.memphis.edu/computer/docs/unix/bshell.htm">An Introduction to the Unix Shell</a></li>
<li><a href="http://blog.comsysto.com/2013/04/25/data-analysis-with-the-unix-shell/">Data Analysis with the Unix Shell</a></li>
<li><a href="http://jeroenjanssens.com/2013/09/19/seven-command-line-tools-for-data-science.html">7 Command Line Tools for Data Science</a></li>
</ul>How random is JavaScript's Math.random()?2013-06-30T00:00:00-07:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2013-06-30:/2013/06/30/testing-javascripts-random-function/<p>A few weeks back, I was talking with my friend <a href="http://mollybierman.tumblr.com">Molly</a> about personal domains and realized that her nickname, Bierface, was available. The exchange basically went like this:</p>
<blockquote>
<p>Me: I should buy bierface.com and just put up a ridiculous picture of you.</p>
<p>Molly: You would have to do a …</p></blockquote><p>A few weeks back, I was talking with my friend <a href="http://mollybierman.tumblr.com">Molly</a> about personal domains and realized that her nickname, Bierface, was available. The exchange basically went like this:</p>
<blockquote>
<p>Me: I should buy bierface.com and just put up a ridiculous picture of you.</p>
<p>Molly: You would have to do a slideshow. Too many gems.</p>
</blockquote>
<p><a href="http://www.bierface.com">So I did just that</a>, switching randomly between 14 pictures every time the page is loaded. The laughs from it have been well worth the $10 spent purchasing the domain.</p>
<p>She started to question the randomness though. Here's what the code that loads each image looks like:</p>
<div class="highlight"><pre><span></span><code><span class="p"><</span><span class="nt">a</span> <span class="na">href</span><span class="o">=</span><span class="s">"http://mollybierman.tumblr.com"</span><span class="p">></span>
<span class="p"><</span><span class="nt">img</span> <span class="na">id</span><span class="o">=</span><span class="s">"bierface"</span> <span class="na">src</span><span class="o">=</span><span class="s">""</span><span class="p">/></span>
<span class="p"></</span><span class="nt">a</span><span class="p">></span>
<span class="p"><</span><span class="nt">script</span> <span class="na">type</span><span class="o">=</span><span class="s">"text/javascript"</span><span class="p">></span><span class="w"></span>
<span class="w"> </span><span class="kd">var</span><span class="w"> </span><span class="nx">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">Math</span><span class="p">.</span><span class="nx">ceil</span><span class="p">(</span><span class="nb">Math</span><span class="p">.</span><span class="nx">random</span><span class="p">()</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="mf">14</span><span class="p">);</span><span class="w"></span>
<span class="w"> </span><span class="nb">document</span><span class="p">.</span><span class="nx">getElementById</span><span class="p">(</span><span class="s2">"bierface"</span><span class="p">).</span><span class="nx">src</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"./pictures/"</span><span class="o">+</span><span class="nx">n</span><span class="o">+</span><span class="s2">".jpg"</span><span class="p">;</span><span class="w"></span>
<span class="p"></</span><span class="nt">script</span><span class="p">></span>
</code></pre></div>
<p>All we're doing is creating an empty <em><code><img></code></em> element, and then changing the src attribute of that element via JavaScript. The first line of JavaScript uses a combination of <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Math/ceil">Math.ceil()</a> and <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Math/random">Math.random()</a> to get a random integer between 1 and 14 (which are how the images are named). The second line uses that integer to create a file path and tells our <em><code><img></code></em> element to use that path as the src for the image.</p>
<p>Since the image is loaded by your web client, this seemed like a great opportunity to learn the very basics of grabbing client-side data - I could write some code to repeatedly get which image was loaded in order to determine how random <em>Math.random()</em> truly is.</p>
<h4>The Setup</h4>
<p>We're going to be using <a href="http://jeanphix.me/Ghost.py/">Ghost.py</a> to simulate a <a href="http://en.wikipedia.org/wiki/WebKit">WebKit</a> client. Ghost.py requires <a href="http://en.wikipedia.org/wiki/PyQt">PyQt</a> or <a href="http://en.wikipedia.org/wiki/PySide">PySide</a>, so you'll want to grab one of those, too. I'm on OSX 1.8.2 and using PySide 1.1.0 for Python 2.7, which you can get <a href="http://qt-project.org/wiki/PySide_Binaries_MacOSX">here</a>. You'll also need to grab Qt 4.7, which you can find <a href="http://packages.kitware.com/item/3736">here</a>.</p>
<h4>The Code</h4>
<p>With a little Python and Ghost.py, we can simulate a browser, allowing us to execute JavaScript telling us which image was loaded. We can also use <a href="http://matplotlib.org/">matplotlib</a> to plot the distribution.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">ghost</span> <span class="kn">import</span> <span class="n">Ghost</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="n">ghost</span> <span class="o">=</span> <span class="n">Ghost</span><span class="p">()</span>
<span class="c1"># JavaScript to grab the src file name for the image loaded</span>
<span class="n">js</span> <span class="o">=</span> <span class="s2">"document.getElementById('bierface').src.substr(33);"</span>
<span class="c1"># initialize zero'd out dictionary to hold image counts</span>
<span class="c1"># this way we can draw a nice, empty, base plot before we have actual values</span>
<span class="n">counts</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">15</span><span class="p">),</span> <span class="p">[</span><span class="mi">0</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">15</span><span class="p">)]))</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">xrange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1002</span><span class="p">):</span>
<span class="c1"># draw empty plot on first pass</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">!=</span> <span class="mi">1</span><span class="p">:</span>
<span class="n">page</span><span class="p">,</span> <span class="n">page_resources</span> <span class="o">=</span> <span class="n">ghost</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="s1">'http://www.bierface.com'</span><span class="p">)</span>
<span class="n">image</span> <span class="o">=</span> <span class="n">ghost</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">js</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">image</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">image</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">'.'</span><span class="p">)[</span><span class="mi">0</span><span class="p">])</span> <span class="c1"># grab just the image number</span>
<span class="n">counts</span><span class="p">[</span><span class="n">image</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">plt</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">15</span><span class="p">),</span> <span class="n">counts</span><span class="o">.</span><span class="n">values</span><span class="p">(),</span> <span class="n">align</span><span class="o">=</span><span class="s1">'center'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">15</span><span class="p">),</span> <span class="n">counts</span><span class="o">.</span><span class="n">keys</span><span class="p">())</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">'Image'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">'# of times shown'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">'n = </span><span class="si">{0}</span><span class="s1">'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">zfill</span><span class="p">(</span><span class="mi">4</span><span class="p">)))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">grid</span><span class="p">()</span>
<span class="n">path</span> <span class="o">=</span> <span class="s1">'</span><span class="si">{0}</span><span class="s1">/images/</span><span class="si">{1}</span><span class="s1">'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">getcwd</span><span class="p">(),</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">)</span><span class="o">.</span><span class="n">zfill</span><span class="p">(</span><span class="mi">4</span><span class="p">))</span>
<span class="n">save</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">close</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">os</span><span class="o">.</span><span class="n">system</span><span class="p">(</span><span class="s1">'ffmpeg -f image2 -r 10 -i images/</span><span class="si">%04d</span><span class="s1">.png -s 480x360 random.avi'</span><span class="p">)</span>
</code></pre></div>
<p>Let's walk though the code:</p>
<ol>
<li>Load our libraries and create an instance of the Ghost class.</li>
<li>Store the JavaScript we'll need to execute in order to grab the image file name into a variable named <em>js</em>.</li>
<li>The comment should explain this one - we're initializing a zero'd out dictionary called <em>counts</em> so that our first plot doesn't have an x-axis with just one value. Each key of the dictionary will correspond to one of the images.</li>
<li>The <a href="http://docs.python.org/2/reference/compound_stmts.html#for">for loop</a> is used to run 1,000 simulations. My <a href="http://docs.python.org/2/library/functions.html#xrange">xrange</a> usage is a little wacky because I'm using it to title and name the plots - typically <em>xrange</em> starts with 0 and runs up <em>until</em> the number specified (e.g. 1,001 will be the last loop, not 1,002).</li>
<li>
<p>This is the section that grabs which image was loaded by simulating a WebKit client with Ghost.py. This section does not get run on the first pass since we want to start with an empty plot.</p>
<ol>
<li>Load bierface.com into our <em>page</em> variable.</li>
<li>Execute the JavaScript mentioned in #2 and store it in the <em>image</em> variable. Remember that this will be a string.</li>
<li>Split the <em>image</em> string so that we just grab the image number loaded.</li>
<li>Update our dictionary of counts for the given <em>image</em>.</li>
</ol>
</li>
<li>
<p>Here we're using <a href="http://matplotlib.org/api/pyplot_api.html">matplotlib.pyplot</a> to draw a bar chart. Thanks to <a href="http://www.jesshamrick.com/">Jess Hamrick</a> for some awesome <a href="http://www.jesshamrick.com/2012/09/03/saving-figures-from-pyplot/">plot-saving boilerplate</a>, which I'm using behind the <em>save</em> function.</p>
</li>
<li>Finally, use <a href="https://en.wikipedia.org/wiki/FFmpeg">ffmpeg</a> to stitch our plots together into a video.</li>
</ol>
<h4>The Results</h4>
<p><em>Math.random()</em> is pretty random (though #7 is the clear loser in the video below). It's easy to think it's not when working with a small sample size, but it's clear the numbers start to even out as the sample size increases.</p>
<p><center><iframe width="480" height="360" src="//www.youtube.com/embed/y-tRXCyBk4w" frameborder="0" allowfullscreen></iframe></center></p>Join vs Exists vs In (SQL)2013-06-03T00:00:00-07:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2013-06-03:/2013/06/03/join-vs-exists-vs-in/<p>Last weekend, I came across <a href="http://en.wikipedia.org/wiki/Jeff_Atwood">Jeff Atwood</a>'s excellent <a href="http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html">visual explanation of SQL joins</a> on Hacker News.</p>
<p>It reminded me of teaching SQL to the incoming batch of <a href="http://www.pwc.com/us/en/forensic-services/technology-solutions.jhtml">PwC FTS</a> associates a few years ago. Not many of them had prior programming experience, much less SQL exposure, so it was …</p><p>Last weekend, I came across <a href="http://en.wikipedia.org/wiki/Jeff_Atwood">Jeff Atwood</a>'s excellent <a href="http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html">visual explanation of SQL joins</a> on Hacker News.</p>
<p>It reminded me of teaching SQL to the incoming batch of <a href="http://www.pwc.com/us/en/forensic-services/technology-solutions.jhtml">PwC FTS</a> associates a few years ago. Not many of them had prior programming experience, much less SQL exposure, so it was a fun week to learn how well us instructors could teach the topic.</p>
<p>Most of them intuitively picked up on how the IN clause worked, but struggled with EXISTS and JOINs initially. An explanation that always seemed to help illustrate the concept was to show that often you can write the exact same query using an IN, EXISTS, or a JOIN.</p>
<p>As an example, let's assume the following two tables, which we'll call <em>tableA</em> and <em>tableB</em>.</p>
<div class="highlight"><pre><span></span><code><span class="n">id</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="n">title</span><span class="w"></span>
<span class="o">--</span><span class="w"> </span><span class="o">----</span><span class="w"> </span><span class="o">--</span><span class="w"> </span><span class="o">----</span><span class="w"></span>
<span class="mh">1</span><span class="w"> </span><span class="n">Kenny</span><span class="w"> </span><span class="mh">1</span><span class="w"> </span><span class="n">Analyst</span><span class="w"></span>
<span class="mh">1</span><span class="w"> </span><span class="n">Rob</span><span class="w"> </span><span class="mh">2</span><span class="w"> </span><span class="n">Sales</span><span class="w"></span>
<span class="mh">4</span><span class="w"> </span><span class="n">Molly</span><span class="w"> </span><span class="mh">3</span><span class="w"> </span><span class="n">Manager</span><span class="w"></span>
<span class="mh">1</span><span class="w"> </span><span class="n">Greg</span><span class="w"></span>
<span class="mh">2</span><span class="w"> </span><span class="n">John</span><span class="w"></span>
</code></pre></div>
<p>If we wanted to get everyone that's an Analyst, we could do the following:</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"></span>
<span class="k">FROM</span><span class="w"> </span><span class="n">tableA</span><span class="w"></span>
<span class="k">WHERE</span><span class="w"> </span><span class="n">tableA</span><span class="p">.</span><span class="n">id</span><span class="w"> </span><span class="k">IN</span><span class="w"> </span><span class="p">(</span><span class="k">SELECT</span><span class="w"> </span><span class="n">tableB</span><span class="p">.</span><span class="n">id</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">tableB</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Analyst'</span><span class="p">);</span><span class="w"></span>
<span class="c1">-- Returns 3 records - Kenny, Rob, and Greg</span>
</code></pre></div>
<p>For those not very familiar with SQL, this should be relatively easy to understand. We have written a <a href="http://en.wikipedia.org/wiki/Correlated_subquery">subquery</a> that will get the <em>id</em> for the <em>Analyst</em> title in <em>tableB</em>. Using IN, we can then grab all of the employees from <em>tableA</em> who have that title.</p>
<p>While IN statements are fairly intuitive, they're often less efficient than the same query written as a JOIN or EXISTS statement would be.</p>
<p>To produce the same results as above, we can do the following:</p>
<div class="highlight"><pre><span></span><code><span class="c1">-- EXISTS</span>
<span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"></span>
<span class="k">FROM</span><span class="w"> </span><span class="n">tableA</span><span class="w"></span>
<span class="k">WHERE</span><span class="w"> </span><span class="k">EXISTS</span><span class="w"> </span><span class="p">(</span><span class="k">SELECT</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">tableB</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Analyst'</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">tableA</span><span class="p">.</span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tableB</span><span class="p">.</span><span class="n">id</span><span class="p">);</span><span class="w"></span>
<span class="c1">-- JOIN (INNER is the default when only JOIN is specified)</span>
<span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"></span>
<span class="k">FROM</span><span class="w"> </span><span class="n">tableA</span><span class="w"></span>
<span class="k">JOIN</span><span class="w"> </span><span class="n">tableB</span><span class="w"></span>
<span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">tableA</span><span class="p">.</span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tableB</span><span class="p">.</span><span class="n">id</span><span class="w"></span>
<span class="k">WHERE</span><span class="w"> </span><span class="n">tableB</span><span class="p">.</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Analyst'</span><span class="p">;</span><span class="w"></span>
</code></pre></div>
<p>In most cases, EXISTS or JOIN will be much more efficient (and faster) than an IN statement. Why?</p>
<p>When using an IN combined with a subquery, the database must process <em>the entire subquery</em> first, then process the overall query as a whole, matching up based on the relationship specified for the IN.</p>
<p>With an EXISTS or a JOIN, the database will return true/false while checking the relationship specified. Unless the table in the subquery is <em>very</em> small, EXISTS or JOIN will perform much better than IN.</p>
<p>Furthermore, writing the query as a JOIN gives us some additional flexibility to easily return all of the employees if we'd like, or to even check for employees who do not have a title (orphan records).</p>
<div class="highlight"><pre><span></span><code><span class="c1">-- Return employees and display their title</span>
<span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"></span>
<span class="k">FROM</span><span class="w"> </span><span class="n">tableA</span><span class="w"></span>
<span class="k">JOIN</span><span class="w"> </span><span class="n">tableB</span><span class="w"></span>
<span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">tableA</span><span class="p">.</span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tableB</span><span class="p">.</span><span class="n">id</span><span class="p">;</span><span class="w"></span>
<span class="c1">-- 1 Kenny 1 Analyst</span>
<span class="c1">-- 1 Rob 1 Analyst</span>
<span class="c1">-- 1 Greg 1 Analyst</span>
<span class="c1">-- 2 John 2 Sales</span>
<span class="c1">-- Which employees do not have a title?</span>
<span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"></span>
<span class="k">FROM</span><span class="w"> </span><span class="n">tableA</span><span class="w"></span>
<span class="k">LEFT</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">tableB</span><span class="w"></span>
<span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">tableA</span><span class="p">.</span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tableB</span><span class="p">.</span><span class="n">id</span><span class="w"></span>
<span class="k">WHERE</span><span class="w"> </span><span class="n">tableB</span><span class="p">.</span><span class="n">id</span><span class="w"> </span><span class="k">IS</span><span class="w"> </span><span class="k">NULL</span><span class="p">;</span><span class="w"></span>
<span class="c1">-- 4 Molly NULL NULL</span>
</code></pre></div>
<p>In the first query above, Molly falls out because she does not have a title. If we would have liked her to appear in the record set, we could simply change the JOIN to a LEFT JOIN and she would appear with NULL data from <em>tableB</em>.</p>
<p>If you have many IN statements littered throughout your code, you should compare the performance of these queries against an EXISTS or JOIN version of the same query - you'll likely see performance gains.</p>
<p>I hope this illustrated some of the subtle differences between INs, EXISTS, and JOINs. Questions and feedback in the comments are appreciated.</p>More web scraping with Python (and a map)2013-04-29T00:00:00-07:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2013-04-29:/2013/04/29/more-web-scraping-with-python/<p><em>This is a follow-up to my <a href="/2013/03/03/web-scraping-101-with-python/" title="Web Scraping 101 with Python">previous post</a> about web scraping with Python</em>.</p>
<p>Previously, I wrote a basic intro to scraping data off of websites. Since I wanted to keep the intro fairly simple, I didn't cover storing the data. In this post, I'll cover the basics of writing the …</p><p><em>This is a follow-up to my <a href="/2013/03/03/web-scraping-101-with-python/" title="Web Scraping 101 with Python">previous post</a> about web scraping with Python</em>.</p>
<p>Previously, I wrote a basic intro to scraping data off of websites. Since I wanted to keep the intro fairly simple, I didn't cover storing the data. In this post, I'll cover the basics of writing the scraped data to a flat file and then take things a bit further from there.</p>
<p>Last time, we used the Chicago Reader's Best of 2011 list, but let's change it up a bit this time and scrape a different site. Why? Because scrapers break, so we might as well practice a little bit more by scraping something different.</p>
<p>In this post, we're going to use the data from <a href="http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/">Chicago Magazine's Best Sandwiches list</a> because ... who doesn't like sandwiches?</p>
<p>If you're new to scraping, it might be a good idea to go back and read my <a href="http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/" title="Web Scraping 101 with Python">previous post</a> as a refresher as I don't intend to be methodical in this one.</p>
<h4>Finding the data</h4>
<p>Looking at the list, it's clear everything is in a fairly standard format - each of the sandwiches in the list gets a <em><code><div class="sammy"></code></em> and each div holds a bit more information - specifically, the rank, sandwich name, location, and a URL to a detailed page about each entry.</p>
<p><img alt="Delicious sammy divs" src="/images/sammy-divs.png"></p>
<p>Clicking through a few of the sammy links, we can see that each sandwich also gets a detailed page that includes the sandwich's name, rank, description, and price along with the restaurant's name, address, phone number, and website. Each of these details is contained within <em><code><div id="sandwich"></code></em>, which will make them very easy to get at.</p>
<p><img alt="Sandwich details HTML" src="/images/sammy-details.png"></p>
<h4>Package choices</h4>
<p>We'll again be using the <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a> and <a href="http://docs.python.org/2/library/urllib2.html">urllib2</a> libraries. Last time around, the choice of these two libraries generated some discussion in the post's <a href="http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/#disqus_thread">comments section</a>, on <a href="http://www.reddit.com/r/Python/comments/19lnth/web_scraping_101_with_python_and_beautifulsoup/">Reddit</a>, and <a href="https://news.ycombinator.com/item?id=5353347">Hacker News</a>.</p>
<p>The reason I use <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a> is because I've found it to be very easy to use and understand, but YMMV. It's been around for a very long time (since 2004) and is certainly in the tool belt of many. That said, Python has a vast ecosystem with a lot of scraping libraries and ones like <a href="http://scrapy.org/">Scrapy</a> and <a href="http://pythonhosted.org/pyquery/">PyQuery</a> (amongst many others) are worth a look.</p>
<p><a href="http://docs.python.org/2/library/urllib2.html">Urllib2</a> is <em>one</em> of Python's URL handling packages within its standard library. Because the standard library has <a href="http://docs.python.org/2/library/urllib.html">urllib</a> and <a href="http://docs.python.org/2/library/urllib2.html">urllib2</a>, it has at times been confusing to know which is the one you're actually looking for. On top of that, <a href="http://kennethreitz.org/">Kenneth Reitz</a>'s fantastic <a href="http://docs.python-requests.org/en/latest/">requests</a> library exists, which really simplifies dealing with HTTP.</p>
<p>In this example, and in the previous one, I use urllib2 simply because I <em>only</em> need the <a href="http://docs.python.org/2/library/urllib2.html#urllib2.urlopen">urlopen</a> function. If this scraper were more complex, I would likely use <a href="http://docs.python-requests.org/en/latest/">requests</a>, but I think using a third party library is a bit of overkill for this very simple use case.</p>
<h4>Getting the data</h4>
<p>Our code this time is going to be very similar to what it was <a href="http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/" title="Web Scraping 101 with Python">in the previous post</a>, save for a few minor changes. Since the details pages have the data we're looking for, let's get all of their URLs from the initial list page, and then process each details page. We're also going to write all of the data to a tab-delimited file using Python's <a href="http://docs.python.org/2/library/csv.html">CSV</a> package.</p>
<p>Last time around, we wrote our code as a set of functions, which I think helps the code's readability since it makes clear what each piece of the code is doing. This time around, we're just going to write a short script since this is really a one-off thing - once we have our data written to a CSV, we don't really have a use for this code anymore.</p>
<p>Our script will do the following:</p>
<ol>
<li>Load our libraries</li>
<li>Read our <em>base_url</em> into a BeautifulSoup object, grab all <em><code><div class="sammy"></code></em> sections, and then from each section, grab our sammy details URL.</li>
<li>Open up a file named <em>src-best-sandwiches.tsv</em> for writing. We'll write to this file using Python's <a href="http://docs.python.org/2/library/csv.html#csv.writer">csv.writer</a> object and separate the fields by a tab (\t). We'll also pass in a list of field names so that our file has a header row.</li>
<li>Loop through all of our sammy details URLs, grabbing each piece of information we're interested in, and writing that data to our <em>src-best-sandwiches.tsv</em> file.</li>
</ol>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
<span class="kn">from</span> <span class="nn">urllib2</span> <span class="kn">import</span> <span class="n">urlopen</span>
<span class="kn">import</span> <span class="nn">csv</span>
<span class="n">base_url</span> <span class="o">=</span> <span class="p">(</span><span class="s2">"http://www.chicagomag.com/Chicago-Magazine/"</span>
<span class="s2">"November-2012/Best-Sandwiches-Chicago/"</span><span class="p">)</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">urlopen</span><span class="p">(</span><span class="n">base_url</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">())</span>
<span class="n">sammies</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s2">"div"</span><span class="p">,</span> <span class="s2">"sammy"</span><span class="p">)</span>
<span class="n">sammy_urls</span> <span class="o">=</span> <span class="p">[</span><span class="n">div</span><span class="o">.</span><span class="n">a</span><span class="p">[</span><span class="s2">"href"</span><span class="p">]</span> <span class="k">for</span> <span class="n">div</span> <span class="ow">in</span> <span class="n">sammies</span><span class="p">]</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">"data/src-best-sandwiches.tsv"</span><span class="p">,</span> <span class="s2">"w"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">fieldnames</span> <span class="o">=</span> <span class="p">(</span><span class="s2">"rank"</span><span class="p">,</span> <span class="s2">"sandwich"</span><span class="p">,</span> <span class="s2">"restaurant"</span><span class="p">,</span> <span class="s2">"description"</span><span class="p">,</span> <span class="s2">"price"</span><span class="p">,</span>
<span class="s2">"address"</span><span class="p">,</span> <span class="s2">"phone"</span><span class="p">,</span> <span class="s2">"website"</span><span class="p">)</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">writer</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s2">"</span><span class="se">\t</span><span class="s2">"</span><span class="p">)</span>
<span class="n">output</span><span class="o">.</span><span class="n">writerow</span><span class="p">(</span><span class="n">fieldnames</span><span class="p">)</span>
<span class="k">for</span> <span class="n">url</span> <span class="ow">in</span> <span class="n">sammy_urls</span><span class="p">:</span>
<span class="n">url</span> <span class="o">=</span> <span class="n">url</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s2">"http://www.chicagomag.com"</span><span class="p">,</span> <span class="s2">""</span><span class="p">)</span> <span class="c1"># inconsistent URL</span>
<span class="n">page</span> <span class="o">=</span> <span class="n">urlopen</span><span class="p">(</span><span class="s2">"http://www.chicagomag.com</span><span class="si">{0}</span><span class="s2">"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">url</span><span class="p">))</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">page</span><span class="o">.</span><span class="n">read</span><span class="p">())</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s2">"div"</span><span class="p">,</span> <span class="p">{</span><span class="s2">"id"</span><span class="p">:</span> <span class="s2">"sandwich"</span><span class="p">})</span>
<span class="n">rank</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s2">"div"</span><span class="p">,</span> <span class="p">{</span><span class="s2">"id"</span><span class="p">:</span> <span class="s2">"sandRank"</span><span class="p">})</span><span class="o">.</span><span class="n">encode_contents</span><span class="p">()</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="n">sandwich</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">h1</span><span class="o">.</span><span class="n">encode_contents</span><span class="p">()</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">"<br/>"</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">restaurant</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">h1</span><span class="o">.</span><span class="n">span</span><span class="o">.</span><span class="n">encode_contents</span><span class="p">()</span>
<span class="n">description</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">p</span><span class="o">.</span><span class="n">encode_contents</span><span class="p">()</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="n">addy</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s2">"p"</span><span class="p">,</span> <span class="s2">"addy"</span><span class="p">)</span><span class="o">.</span><span class="n">em</span><span class="o">.</span><span class="n">encode_contents</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">","</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="n">price</span> <span class="o">=</span> <span class="n">addy</span><span class="o">.</span><span class="n">partition</span><span class="p">(</span><span class="s2">" "</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="n">address</span> <span class="o">=</span> <span class="n">addy</span><span class="o">.</span><span class="n">partition</span><span class="p">(</span><span class="s2">" "</span><span class="p">)[</span><span class="mi">2</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="n">phone</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s2">"p"</span><span class="p">,</span> <span class="s2">"addy"</span><span class="p">)</span><span class="o">.</span><span class="n">em</span><span class="o">.</span><span class="n">encode_contents</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">","</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="k">if</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s2">"p"</span><span class="p">,</span> <span class="s2">"addy"</span><span class="p">)</span><span class="o">.</span><span class="n">em</span><span class="o">.</span><span class="n">a</span><span class="p">:</span>
<span class="n">website</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s2">"p"</span><span class="p">,</span> <span class="s2">"addy"</span><span class="p">)</span><span class="o">.</span><span class="n">em</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">encode_contents</span><span class="p">()</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">website</span> <span class="o">=</span> <span class="s2">""</span>
<span class="n">output</span><span class="o">.</span><span class="n">writerow</span><span class="p">([</span><span class="n">rank</span><span class="p">,</span> <span class="n">sandwich</span><span class="p">,</span> <span class="n">restaurant</span><span class="p">,</span> <span class="n">description</span><span class="p">,</span> <span class="n">price</span><span class="p">,</span>
<span class="n">address</span><span class="p">,</span> <span class="n">phone</span><span class="p">,</span> <span class="n">website</span><span class="p">])</span>
<span class="nb">print</span> <span class="s2">"Done writing file"</span>
</code></pre></div>
<p>While our scraper does a good job of getting all of the sandwiches and restaurants, a couple of restaurants had "multiple locations" listed as their address. If we were to need this data, we'll have to find another way to get it (like checking each restaurant's website and manually adding their locations to our dataset). We'll also need to manually fix some oddities that wound up in our data due some inconsistent HTML on the other end (addresses and URLs winding up in the phone numbers column).</p>
<p>We're now left with a file full of data about Chicago Magazine's fifty best sandwiches. Sure, it's nice to have the data structured neatly in a flat file, but that's not all that interesting.</p>
<p>Collecting and hoarding data isn't of use to anyone - it's a waste of a potentially very valuable resource - it needs to be taken a step further. In some cases, this means a thorough analysis in search of patterns and trends, surfacing relationships we did not necessarily expect, and utilizing that information to better our decision-making. Data should be used to inform. In some cases, even a very basic visualization of the data can be of use.</p>
<p>Since we have addresses for each restaurant, this seems like a great time to make a map, but first, geocoding!</p>
<h4>Geocoding</h4>
<p>We're going to make our map using the <a href="https://developers.google.com/maps/">Google Maps API</a>, but in order to do so, we're first going to need to geocode our addresses to a set of lat/long points. Don't worry, I've taken the time to manually fill in the blanks on those "multiple locations" restaurants (you can grab the new file from my <a href="https://github.com/gjreda/best-sandwiches">GitHub repo</a> - it's called <em>best-sandwiches.tsv</em>).</p>
<p>To do so, we'll just write a short Python script which hits the <a href="https://developers.google.com/maps/documentation/geocoding/">Google Geocoding API</a>. Our script will do the following:</p>
<ol>
<li>Read our <em>best-sandwiches.tsv</em> file using the CSV module's <a href="http://docs.python.org/2/library/csv.html#csv.DictReader">DictReader</a> class, which reads each line of the file into its own dictionary object.</li>
<li>For each address, make a call to the Google Geocoding API, which will return a JSON response full of details about that address.</li>
<li>Using the <a href="http://docs.python.org/2/library/csv.html#csv.DictWriter">DictWriter</a> class, write a new file with our data along with the formatted address, lat, and long that we got back from the geocoder.</li>
</ol>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">urllib2</span> <span class="kn">import</span> <span class="n">urlopen</span>
<span class="kn">import</span> <span class="nn">csv</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">from</span> <span class="nn">time</span> <span class="kn">import</span> <span class="n">sleep</span>
<span class="k">def</span> <span class="nf">geocode</span><span class="p">(</span><span class="n">address</span><span class="p">):</span>
<span class="n">url</span> <span class="o">=</span> <span class="p">(</span><span class="s2">"http://maps.googleapis.com/maps/api/geocode/json?"</span>
<span class="s2">"sensor=false&address=</span><span class="si">{0}</span><span class="s2">"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">address</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s2">" "</span><span class="p">,</span> <span class="s2">"+"</span><span class="p">)))</span>
<span class="k">return</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">urlopen</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">())</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">"data/best-sandwiches.tsv"</span><span class="p">,</span> <span class="s2">"r"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">reader</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">DictReader</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s2">"</span><span class="se">\t</span><span class="s2">"</span><span class="p">)</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">"data/best-sandwiches-geocode.tsv"</span><span class="p">,</span> <span class="s2">"w"</span><span class="p">)</span> <span class="k">as</span> <span class="n">w</span><span class="p">:</span>
<span class="n">fields</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"rank"</span><span class="p">,</span> <span class="s2">"sandwich"</span><span class="p">,</span> <span class="s2">"restaurant"</span><span class="p">,</span> <span class="s2">"description"</span><span class="p">,</span> <span class="s2">"price"</span><span class="p">,</span>
<span class="s2">"address"</span><span class="p">,</span> <span class="s2">"city"</span><span class="p">,</span> <span class="s2">"phone"</span><span class="p">,</span> <span class="s2">"website"</span><span class="p">,</span> <span class="s2">"full_address"</span><span class="p">,</span>
<span class="s2">"formatted_address"</span><span class="p">,</span> <span class="s2">"lat"</span><span class="p">,</span> <span class="s2">"lng"</span><span class="p">]</span>
<span class="n">writer</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">DictWriter</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">fieldnames</span><span class="o">=</span><span class="n">fields</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s2">"</span><span class="se">\t</span><span class="s2">"</span><span class="p">)</span>
<span class="n">writer</span><span class="o">.</span><span class="n">writeheader</span><span class="p">()</span>
<span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">reader</span><span class="p">:</span>
<span class="nb">print</span> <span class="s2">"Geocoding: </span><span class="si">{0}</span><span class="s2">"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">line</span><span class="p">[</span><span class="s2">"full_address"</span><span class="p">])</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">geocode</span><span class="p">(</span><span class="n">line</span><span class="p">[</span><span class="s2">"full_address"</span><span class="p">])</span>
<span class="k">if</span> <span class="n">response</span><span class="p">[</span><span class="s2">"status"</span><span class="p">]</span> <span class="o">==</span> <span class="sa">u</span><span class="s2">"OK"</span><span class="p">:</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">response</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">"results"</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">line</span><span class="p">[</span><span class="s2">"formatted_address"</span><span class="p">]</span> <span class="o">=</span> <span class="n">results</span><span class="p">[</span><span class="s2">"formatted_address"</span><span class="p">]</span>
<span class="n">line</span><span class="p">[</span><span class="s2">"lat"</span><span class="p">]</span> <span class="o">=</span> <span class="n">results</span><span class="p">[</span><span class="s2">"geometry"</span><span class="p">][</span><span class="s2">"location"</span><span class="p">][</span><span class="s2">"lat"</span><span class="p">]</span>
<span class="n">line</span><span class="p">[</span><span class="s2">"lng"</span><span class="p">]</span> <span class="o">=</span> <span class="n">results</span><span class="p">[</span><span class="s2">"geometry"</span><span class="p">][</span><span class="s2">"location"</span><span class="p">][</span><span class="s2">"lng"</span><span class="p">]</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">line</span><span class="p">[</span><span class="s2">"formatted_address"</span><span class="p">]</span> <span class="o">=</span> <span class="s2">""</span>
<span class="n">line</span><span class="p">[</span><span class="s2">"lat"</span><span class="p">]</span> <span class="o">=</span> <span class="s2">""</span>
<span class="n">line</span><span class="p">[</span><span class="s2">"lng"</span><span class="p">]</span> <span class="o">=</span> <span class="s2">""</span>
<span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">writer</span><span class="o">.</span><span class="n">writerow</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
<span class="nb">print</span> <span class="s2">"Done writing file"</span>
</code></pre></div>
<p>Our file now has everything we need to make our map, which we're able to do with some basic HTML, CSS, JavaScript, and a little <a href="https://developers.google.com/fusiontables/">Google Fusion Tables</a> magic.</p>
<h4>Mapping</h4>
<p>While we could write another Python script to turn our flat file data into KML for mapping, it's much, much easier to use Google Fusion Tables. However, one important note with the Fusion Tables approach is that the underlying data must be within a <em>public</em> Fusion Table. Since our data is scraped from a publicly accessible website, that's not an issue here.</p>
<p>If you don't see Fusion Table as an option in your Google Drive account, you'll need to "connect more apps" and add it from there.</p>
<p><img alt="Adding Fusion Tables" src="/images/add-fusion-tables.png"></p>
<p>Once you've added the app, create a new Fusion Table from the delimited file on your computer (our <em>best-sandwiches-geocode.tsv</em>).</p>
<p><img alt="Loading to Fusion Tables" src="/images/loading-to-fusion-tables.png"></p>
<p>After you've finished your upload process, you should now have a spreadsheet-like table with the data in it. You'll notice that some of the columns are highlighted in yellow - this means that Fusion Tables is recognizing that it's a location. Our lat and lng columns should be all the way at the right - hover over the lat column header and select <em>change</em> from the drop down. This should display a prompt showing us the column type is a two column location comprised of both our lat and lng.</p>
<p>This is probably where I should point out that we could have also used Fusion Tables to geocode our data, but writing a script in Python seemed like more fun to me.</p>
<p><img alt="Lat Lng column type" src="/images/lat-lng-column-type.png"></p>
<p>Now that we have our data successfully in the Fusion Table, we can use a combination HTML, CSS, some JavaScript, and the Fusion Tables API to serve up a map (you could also just click the map tab in Fusion Tables to see an embedded map of the data, but that's not as fun). We can even style the map with the <a href="http://gmaps-samples-v3.googlecode.com/svn/trunk/styledmaps/wizard/index.html">Google Maps Style Wizard</a>.</p>
<p>Head over to my <a href="https://github.com/gjreda/best-sandwiches">GitHub repo</a> to see the HTML, CSS, and JavaScript used to create the map (along with the rest of the code and data used throughout this post). I've done my best to comment the <em>best-sandwiches.html</em> file to indicate what each piece is doing. I've also used HTML5's geolocation capabilities so that fellow Chicagoans can easily see which sandwiches are near them (it displays pretty nicely on a mobile browser, too).</p>
<p>You can check out the awesome map we made <a href="http://www.gregreda.com/best-sandwiches.html">here</a>. Note that if you aren't in Chicago and let your browser know your location, you likely won't see any of the data - you'll have to scroll over to Chicago.</p>
<p>Hopefully you found this post fun and informative. Was there something I didn't cover? Let me know in the comments.</p>Write online about what you love2013-03-16T00:00:00-07:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2013-03-16:/2013/03/16/why-you-should-write-online/<p>The other week, I wrote a <a href="http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/">very basic intro to web scraping with Python</a>. Some friends knew that I had experience scraping data and they wanted to learn, so I figured it would be a great opportunity to write something publicly and test how well I could explain it.</p>
<p>I'll …</p><p>The other week, I wrote a <a href="http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/">very basic intro to web scraping with Python</a>. Some friends knew that I had experience scraping data and they wanted to learn, so I figured it would be a great opportunity to write something publicly and test how well I could explain it.</p>
<p>I'll be extending that scraping post a bit more in the future, but first I wanted to write about how the week and a half since I posted it has gone - or, explain why I think you should write online about what you love.</p>
<h4>How it started</h4>
<p>Shortly after finishing the post and feeling fairly satisfied with the way it turned out, I posted it and e-mailed a link to three of my friends - two of which were the ones who asked me to teach them. One of them, <a href="https://twitter.com/kennylong">Kenny</a>, immediately messaged me, read through the post, and said I should share it on Twitter. <a href="https://twitter.com/gjreda/status/308337050065727489">So I did</a>. To me, any feedback was better than no feedback, so I posted it <a href="http://www.reddit.com/r/Python/comments/19lnth/web_scraping_101_with_python_and_beautifulsoup/">in the Python subreddit</a> too.</p>
<p>I was really just hoping some people would see it and let me know what they thought.</p>
<p>Turns out, quite a few more people than my 92 followers (at the time) have seen it in the week and a half since. About 32,000 more.</p>
<p>It was pretty exhilarating to watch something I wrote be shared in real-time. Many of the <a href="https://twitter.com/siah/status/308719789524799488">data</a> <a href="https://twitter.com/treycausey/status/308342790180458496">nerds</a> that I admire and follow on Twitter were sharing it. Hell, even Philadelphia's Chief Data Officer <a href="https://twitter.com/mheadd/status/308576308810637312">shared it</a>. It was a ton of fun to watch and read both (fairly) positive and constructive comments about it on /r/python. It immediately made me want to write the post you're currently reading, which I started working on two days later.</p>
<h4>A week later</h4>
<p>Sitting around the following Sunday, making minor CSS tweaks to this site and finishing up the previously mentioned post, I decided to check my Google Analytics to see what the final traffic from /r/python and Twitter looked like. Surprisingly, the real-time section of Analytics showed 250+ on the site. What? How?!</p>
<p>That's when I realized it wound up at the top of <a href="https://news.ycombinator.com/item?id=5353347">Hacker News</a>. And then <a href="http://www.reddit.com/r/programming/comments/1a20lf/web_scraping_101_with_python/">/r/programming</a>. Traffic went through the roof.</p>
<p><img alt="Hacker News, /r/programming, and lots of Twitter sharing" src="/images/more-traffic-20130313.png"></p>
<p>And again, the comments were positive and constructive.</p>
<h4>Lesson learned</h4>
<p>And this leads me to why you should write about things you're passionate about online. When you're truly passionate about something, you spend a lot of time thinking and learning about it - you try to make it a part of your life. You try to become a reputable source on the topic (or even an expert). It can be something as broad as beer, personal finance, or film; or as niche as stand-up comedy, vegan baking, or <a href="http://en.wikipedia.org/wiki/Emo#Underground_popularity:_mid-1990s">90s midwest emo bands</a> (guilty). It doesn't matter what it is as long as <em>you</em> love it.</p>
<p>Like me, you're likely ecstatic when you find someone you're able to get nerdy with about something you love. You truly enjoy the topic and are always looking for ways to <em>learn more</em> and <em>teach others</em> about it (or just banter).</p>
<p>That's why you should share your knowledge about whatever the field may be. <em>It doesn't matter whether 10 or 10,000 people see what you've shared</em>. There are people interested, but might not know where to start. And that's the best way to reinforce how well we know something - by teaching it to others. You'll be prompted with questions you hadn't thought about before, which will only further your own curiosity. You're forced to explain concepts in simple terms that anyone can understand - you become a better teacher and communicator. Sometimes, someone else crazily passionate about the same topic will even come along and teach you a thing or two.</p>
<p>We all have a thirst for knowledge in some form. The internet's a magnificent place to test our existing knowledge by teaching others and learning more throughout the process thanks to feedback from those with differing experiences.</p>
<p>Put your passions out there. More often than not, you'll be amazed at what you get back.</p>Web Scraping 101 with Python2013-03-03T00:00:00-08:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2013-03-03:/2013/03/03/web-scraping-101-with-python/<p><em>This is part of a series of posts I have written about web scraping with Python.</em></p>
<ol>
<li><a href="http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/">Web Scraping 101 with Python</a>, which covers the basics of using Python for web scraping.</li>
<li><a href="http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/">Web Scraping 201: Finding the API</a>, which covers when sites load data client-side with Javascript.</li>
<li><a href="http://www.gregreda.com/2016/10/16/asynchronous-scraping-with-python/">Asynchronous Scraping with Python …</a></li></ol><p><em>This is part of a series of posts I have written about web scraping with Python.</em></p>
<ol>
<li><a href="http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/">Web Scraping 101 with Python</a>, which covers the basics of using Python for web scraping.</li>
<li><a href="http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/">Web Scraping 201: Finding the API</a>, which covers when sites load data client-side with Javascript.</li>
<li><a href="http://www.gregreda.com/2016/10/16/asynchronous-scraping-with-python/">Asynchronous Scraping with Python</a>, showing how to use multithreading to speed things up.</li>
<li><a href="http://www.gregreda.com/2020/11/17/scraping-pages-behind-login-forms/">Scraping Pages Behind Login Forms</a>, which shows how to log into sites using Python.</li>
</ol>
<hr>
<p>Yea, yea, I know I said I was going to <a href="http://www.gregreda.com/2013/01/23/translating-sql-to-pandas-part1/">write more</a> on <a href="http://pandas.pydata.org">pandas</a>, but recently I've had a couple friends ask me if I could teach them how to scrape data. While they said they were able to find a ton of resources online, all assumed some level of knowledge already. Here's my attempt at assuming a very minimal knowledge of programming.</p>
<h4>Getting Setup</h4>
<p>We're going to be using Python 2.7, <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a>, and <a href="http://lxml.de/">lxml</a>. If you don't already have Python 2.7, you'll want to download the proper version for your OS <a href="http://python.org/download/releases/2.7.3/">here</a>.</p>
<p>To check if you have Python 2.7 on OSX, open up <a href="http://en.wikipedia.org/wiki/Terminal_(OS_X)">Terminal</a> and type <em>python --version</em>. You should see something like this:</p>
<p><img alt="What Terminal should looks like" src="/images/python-version.png"></p>
<p>Next, you'll need to install <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a>. If you're on OSX, you'll already have <a href="http://pypi.python.org/pypi/setuptools">setuptools</a> installed. Let's use it to install <a href="http://www.pip-installer.org/en/latest/">pip</a> and use that for package management instead.</p>
<p>In Terminal, run <em>sudo easy_install pip</em>. You'll be prompted for your password - type it in and let it run. Once that's done, again in Terminal, <em>sudo pip install BeautifulSoup4</em>. Finally, you'll need to <a href="http://lxml.de/installation.html">install lxml</a>.</p>
<h4>A few scraping rules</h4>
<p>Now that we have the packages we need, we can start scraping. But first, a couple of rules.</p>
<ol>
<li>You should check a site's terms and conditions before you scrape them. It's their data and they likely have some rules to govern it.</li>
<li>Be nice - A computer will send web requests much quicker than a user can. Make sure you space out your requests a bit so that you don't hammer the site's server.</li>
<li>Scrapers break - Sites change their layout all the time. If that happens, be prepared to rewrite your code.</li>
<li>Web pages are inconsistent - There's sometimes some manual clean up that has to happen even after you've gotten your data.</li>
</ol>
<h4>Finding your data</h4>
<p>For this example, we're going to use the <a href="http://www.chicagoreader.com/chicago/best-of-chicago-2011/BestOf?oid=4100483">Chicago Reader's Best of 2011</a> list. Why? Because I think it's a great example of terrible data presentation on the web. Go ahead and browse it for a bit.</p>
<p>All you want to see is a list of the category, winner, and maybe the runners-up, right? But you have to continuously click link upon link, slowly navigating your way through the list.</p>
<p>Hopefully in your clicking you noticed the important thing though - all the pages are structured the same.</p>
<h4>Planning your code</h4>
<p>In looking at the <a href="http://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228">Food and Drink</a> section of the Best of 2011 list, we see that all the categories are a link. Each of those links has the winner, maybe some information about the winner (like an address), and the runners-up. It's probably a good idea to break these things into separate functions in our code.</p>
<p>To start, we need to take a look at the HTML that displays these categories. If you're in Chrome or Firefox, highlight "Readers' Poll Winners", right-click, and select Inspect Element.</p>
<p><img alt="Inspect Element" src="/images/inspect-element.png"></p>
<p>This opens up the browser's Developer Tools (in Firefox, you might now have to click the HTML button on the right side of the developer pane to fully show it). Now we'll be able to see the page layout. The browser has brought us directly to the piece of HTML that's used to display the "Readers' Poll Winners" <em><code><dt></code></em> element.</p>
<p><img alt="Inspect Element some more" src="/images/inspect-element-more.png"></p>
<p>This seems to be the area of code where there's going to be some consistency in how the category links are displayed. See that <em><code><dl class="boccat"></code></em> just above our "Readers' Poll Winners" line? If you mouse over that line in your browser's dev tools, you'll notice that it highlights the <strong>entire section</strong> of category links we want. And every category link is within a <em><code><dd></code></em> element. Perfect! Let's get all of them.</p>
<p><img alt="Inspect Element mouse over" src="/images/inspect-element-mouseover.png"></p>
<h4>Our first function - getting the category links</h4>
<p>Now that we know we know the <em><code><dl class="boccat"></code></em> section holds all the links we want, let's write some code to find that section, and then grab all of the links within the <em><code><dd></code></em> elements of that section.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
<span class="kn">from</span> <span class="nn">urllib2</span> <span class="kn">import</span> <span class="n">urlopen</span>
<span class="n">BASE_URL</span> <span class="o">=</span> <span class="s2">"http://www.chicagoreader.com"</span>
<span class="k">def</span> <span class="nf">get_category_links</span><span class="p">(</span><span class="n">section_url</span><span class="p">):</span>
<span class="n">html</span> <span class="o">=</span> <span class="n">urlopen</span><span class="p">(</span><span class="n">section_url</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html</span><span class="p">,</span> <span class="s2">"lxml"</span><span class="p">)</span>
<span class="n">boccat</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s2">"dl"</span><span class="p">,</span> <span class="s2">"boccat"</span><span class="p">)</span>
<span class="n">category_links</span> <span class="o">=</span> <span class="p">[</span><span class="n">BASE_URL</span> <span class="o">+</span> <span class="n">dd</span><span class="o">.</span><span class="n">a</span><span class="p">[</span><span class="s2">"href"</span><span class="p">]</span> <span class="k">for</span> <span class="n">dd</span> <span class="ow">in</span> <span class="n">boccat</span><span class="o">.</span><span class="n">findAll</span><span class="p">(</span><span class="s2">"dd"</span><span class="p">)]</span>
<span class="k">return</span> <span class="n">category_links</span>
</code></pre></div>
<p>Hopefully this code is relatively easy to follow, but if not, here's what we're doing:</p>
<ol>
<li>Loading the urlopen function from the urllib2 library into our local <a href="http://en.wikipedia.org/wiki/Namespace_(computer_science)">namespace</a>.</li>
<li>Loading the BeautifulSoup class from the bs4 (BeautifulSoup4) library into our local namespace.</li>
<li>Setting a variable named <em>BASE_URL</em> to "http://www.chicagoreader.com". We do this because the links used through the site are relative - meaning they do not include the base domain. In order to store our links properly, we need to concatenate the base domain with each relative link.</li>
<li>Define a function named <em>get_category_links</em>.<ol>
<li>The function requires a parameter of <em>section_url</em>. In this example, we're going to use the <a href="http://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228">Food and Drink</a> section of the BOC list, however we could use a different section URL - for instance, the <a href="http://www.chicagoreader.com/chicago/best-of-chicago-2011-city-life/BestOf?oid=4106233">City Life</a> section's URL. We're able to create just one generic function because each section page is structured the same.</li>
<li>Open the section_url and read it in the <em>html</em> object.</li>
<li>Create an object called <em>soup</em> based on the BeautifulSoup class. The <em>soup</em> object is an <a href="http://en.wikipedia.org/wiki/Instance_(computer_science)">instance</a> of the BeautifulSoup class. It is initialized with the html object and parsed with <a href="http://lxml.de/">lxml</a>.</li>
<li>In our BeautifulSoup instance (which we called <em>soup</em>), find the <em><code><dl></code></em> element with a class of "boccat" and store that section in a variable called <em>boccat</em>.</li>
<li>This is a <a href="http://docs.python.org/2/tutorial/datastructures.html#list-comprehensions">list comprehension</a>. For every <em><code><dd></code></em> element found within our <em>boccat</em> variable, we're getting the href of its <em><code><a></code></em> element (our category links) and concatenating on our <em>BASE_URL</em> to make it a complete link. All of these links are being stored in a list called <em>category_links</em>. You could also write this line with a <a href="http://docs.python.org/2/tutorial/controlflow.html#for-statements">for loop</a>, but I prefer a list comprehension here because of its simplicity.</li>
<li>Finally, our function returns the <em>category_links</em> list that we created on the previous line.</li>
</ol>
</li>
</ol>
<h4>Our second function - getting the category, winner, and runners-up</h4>
<p>Now that we have our list of category links, we'd better start going through them to get our winners and runners-up. Let's figure out which elements contain the parts we care about.</p>
<p>If we look at the <a href="http://www.chicagoreader.com/chicago/best-chef/BestOf?oid=4088191">Best Chef</a> category, we can see that our category is in <em><code><h1 class="headline"></code></em>. Shortly after that, we find our winner and runners-up stored in <em><code><h2 class="boc1"></code></em> and <em><code><h2 class="boc2"></code></em>, respectively.</p>
<p><img alt="Finding our winners and runners-up" src="/images/winners-and-runners-up.png"></p>
<p>Let's write some code to get all of them.</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">get_category_winner</span><span class="p">(</span><span class="n">category_url</span><span class="p">):</span>
<span class="n">html</span> <span class="o">=</span> <span class="n">urlopen</span><span class="p">(</span><span class="n">category_url</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html</span><span class="p">,</span> <span class="s2">"lxml"</span><span class="p">)</span>
<span class="n">category</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s2">"h1"</span><span class="p">,</span> <span class="s2">"headline"</span><span class="p">)</span><span class="o">.</span><span class="n">string</span>
<span class="n">winner</span> <span class="o">=</span> <span class="p">[</span><span class="n">h2</span><span class="o">.</span><span class="n">string</span> <span class="k">for</span> <span class="n">h2</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">findAll</span><span class="p">(</span><span class="s2">"h2"</span><span class="p">,</span> <span class="s2">"boc1"</span><span class="p">)]</span>
<span class="n">runners_up</span> <span class="o">=</span> <span class="p">[</span><span class="n">h2</span><span class="o">.</span><span class="n">string</span> <span class="k">for</span> <span class="n">h2</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">findAll</span><span class="p">(</span><span class="s2">"h2"</span><span class="p">,</span> <span class="s2">"boc2"</span><span class="p">)]</span>
<span class="k">return</span> <span class="p">{</span><span class="s2">"category"</span><span class="p">:</span> <span class="n">category</span><span class="p">,</span>
<span class="s2">"category_url"</span><span class="p">:</span> <span class="n">category_url</span><span class="p">,</span>
<span class="s2">"winner"</span><span class="p">:</span> <span class="n">winner</span><span class="p">,</span>
<span class="s2">"runners_up"</span><span class="p">:</span> <span class="n">runners_up</span><span class="p">}</span>
</code></pre></div>
<p>It's very similar to our last function, but let's walk through it anyway.</p>
<ol>
<li>Define a function called <em>get_category_winner</em>. It requires a <em>category_url</em>.</li>
<li>Lines two and three are actually exactly the same as before - we'll come back to this in the next section.</li>
<li>Find the string within the <em><code><h1 class="headline"></code></em> element and store it in a variable named category.</li>
<li>Another list comprehension - store the string within every <em><code><h2 class="boc1"></code></em> element in a list called <em>winner</em>. But shouldn't there be only one winner? You'd think that, but some have multiple (e.g. <a href="http://www.chicagoreader.com/chicago/best-bang-for-your-buck/BestOf?oid=4088018">Best Bang for your Buck</a>).</li>
<li>Same as the previous line, but this time we're getting the runners-up.</li>
<li>Finally, return a <a href="http://docs.python.org/2/tutorial/datastructures.html#dictionaries">dictionary</a> with our data.</li>
</ol>
<h4>DRY - Don't Repeat Yourself</h4>
<p>As mentioned in the previous section, lines two and three of our second function mirror lines in our first function.</p>
<p>Imagine a scenario where we want to change the parser we're passing into our BeautifulSoup instance (in this case, lxml). With the way we've currently written our code, we'd have to make that change in two places. Now imagine you've written many more functions to scrape this data - maybe one to get addresses and another to get <a href="http://www.chicagoreader.com/chicago/best-new-food-truckfood/BestOf?oid=4101387">paragraphs of text about the winner</a> - you've likely repeated those same two lines of code in these functions and you now have to remember to make changes in four different places. That's not ideal.</p>
<p>A good principle in writing code is <a href="http://en.wikipedia.org/wiki/Don't_repeat_yourself">DRY - Don't Repeat Yourself</a>. When you notice that you've written the same lines of code a couple times throughout your script, it's probably a good idea to step back and think if there's a better way to structure that piece.</p>
<p>In our case, we're going to write another function to simply process a URL and return a BeautifulSoup instance. We can then call this function in our other functions instead of duplicating our logic.</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">make_soup</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="n">html</span> <span class="o">=</span> <span class="n">urlopen</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="k">return</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html</span><span class="p">,</span> <span class="s2">"lxml"</span><span class="p">)</span>
</code></pre></div>
<p>We'll have to change our other functions a bit now, but it's pretty minor - we just need to replace our duplicated lines with the following:</p>
<div class="highlight"><pre><span></span><code><span class="n">soup</span> <span class="o">=</span> <span class="n">make_soup</span><span class="p">(</span><span class="n">url</span><span class="p">)</span> <span class="c1"># where url is the url we're passing into the original function</span>
</code></pre></div>
<h4>Putting it all together</h4>
<p>Now that we have our main functions written, we can write a script to output the data however we'd like. Want to write to a CSV file? Check out Python's <a href="http://docs.python.org/2/library/csv.html#csv.DictWriter">DictWriter</a> class. Storing the data in a database? Check out the <a href="http://docs.python.org/2/library/sqlite3.html">sqlite3</a> or <a href="http://wiki.python.org/moin/DatabaseInterfaces">other various database libraries</a>. While both tasks are somewhat outside of my intentions for this post, if there's interest, let me know in the comments and I'd be happy to write more.</p>
<p>Hopefully you found this post useful. I've put a final example script in <a href="http://bit.ly/13yd9ng">this gist</a>.</p>Translating SQL to Pandas, Part 12013-01-23T00:00:00-08:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2013-01-23:/2013/01/23/translating-sql-to-pandas-part1/<p><em>I wrote a three part pandas tutorial for SQL users that you can find <a href="http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/">here</a></em>.</p>
<p><em>UPDATE: If you're interested in learning pandas from a SQL perspective and would prefer to watch a video, you can find video of my 2014 PyData NYC talk <a href="http://reda.io/sql2pandas">here</a>.</em></p>
<p>For some reason, I've always found …</p><p><em>I wrote a three part pandas tutorial for SQL users that you can find <a href="http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/">here</a></em>.</p>
<p><em>UPDATE: If you're interested in learning pandas from a SQL perspective and would prefer to watch a video, you can find video of my 2014 PyData NYC talk <a href="http://reda.io/sql2pandas">here</a>.</em></p>
<p>For some reason, I've always found SQL to a much more intuitive tool for exploring a tabular dataset than I have other languages (namely R and Python).</p>
<p>If you know SQL well, you can do a whole lot with it, and since data is often in a relational database anyway, it usually makes sense to stick with it. I find that my workflow often includes writing a lot of queries in SQL (using <a href="http://www.sequelpro.com/">Sequel Pro</a>) to get the data the way I want it, reading it into R (with <a href="http://www.rstudio.com/">RStudio</a>), and then maybe a bit more exploration, modeling, and visualization (with <a href="http://ggplot2.org/">ggplot2</a>).</p>
<p>Not too long ago though, I came across <a href="http://blog.wesmckinney.com/">Wes McKinney</a>'s <a href="http://pandas.pydata.org">pandas</a> package and my interest was immediately piqued. Pandas adds a bunch of functionality to Python, but most importantly, it allows for a DataFrame data structure - much like a database table or R's data frame.</p>
<p>Given the great things I've been reading about pandas lately, I wanted to make a conscious effort to play around with it. Instead of my typical workflow being a couple disjointed steps with SQL + R + (sometimes) Python, my thought is that it might make sense to have pandas work its way in and take over the R work. While I probably won't be able to completely give up R (too much ggplot2 love over here), I get bored if I'm not learning something new, so pandas it is.</p>
<p>I intend to document the process a bit - hopefully a couple posts illustrating the differences between SQL and pandas (and maybe some R too).</p>
<p>Throughout the rest of this post, we're going to be working with data from the <a href="https://data.cityofchicago.org">City of Chicago's open data</a> - specifically the <a href="https://data.cityofchicago.org/Transportation/Towed-Vehicles/ygr5-vcbg">Towed Vechicles data</a>.</p>
<h4>Loading the data</h4>
<h5>Using SQLite</h5>
<p>To be able to use SQL with this dataset, we'd first have to create the table. Using <a href="http://www.sqlite.org/">SQLite</a> syntax, we'd run the following:</p>
<div class="highlight"><pre><span></span><code><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">towed</span><span class="w"> </span><span class="p">(</span><span class="w"></span>
<span class="w"> </span><span class="n">tow_date</span><span class="w"> </span><span class="nb">text</span><span class="p">,</span><span class="w"></span>
<span class="w"> </span><span class="n">make</span><span class="w"> </span><span class="nb">text</span><span class="p">,</span><span class="w"></span>
<span class="w"> </span><span class="n">style</span><span class="w"> </span><span class="nb">text</span><span class="p">,</span><span class="w"></span>
<span class="w"> </span><span class="n">model</span><span class="w"> </span><span class="nb">text</span><span class="p">,</span><span class="w"></span>
<span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="nb">text</span><span class="p">,</span><span class="w"></span>
<span class="w"> </span><span class="n">plate</span><span class="w"> </span><span class="nb">text</span><span class="p">,</span><span class="w"></span>
<span class="w"> </span><span class="k">state</span><span class="w"> </span><span class="nb">text</span><span class="p">,</span><span class="w"></span>
<span class="w"> </span><span class="n">towed_address</span><span class="w"> </span><span class="nb">text</span><span class="p">,</span><span class="w"></span>
<span class="w"> </span><span class="n">phone</span><span class="w"> </span><span class="nb">text</span><span class="p">,</span><span class="w"></span>
<span class="w"> </span><span class="n">inventory</span><span class="w"> </span><span class="nb">text</span><span class="w"></span>
<span class="p">);</span><span class="w"></span>
</code></pre></div>
<p>Because SQLite <a href="http://www.sqlite.org/datatype3.html">uses a very generic type system</a>, we don't get the strict data types that we would in most other databases (such as MySQL and PostgreSQL); therefore, all of our data is going to be stored as text. In other databases, we'd store tow_date as a date or datetime field.</p>
<p>Before we read the data into SQLite, we need to tell the database to that the fields are separated by a comma. Then we can use the import command to read the file into our table.</p>
<div class="highlight"><pre><span></span><code><span class="p">.</span><span class="n">separator</span><span class="w"> </span><span class="s1">','</span><span class="w"></span>
<span class="p">.</span><span class="n">import</span><span class="w"> </span><span class="p">.</span><span class="o">/</span><span class="n">Towed_Vehicles</span><span class="p">.</span><span class="n">csv</span><span class="w"> </span><span class="n">towed</span><span class="w"></span>
</code></pre></div>
<p>Note that the downloaded CSV contains two header rows, so we'll need to delete those from our table since we don't need them.</p>
<div class="highlight"><pre><span></span><code><span class="k">DELETE</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">towed</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">tow_date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Tow Date'</span><span class="p">;</span><span class="w"></span>
</code></pre></div>
<p>We should have 5,068 records in our table now (note: the City of Chicago regularly updates this dataset, so you might get a different number).</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">towed</span><span class="p">;</span><span class="w"> </span><span class="c1">-- 5068</span>
</code></pre></div>
<h5>Using Python + pandas</h5>
<p>Let do the same with <a href="http://pandas.pydata.org">pandas</a> now.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="n">col_names</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"tow_date"</span><span class="p">,</span> <span class="s2">"make"</span><span class="p">,</span> <span class="s2">"style"</span><span class="p">,</span> <span class="s2">"model"</span><span class="p">,</span> <span class="s2">"color"</span><span class="p">,</span> <span class="s2">"plate"</span><span class="p">,</span> <span class="s2">"state"</span><span class="p">,</span>
<span class="s2">"towed_address"</span><span class="p">,</span> <span class="s2">"phone"</span><span class="p">,</span> <span class="s2">"inventory"</span><span class="p">]</span>
<span class="n">towed</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s2">"Towed_Vehicles.csv"</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">names</span><span class="o">=</span><span class="n">col_names</span><span class="p">,</span>
<span class="n">skiprows</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">parse_dates</span><span class="o">=</span><span class="p">[</span><span class="s2">"tow_date"</span><span class="p">])</span>
</code></pre></div>
<p>The read_csv function in pandas actually allowed us to skip the two header columns and translate the tow_date field to a datetime field.</p>
<p>Let's check our count just to make sure.</p>
<div class="highlight"><pre><span></span><code><span class="nb">len</span><span class="p">(</span><span class="n">towed</span><span class="p">)</span> <span class="c1"># 5068</span>
</code></pre></div>
<h4>Selecting data</h4>
<h5>SQL</h5>
<p>Selection data with SQL is fairly intuitive - just SELECT the columns you want FROM the particular table you're interested in. You can also take advantage of the LIMIT clause to only see a subset of your data.</p>
<div class="highlight"><pre><span></span><code><span class="c1">-- Return every column for every record in the towed table</span>
<span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">towed</span><span class="p">;</span><span class="w"></span>
<span class="c1">-- Return the tow_date, make, style, model, and color for every record in the towed table</span>
<span class="k">SELECT</span><span class="w"> </span><span class="n">tow_date</span><span class="p">,</span><span class="w"> </span><span class="n">make</span><span class="p">,</span><span class="w"> </span><span class="n">style</span><span class="p">,</span><span class="w"> </span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">towed</span><span class="p">;</span><span class="w"></span>
<span class="c1">-- Return every column for the first five records of the towed table</span>
<span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">towed</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">5</span><span class="p">;</span><span class="w"></span>
<span class="c1">-- Return every column in the towed table - start at the fifth record and show the next ten</span>
<span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">towed</span><span class="w"> </span><span class="k">LIMIT</span><span class="w"> </span><span class="mi">5</span><span class="p">,</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span><span class="w"> </span><span class="c1">-- records 5-14</span>
</code></pre></div>
<p>Additionally, you can throw a WHERE or ORDER BY (or both) into your queries for proper filtering and ordering of the data returned:</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">towed</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="k">state</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'TX'</span><span class="p">;</span><span class="w"> </span><span class="c1">-- Only towed vehicles from Texas</span>
<span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">towed</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">make</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'KIA'</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="k">state</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'TX'</span><span class="p">;</span><span class="w"> </span><span class="c1">-- KIAs with Texas plates</span>
<span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">towed</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">make</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'KIA'</span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">color</span><span class="p">;</span><span class="w"> </span><span class="c1">-- All KIAs ordered by color (A to Z)</span>
</code></pre></div>
<h5>Python + pandas</h5>
<p>Let's do some of the same, but this time let's use pandas:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># show only the make column for all records</span>
<span class="n">towed</span><span class="p">[</span><span class="s2">"make"</span><span class="p">]</span>
<span class="c1"># tow_date, make, style, model, and color for the first ten records</span>
<span class="n">towed</span><span class="p">[[</span><span class="s2">"tow_date"</span><span class="p">,</span> <span class="s2">"make"</span><span class="p">,</span> <span class="s2">"style"</span><span class="p">,</span> <span class="s2">"model"</span><span class="p">,</span> <span class="s2">"color"</span><span class="p">]][:</span><span class="mi">10</span><span class="p">]</span>
<span class="n">towed</span><span class="p">[:</span><span class="mi">5</span><span class="p">]</span> <span class="c1"># first five rows (alternatively, you could use towed.head())</span>
</code></pre></div>
<p>Because pandas is built on top of <a href="http://www.numpy.org/">NumPy</a>, we're able to use <a href="http://pandas.pydata.org/pandas-docs/dev/indexing.html#boolean-indexing">boolean indexing</a>. Since we're going to replicate similar statements to the ones we did in SQL, we know we're going to need towed cars from TX made by KIA.</p>
<div class="highlight"><pre><span></span><code><span class="n">towed</span><span class="p">[</span><span class="n">towed</span><span class="p">[</span><span class="s2">"state"</span><span class="p">]</span> <span class="o">==</span> <span class="s2">"TX"</span><span class="p">]</span> <span class="c1"># all columns and records where the car was from TX</span>
<span class="n">towed</span><span class="p">[(</span><span class="n">towed</span><span class="p">[</span><span class="s2">"state"</span><span class="p">]</span> <span class="o">==</span> <span class="s2">"TX"</span><span class="p">)</span> <span class="o">&</span> <span class="p">(</span><span class="n">towed</span><span class="p">[</span><span class="s2">"make"</span><span class="p">]</span> <span class="o">==</span> <span class="s2">"KIA"</span><span class="p">)]</span> <span class="c1"># made by KIA AND from TX</span>
<span class="n">towed</span><span class="p">[(</span><span class="n">towed</span><span class="p">[</span><span class="s2">"state"</span><span class="p">]</span> <span class="o">==</span> <span class="s2">"MA"</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">towed</span><span class="p">[</span><span class="s2">"make"</span><span class="p">]</span> <span class="o">==</span> <span class="s2">"JAGU"</span><span class="p">)]</span> <span class="c1"># made by Jaguar OR from MA</span>
<span class="n">towed</span><span class="p">[</span><span class="n">towed</span><span class="p">[</span><span class="s2">"make"</span><span class="p">]</span> <span class="o">==</span> <span class="s2">"KIA"</span><span class="p">]</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="s2">"color"</span><span class="p">)</span> <span class="c1"># made by KIA, ordered by color (A to Z)</span>
</code></pre></div>
<h5>Conclusion, Part 1</h5>
<p>This was obviously a very basic start, but there are a lot of good things about pandas - it's certainly concise and readable. Plus, since it works well with the various science + math packages (<a href="http://www.scipy.org">SciPy</a>, <a href="http://www.numpy.org/">NumPy</a>, <a href="http://matplotlib.org/">Matplotlib</a>, <a href="http://statsmodels.sourceforge.net/">statsmodels</a>, etc.), there's the potential to work almost entirely in one language for analysis tasks.</p>
<p>I plan on covering aggregate functions, pivots, and maybe some matplotlib in my next post.</p>Hello World2013-01-22T00:00:00-08:002023-10-26T18:23:07-07:00Greg Redatag:www.gregreda.com,2013-01-22:/2013/01/22/hello-world/<p>So I finally got around to putting something of my own up.</p>
<p>My intentions are mainly to use this space as a way to document mini projects that I'm working on, so plan on it being pretty programming, data, visualization, and statistics heavy - I get bored if I'm not learning …</p><p>So I finally got around to putting something of my own up.</p>
<p>My intentions are mainly to use this space as a way to document mini projects that I'm working on, so plan on it being pretty programming, data, visualization, and statistics heavy - I get bored if I'm not learning something new or doing something I find challenging. That said, I'm known to go on tangents about music and beer (and whatever else I feel like ranting about at the time).</p>
<p>There's also a high likelihood that I'll constantly be tweaking the layout of the site - hopefully the four people reading won't mind.</p>