I’ve recently been mapping the missions of the Paulist Fathers over the course of the nineteenth century. One problem with the data is that many of the points overlap with one another, since the Paulists were often in cities like New York, Philadelphia, Chicago, and Baltimore. When mapping these points, they overlap with one another.

This is a common problem in mapping, which Leaflet.js solves admirably for web maps. See, for example, the DPLA’s map of items. Another solution is to make the points transparent, so that overlapping points are darker than one another. While a judicious use of transparency can help in some places, it is generally a poor solution. It’s hard to explain what the layered colors mean to users, and the eye is poor at detecting the difference anyway.

What I wanted to do is to sum together the overlapping points. So for example, instead of having a data file with a mission at St. Peter’s Church in New York in September 1851 with 4,000 confessions, and another mission at St. Patrick’s Cathedral in October 1851 with 7,000 confessions, I wanted a data file where those points are aggregated as New York missions in 1851 with 11,000 total confessions and 2 missions.

This task is a simple job for Hadley Wickham’s plyr package in R.

<span class="n">library</span><span class="p">(</span><span class="n">plyr</span><span class="p">)</span>

<span class=“n”>raw</span> <span class=“o”><-</span> <span class=“n”>read.csv</span><span class=“p”>(</span><span class=“s2”>“data/paulist-chronicles/paulist-missions.geocoded.csv”</span><span class=“p”>)</span>

<span class=“n”>aggregated</span> <span class=“o”><-</span> <span class=“n”>ddply</span><span class=“p”>(</span><span class=“n”>raw</span><span class=“p”>,</span> <span class=“err”>.</span><span class=“p”>(</span><span class=“n”>city</span><span class=“p”>,</span> <span class=“n”>state</span><span class=“p”>,</span> <span class=“n”>year</span><span class=“p”>),</span> <span class=“n”>summarize</span><span class=“p”>,</span> <span class=“n”>long</span> <span class=“o”>=</span> <span class=“n”>max</span><span class=“p”>(</span><span class=“n”>geo.lon</span><span class=“p”>),</span> <span class=“n”>lat</span> <span class=“o”>=</span> <span class=“n”>max</span><span class=“p”>(</span><span class=“n”>geo.lat</span><span class=“p”>),</span> <span class=“n”>confessions</span> <span class=“o”>=</span> <span class=“n”>sum</span><span class=“p”>(</span><span class=“n”>confessions_total</span><span class=“p”>,</span> <span class=“n”>na.rm</span> <span class=“o”>=</span> <span class=“n”>TRUE</span><span class=“p”>),</span> <span class=“n”>converts</span> <span class=“o”>=</span> <span class=“n”>sum</span><span class=“p”>(</span><span class=“n”>converts_total</span><span class=“p”>,</span> <span class=“n”>na.rm</span> <span class=“o”>=</span> <span class=“n”>TRUE</span><span class=“p”>),</span> <span class=“n”>number_missions</span> <span class=“o”>=</span> <span class=“n”>length</span><span class=“p”>(</span><span class=“n”>location</span><span class=“p”>))</span>

<span class=“n”>write.csv</span><span class=“p”>(</span><span class=“n”>aggregated</span><span class=“p”>,</span> <span class=“s2”>“data/paulist-chronicles/paulist-missions.aggregated.csv”</span><span class=“p”>,</span> <span class=“n”>row.names</span> <span class=“o”>=</span> <span class=“n”>FALSE</span><span class=“p”>)</span>

First we load the package and read in the raw data. One function, `ddply`, will accomplish our work for us. The first argument passes our data to the function. The second argument, `.(city, state, year)`, splits up our original data by finding all the unique combinations of the `city`, `state`, and `year` variables. In other words, `ddply` makes a new data frame for each combination, such as New York missions in 1851, New York missions in 1852, Chicago missions in 1851, Chicago missions in 1852, and so on. (We could leave out the `year` variable if we wanted to aggregate the missions just by place, not the combination of time and place.) Then `ddply` applies the `summarize` function to each of those split-up data frames. Essentially we're deciding on the columns for our new aggregated data frame. For `long` and `lat`, we're taking the maximum value for, say, each New York mission, but each New York mission should have the same latitude and longitude so it doesn't matter. For `confessions` and `converts` we are summing up the total. Then for the `number_missions` we count the `length` of the split-up data frame (in other words, the total number of observations that have the unique combination of `city`, `state`, and `year`).

You can see what this does by looking at the original data, which goes through this script to produce this aggregated data. I won’t belabor the explanation, since Wickham explains it better than I can in his article “The Split-Apply-Combine Strategy for Data Analysis.” 1{#fnref1.footnoteRef} What I do want to point out is that this is a useful technique for making maps with overlapping points. And even if you are making the maps outside of R, perhaps in GIS software or in D3, R and plyr can still be powerful tools to get your data into the proper format.

  • Hadley Wickham, "The Split-Apply-Combine Strategy for Data Analysis," Journal of Statistical Software 40, no. 1 (April 2011): 1--29, http://www.jstatsoft.org/v40/i01.