I’ve recently been mapping the missions of the Paulist Fathers over the course of the nineteenth century. One problem with the data is that many of the points overlap with one another, since the Paulists were often in cities like New York, Philadelphia, Chicago, and Baltimore. When mapping these points, they overlap with one another.
This is a common problem in mapping, which Leaflet.js solves admirably for web maps. See, for example, the DPLA’s map of items. Another solution is to make the points transparent, so that overlapping points are darker than one another. While a judicious use of transparency can help in some places, it is generally a poor solution. It’s hard to explain what the layered colors mean to users, and the eye is poor at detecting the difference anyway.
What I wanted to do is to sum together the overlapping points. So for example, instead of having a data file with a mission at St. Peter’s Church in New York in September 1851 with 4,000 confessions, and another mission at St. Patrick’s Cathedral in October 1851 with 7,000 confessions, I wanted a data file where those points are aggregated as New York missions in 1851 with 11,000 total confessions and 2 missions.
This task is a simple job for Hadley Wickham’s plyr package in R.
<span class="n">library</span><span class="p">(</span><span class="n">plyr</span><span class="p">)</span>
<span class=“n”>raw</span> <span class=“o”><-</span> <span class=“n”>read.csv</span><span class=“p”>(</span><span class=“s2”>“data/paulist-chronicles/paulist-missions.geocoded.csv”</span><span class=“p”>)</span>
<span class=“n”>aggregated</span> <span class=“o”><-</span> <span class=“n”>ddply</span><span class=“p”>(</span><span class=“n”>raw</span><span class=“p”>,</span> <span class=“err”>.</span><span class=“p”>(</span><span class=“n”>city</span><span class=“p”>,</span> <span class=“n”>state</span><span class=“p”>,</span> <span class=“n”>year</span><span class=“p”>),</span> <span class=“n”>summarize</span><span class=“p”>,</span>
<span class=“n”>long</span> <span class=“o”>=</span> <span class=“n”>max</span><span class=“p”>(</span><span class=“n”>geo.lon</span><span class=“p”>),</span>
<span class=“n”>lat</span> <span class=“o”>=</span> <span class=“n”>max</span><span class=“p”>(</span><span class=“n”>geo.lat</span><span class=“p”>),</span>
<span class=“n”>confessions</span> <span class=“o”>=</span> <span class=“n”>sum</span><span class=“p”>(</span><span class=“n”>confessions_total</span><span class=“p”>,</span> <span class=“n”>na.rm</span> <span class=“o”>=</span> <span class=“n”>TRUE</span><span class=“p”>),</span>
<span class=“n”>converts</span> <span class=“o”>=</span> <span class=“n”>sum</span><span class=“p”>(</span><span class=“n”>converts_total</span><span class=“p”>,</span> <span class=“n”>na.rm</span> <span class=“o”>=</span> <span class=“n”>TRUE</span><span class=“p”>),</span>
<span class=“n”>number_missions</span> <span class=“o”>=</span> <span class=“n”>length</span><span class=“p”>(</span><span class=“n”>location</span><span class=“p”>))</span>
<span class=“n”>write.csv</span><span class=“p”>(</span><span class=“n”>aggregated</span><span class=“p”>,</span>
<span class=“s2”>“data/paulist-chronicles/paulist-missions.aggregated.csv”</span><span class=“p”>,</span>
<span class=“n”>row.names</span> <span class=“o”>=</span> <span class=“n”>FALSE</span><span class=“p”>)</span>
You can see what this does by looking at the original data, which goes through this script to produce this aggregated data. I won’t belabor the explanation, since Wickham explains it better than I can in his article “The Split-Apply-Combine Strategy for Data Analysis.” 1{#fnref1.footnoteRef} What I do want to point out is that this is a useful technique for making maps with overlapping points. And even if you are making the maps outside of R, perhaps in GIS software or in D3, R and plyr can still be powerful tools to get your data into the proper format.
Hadley Wickham, "The Split-Apply-Combine Strategy for Data Analysis," Journal of Statistical Software 40, no. 1 (April 2011): 1--29, http://www.jstatsoft.org/v40/i01.↩