Please note that this book is a work in progress, the equivalent of a pre-alpha release. All the code should work, because if it didn’t the site could not be built. But there is still a lot of work to do to explain the historical methods under discussion. Feel free to leave feedback as issues on the GitHub repository, or to e-mail me.

Mapping is difficult because of the difficulty in representing geospatial data. In particular, the dominant format for spatial data is the shapefile, which is a proprietary but mostly open format with some severe limitations. Geospatial data can be very large: in the hundreds of megabytes. Often it is unnecessarily large for the resolution of the map that you want to make, and must be simplified in order to be manageable. Finally, geospatial information is projected, meaning that it can represent a point’s position on the earth by a latitude or longitude but also by any other number of coordinate reference systems. These systems have to be translated between one another. In general, it is easiest to prepare geospatial data outside of R and then to load and import it. This chapter focuses primarily on what you can do with R, but the companion chapter on working with shapefiles demonstrates how to take a shapefile and then convert and simplify it into a useable format.

Another problem with geospatial information is that it is usually intended for contemporary use, which means that the information about boundaries and even about natural features like rivers and coastlines may be anachronistic. You will have to be cautious about finding geospatial information that fits your time and place.

There are some good resources for getting geospatial data. NHGIS contains aggregated census information for the United States keyed by special codes to shapefiles that they provide. We will use NHGIS data about manufacturing in this chapter.Natural Earth provides all kinds of geographic information, such as rivers, coastlines, and mountains.

Nevertheless, mapping is the digital history technique from which it is easiest to make historical arguments. Not coincidentally, it is also the digital history technique which is easiest to explain to non-digital historians. Historians have been working with maps since at least the beginnings of the historical profession: digital mapping gives you the capability to map data and places yourself.

Installation and setup

Most R packages that work with geospatial data are wrappers around libraries written in lower-level languages. In order to use these packages, you will have to install those libraries which are external to R. The two main libraries are GDAL/OGR, which you will likely have to install anyway in order to get the command line tools to prepare your geospatial data, and GEOS.1 (In most cases, the R wrapper for GEOS can install the external library itself.)

You can install these packages by following the instructions on their respective websites. If you are on Mac and use Homebrew, you can install them with the following commands.

brew update
brew install gdal
brew install geos

If you are on Ubuntu, you can install them with the following commands. These libraries will be found in almost any Linux package manager.

apt-get update
apt-get install gdal-bin libgdal-dev libproj-dev
apt-get install libgeos-dev

Once you have those libraries installed, you can install the necessary R packages, which include the following:

You can install all of these with the following command.

install.packages(c("sp", "rgdal", "rgeos", "maptools",
                   "ggmap", "ggplot2", "RColorBrewer", "classInt"),
                 dependencies = TRUE)

On Mac, R will try to intall binary versions of rgdal and rgeos which do not need the external dependencies. In general, install.packages() tries to get a binary package, meaning a package which contains everything necessary for your operating system. If R can do that for, say, rgdal, then you don’t need to install the external library. But this doesn’t always work, in which case R lets you build the package from source, and you do need the external libraries. If this method of installation fails, and R complains that a binary package is not available, then try the following installation commands.

install.packages("rgdal", type = "source")
install.packages("rgeos", type = "source")

Loading and plotting a shapefile

The first step in making a map is to load our geospatial data. I have prepared a shapefile from the NHGIS with the state boundaries of the United States in 1850 by reprojecting it and simplifying it. We can load this shapefile using the readOGR() function in the rgdal package. The readOGR() package takes two main arguments: the first is the path to the directory with the shapefile, and the second is the name of the shapefile without its extension.

library(rgdal)
## Loading required package: sp
## rgdal: version: 0.9-2, (SVN revision 526)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 1.11.1, released 2014/09/24
## Path to GDAL shared files: /usr/local/Cellar/gdal/1.11.1_3/share/gdal
## Loaded PROJ.4 runtime: Rel. 4.8.0, 6 March 2012, [PJ_VERSION: 480]
## Path to PROJ.4 shared files: (autodetected)
map_sp <- readOGR("data/nhgis-shp/", "state_1850")
## OGR data source with driver: ESRI Shapefile 
## Source: "data/nhgis-shp/", layer: "state_1850"
## with 37 features
## It has 7 fields

We now have an object which we have saved as map_sp. We can investigate the class and structure of this object.

class(map_sp)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

The object is a "sp", which means it is an object of the type provided by the sp package. Almost all spatial objects will have this type. The object is also a "SpatialPolygonsDataFrame". There are three basic types of spatial data: points, lines, and polygons. This file contains polygons.

str(map_sp, max.level = 2)
## Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots
##   ..@ data       :'data.frame':  37 obs. of  7 variables:
##   ..@ polygons   :List of 37
##   ..@ plotOrder  : int [1:37] 33 15 29 35 5 31 12 14 6 9 ...
##   ..@ bbox       : num [1:2, 1:2] -124.7 24.5 -67 49.4
##   .. ..- attr(*, "dimnames")=List of 2
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot

The structure of the object is even more revealing. The object has several “slots,” which are the places where R stores information for objects. TODO: Explain R object systems. We can see that the object contains 37 polygons: one for each of the states and territories on the map. It also contains information such as the bbox, which is the maximum extent of the map in terms of latitude and longitude, and the proj4string, which contains the coordinate reference system. We can look at either of those slots with @.

map_sp@bbox
##          min       max
## x -124.73290 -66.95211
## y   24.54479  49.38436
map_sp@proj4string
## CRS arguments:
##  +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0

The latitude and longitude of the bounding box is what we would expect for the United States. The proj4string tells us that this object is not projected; it happens to use the EPSG 4326 coordinate reference system which uses latitude and longitude.2 Notice that in both cases the longitude comes first because it is mapped to the x axis, and latitude comes second because it is mapped to the y axis.

The other interesting slot is data, which contains a data frame. We can investigate this data frame:

head(map_sp@data)
##   NHGISST ICPSRST             STATENAM GISJOIN GISJOIN2   SHAPE_AREA
## 0     250       3        Massachusetts    G250      250  21030126218
## 1     240      52             Maryland    G240      240  25709868254
## 2     470      54            Tennessee    G470      470 109151322774
## 3     280      46          Mississippi    G280      280 123453258391
## 4     355      66 New Mexico Territory    G355      355 608422545080
## 5     510      40             Virginia    G510      510 166415975990
##   SHAPE_LEN
## 0   2716620
## 1   6413762
## 2   2111489
## 3   2908321
## 4   4290397
## 5   8467633

This shows us that the 37 rows of the data frame correspond to each of the states and territories. There is some important information in this data frame. First, we have several columns that will permit us to join this data frame to other data sets. NHGISST, ICPSRST, GISJOIN, and GISJOIN2 are all special ID fields to be used for this purpose; in a pinch, we could use the STATENAM column too. We also are given the SHAPE_AREA, which will let us normalize the data by the area of the polygon.

The data in this data frame does not come standard with every shapefile. In this case it has been added by the NHGIS. Other shapefiles will come with data already included: the Natural Earth shapefiles have many columns which are useful for plotting. Some shapefiles produced by historical organizations will contain the dates that boundaries are valid for. The Newberry Library’s Atlas of Historical County Boundaries is a particular good example of using shapefiles for this purpose. Many shapefiles will come with no information whatsoever.

The sp provides a method so that it can be plotted using the generic R function plot(). This is the fastest and easiest way to make sure that your shapefile has sometime useful.

plot(map_sp)

The sp and rgdal packages contain many useful functions. Not least is the ability to reproject a shapefile. A projection transforms the three dimensional space of a (nearly) spherical globe into two dimensional space. There are many map projections which are useful for different purposes. We can see what we mean by using a shapefile of the earth’s land from Natural Earth and plotting i.

earth <- readOGR("data/ne_50m_land/", "ne_50m_land") 
## OGR data source with driver: ESRI Shapefile 
## Source: "data/ne_50m_land/", layer: "ne_50m_land"
## with 1420 features
## It has 2 fields
plot(earth, col = "gray")
title("The world according to EPSG 4326")

Using the spTransform() function, we can set a new projection. We will use the Winkel tripel projection since it is obviously different from (and superior to) the standard projection. (You can find the PROJ.4 definitions for many coordinate reference systems online.) First we transform the shapefile then plot it.

winkel <- spTransform(earth, CRS("+proj=wintri"))
plot(winkel, col = "gray")
title("The world according to Oswald Winkel")

It is well worth learning the base R graphics in order to interact with plots in these ways. But just as we used ggplot2 in the plotting chapter, so we will use it here because it provides a grammar of graphics, and thus a way of thinking through our problems. Before we explore how to make maps in ggplot2, we will first have to get some data to plot on top of our shapefiles.

Geocoding

There are essentially three kind of spatial information that can be plotted: points, lines, and polygons. We have already seen what polygons look like. Often if it is sufficient to draw a point representing a city or an event. Given a place name, it is possible to look up that latitude and longitude for that place using Google’s API.3

The ggmap package provides a geocode() function which calls Google’s API and returns a latitude and longitude.

library(ggmap)
## Loading required package: ggplot2
geocode("San Francisco, CA")
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=San+Francisco,+CA&sensor=false
##         lon      lat
## 1 -122.4194 37.77493

The function’s default arguments return only latitude and longitude, but it is capable of returning quite a lot of information, such as the precision of the guess as to the location.

geocode("San Francisco, CA", output = "more")
##         lon      lat     type     loctype                address  north
## 1 -122.4194 37.77493 locality approximate san francisco, ca, usa 37.812
##     south      east     west postal_code       country
## 1 37.7034 -122.3482 -122.527        <NA> united states
##   administrative_area_level_2 administrative_area_level_1      locality
## 1        san francisco county                  california san francisco
##   street streetNo point_of_interest             query
## 1   <NA>       NA              <NA> San Francisco, CA

Let’s take a small data frame of cities that we might want to plot on a map.

cities <- data.frame(name = c("Saint Louis, MO",
                    "San Francisco, CA",
                    "Boston, MA",
                    "Charleston, SC",
                    "Houston, TX"),
           stringsAsFactors = FALSE)

cities
##                name
## 1   Saint Louis, MO
## 2 San Francisco, CA
## 3        Boston, MA
## 4    Charleston, SC
## 5       Houston, TX

We can then add two columns to our data frame using geocode. geocode() function returns a data frame, which makes it awkward to use with dplyr.

cities_geocoded <- geocode(cities$name)

cities_geocoded
##          lon      lat
## 1  -90.19940 38.62700
## 2 -122.41942 37.77493
## 3  -71.05888 42.36008
## 4  -79.93105 32.77647
## 5  -95.36980 29.76043

We will have to bind this new data frame to our cities data frame.

cities <- cbind(cities, cities_geocoded)

We we have a data frame that we can plot using ggplot2, remembering to map the longitude to x and the latitude to y. We will also use the ggplot2 function coord_map() which will set the coordinate system for our plot to something that makes sense for a map. (The default is the Mercator projection; we will change this later.) The geom_text() function will print the name of the city.

library(ggplot2)
ggplot(cities, aes(x = lon, y = lat)) +
  geom_point() +
  geom_text(aes(label = name), vjust = -1) +
  coord_map()

This looks about right, though the labels could be better placed. But this is hardly useful without some kind of boundaries beneath it.

Plotting polygons in ggplot2

To plot our earlier map_sp of U.S. state boundaries in 1850, it is necessary to convert the shapefile object to a data frame. The ggplot2 package contains a function fortify(), which will perform the conversion. We will need to pass it a region = argument with the name of the field that we might want to perform a merge with. With NHGIS data, the GISJOIN field is the best to use.

We can also fortify using the tidy() function from the broom package. TODO

map_df <- fortify(map_sp, region = "GISJOIN")

head(map_df)
##        long      lat order  hole piece  group   id
## 1 -85.00237 31.00068     1 FALSE     1 G010.1 G010
## 2 -85.14583 31.00070     2 FALSE     1 G010.1 G010
## 3 -85.24363 31.00088     3 FALSE     1 G010.1 G010
## 4 -85.36962 30.99872     4 FALSE     1 G010.1 G010
## 5 -85.49993 30.99690     5 FALSE     1 G010.1 G010
## 6 -85.56811 30.99624     6 FALSE     1 G010.1 G010

Now that we have a data frame, we can plot it. Because we are going to use the cities data frame, we are going to specify different data frames for ggplot to use. Instead of specifying the data = argument in the main ggplot() function, we will specify it in the layer arguments.

map_1850 <- ggplot() + 
  geom_map(data = map_df,
           map = map_df,
           aes(x = long, y = lat, group = group, map_id = id),
           fill = "white",
           color = "black",
           size = 0.2) +
  coord_map() +
  theme_minimal()

map_1850

Now we have a map of the states in 1850. We can modify our code that plotted the cities above.

map_1850 +
  geom_point(data = cities, aes(x = lon, y = lat),
             color = "red", size = 3) +
  geom_text(data = cities, aes(x = lon, y = lat, label = name),
            vjust = -1)

We can use a more sophisticated dataset to plot points on the map. The historydata package contains a geocoded set of Paulist missions. We might normally display just the missions from 1850, but for learning purposes we will display all the missions at once.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(historydata)
data(paulist_missions)
glimpse(paulist_missions)
## Observations: 841
## Variables:
## $ mission_number (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ church         (chr) "St. Joseph's Church", "St. Michael's Church", ...
## $ city           (chr) "New York", "Loretto", "Hollidaysburg", "Johnst...
## $ state          (chr) "NY", "PA", "PA", "PA", "NY", "NY", "PA", "PA",...
## $ start_date     (chr) "4/6/1851", "4/27/1851", "5/18/1851", "5/31/185...
## $ end_date       (chr) "4/20/1851", "5/11/1851", "5/28/1851", "6/8/185...
## $ confessions    (int) 6000, 1700, 1000, 1000, 4000, 7000, 1000, 270, ...
## $ converts       (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0,...
## $ order          (chr) "Redemptorist", "Redemptorist", "Redemptorist",...
## $ lat            (dbl) 40.71435, 40.50313, 40.42729, 40.32674, 40.7143...
## $ long           (dbl) -74.00597, -78.63030, -78.38890, -78.92197, -74...

We can add these as points to our 1850 map. Instead of sizing the points all the same, we will size them by number of confessions at the mission, but we will use hollow circles (shape = 1) to try to deal with the problem of overplotting.

map_1850 +
  geom_point(data = paulist_missions,
             aes(x = long, y = lat, size = confessions),
             color = "red", shape = 1) +
  theme(legend.position = "bottom") +
  scale_size(range = c(2, 8)) + 
  ggtitle("Paulist missions as points")
## Warning in loop_apply(n, do.ply): Removed 6 rows containing missing values
## (geom_point).

When plotting many points, sometimes it makes more sense to plot the density of the points instead of the points themselves. We can do this in ggplot2 with either the geom_hex() or geom_density2d().

map_1850 +
  geom_density2d(data = paulist_missions,
             aes(x = long, y = lat)) +
  theme(legend.position = "bottom") +
  ggtitle("Paulist missions with density plot")

Choropleths

A choropleth map shades the region with a color keyed to some meaning. For example, we might color the region based on population, with regions with little population being represented by the color white, and regions with the most population being represented by the color green. We already know how to shade a region in ggplot2: we simply map the fill = argument to some variable in our dataset.

Before we can do this, however, we have to join our spatial data to the data we wish to map. (Joining is covered in more detail in the chapter on data manipulation.)

We’ll begin by loading in a shapefile of the US counties in 1850 and turning it into a data frame which we can use with ggplot2.

counties_1850_sp <- readOGR("data/nhgis-shp/", "US_county_1850")
## OGR data source with driver: ESRI Shapefile 
## Source: "data/nhgis-shp/", layer: "US_county_1850"
## with 1632 features
## It has 20 fields
counties_1850_df <- fortify(counties_1850_sp, region = "GISJOIN")
glimpse(counties_1850_df)
## Observations: 110325
## Variables:
## $ long  (dbl) -86.22141, -86.23277, -86.25719, -86.25266, -86.26951, -...
## $ lat   (dbl) 32.52566, 32.52220, 32.52181, 32.50748, 32.50344, 32.486...
## $ order (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
## $ hole  (lgl) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ piece (fctr) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ group (fctr) G0100010.1, G0100010.1, G0100010.1, G0100010.1, G010001...
## $ id    (chr) "G0100010", "G0100010", "G0100010", "G0100010", "G010001...

Notice that this data frame, counties_1850_df, has an ID column which we can use for merging our data. Let’s load in NHGIS data about the “Average Value of Farmland and Buildings” in dollars by county in 1850.

farms_1850 <- read.csv("data/nhgis-farmland-value/nhgis0032_ds11_1850_county.csv",
                       stringsAsFactors = FALSE)
glimpse(farms_1850)
## Observations: 1517
## Variables:
## $ GISJOIN  (chr) "G0100010", "G0100030", "G0100050", "G0100070", "G010...
## $ YEAR     (int) 1850, 1850, 1850, 1850, 1850, 1850, 1850, 1850, 1850,...
## $ STATE    (chr) "Alabama", "Alabama", "Alabama", "Alabama", "Alabama"...
## $ STATEA   (int) 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1...
## $ COUNTY   (chr) "Autauga", "Baldwin", "Barbour", "Bibb", "Blount", "B...
## $ COUNTYA  (int) 10, 30, 50, 70, 90, 130, 170, 190, 230, 250, 310, 350...
## $ AREANAME (chr) "Autauga", "Baldwin", "Barbour", "Bibb", "Blount", "B...
## $ AE7001   (int) 5, 3, 5, 3, 3, 3, 6, 7, 9, 3, 4, 3, 4, 3, 3, 7, 6, 3,...

Notice that this data has the same kind of field, only called GISJOIN. We can perform a join between the two data frames using the id and GISJOIN columns from the respective variables.

farms_merged <- counties_1850_df %>%
  left_join(farms_1850, by = c("id" = "GISJOIN"))
glimpse(farms_merged)
## Observations: 110325
## Variables:
## $ long     (dbl) -86.22141, -86.23277, -86.25719, -86.25266, -86.26951...
## $ lat      (dbl) 32.52566, 32.52220, 32.52181, 32.50748, 32.50344, 32....
## $ order    (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ hole     (lgl) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS...
## $ piece    (fctr) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ group    (fctr) G0100010.1, G0100010.1, G0100010.1, G0100010.1, G010...
## $ id       (chr) "G0100010", "G0100010", "G0100010", "G0100010", "G010...
## $ YEAR     (int) 1850, 1850, 1850, 1850, 1850, 1850, 1850, 1850, 1850,...
## $ STATE    (chr) "Alabama", "Alabama", "Alabama", "Alabama", "Alabama"...
## $ STATEA   (int) 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1...
## $ COUNTY   (chr) "Autauga", "Autauga", "Autauga", "Autauga", "Autauga"...
## $ COUNTYA  (int) 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1...
## $ AREANAME (chr) "Autauga", "Autauga", "Autauga", "Autauga", "Autauga"...
## $ AE7001   (int) 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,...

Now we have all the information in one place. We can make a map of the counties in 1850, but this time we will set fill = AE7001 to make the choropleth.

ggplot(data = farms_merged,
       aes(x = long, y = lat, group = group, fill = AE7001, map_id = id)) +
  geom_map(map = farms_merged) + 
  ggtitle("Average value of farms and buildings in dollars/acre in 1850") + 
  coord_map() +
  theme_minimal() +
  theme(legend.position = "bottom") 

This is not a very good map. It still has two related problems. First, the color scheme is difficult to read. We need to pick a better color palette. Second, the map plots the color on a continuous scale. That is, there is an infinite possible number of values between 0 and the maximum. But the values tend to clump on the low end of the scale, and there are a few very high values. We can use the summary() function and then a histogram to see the distribution of the AE7001 variable in our original dataset.

summary(farms_1850$AE7001)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.00    7.00   12.12   12.00 1847.00
ggplot(data = farms_1850, aes(x = AE7001)) + geom_histogram(binwidth = 10) +
  ggtitle("Distribution of average values of farmland in 1850")

We can see from both the summary statistics and the histogram that the vast majority of counites have farmland and buildings valued on average at $12 per acre or less. But on county has farmland averaging $1847 per acre. (We might suspect that this value is an error, but looking at the data shows that outlier is New York City, which makes sense.) That one value reduces the contrast in our data, so to speak, so that we cannot see distinctions among the rest of the counties.

We will do much better if we break up the values into bins. That is, we could say that a county with property values of $1 or $2 per acre will receive the same color, and so on. The best way to do this is with the classInt package. (Install it with install.packages("classInt").) This package provides several ways of assigning groupings to a variable. For example we can try the pretty method in the package. We have to tell the classIntervals() function the variable for which we want to compute the breaks, the number of groupings that we want (keep this to 9 or fewer), and the method of computing the breaks.

library(classInt)
classIntervals(farms_1850$AE7001, 9, "pretty")
## style: pretty
##   one of 65,033,528,560 possible partitions of this variable into 10 classes
##     [0,200)   [200,400)   [400,600)   [600,800)  [800,1000) [1000,1200) 
##        1515           1           0           0           0           0 
## [1200,1400) [1400,1600) [1600,1800) [1800,2000] 
##           0           0           0           1

These breaks are indeed pretty to read, but not very helpful. There are 1515 counties between $0 and $200 average value; but only 1 between $200 and $400, and then 0 in all the rest of the intervals until we get to New York City. We clearly need a better algorithm. The Jenks natural breaks algorithm tries to make the members of each group as alike as possible, while making each group as a whole as distinc from other groups as possible.

intervals <- classIntervals(farms_1850$AE7001, 9, "jenks")
intervals$brks
##  [1]    1    5   10   18   29   43   60  120  208 1847

Those intervals are far more useful. (Notice that the intervals object is a list, and we can access the brks themselves with the $.) We see some real resolution or contrast in the data. We should create a new column in our original farms data frame with those values and remerge the data back to the county shapefile. The cut() function will use our intervals to “cut” up our existing data points into different groups.

head(cut(farms_1850$AE7001, breaks = intervals$brks))
## [1] (1,5] (1,5] (1,5] (1,5] (1,5] (1,5]
## 9 Levels: (1,5] (5,10] (10,18] (18,29] (29,43] (43,60] ... (208,1.85e+03]
farms_1850 <- farms_1850 %>%
  mutate(value_classified = cut(AE7001, intervals$brks))

farms_merged <- counties_1850_df %>%
  left_join(farms_1850, by = c("id" = "GISJOIN"))

glimpse(farms_merged)
## Observations: 110325
## Variables:
## $ long             (dbl) -86.22141, -86.23277, -86.25719, -86.25266, -...
## $ lat              (dbl) 32.52566, 32.52220, 32.52181, 32.50748, 32.50...
## $ order            (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
## $ hole             (lgl) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ piece            (fctr) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ group            (fctr) G0100010.1, G0100010.1, G0100010.1, G0100010...
## $ id               (chr) "G0100010", "G0100010", "G0100010", "G0100010...
## $ YEAR             (int) 1850, 1850, 1850, 1850, 1850, 1850, 1850, 185...
## $ STATE            (chr) "Alabama", "Alabama", "Alabama", "Alabama", "...
## $ STATEA           (int) 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1...
## $ COUNTY           (chr) "Autauga", "Autauga", "Autauga", "Autauga", "...
## $ COUNTYA          (int) 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1...
## $ AREANAME         (chr) "Autauga", "Autauga", "Autauga", "Autauga", "...
## $ AE7001           (int) 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ value_classified (fctr) (1,5], (1,5], (1,5], (1,5], (1,5], (1,5], (1...

Now we can redo our plot by this new variable.

ggplot(data = farms_merged,
       aes(x = long, y = lat, group = group, map_id = id,
           fill = value_classified)) +
  geom_map(map = farms_merged) + 
  ggtitle("Average alue of farms and buildings in dollars/acre in 1850") + 
  coord_map() +
  theme_minimal() +
  theme(legend.position = "bottom") 

This map is far better: instead of an undifferentiated mass of color, we can see see distinctions between the value of property across the country. It is important to note that these distinctions are not simply random, as if we had chosen bad groupings for our variables, but seem at first glance to create contiguous regions and to have variation which could begin to be an argument.

The biggest problem is that the colors do not help us understand the data. The palette offers colors which are far apart from one another (that is, equally distant from one another on a color wheel). This is what one might call, using Cynthia Brewer’s terminology, a qualitative scale. If we were mapping political parties, say, in the 1852 election it would be desireable to distinguish between the Democrats, the Whigs, and the Free Soil party. But our scale is what Brewer calls a sequential scale. That is, the values increase sequentially. This map would be better represented by a shade of colors that grew darker the more expensive land was, so that we could see a progression in the colors. Brewer also offers a diverging scale, which has a neutral color (like white) in the middle and two opposite colors at the beginning and end of the scale. The Color Brewer 2.0 website by Cynthia Brewer and Mark Harrower provides a very helpful set of color palettes and maps to pick the right color palettes for maps.

The Color Brewer 2.0 website

The same palettes can be accessed in R and used with ggplot2 using the RColorBrewer package. (Install this package with install.packages("RColorBrewer").) The package provides a display.brewer.all() function which lets us see all the available color palettes.

library(RColorBrewer)
display.brewer.all()

Now we can change the fill scale of our map to use the RColorBrewer palette of our choice, in this case YlGnBu.

ggplot(data = farms_merged,
       aes(x = long, y = lat, group = group, map_id = id,
           fill = value_classified)) +
  geom_map(map = farms_merged, color = "gray", size = 0.2) + 
  ggtitle("Average value of farms and buildings in dollars/acre in 1850") + 
  coord_map() +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_fill_brewer(palette = "YlGnBu")

Finally, we can improve this map by setting more sensible limits on the longitude limits, since we don’t have data for the western United States.

ggplot(data = farms_merged,
       aes(x = long, y = lat, group = group, map_id = id,
           fill = value_classified)) +
  geom_map(map = farms_merged, color = "gray", size = 0.2) + 
  ggtitle("Average value of farms and buildings in dollars/acre in 1850") + 
  coord_map() +
  theme_minimal() +
  scale_fill_brewer(palette = "YlGnBu") +
  xlim(-98, -65)

Note that we cannot simply take this map as read, but must bring our historical knowledge to bear on it. This map seems to show that the northern states are wealthier than the southern in terms of real property. Or, we should say, the property was more expensive on average; we cannot say what the total wealth in each county was from this data alone. But what the map does not show is people held as real property, that is, slavery. Our map would have a considerably different meaning if we could add in the value of slaves held on southern farms.4

Dot maps

Specifying projections in ggplot

Named Entity Extraction

How to get place (and other kinds of names) out of a document.

Next Steps

Further Reading


  1. Make sure you get version 1.11.0 or later of GDAL/OGR, which added support for topojson.

  2. The website http://spatialreference.org/ is helpful for understanding various coordinate reference systems.

  3. One must be cautious of anachronism, of course, since Google’s API only has current place names. In general, though, cities do not move.

  4. Ed Baptist, The Half Has Never Been Told, TODO.