Please note that this book is a work in progress, the equivalent of a pre-alpha release. All the code should work, because if it didn’t the site could not be built. But there is still a lot of work to do to explain the historical methods under discussion. Feel free to leave feedback as issues on the GitHub repository, or to e-mail me.

There are several formats in which geographic data (especially data for boundaries) might come, but the most common format in the shapefile. The format of shapefiles is controlled by Esri, a corporation that makes ArcGIS. The format, however, is widely used. Shapefiles can contain coordinates in latitude and longitude, but they can also contain projected coordinates in some other coordinate reference system which translates the coordinates on a three-dimensional globe to coordinates on a two-dimensional representation. For example, the coordinates of New Haven, Connecticut, in latitude and longitude are 41.3100° N, 72.9236° W. In the popular coordinate reference system used by Google Maps and most other web mapping services, called Google Mercator, those coordinates would be represented as -8117860.8230976425 5058258.564581587 (notice that longitude comes first, then latitude, and that the unit of measurement is not degrees but meters). While R can handle converting between coordinate reference systems using the rgdal package, it is often easier to handle the conversion outside of R. Furthermore, shapefiles can often be enormous—in the hundreds of megabytes—because they contain much more detail about borders or coastlines than will appear in a an actual map. Simplifying these shapefiles is often a task better done outside of R, though many of the tasks can be done inside R as well. This chapter will explain how to reproject and simplify shapefiles gathered from the NHGIS website for use with ggplot2. This pattern should be generalizable for most shapefiles.

Installing external programs

We will use two external programs to work with shapefiles. The first is ogr* family of programs in GDAL/OGR. The second is mapshaper, which depends on Node.js.

Installing on Mac

To install these programs on a Mac, run the following commands (assuming you use Homebrew).

brew update
brew install gdal
brew install node

You can then install mapshaper globally by running the following commands.

npm update
npm install -g mapshaper

Installing on Ubuntu

To install these programs on Ubuntu, run the following commands.

apt-get update
apt-get install gdal-bin libgdal-dev libproj-dev
apt-get install nodejs

Then run the following command to install mapshaper.

npm update
npm install -g mapshaper

Getting information about a shapefile

You can download a shapefile from the NHGIS. For this example, we will use the shapefile of U.S. counties in 1850.

When you unzip the directory which contains the shapefile, you will notice that it actually contains several files. The file that ends .shp is the shapefile itself, but the other files are important too. The file that ends .prj contains the projection information for the shapefile, and the file that ends .dbf contains the attributes associated with the polygons or points or lines in the shapefile. (R will let you read in this DBF file on its own if you just want that data.)

A shapefile

We can see what kind of information is available in the shapefile using the command ogrinfowith some flags that tell the command to give us all the information in the file. Assuming you are in your terminal in the same directory as the shapefile, you can run the following command.

ogrinfo -so -al US_county_1850.shp

This command will return a lot of information:

INFO: Open of `US_county_1850.shp'
      using driver `ESRI Shapefile' successful.

Layer name: US_county_1850
Geometry: Polygon
Feature Count: 1632
Extent: (-2356113.743199, -1337508.077280) - (2258224.796357, 1565781.659379)
Layer SRS WKT:
PROJCS["USA_Contiguous_Albers_Equal_Area_Conic",
    GEOGCS["GCS_North_American_1983",
        DATUM["North_American_Datum_1983",
            SPHEROID["GRS_1980",6378137.0,298.257222101]],
        PRIMEM["Greenwich",0.0],
        UNIT["Degree",0.0174532925199433]],
    PROJECTION["Albers_Conic_Equal_Area"],
    PARAMETER["False_Easting",0.0],
    PARAMETER["False_Northing",0.0],
    PARAMETER["longitude_of_center",-96.0],
    PARAMETER["Standard_Parallel_1",29.5],
    PARAMETER["Standard_Parallel_2",45.5],
    PARAMETER["latitude_of_center",37.5],
    UNIT["Meter",1.0]]
DECADE: String (4.0)
NHGISNAM: String (50.0)
NHGISST: String (3.0)
NHGISCTY: String (4.0)
ICPSRST: String (3.0)
ICPSRCTY: String (4.0)
ICPSRNAM: String (35.0)
STATENAM: String (25.0)
ICPSRSTI: Integer (10.0)
ICPSRCTYI: Integer (10.0)
ICPSRFIP: Real (17.5)
STATE: String (3.0)
COUNTY: String (4.0)
PID: Real (19.8)
X_CENTROID: Real (19.8)
Y_CENTROID: Real (19.8)
GISJOIN: String (8.0)
GISJOIN2: String (7.0)
SHAPE_AREA: Real (19.11)
SHAPE_LEN: Real (19.11)

This output tells us that the file contains polygons instead of points or lines (Geometry: Polygon) and that there are 1,632 polygons in the file (Feature Count: 1632). We can also see that the file is projected in using an Albers equal area projection. That is a fine choice for a projection, but it will make working with the shapefile in R more difficult. The output also tells us the variables associated with each polygon, such as STATE and COUNTY. In this file the information is mostly identification of the polygons, but other shapefiles will contain actual data. Note that we could obtain the same information in R by using the rgdal wrapper around the ogrinfo command. The ogrInfo() function takes two arguments: the path to the directory with the shapefile, and the name of the shapefile without the .shp extension.

library(rgdal)
## Loading required package: sp
## rgdal: version: 0.9-2, (SVN revision 526)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 1.11.1, released 2014/09/24
## Path to GDAL shared files: /usr/local/Cellar/gdal/1.11.1_3/share/gdal
## Loaded PROJ.4 runtime: Rel. 4.8.0, 6 March 2012, [PJ_VERSION: 480]
## Path to PROJ.4 shared files: (autodetected)
ogrInfo("data/county-shapefile/", "US_county_1850")
## Source: "data/county-shapefile/", layer: "US_county_1850"
## Driver: ESRI Shapefile number of rows 1632 
## Feature type: wkbPolygon with 2 dimensions
## Extent: (-2356114 -1337508) - (2258225 1565782)
## CRS: +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs  
## LDID: 87 
## Number of fields: 20 
##          name type length typeName
## 1      DECADE    4      4   String
## 2    NHGISNAM    4     50   String
## 3     NHGISST    4      3   String
## 4    NHGISCTY    4      4   String
## 5     ICPSRST    4      3   String
## 6    ICPSRCTY    4      4   String
## 7    ICPSRNAM    4     35   String
## 8    STATENAM    4     25   String
## 9    ICPSRSTI    0     10  Integer
## 10  ICPSRCTYI    0     10  Integer
## 11   ICPSRFIP    2     17     Real
## 12      STATE    4      3   String
## 13     COUNTY    4      4   String
## 14        PID    2     19     Real
## 15 X_CENTROID    2     19     Real
## 16 Y_CENTROID    2     19     Real
## 17    GISJOIN    4      8   String
## 18   GISJOIN2    4      7   String
## 19 SHAPE_AREA    2     19     Real
## 20  SHAPE_LEN    2     19     Real

Reprojecting a shapefile

Next we want to reproject that shapefile into something more useable in R. There are many coordinate reference systems (CRS) or spatial reference systems (SRS), which are formal ways of describing the mathematics of a projection. These CRSes are usually defined either by a Well-Known Text (WKT) or by a Proj4 string. Notice that the output of the ogrinfo command line tool gave us the WKT representation:

Layer SRS WKT:
PROJCS["USA_Contiguous_Albers_Equal_Area_Conic",
    GEOGCS["GCS_North_American_1983",
        DATUM["North_American_Datum_1983",
            SPHEROID["GRS_1980",6378137.0,298.257222101]],
        PRIMEM["Greenwich",0.0],
        UNIT["Degree",0.0174532925199433]],
    PROJECTION["Albers_Conic_Equal_Area"],
    PARAMETER["False_Easting",0.0],
    PARAMETER["False_Northing",0.0],
    PARAMETER["longitude_of_center",-96.0],
    PARAMETER["Standard_Parallel_1",29.5],
    PARAMETER["Standard_Parallel_2",45.5],
    PARAMETER["latitude_of_center",37.5],
    UNIT["Meter",1.0]]

But the ogrInfo() R function gave us the Proj4 string:

+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs

These are essentially interchangable, though sometimes functions or programs will expect one rather than the other. There are formal systems provided by ESRI, EPSG, and other organizations which provide identifying numbers to these projections. The best guide to these coordinate reference systems is Spatial Reference, which will list the common names, authority names and numbers, well-known texts, Proj4 strings, and other modes of representing projections. Looking up the “USA Contiguous Albers Equal Area Conic” projection, we can find various information about the projection.1

Spatial reference

We want to convert the shapefile to EPSG 4326, a system which is useful because it is unprojected. That is, is will represents data points in latitude and longitude. We can do this at the command line using the program ogr2ogr. We will specify the SRS that we want to convert to (-t_srs) and the name of input and output files. Notice that the output file (here called US_county_1850_epsg4326.shp) comes before the name of the input file.

ogr2ogr -t_srs EPSG:4326 US_county_1850_epsg4326.shp US_county_1850.shp

We could also use rgdal to reproject the file inside R. First we would have to load the shapefile using the function readOGR(). We can find its Proj4 string with the function proj4string().

sp <- readOGR("data/county-shapefile/", "US_county_1850")
## OGR data source with driver: ESRI Shapefile 
## Source: "data/county-shapefile/", layer: "US_county_1850"
## with 1632 features
## It has 20 fields
proj4string(sp)
## [1] "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0"

We can define a new projection using the CRS() function, which takes as its argument a Proj4 string. This new projection can be used as an argument in the spTransform() function, which does the reprojection. From Spatial Reference we can find that the Proj4 string we want is +proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs.

sp_epsg4326 <- spTransform(sp, CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"))
proj4string(sp_epsg4326)
## [1] "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0"

If we wished, we could then save the reprojected shapefile to disk using writeOGR() function.

Simplifying shapefiles

The real problem with using these shapefiles in R is their size. The 1850 counties shapefile is about 55 MB. While that is not extremely large in terms of disk space, it is very large in terms of the complexity of the shapes that it represents. The shapefile is a vector format, meaning it is a mathematical description of the curves involved rather than a pixel by pixel grid as in a photograph. That means that the file is very space efficient. Consider the needlessly complex, fractal-like level of detail in the coastline of the Chesapeake Bay as contained in this shapefile. (The square in the lower left is the District of Columbia.)

The Chesapeake in an unsimplified shapefile

At almost any size map that we make, that level of detail will be invisible to the user. But it comes at a heavy cost in terms of processing. A plot of the shapefile using the base R function plot() will take several minutes on even a fast computer; it will probably never plot at all in ggplot2. It is possible to use various algorithms such as the Douglas-Peucker or Visvalingam algorithm (admirably explained by Mike Bostock) to simplify those lines and remove unnecessary polygons, such as small islands.

The MapShaper website will let you upload a shapefile (or GeoJSON or TopoJSON file) and perform such simplification interactively. Keeping only 2% of the information in the file, we can create a much simpler geometry without losing any information that will be visible the user. This will make working the shapefile in R much faster.

Mapshaper

For simplifying shapefiles one at a time, the mapshaper website is often sufficient. But whenever you have more than one shapefile to simplify or wish to make your work reproducible, it is better to use the mapshaper command line tool. (See the chapter on reproducible research.) We will use that command on our shapefile that we reprojected to EPSG:4326. The command takes an input file, a percentage by which to simplify the file, and an output file. It is often good to use the auto-snap option, which aligns the points to a grid to make sure that polygons continue to overlap with one another. The -force flag tells mapshaper to overwrite the output file if it already exists.2

mapshaper US_county_1850_epsg4326.shp auto-snap -simplify 2.5% -o force US_county_1850_simplified.shp

The resulting file is much smaller, about 1.8 megabytes instead of 52 megabytes. We can now plot it quite easily after loading in the new shapefile.

sp_simplified <- readOGR("data/county-shapefile", "US_county_1850_simplified")
## OGR data source with driver: ESRI Shapefile 
## Source: "data/county-shapefile", layer: "US_county_1850_simplified"
## with 1632 features
## It has 20 fields
plot(sp_simplified)

Especially when fortifying shapefiles to use them with ggplot2, as explained in the chapter on mapping, a simplified shapefile make complain about missing holes or rings. This is usually a sign that you have simplified the shapefile too much and lost some essential information. You should resimplify the original shapefile keeping more of the information. This usually requires some trial and error.

It is possible to use the R wrappers around GDAL/OGR to simplify shapefiles, but I do not recommend that you use them. Mapshaper is a far superior implementation.

A Makefile for reprojecting a simplifying shapefiles

The following Makefile will take a directory full of shapefiles, convert them to EPSG:4326 and simplify them to a user-specified percentage. This Makefile can be run in parallel to convert many shapefiles at the same time. (For an explanation of Make, see the chapter on reproducible research.) The Makefile assumes that within the project directory, the original shapefiles are stored in the shp/ directory, and that the final shapefiles will be created in the shp_out/ directory.

# A Makefile to re-project and simplify shapefiles.

SIMPLIFY_PERCENTAGE := 2.5
REPROJECTION_CODE   := EPSG:4326

SIMPLIFIED  := $(patsubst shp/%.shp, shp_out/%.shp, $(wildcard shp/*.shp))
REPROJECTED := $(patsubst shp/%.shp, temp/%.shp, $(wildcard shp/*.shp))

all : $(SIMPLIFIED)
.SECONDARY : $(REPROJECTED)

shp_out/%.shp : temp/%.shp
  @echo "Simplifying $*"
    mkdir -p shp_out
    mapshaper $^ auto-snap -simplify $(SIMPLIFY_PERCENTAGE)% -o force $@

temp/%.shp : shp/%.shp
    @echo "Reprojecting $*"
    mkdir -p temp
    ogr2ogr -t_srs $(REPROJECTION_CODE) $@ $^

clean :
    rm -f temp/*

clobber : clean
    rm -f shp_out/*

You can modify the options at the top of the file as necessary. Running the command make all will re-project and simplify all the shapefiles. You can run this task in parallel with the command make all -j 8, where the number is the number of parallel processes that you want. The command make clobber will remove all output files.


  1. The EPSG codes and their associated Proj4 strings are available as a data frame via the make_EPSG() function in rgdal.

  2. See the other options at the mapshaper wiki.