Please note that this book is a work in progress, the equivalent of a pre-alpha release. All the code should work, because if it didn’t the site could not be built. But there is still a lot of work to do to explain the historical methods under discussion. Feel free to leave feedback as issues on the GitHub repository, or to e-mail me.

One of the most crucial tasks in digital history is manipulating data. It is no exaggeration to say that in many projects, manipulating and cleaning the data is 80 percent or more of the work, while visualizing or analyzing it is only 20 percent of the work. It might seem like visualizing or analyzing is the real historical work, and it is certainly the part of the work that leads to historical interpretations. But data manipulation is the crucial task for the digital historian, because it is in manipulating the data that we explores its possibilities and its perils: that is, the possibilities that exist for finding interesting historical interpretations, but also the places where we could make errors by trusting the data too much.

There are several things that you can learn by learning data manipulation. Most obviously, you will learn how to get data from whatever form you receive it to a form which is useable for any kind of task, as well as how to take large datasets and summarize them meaningfully. Less obviously, the methods for manipulating data are closely related to some of the basic principles of databases. For example, filtering data is like querying it; merging data is like joining it; tidying data is like normalizing a database. If you can learn the principles behind data manipulation, you will be well prepared for understanding databases.

Definitions

We need to begin with a few definitions. What is data? For our purposes, let us say that data is any source amenable to computation. Most obviously this could be numeric information, such as in the reports of the U.S. Census Bureau. But it could also be a corpus of text files to be text mined, or bibliographic information retrived from an API.

Data comes in different data structures. Each of these types of data is likely to have its own format. The Census Bureau information is like to come as a table of numbers; the API information is likely to come as a list (an R data structure like an object in JavaScript, a dictionary in Python, or a hash in Ruby), and the text corpus is likely to be a set of plain text files. This chapter will briefly discuss how to work with lists, though we will leave most of those examples for the chapter on APIs. Working with text files will be discussed in the chapter on text analysis. This chapter will mostly discuss how to work with data that comes in tables: the form you are most likely to work with.

Tables in R are represented by a data structure called a data frame. A data frame is very much like a spreadsheet as represented in R. It contains columns, which have names which are usually the names of variables. It also has rows which contain the observations. A data frame is the type of data structure that you get whenever you load a .csv file into R.

Take for example the data frame of US state populations from the historydata package.

library(stringi)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(historydata)
data(us_state_populations)
us_state_populations
## Source: local data frame [983 x 4]
## 
##    GISJOIN year          state population
## 1     G090 1790    Connecticut     237655
## 2     G100 1790       Delaware      59096
## 3     G130 1790        Georgia      82548
## 4     G240 1790       Maryland     319728
## 5     G250 1790  Massachusetts     475199
## 6     G330 1790  New Hampshire     141899
## 7     G340 1790     New Jersey     184139
## 8     G360 1790       New York     340241
## 9     G370 1790 North Carolina     395005
## 10    G420 1790   Pennsylvania     433611
## ..     ...  ...            ...        ...

This data frame has columns of varying types. The state column is a character vector containing state names; the population column is a numeric vector.

This can be seen when examining the structure of the data frame.

str(us_state_populations)
## Classes 'tbl_df', 'tbl' and 'data.frame':    983 obs. of  4 variables:
##  $ GISJOIN   : chr  "G090" "G100" "G130" "G240" ...
##  $ year      : int  1790 1790 1790 1790 1790 1790 1790 1790 1790 1790 ...
##  $ state     : chr  "Connecticut" "Delaware" "Georgia" "Maryland" ...
##  $ population: int  237655 59096 82548 319728 475199 141899 184139 340241 395005 433611 ...

We can see that there are 983 “observations” (rows) and 4 “variables” (columns) in the object which has the class data.frame. The $ gives us a hint that we can access the parts of the data frame with that operator. We can, for example, find the median population of all the states in U.S. history:

median(us_state_populations$population, na.rm = TRUE)
## [1] 1131356

We can think of data frames as coming in two forms. One is a wide data frame, in which there are many columns of variables. Take for instance this data set from the NHGIS which contains information about the birthplace of citizens of each state in the United States in 1850.

birthplace_1850 <- read.csv("data/nhgis-nativity/nhgis0023_ds12_1850_state.csv",
                            stringsAsFactors = FALSE)
str(birthplace_1850)
## 'data.frame':    36 obs. of  70 variables:
##  $ GISJOIN : chr  "G010" "G050" "G060" "G090" ...
##  $ YEAR    : int  1850 1850 1850 1850 1850 1850 1850 1850 1850 1850 ...
##  $ STATE   : chr  "Alabama" "Arkansas" "California" "Connecticut" ...
##  $ STATEA  : int  10 50 60 90 100 110 120 130 170 180 ...
##  $ AREANAME: chr  "Alabama" "Arkansas" "California" "Connecticut" ...
##  $ AFB001  : int  237542 11250 631 74 4 45 2340 3154 1335 395 ...
##  $ AFB002  : int  91 63206 350 0 0 4 5 25 727 151 ...
##  $ AFB003  : int  0 6 6602 0 0 0 0 0 3 0 ...
##  $ AFB004  : int  66 49 86 50 28 24967 33 72 226 227 ...
##  $ AFB005  : int  612 121 1317 292653 50 135 179 712 6899 2485 ...
##  $ AFB006  : int  73 51 305 58 72351 99 9 117 1397 2737 ...
##  $ AFB007  : int  1060 38 54 46 4 26 20563 1103 23 21 ...
##  $ AFB008  : int  58997 6367 876 217 14 67 11316 402666 1341 761 ...
##  $ AFB009  : int  114 3276 2722 80 5 24 8 41 343618 4173 ...
##  $ AFB010  : int  93 2128 2077 47 19 29 14 50 30953 541079 ...
##  $ AFB011  : int  7 106 341 18 0 1 0 1 1511 407 ...
##  $ AFB012  : int  2694 7428 4690 41 16 90 87 458 49588 68651 ...
##  $ AFB013  : int  628 1096 929 64 4 58 146 42 480 321 ...
##  $ AFB014  : int  215 80 2700 670 24 87 140 178 3693 976 ...
##  $ AFB015  : int  757 326 1164 265 4360 9245 194 703 6898 10177 ...
##  $ AFB016  : int  654 174 4760 11366 113 331 235 594 9230 2678 ...
##  $ AFB017  : int  3 17 284 89 12 28 7 3 2158 1817 ...
##  $ AFB018  : int  2852 4463 772 23 6 55 92 184 490 287 ...
##  $ AFB019  : int  158 5328 5890 28 8 28 7 60 7228 1006 ...
##  $ AFB020  : int  151 49 904 795 31 84 61 122 4288 886 ...
##  $ AFB021  : int  271 117 1022 1174 1186 163 83 331 6848 7837 ...
##  $ AFB022  : int  1443 537 10160 14416 218 817 614 1203 67180 24310 ...
##  $ AFB023  : int  28521 8772 1027 95 18 100 3537 37522 13851 33175 ...
##  $ AFB024  : int  276 1051 5500 400 54 123 53 46 64219 120193 ...
##  $ AFB025  : int  876 702 4506 1055 5067 1164 240 642 37979 44245 ...
##  $ AFB026  : int  74 36 861 6890 204 23 66 138 1051 438 ...
##  $ AFB027  : int  48663 4587 519 116 13 100 4470 52154 4162 4069 ...
##  $ AFB028  : int  22541 33807 3145 13 4 58 112 8211 32303 12734 ...
##  $ AFB029  : int  55 336 250 20 1 7 8 28 63 44 ...
##  $ AFB030  : int  155 82 1194 1508 12 43 55 186 11381 3183 ...
##  $ AFB031  : int  10387 4737 3407 228 139 4950 643 7331 24697 41819 ...
##  $ AFB032  : int  3 13 248 23 1 2 3 2 1095 99 ...
##  $ AFB033  : int  0 9 317 3 2 3 0 0 16 11 ...
##  $ AFB034  : int  941 196 3050 5091 952 682 300 679 18628 5550 ...
##  $ AFB035  : int  584 71 883 1916 155 142 182 367 4661 1341 ...
##  $ AFB036  : int  67 11 182 111 17 20 11 13 572 169 ...
##  $ AFB037  : int  3639 514 2452 26689 3513 2373 878 3202 27786 12787 ...
##  $ AFB038  : int  45 24 158 42 28 11 17 25 286 740 ...
##  $ AFB039  : int  1068 516 2926 1671 343 1404 307 947 38160 28584 ...
##  $ AFB040  : int  33 0 87 20 0 3 8 13 65 17 ...
##  $ AFB041  : int  1 2 63 19 5 4 8 11 220 43 ...
##  $ AFB042  : int  3 1 124 1 0 0 17 6 2415 18 ...
##  $ AFB043  : int  18 7 92 16 1 6 21 24 93 10 ...
##  $ AFB044  : int  51 1 162 13 2 5 33 11 1123 16 ...
##  $ AFB045  : int  4 2 12 2 1 14 4 41 33 86 ...
##  $ AFB046  : int  113 12 177 55 22 36 7 38 1635 724 ...
##  $ AFB047  : int  10 6 48 5 1 2 2 8 27 6 ...
##  $ AFB048  : int  503 77 1546 321 73 80 67 177 3396 2279 ...
##  $ AFB049  : int  163 3 220 12 1 20 70 13 70 3 ...
##  $ AFB050  : int  39 3 109 74 0 6 17 5 42 6 ...
##  $ AFB051  : int  90 15 228 16 0 74 40 33 43 6 ...
##  $ AFB052  : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ AFB053  : int  7 0 9 1 0 0 0 1 4 0 ...
##  $ AFB054  : int  1 0 0 2 0 0 0 1 0 0 ...
##  $ AFB055  : int  0 0 660 5 0 1 0 0 1 0 ...
##  $ AFB056  : int  0 0 117 16 0 4 3 2 2 4 ...
##  $ AFB057  : int  3 0 319 45 0 0 0 0 9 0 ...
##  $ AFB058  : int  18 1 65 72 10 2 23 13 11 4 ...
##  $ AFB059  : int  49 41 834 970 21 32 97 108 10699 1878 ...
##  $ AFB060  : int  39 68 6454 4 3 9 6 8 30 31 ...
##  $ AFB061  : int  3 0 39 0 0 0 0 0 0 0 ...
##  $ AFB062  : int  2 0 877 35 3 5 3 8 12 4 ...
##  $ AFB063  : int  28 7 64 192 25 15 599 95 75 12 ...
##  $ AFB064  : int  116 50 400 57 35 17 37 58 495 108 ...
##  $ AFB065  : int  1109 824 629 794 63 77 58 517 3946 2598 ...

This dataset contains only 36 rows, one for each state or territory in the United States. But it contains 70 columns. Some of these contain the YEAR and STATE information. But most of these columns have cryptic names such as AFA001 and AFB057. Looking at an excerpt from the codebook that NHGIS provides with the dataset, we can see what these columns mean.

Table 2:     Place of Birth
Universe:    Persons
Source code: NT12
NHGIS code:  AFB
    AFB001:      Native-born: Alabama
    AFB002:      Native-born: Arkansas
    AFB003:      Native-born: California
...
    AFB034:      Foreign-born: England
    AFB035:      Foreign-born: Scotland
    AFB036:      Foreign-born: Wales
    AFB037:      Foreign-born: Ireland

The name of each column in the dataset is itself a variable. The column name represents where someone was born. Actually, it represents two pieces of information: where someone was born, and whether that birth place was inside the United States or outside of it. This type of wide dataset is very difficult to work with for most purposes. For example, it would require a complex set of commands to find the proportion of native- and foreign-born people in each state. We have a similar table for other years, but it is impossible to easily join the two data frames together. It is also difficult to replace the codes with the names that they represent.

A much better solution is to use a long (or narrow) data format. Fortunately R has a command to reshape the data frame from wide to long. Consider the same data frame in long format. Don’t worry for now about the commands that do the transformation (but notice that we can do this transformation with a single function). Instead just look at the final product.

library(tidyr)
birthplace_1850_long <- birthplace_1850 %>%
  gather(birthplace, count, -GISJOIN, -YEAR, -STATE, -STATEA, -AREANAME)

head(birthplace_1850_long)
##   GISJOIN YEAR                STATE STATEA             AREANAME birthplace
## 1    G010 1850              Alabama     10              Alabama     AFB001
## 2    G050 1850             Arkansas     50             Arkansas     AFB001
## 3    G060 1850           California     60           California     AFB001
## 4    G090 1850          Connecticut     90          Connecticut     AFB001
## 5    G100 1850             Delaware    100             Delaware     AFB001
## 6    G110 1850 District Of Columbia    110 District of Columbia     AFB001
##    count
## 1 237542
## 2  11250
## 3    631
## 4     74
## 5      4
## 6     45
str(birthplace_1850_long)
## 'data.frame':    2340 obs. of  7 variables:
##  $ GISJOIN   : chr  "G010" "G050" "G060" "G090" ...
##  $ YEAR      : int  1850 1850 1850 1850 1850 1850 1850 1850 1850 1850 ...
##  $ STATE     : chr  "Alabama" "Arkansas" "California" "Connecticut" ...
##  $ STATEA    : int  10 50 60 90 100 110 120 130 170 180 ...
##  $ AREANAME  : chr  "Alabama" "Arkansas" "California" "Connecticut" ...
##  $ birthplace: Factor w/ 65 levels "AFB001","AFB002",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ count     : int  237542 11250 631 74 4 45 2340 3154 1335 395 ...

Notice that now we have 2340 rows but only 7 columns. All of the columns whose names were codes are now gathered into the birthplace and count columns. The birthplace column contains the codes (which, as you remember, actually encode information about the birthplace) and the count column contains the number of people who had that particular birthplace for each combination of state and year.

This long dataset is far more useful. We could much more easily replace the hard-to-read codes with the birthplaces that they represent. We could also do the same transformation on datasets for other years, and then append them to one another, creating a new timeseries.

This way of structuring data has been called tidy data by Hadley Wickham.1 We can use his definition of tidy data from the article in which he explains the concept:

In tidy data:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.2

We will explore this definition by applying it to our dataset later in this chapter. In this case we have applied principles 1 and 2. For us an observation take the form, “n people from birthplace x lived in state y in year z.” Our wide version of the birthplace data was messy because the columns contained information (the birthplace), not just the names of the variables. By reshaping the data, we have moved all of the information in the table into a place where it can be easily manipulated. The tidy data format is not superior to the wide data format for every purpose: for presentation purposes it is often much better to have data in a wide format, and sometimes it is easier to perform certain calculations in that format. The R package tidyr (again, written by Wickham) also provides a function to go from narrow to wide data. But for most data manipulation the goal is to achieve a tidy data format, from which it is far easier to do other data manipulations to gather information.

There is one other definition that we need to offer. So far we have been talking about tidying data, by which we mean changing its structure into a more useful form. Another task altogether is cleaning data. Data often contains inconsistencies or outright errors. For example, in our data frame of birthplaces, it is entirely possible that the names of states could be inconsistent: "MA", "Massachusetts", and "Mass." and even "MA " and "Massachusetts " (note the extra spaces) might convey the same meaning to you, but they are decidedly not the same thing to your computer. It is common for dates to contain errors: I recently cleaned some data where an event started in 1880 and ended in 1800. Any human entered data will be rife with such errors; any data that is OCR’ed is likely to be worse. The methods of data manipulation, including data tidying, are generalizable to most datasets. The problems of data cleaning are particular to each dataset. Only by working closely with a dataset will you be able to identify the problems and find an appropriate way to solve them. This chapter will suggest some basic strategies for data cleaning, but on the whole it will focus on data tidying and manipulation.

This chapter will begin by identifying the common verbs of a grammar of data manipulation provided by two packages, dplyr and tidyr, using a small sample dataset. Then it will take your through the process of combining those verbs to turn the NHGIS dataset on birthplaces into a usable dataset.

The grammar of data manipulation

When we manipulate data, it helps to think of what we are doing as a set of verbs. Since almost everything in R is a function, those verbs will be expressed as functions. This concept of data manipulation as a grammar, as well as the actual R functions which make it possible, are drawn from Hadley Wickham’s work.3

There are essentially eight verbs for data manipulation. These eight verbs are provided by two R packages, dplyr and tidyr. While these manipulations can also be performed in base R, they are far more easily explained, and even faster to compute, using these two packages.

We’ll begin our exploration of the grammar of data manipulation by creating a toy dataset small enough that we can perform all of the calculations manually to understand what is going on. We will use a made up example of church memberships.

set.seed(337)
sample_df <- data.frame(
  name = toupper(letters[1:10]),
  denomination = sample(c("Presbyterian", "Episcopalian",
                          "Catholic", "Baptist"), 10, replace = TRUE),
  city = sample(c("New York", "Boston", "Baltimore"), 10,  replace = TRUE),
  members = sample(seq(1e2, 1e3, 10), 10, replace = TRUE),
  attendees = sample(seq(1e2, 1e4, 10), 10, replace = TRUE),
  stringsAsFactors = FALSE)
sample_df
##    name denomination      city members attendees
## 1     A      Baptist  New York     350      7390
## 2     B      Baptist    Boston     720       900
## 3     C Presbyterian Baltimore     130      8730
## 4     D      Baptist Baltimore     280      4380
## 5     E Presbyterian  New York     800      6570
## 6     F Episcopalian Baltimore     110      6130
## 7     G      Baptist Baltimore     460      1490
## 8     H     Catholic    Boston     320      7510
## 9     I     Catholic  New York     490      4560
## 10    J     Catholic  New York     240      9900

Now that we have a data frame with sample data, we can use it to try out the data manipulation verbs. We also will load our two necessary libraries:

library(dplyr)
library(tidyr)

The pipe (%>%) function

First we have to learn about the pipe operator, %>%. The pipe operator is not one of the data manipulation verbs, but it will make our task of data manipulation far easier.4

The pipe operator is like the pipe operator, |, in Unix: it passes the results of one command to the input of the next. You can think of it as the word “then” for R: do this then do that. What the pipe operator does is take the output of one function and pass it as the first argument in the next function. Consider the following example.

head(sample_df)
##   name denomination      city members attendees
## 1    A      Baptist  New York     350      7390
## 2    B      Baptist    Boston     720       900
## 3    C Presbyterian Baltimore     130      8730
## 4    D      Baptist Baltimore     280      4380
## 5    E Presbyterian  New York     800      6570
## 6    F Episcopalian Baltimore     110      6130
sample_df %>% head()
##   name denomination      city members attendees
## 1    A      Baptist  New York     350      7390
## 2    B      Baptist    Boston     720       900
## 3    C Presbyterian Baltimore     130      8730
## 4    D      Baptist Baltimore     280      4380
## 5    E Presbyterian  New York     800      6570
## 6    F Episcopalian Baltimore     110      6130

Notice that the two examples are equivalent. The pipe operator lets us call a function on a dataset by writing it after instead of around the variable name. Even more important, it lets us pass the dataset through a chain of operators. We can do this without saving any intermerdiate state. Consider the following more complicated example to find out how many unique denominations are in our dataset:

length(unique(sample_df$denomination))
## [1] 4
sample_df$denomination %>%
  unique() %>%
  length()
## [1] 4

The top example is nearly unreadable, and that is with only two function calls and a variable name. The bottom example is far more readable: “Take the list of denominations, then find the unique names, then find the number of unique names.” The ability to chain operators lets us string together our data manipulation verbs to perform complex actions.5.

select() (dplyr)

Our first data manipulation verb is select. The select() operator lets us pick which columns we want from a data frame. We can do this by specifying the names of the columns that we wish to keep:

sample_df %>%
  select(name, city, members)
##    name      city members
## 1     A  New York     350
## 2     B    Boston     720
## 3     C Baltimore     130
## 4     D Baltimore     280
## 5     E  New York     800
## 6     F Baltimore     110
## 7     G Baltimore     460
## 8     H    Boston     320
## 9     I  New York     490
## 10    J  New York     240

It is also possible to specify the columns we do not want and keep all the others.

sample_df %>%
  select(-denomination, -attendees)
##    name      city members
## 1     A  New York     350
## 2     B    Boston     720
## 3     C Baltimore     130
## 4     D Baltimore     280
## 5     E  New York     800
## 6     F Baltimore     110
## 7     G Baltimore     460
## 8     H    Boston     320
## 9     I  New York     490
## 10    J  New York     240

You can also specify a range of columns using the : symbol.

sample_df %>%
  select(name, city:attendees)
##    name      city members attendees
## 1     A  New York     350      7390
## 2     B    Boston     720       900
## 3     C Baltimore     130      8730
## 4     D Baltimore     280      4380
## 5     E  New York     800      6570
## 6     F Baltimore     110      6130
## 7     G Baltimore     460      1490
## 8     H    Boston     320      7510
## 9     I  New York     490      4560
## 10    J  New York     240      9900

filter() (dplyr)

The verb to select only certain rows from a column is filter. The filter() function take a conditional expression using one or more of the columns in the data frame. A conditional checks for the truth of an expression and returns a boolean value, TRUE or FALSE. All of the following are examples of conditionals.

100 > 0
## [1] TRUE
4 <= 0
## [1] FALSE
"string" == "another string"
## [1] FALSE
is.na("a value")
## [1] FALSE
is.na(NA)
## [1] TRUE
nchar("string") == 6
## [1] TRUE

You can use the filter() function to return only the rows that match a particular conditional or set of conditionals. For example, we could return only the Catholic churches in our sample data frame.

sample_df %>%
  filter(denomination == "Catholic")
##   name denomination     city members attendees
## 1    H     Catholic   Boston     320      7510
## 2    I     Catholic New York     490      4560
## 3    J     Catholic New York     240      9900

Alternatively, we could find only the churches with at least 500 members.

sample_df %>%
  filter(members >= 500)
##   name denomination     city members attendees
## 1    B      Baptist   Boston     720       900
## 2    E Presbyterian New York     800      6570

Or we could use both conditions. This is the equivalent of joining the two conditions with the & operator.

sample_df %>%
  filter(denomination == "Catholic",
         members >= 500)
## [1] name         denomination city         members      attendees   
## <0 rows> (or 0-length row.names)

arrange() (dplyr)

The verb arrange and its corresponding `arrange() function lets us sort a data frame. We might want to sort our sample list of churches by the number of attendees:

sample_df %>%
  arrange(attendees)
##    name denomination      city members attendees
## 1     B      Baptist    Boston     720       900
## 2     G      Baptist Baltimore     460      1490
## 3     D      Baptist Baltimore     280      4380
## 4     I     Catholic  New York     490      4560
## 5     F Episcopalian Baltimore     110      6130
## 6     E Presbyterian  New York     800      6570
## 7     A      Baptist  New York     350      7390
## 8     H     Catholic    Boston     320      7510
## 9     C Presbyterian Baltimore     130      8730
## 10    J     Catholic  New York     240      9900

Notice that the churches are sorted in ascending order of attendees. If we want to sort them from greatest to least, we can use the desc() function.

sample_df %>%
  arrange(desc(attendees))
##    name denomination      city members attendees
## 1     J     Catholic  New York     240      9900
## 2     C Presbyterian Baltimore     130      8730
## 3     H     Catholic    Boston     320      7510
## 4     A      Baptist  New York     350      7390
## 5     E Presbyterian  New York     800      6570
## 6     F Episcopalian Baltimore     110      6130
## 7     I     Catholic  New York     490      4560
## 8     D      Baptist Baltimore     280      4380
## 9     G      Baptist Baltimore     460      1490
## 10    B      Baptist    Boston     720       900

Sorting will work with character vectors as well as with numeric vectors.

mutate() (dplyr)

Mutate and the mutate() function is the verb that lets us create new columns from existing columns.6 For example, in our churches dataset, we might like to know the proportion of attendees that are members. We can use the = operator to assign a calculation to a new column named whatever we like.

sample_df %>%
  mutate(membership_proportion = members / attendees)
##    name denomination      city members attendees membership_proportion
## 1     A      Baptist  New York     350      7390            0.04736130
## 2     B      Baptist    Boston     720       900            0.80000000
## 3     C Presbyterian Baltimore     130      8730            0.01489118
## 4     D      Baptist Baltimore     280      4380            0.06392694
## 5     E Presbyterian  New York     800      6570            0.12176560
## 6     F Episcopalian Baltimore     110      6130            0.01794454
## 7     G      Baptist Baltimore     460      1490            0.30872483
## 8     H     Catholic    Boston     320      7510            0.04260985
## 9     I     Catholic  New York     490      4560            0.10745614
## 10    J     Catholic  New York     240      9900            0.02424242

You can use use the mutate() function for all kinds of operations, not just numeric calculations. For example, we could create a column, protestant, which has the value TRUE if the denomination field is not Catholic, and FALSE otherwise.

sample_df %>%
  mutate(protestant = ifelse(denomination == "Catholic", FALSE, TRUE))
##    name denomination      city members attendees protestant
## 1     A      Baptist  New York     350      7390       TRUE
## 2     B      Baptist    Boston     720       900       TRUE
## 3     C Presbyterian Baltimore     130      8730       TRUE
## 4     D      Baptist Baltimore     280      4380       TRUE
## 5     E Presbyterian  New York     800      6570       TRUE
## 6     F Episcopalian Baltimore     110      6130       TRUE
## 7     G      Baptist Baltimore     460      1490       TRUE
## 8     H     Catholic    Boston     320      7510      FALSE
## 9     I     Catholic  New York     490      4560      FALSE
## 10    J     Catholic  New York     240      9900      FALSE

summarize() and group_by() (dplyr)

One of the most powerful data manipulation verbs is summarize, `summarize(). It is often the case that our dataset contains many individual observations which we want to collapse into a summary table. In this pattern, we take our data frame and split it apart into several data frames by grouping together unique variables or combinations of variables. We then summarize each data frame using some function, and combine them back together.

This concept is difficult to wrap our minds around without an example. Say, for example, that we wanted to know the total number of each denomination from our data frame. We have four different denominations represented. We could group all the Episcopalians together, all the Baptists together, and so on. Then we could count how many are in each group (using the function n()) and combine that information into a new data frame. The three pieces here are the variable (or combination of variables) to group by, the function used to summarize them, and the name of the new column to represent the summed up value.

sample_df %>%
  group_by(denomination) %>%
  summarize(number_of_churches = n())
## Source: local data frame [4 x 2]
## 
##   denomination number_of_churches
## 1      Baptist                  4
## 2     Catholic                  3
## 3 Episcopalian                  1
## 4 Presbyterian                  2

We could perform the same essential operation by count the number of churches in each city.

sample_df %>%
  group_by(city) %>%
  summarize(number_of_churches = n())
## Source: local data frame [3 x 2]
## 
##        city number_of_churches
## 1 Baltimore                  4
## 2    Boston                  2
## 3  New York                  4

Or we could summarize the churches in a more sophisticated way by summing up (using the function sum()) the number of attendees and members for each denomination, or by using mean() to find the average number of attendees and members.7

sample_df %>%
  group_by(denomination) %>%
  summarize(total_members = sum(members),
            total_attendees = sum(attendees),
            avg_members = mean(members),
            avg_attendees = mean(attendees))
## Source: local data frame [4 x 5]
## 
##   denomination total_members total_attendees avg_members avg_attendees
## 1      Baptist          1810           14160       452.5      3540.000
## 2     Catholic          1050           21970       350.0      7323.333
## 3 Episcopalian           110            6130       110.0      6130.000
## 4 Presbyterian           930           15300       465.0      7650.000

We could also find the total number of members for each denomination by city by adding a field to the group_by() function. This will yield all the unique combinations of city and denomination.

sample_df %>%
  group_by(denomination, city) %>%
  summarize(total_members = sum(members))
## Source: local data frame [8 x 3]
## Groups: denomination
## 
##   denomination      city total_members
## 1      Baptist Baltimore           740
## 2      Baptist    Boston           720
## 3      Baptist  New York           350
## 4     Catholic    Boston           320
## 5     Catholic  New York           730
## 6 Episcopalian Baltimore           110
## 7 Presbyterian Baltimore           130
## 8 Presbyterian  New York           800

When using summarize it is important to keep the three components in mind.

  1. The group_by() function takes the name of a variable or variables for which it will find all the unique combinations. The number of unique combinations will be the number of rows in our summarized data frame. Those unique combinations will be the basis of several new data frames which will be passed to step 2.

  2. The function to the right of the = sign inside summarize() provides a function which turns multiple rows in our split-apart data frames into a single row. For example, sum() and mean() both take a vector of numbers and return a single value, while n() counts the number of rows in the split-apart data frame.

  3. The variable name to the left of the = sign inside summarize() provides the new column name in the summarized data frame.8

gather() (tidyr)

Our next two data manipulation verbs are drawn from the tidyr package instead of from the dplyr package. Because they involve manipulating data from wide to long formats and back again, we will create a slightly different example data frame. Suppose that our data frame from earlier now contains two columns for the membership in three different years:

sample_df_wide <- data.frame(
  name = toupper(letters[1:10]),
  denomination = sample(c("Presbyterian", "Episcopalian",
                          "Catholic", "Baptist"), 10, replace = TRUE),
  city = sample(c("New York", "Boston", "Baltimore"), 10,  replace = TRUE),
  members_1830 = sample(seq(1e2, 1e3, 10), 10, replace = TRUE),
  members_1840 = sample(seq(1e2, 1e3, 10), 10, replace = TRUE),
  members_1850 = sample(seq(1e2, 1e3, 10), 10, replace = TRUE),
  stringsAsFactors = FALSE)
sample_df_wide
##    name denomination      city members_1830 members_1840 members_1850
## 1     A Presbyterian Baltimore          490          540          270
## 2     B     Catholic  New York          600          150          300
## 3     C Episcopalian  New York          520          630          390
## 4     D      Baptist  New York          720          480          720
## 5     E     Catholic    Boston          990          440          740
## 6     F      Baptist    Boston          460          830          300
## 7     G Presbyterian  New York          450          350          460
## 8     H Presbyterian  New York          180          900          120
## 9     I      Baptist    Boston          270          200          340
## 10    J Presbyterian    Boston          490          480          270

This information would be easier to plot if it were in long format instead of wide format. The gather() function from tidyr lets us perform such an operation.

What we want to do in this case is take the three columns members_1830, members_1840, and members_1850 and turn them into two columns: year and members. As they stand now, the column names actually represent data, that is, the year in which an observation was taken, and that information belongs in its own column. The gather() function will expect us to give the names of the key and value columns. In this case, that will be year (whose values will beome the the existing names of the columns) and members (whose values will become the current values of the columns we are gathering). The function gather() will also let us specify which columns should remain untouched, in this case, name, denomination, and city. The columns to be spread can be specified with the same syntax as in select().

sample_df_wide %>%
  gather(year, members, -name, -denomination, -city)
##    name denomination      city         year members
## 1     A Presbyterian Baltimore members_1830     490
## 2     B     Catholic  New York members_1830     600
## 3     C Episcopalian  New York members_1830     520
## 4     D      Baptist  New York members_1830     720
## 5     E     Catholic    Boston members_1830     990
## 6     F      Baptist    Boston members_1830     460
## 7     G Presbyterian  New York members_1830     450
## 8     H Presbyterian  New York members_1830     180
## 9     I      Baptist    Boston members_1830     270
## 10    J Presbyterian    Boston members_1830     490
## 11    A Presbyterian Baltimore members_1840     540
## 12    B     Catholic  New York members_1840     150
## 13    C Episcopalian  New York members_1840     630
## 14    D      Baptist  New York members_1840     480
## 15    E     Catholic    Boston members_1840     440
## 16    F      Baptist    Boston members_1840     830
## 17    G Presbyterian  New York members_1840     350
## 18    H Presbyterian  New York members_1840     900
## 19    I      Baptist    Boston members_1840     200
## 20    J Presbyterian    Boston members_1840     480
##  [ reached getOption("max.print") -- omitted 10 rows ]

Notice that our data frame is now much longer but also narrower. We would have to do some additional cleanup to transform the character vectors of the type members_1830 to integers of the type 1830, but we could do that using mutate and functions from the stringr package. (See the chapter on manipulating strings.)

spread() (tidyr)

Gather is the exact opposite of spread. Where the gather() function took column names and turned them into the values of rows, spread() takes the values of rows and turns them into columns. We can see this with our newly transformed long dataset made with gather(), which we can retransform using spread(). We have to pass spread the names of the key and value columns. Each unique value in the key column will become a new column with that name; each associated value will the value column will be filed under that column.

sample_df_long <- sample_df_wide %>%
  gather(year, members, -name, -denomination, -city)

sample_df_long %>%
  spread(year, members)
##    name denomination      city members_1830 members_1840 members_1850
## 1     A Presbyterian Baltimore          490          540          270
## 2     B     Catholic  New York          600          150          300
## 3     C Episcopalian  New York          520          630          390
## 4     D      Baptist  New York          720          480          720
## 5     E     Catholic    Boston          990          440          740
## 6     F      Baptist    Boston          460          830          300
## 7     G Presbyterian  New York          450          350          460
## 8     H Presbyterian  New York          180          900          120
## 9     I      Baptist    Boston          270          200          340
## 10    J Presbyterian    Boston          490          480          270

Sometimes you will want to spread or gather a column to perform calculations on it, and then you will want to do the reverse to get it back to its original form.

The join family of functions (dplyr and base R)

The final data manipulation verb is join, which has a wide array of functions associated with it. Joining is a powerful action which combines two separate data frames based on a key column (or columns) shared by both of them. For example, let’s suppose we have a separate data frame with the population of cities.

cities <- data.frame(
  city = c("New York", "Baltimore", "Boston"),
  population = sample(seq(1e6, 1e7, 1e3), 3))

In our original sample_df data frame, it would be nice to associate the population of each city with the churches located in those cities. Then we could calculate what percentage of the city attends each church or denomination. Since the city name is the same in the city columns in both data frames, we can join the two twogether. Churches in New York will get a new column population with the population of New York, and so forth. The join is performed by the function left_join() from dplyr.

sample_df %>%
  left_join(cities, by = "city")
## Warning: joining factor and character vector, coercing into character
## vector
##    name denomination      city members attendees population
## 1     A      Baptist  New York     350      7390    9584000
## 2     B      Baptist    Boston     720       900    4482000
## 3     C Presbyterian Baltimore     130      8730    7910000
## 4     D      Baptist Baltimore     280      4380    7910000
## 5     E Presbyterian  New York     800      6570    9584000
## 6     F Episcopalian Baltimore     110      6130    7910000
## 7     G      Baptist Baltimore     460      1490    7910000
## 8     H     Catholic    Boston     320      7510    4482000
## 9     I     Catholic  New York     490      4560    9584000
## 10    J     Catholic  New York     240      9900    9584000

Notice that in the new data frame, all the churches in New York have the same value for population, drawn from the cities data frame; the same is true for Baltimore and Boston.

There are many different kinds of joins, including left_join(), inner_join(), semi_join(), and anti_join(). You can read about the exact definition of each join by looking up ?join in the dplyr help. Suffice it to say that the different kinds of joins do different things when matching data frames together, usually having to do with what happens when there is not a perfect match.

The by = "city" argument in left_join() specifies the names of the columns which have the shared key. There can be multiple names of the columns in left_join(), but the columns have to be named the same thing. If one data frame had a column named city and the other had a column named city_name, the join would not work. It is likely that eventually dplyr will remove this limitation. In the meantime, you can rename columns using select(). Specifying select(city = city_name) will rename a column; or you can add a new column with mutate(city= city_name). You could also use the base R function merge() which lets you specify by.x and by.y arguments in which you give the names of the shared columns in each data frame.

Do

Finally dplyr also provides a useful do() function for applying models and the like to data frames. We will use this function in the chapter on statistics.

Example: Tidying and analyzing birthplace data

Lets return to our NHGIS dataset about birthplace data to put these data manipulation verbs into practice. Here we have a very interesting dataset which can tell us about immigration to and migration in the United States. But it is contained in three different files, and the format of each file is almost unusable. In particular, the codes are unreadable. This dataset will give us practice manipulating and the querying data.

Combining the data files

The first thing to do is to get the paths of our three files and the associated codebooks.

birthplace1 <- "data/nhgis-nativity/nhgis0023_ds12_1850_state.csv"
birthplace2 <- "data/nhgis-nativity/nhgis0024_ds15_1860_state.csv"
birthplace3 <- "data/nhgis-nativity/nhgis0024_ds17_1870_state.csv"

codes1 <- "data/nhgis-nativity/nhgis0023_ds12_1850_state_codebook.txt"
codes2 <- "data/nhgis-nativity/nhgis0024_ds15_1860_state_codebook.txt"
codes3 <- "data/nhgis-nativity/nhgis0024_ds17_1870_state_codebook.txt"

Now we can load those files.

birthdata1 <- read.csv(birthplace1, stringsAsFactors = FALSE)
birthdata2 <- read.csv(birthplace2, stringsAsFactors = FALSE)
birthdata3 <- read.csv(birthplace3, stringsAsFactors = FALSE)

A quick look at the first few lines of one of the files will remind us what we are working with.

head(birthdata1)
##   GISJOIN YEAR                STATE STATEA             AREANAME AFB001
## 1    G010 1850              Alabama     10              Alabama 237542
##   AFB002 AFB003 AFB004 AFB005 AFB006 AFB007 AFB008 AFB009 AFB010 AFB011
## 1     91      0     66    612     73   1060  58997    114     93      7
##   AFB012 AFB013 AFB014 AFB015 AFB016 AFB017 AFB018 AFB019 AFB020 AFB021
## 1   2694    628    215    757    654      3   2852    158    151    271
##   AFB022 AFB023 AFB024 AFB025 AFB026 AFB027 AFB028 AFB029 AFB030 AFB031
## 1   1443  28521    276    876     74  48663  22541     55    155  10387
##   AFB032 AFB033 AFB034 AFB035 AFB036 AFB037 AFB038 AFB039 AFB040 AFB041
## 1      3      0    941    584     67   3639     45   1068     33      1
##   AFB042 AFB043 AFB044 AFB045 AFB046 AFB047 AFB048 AFB049 AFB050 AFB051
## 1      3     18     51      4    113     10    503    163     39     90
##   AFB052 AFB053 AFB054 AFB055 AFB056 AFB057 AFB058 AFB059 AFB060 AFB061
## 1      0      7      1      0      0      3     18     49     39      3
##   AFB062 AFB063 AFB064 AFB065
## 1      2     28    116   1109
##  [ reached getOption("max.print") -- omitted 5 rows ]

We want to keep the columns GISJOIN, YEAR, STATE, STATEA, and AREANAME as columns, but we want to turn all the other columns into key-value pairs using gather().9 For good measure, we will also give these data frames the extra class tbl_df.10

library(dplyr)
library(tidyr)

birthdata1 %>%
  gather(code, value, -GISJOIN, -YEAR, -STATE, -STATEA, -AREANAME) %>%
  tbl_df() -> birthdata1

birthdata2 %>%
  gather(code, value, -GISJOIN, -YEAR, -STATE, -STATEA, -AREANAME) %>%
  tbl_df() -> birthdata2

birthdata3 %>%
  gather(code, value, -GISJOIN, -YEAR, -STATE, -STATEA, -AREANAME) %>%
  tbl_df() -> birthdata3

Now we can check our work by looking at one of the our data variables.

birthdata1
## Source: local data frame [2,340 x 7]
## 
##    GISJOIN YEAR                STATE STATEA             AREANAME   code
## 1     G010 1850              Alabama     10              Alabama AFB001
## 2     G050 1850             Arkansas     50             Arkansas AFB001
## 3     G060 1850           California     60           California AFB001
## 4     G090 1850          Connecticut     90          Connecticut AFB001
## 5     G100 1850             Delaware    100             Delaware AFB001
## 6     G110 1850 District Of Columbia    110 District of Columbia AFB001
## 7     G120 1850              Florida    120              Florida AFB001
## 8     G130 1850              Georgia    130              Georgia AFB001
## 9     G170 1850             Illinois    170             Illinois AFB001
## 10    G180 1850              Indiana    180              Indiana AFB001
## ..     ...  ...                  ...    ...                  ...    ...
## Variables not shown: value (int)

It would be nice to combine these three variables into one. We could not do this before when they had different numbers of columns with different names. But now that all three data frames have the same columns, we can easily combine them one after another with the function rbind().

birthplace <- rbind(birthdata1, birthdata2, birthdata3)
birthplace
## Source: local data frame [9,277 x 7]
## 
##    GISJOIN YEAR                STATE STATEA             AREANAME   code
## 1     G010 1850              Alabama     10              Alabama AFB001
## 2     G050 1850             Arkansas     50             Arkansas AFB001
## 3     G060 1850           California     60           California AFB001
## 4     G090 1850          Connecticut     90          Connecticut AFB001
## 5     G100 1850             Delaware    100             Delaware AFB001
## 6     G110 1850 District Of Columbia    110 District of Columbia AFB001
## 7     G120 1850              Florida    120              Florida AFB001
## 8     G130 1850              Georgia    130              Georgia AFB001
## 9     G170 1850             Illinois    170             Illinois AFB001
## 10    G180 1850              Indiana    180              Indiana AFB001
## ..     ...  ...                  ...    ...                  ...    ...
## Variables not shown: value (int)

Our new birthplace variable now has 9277 rows: the same as the sum of the number of rows in birthdata1, birthdata2, and birthdata3. We now have a single variable containing all our data in a long instead of wide format. This will be much easier to work with.

Excursus: a better way of loading multiple files

You might have noticed that our method of opening multiple files is clunky. We had to keep track of a lot of variable names using numeric suffixes. Even worse, we had to do a lot of copying and pasting of code. This was barely manageable for three files; it would have been unmanageable for dozens of files. A good rule for programming is DRY: don’t repeat yourself. If you find yourself copying and pasting code, there is probably a better way to do it.

This section uses some of R’s functional programming techniques and its data structures to perform the same action more easily. This technique could load a hundred files as easily as it could load two. You can skip this section if you want.

Instead of loading each file individually, we are going to to use R’s lapply() function to apply the read.csv() function to each element in a vector of file names. Those files will be read into a list of data frames. We can then use lapply() again to reshape the data frames. Then we can combine them all together.

csv_files <- Sys.glob("data/nhgis-nativity/*state.csv")

Notice that in one lines of code we’ve loaded all of the filenames, using a Unix-like glob (or wildcard expansion) .

csv_files
## [1] "data/nhgis-nativity/nhgis0023_ds12_1850_state.csv"
## [2] "data/nhgis-nativity/nhgis0024_ds15_1860_state.csv"
## [3] "data/nhgis-nativity/nhgis0024_ds17_1870_state.csv"

Now we use lapply() to apply the function that reads the CSV files.

birthplace_data <- lapply(csv_files, read.csv, stringsAsFactors = FALSE)

The results is a list of data frames. Those data frames are identical to the data frames we load individually above.

str(birthplace_data, max.level = 1)
## List of 3
##  $ :'data.frame':    36 obs. of  70 variables:
##  $ :'data.frame':    42 obs. of  84 variables:
##  $ :'data.frame':    47 obs. of  82 variables:

Next we use lapply() again to reshape the data frames. Notice that the code is essentially the same was what we used above with gather() but we only have to write it once, no matter how many data frames are in our list. We’ve put the code inside an anonymous function (that is, a function that hasn’t been assigned to a variable) because lapply() applies a function to some kind of vector or list.

birthplace_reshaped <- lapply(birthplace_data, function(df) {
  df %>%
    gather(code, value, -GISJOIN, -YEAR, -STATE, -STATEA, -AREANAME) %>%
    tbl_df()
  })

We can double check that the code did what we expected.

birthplace_reshaped
## [[1]]
## Source: local data frame [2,340 x 7]
## 
##    GISJOIN YEAR                STATE STATEA             AREANAME   code
## 1     G010 1850              Alabama     10              Alabama AFB001
## 2     G050 1850             Arkansas     50             Arkansas AFB001
## 3     G060 1850           California     60           California AFB001
## 4     G090 1850          Connecticut     90          Connecticut AFB001
## 5     G100 1850             Delaware    100             Delaware AFB001
## 6     G110 1850 District Of Columbia    110 District of Columbia AFB001
## 7     G120 1850              Florida    120              Florida AFB001
## 8     G130 1850              Georgia    130              Georgia AFB001
## 9     G170 1850             Illinois    170             Illinois AFB001
## 10    G180 1850              Indiana    180              Indiana AFB001
## ..     ...  ...                  ...    ...                  ...    ...
## Variables not shown: value (int)
## 
## [[2]]
## Source: local data frame [3,318 x 7]
## 
##    GISJOIN YEAR                STATE STATEA             AREANAME   code
## 1     G010 1860              Alabama     10              Alabama AII001
## 2     G050 1860             Arkansas     50             Arkansas AII001
## 3     G060 1860           California     60           California AII001
## 4     G085 1860   Colorado Territory     85   Colorado Territory AII001
## 5     G090 1860          Connecticut     90          Connecticut AII001
## 6     G095 1860     Dakota Territory     95     Dakota Territory AII001
## 7     G100 1860             Delaware    100             Delaware AII001
## 8     G110 1860 District Of Columbia    110 District of Columbia AII001
## 9     G120 1860              Florida    120              Florida AII001
## 10    G130 1860              Georgia    130              Georgia AII001
## ..     ...  ...                  ...    ...                  ...    ...
## Variables not shown: value (int)
## 
## [[3]]
## Source: local data frame [3,619 x 7]
## 
##    GISJOIN YEAR                STATE STATEA             AREANAME   code
## 1     G010 1870              Alabama     10              Alabama ALF001
## 2     G045 1870    Arizona Territory     45    Arizona Territory ALF001
## 3     G050 1870             Arkansas     50             Arkansas ALF001
## 4     G060 1870           California     60           California ALF001
## 5     G085 1870   Colorado Territory     85   Colorado Territory ALF001
## 6     G090 1870          Connecticut     90          Connecticut ALF001
## 7     G095 1870     Dakota Territory     95     Dakota Territory ALF001
## 8     G100 1870             Delaware    100             Delaware ALF001
## 9     G110 1870 District Of Columbia    110 District of Columbia ALF001
## 10    G120 1870              Florida    120              Florida ALF001
## ..     ...  ...                  ...    ...                  ...    ...
## Variables not shown: value (int)

This is still a list of data frames, and now we need to bind them together. We can use the rbind_all() function provided by dplyr to combine them:

birthplace <- rbind_all(birthplace_reshaped)
birthplace
## Source: local data frame [9,277 x 7]
## 
##    GISJOIN YEAR                STATE STATEA             AREANAME   code
## 1     G010 1850              Alabama     10              Alabama AFB001
## 2     G050 1850             Arkansas     50             Arkansas AFB001
## 3     G060 1850           California     60           California AFB001
## 4     G090 1850          Connecticut     90          Connecticut AFB001
## 5     G100 1850             Delaware    100             Delaware AFB001
## 6     G110 1850 District Of Columbia    110 District of Columbia AFB001
## 7     G120 1850              Florida    120              Florida AFB001
## 8     G130 1850              Georgia    130              Georgia AFB001
## 9     G170 1850             Illinois    170             Illinois AFB001
## 10    G180 1850              Indiana    180              Indiana AFB001
## ..     ...  ...                  ...    ...                  ...    ...
## Variables not shown: value (int)

The result is exactly the same as what we did above, but it took many fewer lines of code, and the code is generalizable to any problem where we have to load multiple CSV files and combine them. We could have made the code even shorter by using the pipe operator. We will demonstrate that by loading and parsing the codebooks in a series of steps. Exactly why we are parsing the code books will be explained in the next section.

library(mullenMisc)
Sys.glob("data/nhgis-nativity/*state_codebook.txt") %>%
  lapply(parse_nhgis_codebook) %>%
  rbind_all() -> codebooks

codebooks
## Source: local data frame [227 x 2]
## 
##      code                           meaning
## 1  AFA001                       Native-born
## 2  AFA002                      Foreign-born
## 3  AFB001              Native-born: Alabama
## 4  AFB002             Native-born: Arkansas
## 5  AFB003           Native-born: California
## 6  AFB004 Native-born: District of Columbia
## 7  AFB005          Native-born: Connecticut
## 8  AFB006             Native-born: Delaware
## 9  AFB007              Native-born: Florida
## 10 AFB008              Native-born: Georgia
## ..    ...                               ...

Merging codes with their values

Our birthplace data frame is now much more usable than it was before. But still those codes are a problem. We have the codebooks that NHGIS provided to explain them, but we can hardly go looking at them every time we want to see what a code means. The mullenMisc package contains a function, parse_nhgis_codebook() which can turn a codebook text file into a data frame.11 We will do that for each of the files, then bind them together.

library(mullenMisc)
codebook1 <- parse_nhgis_codebook(codes1)
codebook2 <- parse_nhgis_codebook(codes2)
codebook3 <- parse_nhgis_codebook(codes3)

codebooks <-  rbind(codebook1, codebook2, codebook2)

Now we have a list of all the codes and their meanings. Now we can use a join to combine our dataframe of codes with our data frame of birthplaces. In this case the proper join is left_join(), since a left join will return all of the rows from our birthplace data (we want all of that, even if we don’t have a matching code) and all of the columns from both the birthplace data frame and the codebook data frame. Both our birthplace data frame and our codebooks data frame have a column called code, so we don’t need to specify which tables to join by.

birthplace <- birthplace %>%
  left_join(codebooks)
## Joining by: "code"
## Warning: joining factor and character vector, coercing into character
## vector
str(birthplace)
## Classes 'tbl_df', 'tbl' and 'data.frame':    12595 obs. of  8 variables:
##  $ GISJOIN : chr  "G010" "G050" "G060" "G090" ...
##  $ YEAR    : int  1850 1850 1850 1850 1850 1850 1850 1850 1850 1850 ...
##  $ STATE   : chr  "Alabama" "Arkansas" "California" "Connecticut" ...
##  $ STATEA  : int  10 50 60 90 100 110 120 130 170 180 ...
##  $ AREANAME: chr  "Alabama" "Arkansas" "California" "Connecticut" ...
##  $ code    : chr  "AFB001" "AFB001" "AFB001" "AFB001" ...
##  $ value   : int  237542 11250 631 74 4 45 2340 3154 1335 395 ...
##  $ meaning : Factor w/ 86 levels "Birthplace unknown",..: 35 35 35 35 35 35 35 35 35 35 ...

Now our data frame has a meaning column. This is much better than dealing with those codes. But still the meaning column has two kinds of information in it: whether the place is “native” or “foreign”, and what the name of the place is. We can use mutate() to create new columns which extract just that information. The stri_detect_fixed() function will detect wether the word "Foreign" is present. If so, we will give that field native_or_foreign the value "foreign"; otherwise we will give it the value "native".

birthplace <- birthplace %>%
  mutate(native_or_foreign = ifelse(stri_detect_fixed(meaning, "Foreign"),
                                    "foreign", "native")) 

Now we have a field with a native or foreign value, and we can use the meaning column for the actual place of birth. Now we can start asking interesting questions. In most cases these questions will involve summarizing the data. One interesting question is what the proportion of native- and foreign-born people was in each year.

birthplace %>%
  group_by(YEAR) %>%
  mutate(total = sum(value, na.rm = TRUE),
         proportion = value / total) %>%
  group_by(YEAR, native_or_foreign) %>%
  summarize(proportion = sum(proportion, na.rm = TRUE)) ->
  birthplace_proportions_by_year

birthplace_proportions_by_year
## Source: local data frame [5 x 3]
## Groups: YEAR
## 
##   YEAR native_or_foreign proportion
## 1 1850           foreign 0.11061073
## 2 1850            native 0.88938927
## 3 1860           foreign 0.07565308
## 4 1860            native 0.92434692
## 5 1870                NA 1.00000000
library(ggplot2)
birthplace_proportions_by_year %>%
  ggplot(aes(x = YEAR, y = proportion, fill= native_or_foreign)) + 
  geom_bar(stat = "identity") +
  coord_flip()

Alternatively, we could try to find the most common foreign born group in each state by year.

birthplace %>%
  group_by(YEAR) %>%
  mutate(total = sum(value, na.rm = TRUE),
         proportion = value / total) %>%
  filter(native_or_foreign == "foreign") %>%
  ungroup() %>%
  arrange(desc(proportion)) %>%
  select(YEAR, STATE, meaning, value, proportion)
## Source: local data frame [4,140 x 5]
## 
##    YEAR         STATE               meaning  value  proportion
## 1  1850      New York Foreign-born: Ireland 343111 0.017166225
## 2  1850  Pennsylvania Foreign-born: Ireland 151723 0.007590870
## 3  1850      New York Foreign-born: Germany 118398 0.005923584
## 4  1850 Massachusetts Foreign-born: Ireland 115917 0.005799456
## 5  1850          Ohio Foreign-born: Germany 111257 0.005566311
## 6  1860      New York Foreign-born: Ireland 498072 0.004527583
## 7  1860      New York Foreign-born: Ireland 498072 0.004527583
## 8  1850      New York Foreign-born: England  84820 0.004243639
## 9  1850  Pennsylvania Foreign-born: Germany  78592 0.003932045
## 10 1850          Ohio Foreign-born: Ireland  51562 0.002579704
## ..  ...           ...                   ...    ...         ...

Unsurprisingly, Irish and German immigrants comprised the biggest proportion of immigrants in any state, though it is somewhat surprising that they were less than one percent of the population of any given state except New York in 1850.

Broom package

R functions often return data in non-tidy formats. The spatial objects from the sp package are one example, as are the models returned from various statistical functions. It is often much easier to use the broom package to tidy this data than to work with it in its raw form. TODO

Manipulating lists and lists of lists

Data Cleaning

Further reading


  1. Hadley Wickham, “Tidy Data,” Journal of Statistical Software 50, no. 10 (2014): http://www.jstatsoft.org/v59/i10.

  2. Wickham, “Tidy Data,” 4.

  3. The package dplyr is also written by Romain Francois.

  4. The pipe operator is provided by the magrittr by Stefan Bache. It is not necessary to load magrittr directly if you are loading dplyr, since dply imports the pipe operator. The pipe operator is, like everything in R a function and not technically an operator. But it is a special type of binary function in R which takes its arguments as a left-hand side and a right-hand side.

  5. For more, see the magrittr vignette

  6. You can think of this, if you must, as being like a formula in Excel.

  7. Keep in mind that the sum() function will return NA if the vector passed to it contains any NAs. This is because R is (rightly) conservative and won’t pretend to assign a value to those missing values. You can avoid this by adding the na.rm = TRUE argument to sum. So if you have a vector x with NA values, then sum(x, na.rm = TRUE) will remove the missing values and sum the rest. The same argument also applies to mean() and median() and quite a few other R functions.

  8. These three steps correspond to the “split-apply-combine” steps in Hadley Wickham’s article by that name. Hadley Wickham, “The Split-Apply-Combine Strategy for Data Analysis,” Journal of Statistical Software 40, no. 1 (2011): http://www.jstatsoft.org/v40/i01/.

  9. Some of these columns will be useful later for merging our data with geospatial data, so we’ll keep them around, even though they aren’t strictly necessary for our purposes now.

  10. The main value that tbl_df, which is provided by dplyr, adds is that it makes the default printing of data frames much niced.

  11. mullenMisc is my personal package which contains miscellaneous functions that I use repeatedly. You might eventually want to create your own personal package for such purposes.