Please note that this book is a work in progress, the equivalent of a pre-alpha release. All the code should work, because if it didn’t the site could not be built. But there is still a lot of work to do to explain the historical methods under discussion. Feel free to leave feedback as issues on the GitHub repository, or to e-mail me.

A standard historian’s joke goes something like this: “On or about October 31, 1517, modernity began.” Martin Luther did indeed nail his ninety-five theses to the door of the Wittenberg church on that date. Identifying the beginning of modernity or capitalism, or conceptions of the self, or any other historical abstraction worth studying cannot be so readily pinned down to a specific date, whether on or about. The problem is one of precision. Historical events and movements have fuzzy beginnings and endings and often even straightforward facts often have uncertain dates. Think of chronological terms like “the long nineteenth century,” “the last third of the eighteenth century,” “the American Revolution,” or “antiquity” and “modernity,” and you’ll recognize how good historians are expressing the necessary chronological uncertainty in prose.

The problem of precision and dates is even more vexing when it comes to computation. Computers demand precision in dates. The computer has no built-in concept of dates such as centuries. To explain the nineteenth century to a computer, you will have to give tell it to computer the difference between December 31, 1900 and January 1, 1801.1 Even more difficult is that computers require precision down to the second or below. On Unix operating systems, dates are computed as the number of seconds elapsed since the beginning of the Unix epoch: 00:00:00 in Coordinated Universal Time (i.e., Greenich Mean Time) on Thursday, January 1, 1970, excluding leap seconds.2

If I run the Unix date command, I see that 1.4 billion seconds and counting have elapsed since then, and R reports a value down to the hundredth thousand of a second.

now <- Sys.time()
as.character(as.numeric(now))
## [1] "1428784866.71154"

The problem for the computational historian, then, is to work with artificially precise dates in a way that makes sense for the discipline of history. This will require a deep knowledge on your part about the date systems of the period you are working in. A historian of early modern Europe, for example, will have to be aware of the distinction between old style and new style dates, while a historian of China will have to be able to translate Chinese dates into a format amenable to R. In general, R will only work with dates in the Gregorian calendar.

This chapter will work with a sample dataset of missions by the Paulist Fathers, a Roman Catholic missionary order in the United States. After a brief introduction to R’s native date formats, the chapter will explain the lubridate package. In particular it will cover parsing dates, extracting information from them, and using them in calculations.3

R’s native date system

Creating a date object

Extracting information from a date object

Lubridate

As is the case with much of R, a lot of the pain in manipulating dates can be removed by using the right package. Lubridate is a package which provides many useful functions for parsing dates, getting information out of them, and performing calculations on them.4

Parsing dates

The most useful feature of lubridate are the functions it provides to parse dates from character strings. If you have control over how data is stored and represented, then you should use the widely agreed upon standard ISO-8601 ISO-8601. That standard has many recommendations, but its basic recommendation is that dates be stored as strings, arranged from year to month to day to time. You can omit any value for which you do not have data.5 For example, July 21, 1969 would be represented as "1969-07-21", while, just the month of July 1969 would be represented as "1969-07". This format of dates can be easily parsed, and as a bonus, they can be easily sorted as well even by a program which only knows how to sort numbers and not dates. Chance are good, however, that you do not have control over the formats in which dates are represented in your data, so an easy way of parsing them is important.

Lubridate provides a set of functions in the form ymd() which can parse strings into dates. To take our example above, we can use that function to parse our string representing a date into a date object.

library(lubridate)
ymd("1969-07-21")
## [1] "1969-07-21 UTC"

But lubridate includes a number of other functions that rearrange the order of year, month, and day. For example, mdy() will parse dates arranged by month, day, and year. And lubridate is very good at taking representations of dates such as "January" or "Jan." and turning them into date objects. All of the following are dates that lubridate knows how to parse.

ymd("1969 July 21")
## [1] "1969-07-21 UTC"
mdy("July 21, 1969")
## [1] "1969-07-21 UTC"
mdy("7/21/1969")
## [1] "1969-07-21 UTC"
dmy("21 July 1969")
## [1] "1969-07-21 UTC"

If lubridate does not know how to parse a date, it does the right thing. It will emit a warning message so you will know that a string did not parse into a date, but it will also assign the value NA. For example, a dataset I worked with represented unknown dates with the string "unknown".

ymd("unknown")
## Warning: All formats failed to parse. No formats found.
## [1] NA

If you have a date that represents only a year and a month, lubridate will not be able to parse it because R needs a year, and month, and a day to create a date object. A common practice in such cases is to assign an arbitrary day, usually the first day of the month. As long as you don’t mix dates that are specific only down to the month with dates that contain real days, this should not cause a problem.

month <- "1969-07"
ymd(month)
## Warning: All formats failed to parse. No formats found.
## [1] NA
# Use paste() to add an arbitrary day to the date
ymd(paste(month, "-01"))
## [1] "1969-07-01 UTC"

If you have dates that are only years, it is not usually necessary to convert them to date objects. You can treat them simply as integers.

Extracting information

Once you have dates stored as date objects, you can use lubridate’s functions to extract the components of the date. For example, given a list of dates, you might be interested only in the year in which they occured, or perhaps you want to see if there is a cyclical pattern by extracting the month. Lubridate can also extract the day of the week. (This will only work for Gregorian dates.)

Take our example date of July 21, 1969 from above. If we save this in a variable, we can extract all kinds of information from it in a variety of numeric and character formats.

moon_walk <- mdy("July 21, 1969")
year(moon_walk)
## [1] 1969
month(moon_walk)
## [1] 7
month(moon_walk, label = TRUE)
## [1] Jul
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
month(moon_walk, label = TRUE, abbr = FALSE)
## [1] July
## 12 Levels: January < February < March < April < May < June < ... < December
day(moon_walk)
## [1] 21
yday(moon_walk) # day of the year
## [1] 202
weekdays(moon_walk)
## [1] "Monday"
weekdays(moon_walk, abbr = TRUE)
## [1] "Mon"
week(moon_walk) # week of the year
## [1] 29

In the section on Paulist missions below, we will demonstrate how extracting this kind of information can lead to useful analysis.

Calculations with dates

When we use date objects instead of character strings to represent dates, it becomes possible to perform calculations on the dates. We might ask, for example, how long a period there was between John F. Kennedy’s “We choose to go to the moon” speech and the Apollo 11 moon walk. After we create the two date objects, we can simply subtract the date of the Kennedy speech from the date of the moon walk:

kennedy_speech <- mdy("September 12, 1962")

moon_walk - kennedy_speech
## Time difference of 2504 days

The - operator here is actually a function provided by lubridate. We can access that function directly using the difftime() function. Doing so will allow us to specify the units:

difftime(moon_walk, kennedy_speech, units = "weeks")
## Time difference of 357.7143 weeks

There are other kinds of arithmetic that can be performed with dates. For example, one could add a week, a month, or a year to the date of the Kennedy speech.

kennedy_speech
## [1] "1962-09-12 UTC"
kennedy_speech + weeks(1)
## [1] "1962-09-19 UTC"
kennedy_speech + months(1)
## [1] "1962-10-12 UTC"
kennedy_speech + years(1)
## [1] "1963-09-12 UTC"

Intervals

Lubridate also provides the ability to define intervals of time using the interval() function. For example, we could define a set of intervals for each of the presidential administrations:

kennedy_admin <- interval(mdy("January 20, 1961"),  mdy("November 22, 1963"))
johnson_admin <- interval(mdy("November 22, 1963"), mdy("January 20, 1969"))
nixon_admin   <- interval(mdy("January 20, 1969"),  mdy("August 9, 1974"))
administrations <- c(kennedy_admin, johnson_admin, nixon_admin)

There are a number of helpful operators that can be performed on intervals. The most obvious is to check whether a date occured within an interval. We can figure out whether Kennedy’s speech occured during his administration.

kennedy_speech %within% kennedy_admin
## [1] TRUE

This becomes more powerful when we check the date of the first moon walk against the vector of presidential administrations.

moon_walk %within% administrations
## [1] FALSE FALSE  TRUE

The moon walk was only within the third administration, which was Nixon’s. Lubridate will also allow you to perform set operations such as finding the intersection between two intervals and their union with the base R functions intersect() and union() respectively. A particularly useful function is int_overlaps() which returns a TRUE value if two intervals overlap at all. For example, we can test whether which administrations kept U.S. troops in Vietnam.

troops_in_vietnam <- interval(mdy("9/1/1950"), mdy("4/30/1975"))

int_overlaps(troops_in_vietnam, administrations)
## [1] TRUE TRUE TRUE

There were American troops in Vietnam during all of the Kennedy, Johnson, and Nixon administrations.

Working with dates

Beginning in 1851, Redemptorist priests who became the Paulist Fathers held missions throughout the United States. They recorded their missions in manuscript volumes, keeping track of the start and end dates of the missions, where they were held, and how many Catholics came to confession and how many Protestants converted. A transcription of these records is available in the historydata package. Because of the chronological information contained in the records, they make a good source to explore how to work with dates.

We will begin by loading the data, along with the packages we will use to manipulate and visualize the data.

library(lubridate)
library(dplyr, warn.conflicts = FALSE)
library(tidyr, warn.conflicts = FALSE)
library(ggplot2)
library(historydata)
data(paulist_missions)
paulist_missions
## Source: local data frame [841 x 11]
## 
##    mission_number                             church          city state
## 1               1                St. Joseph's Church      New York    NY
## 2               2               St. Michael's Church       Loretto    PA
## 3               3                  St. Mary's Church Hollidaysburg    PA
## 4               4      Church of St. John Evangelist     Johnstown    PA
## 5               5                 St. Peter's Church      New York    NY
## 6               6            St. Patrick's Cathedral      New York    NY
## 7               7               St. Patrick's Church          Erie    PA
## 8               8           St. Philip Benizi Church     Cussewago    PA
## 9               9 St. Vincent's Church (Benedictine)    Youngstown    PA
## 10             10                 St. Peter's Church      Saratoga    NY
## ..            ...                                ...           ...   ...
## Variables not shown: start_date (chr), end_date (chr), confessions (int),
##   converts (int), order (chr), lat (dbl), long (dbl)

Now we can use lubridate to parse the date fields, start_date and end_date, into date objects.

paulist_missions <- paulist_missions %>%
  mutate(start = mdy(start_date),
         end   = mdy(end_date))

We can then extract some information from the date objects. Let’s find out the year, the month, and the day of the week. We can also create a field for the duration of the mission.

paulist_missions <- paulist_missions %>%
  mutate(year_start  = year(start),
         year_end    = year(end),
         month_start = month(start, label = TRUE),
         month_end   = month(end, label = TRUE),
         day_start   = weekdays(start),
         day_end     = weekdays(end),
         duration    = as.numeric(difftime(end, start, units = "days")))
head(paulist_missions)
## Source: local data frame [6 x 20]
## 
##   mission_number                        church          city state
## 1              1           St. Joseph's Church      New York    NY
## 2              2          St. Michael's Church       Loretto    PA
## 3              3             St. Mary's Church Hollidaysburg    PA
## 4              4 Church of St. John Evangelist     Johnstown    PA
## 5              5            St. Peter's Church      New York    NY
## 6              6       St. Patrick's Cathedral      New York    NY
## Variables not shown: start_date (chr), end_date (chr), confessions (int),
##   converts (int), order (chr), lat (dbl), long (dbl), start (time), end
##   (time), year_start (dbl), year_end (dbl), month_start (fctr), month_end
##   (fctr), day_start (chr), day_end (chr), duration (dbl)

In their narrative sources, the Paulists record that the typical mission began on a Sunday evening and lasted for two weeks until the next Sunday. That was their ideal mission, but how did the actual missions proceed? And did the typical mission vary over time?

We can calculate a simple mean and median of the duration of the missions.

mean(paulist_missions$duration)
## [1] 9.560048
median(paulist_missions$duration)
## [1] 8

Surprisingly, the average mission (9.56 days) or median mission (8 days) was quite a bit shorter than what the Paulist described as typical or ideal. We can use dplyr and ggplot to find out a breakdown of the kinds of missions.

missions_duration <- paulist_missions %>%
  group_by(duration) %>%
  summarize(n = n())
missions_duration
## Source: local data frame [25 x 2]
## 
##    duration   n
## 1         0  24
## 2         2   4
## 3         3  18
## 4         4  27
## 5         5  20
## 6         6  13
## 7         7 266
## 8         8  73
## 9         9  54
## 10       10  60
## ..      ... ...
ggplot(missions_duration,
       aes(x = duration, y = n)) +
  geom_bar(stat = "identity") +
  xlim(0, 30) + 
  ggtitle("Durations of Paulist missions")
## Warning in loop_apply(n, do.ply): Removed 2 rows containing missing values
## (position_stack).

The missions show a definite multi-modal pattern: missions were usually either one week long or two weeks long, though they could range anywhere from one day to twenty-one.6

We can complicate the analysis by calculating the average duration of the missions per year, to see whether the Paulists lengthened or shortened their missions. First we will plot the duration of the mission against the start date:

ggplot(paulist_missions, aes(start, duration)) +
  geom_point() +
  geom_smooth() +
  ggtitle("Duration of Paulist missions by date")

This shows no significant variation in the pattern, though we note that after 1880 missions could be considerably longer. We can also calculate the mean length of the mission for each year.

missions_length_by_year <- paulist_missions %>%
  group_by(year_start) %>%
  summarize(duration = mean(duration))
missions_length_by_year
## Source: local data frame [38 x 2]
## 
##    year_start  duration
## 1        1851 11.333333
## 2        1852 13.750000
## 3        1853 10.818182
## 4        1854  9.916667
## 5        1855 10.400000
## 6        1856  9.642857
## 7        1857 10.000000
## 8        1858  7.812500
## 9        1859  9.200000
## 10       1860  9.384615
## ..        ...       ...
ggplot(missions_length_by_year, 
       aes(x = year_start, y = duration)) +
  geom_line() + geom_point() +
  ylim(0, 14) +
  ggtitle("Average mission length by year")

By this chart we can see that there are significant fluctuations, but that they do not appear to have any meaning behind them. Notice that the x-axis is plotted using the date object, which means that the points in the scatterplot are organized not just by year but also by month and day.

Since it appears that the Paulists preferred missions that were one or two weeks long, we can ask which days of the week were most popular to start and end the missions. Here will will simply combine the data manipulation and plotting.

paulist_missions %>%
  select(day_start, day_end) %>% 
  gather(type, day, day_start, day_end) %>% 
  mutate(day = factor(day,
                      levels = c("Sunday", "Monday", "Tuesday", "Wednesday",
                                 "Thursday", "Friday", "Saturday"))) %>%
  ggplot(aes(x = day)) + 
  geom_bar() + 
  facet_grid(type ~ .) + 
  ggtitle("Start and end days of Paulist missions")

It appears that as a rule Paulist missions started on a Sunday with only very rare exceptions. In general they ended on a Sunday as well, but it was also common for them to end on some other day of the week.

Finally, the Paulists tell us that they did not go on missions during the summer. We can use plot the cyclical pattern of missions each year.

ggplot(paulist_missions, aes(x = month_start)) + 
  geom_bar() + 
  ggtitle("Months in which Paulists started missions")

The Paulists never started missions in July (notice it is absent from the chart) and only rarely started missions in June or August. Fall, winter, and spring were the Paulist campaigning seasons.

Using lubridate, we can create table and visualizations that take into account the diachronic and cyclical patterns in temporal data.

Times

Both R and lubridate offer functions for dealing with times too. Besides date objects, it is possible to create date and time objects. I have seldom found reason to analyze times as a historian. In general, you should only represent times if you absolutely need them. The simpler form of just date objects is generally sufficient. Read the documentation for lubridate if you need to represent times. The pattern is essentially the same as for dates.

Further Reading


  1. Or 1800 and 1899: whichever definition of the start of a century you prefer.

  2. If we wanted to represent a date before January 1, 1970, then R would store that date as a negative number.

  3. I will assume that you know how to manipulate data as explained in that chapter.

  4. A much fuller introduction to lubridate can be found in Garret Grolemund and Hadley Wickham, “Dates and Times Made Easy with lubridate,” Journal of Statistical Software 40, no. 3 (2011): http://www.jstatsoft.org/v40/i03/ and the package’s documentation and vignette.

  5. The standard also allows you to represent dates as numbers without the hyphens, for example, 19690721.

  6. One missions is recorded as taking 60 days, but looking at the manuscript shows that it was a series of small three- or four-day-long retreats for which no specific dates were recorded.