In my last post I explained that historians of U.S. religion have barely begun to scratch the surface of the data (meaning, sources that are amenable to computation) that are available to them. To demonstrate this I gave the example of a single source, the Minutes of the Annual Conferences of the Methodist Episcopal Church.

In this post I want to attempt a very preliminary taxonomy of the kinds of sources that are available to religious historians who wish to use mapping or quantitative analysis of some kind or another. Let’s call this a taxonomy instead of a catalog, because I’m going to list the kinds of sources that I’ve come about rather than try to give a bibliography of all of the sources themselves. I’d love to be able to list all the sources, but I haven’t done all that work yet. And let’s say this is very preliminary, because I hope this post is an example of the so-called Cunningham’s Law: “the best way to get the right answer on the Internet is not to ask a question; it’s to post the wrong answer.” That is to say, if you know of a source or category of source that I don’t know about, I hope you’ll correct me in the comments. Finally, I should mention that I’m teaching a course this semester on “Data and Visualization in Digital History” where we are working on nineteenth-century U.S. religious statistics. I’m indebted to the excellent students in that course, who have already turned up many sources that I didn’t know about.

Enough throat clearing.

All U.S. religious statistics are divided into two parts, those from the Census, and those not from the Census.

Continue reading →

While everyone else is live tweeting, I’m live blogging the AHA’s digital projects lightening round. While of course the projects are widely varied in terms of content, they all have something in common. With very few exceptions, all of the digital projects are expressed in terms of a historical argument or interpretation. This is rather different than many DH presentations, which tend to focus on methods or tools or technologies. Why the difference? It can probably be attributed to the way that the context of the AHA meeting pushes everyone to frame their project in disciplinary terms. That is a very good thing. And maybe everyone here was in the room for Cameron Blevin’s talk at the AHA last year, and took his admonishment to heart.

AHA16 Digital Projects Lightning Round Lineup →

Here is a summary of what I said in the DH pedagogy lightening sessions at the AHA.

Simple idea 1: Installing software takes a lot of time, and installing software can often be harder and require more technological skill than actually using the software.

Simple idea 2: You must scaffold your digital history courses, so that one assignment leads into the next, and so that students build the methodological and technical skills that they need as they go through the course.

The problem is that students need to install the software before they can use it. The most technologically difficult, and the least pedagogically or historically interesting task, happens at the beginning of the course. This presents a tremendous barrier to student involvement. It wastes course time early in the semester, when building momentum is crucial.

Not so simple solution: My solution to this problem is to try to take the burden of installing software on myself so as to not waste students’ time. For my “Data and DH” course next semester, as well as for previous courses, I have relied on an RRCHNM installation of RStudio Server. This lets students access a full development environment through their browser: no installing or configuring software.

You might object: there is no way my institution will give me a server of my own, and installing RStudio Server might be too difficult for me. For RStudio, at least, the analogsea package can help you get a server up and running at Digital Ocean. Assuming you already have a Digital Ocean account, it can be as simple as these few lines of code (though you will also have to add users and configure the memory).


docklet_create() %>%

The point is not that you should use RStudio Server (though it’s great), and there are other options like Anaconda for Python. The point is to find a way to reduce or eliminate the waste of student time and attention that comes from installing software. Find a way to scaffold your courses so that you can get straight into the digital history.

Cameron Blevins and I recently published an article in Digital Humanities Quarterly titled “Jane, John … Leslie? A Historical Method for Algorithmic Gender Prediction.” The article has two related goals. First we explain the historical method behind the gender package for R, showing how it takes into account changes in the associations between names and genders. This method can be used by historians and other scholars to guess genders from first names as reliably as possible. Then, to show how the method can actually be used to make an argument, we apply the method to show that, while the number of history dissertations written by men and women is nearly equal, there continues to be a gap between the number of books on history reviewed in the American Historical Review written by men and women.

Here is the abstract:

This article describes a new method for inferring the gender of personal names using large historical datasets. In contrast to existing methods of gender prediction that treat names as if they are timelessly associated with one gender, this method uses a historical approach that takes into account how naming practices change over time. It uses historical data to measure the likelihood that a name was associated with a particular gender based on the time or place under study. This approach generates more accurate results for sources that encompass changing periods of time, providing digital humanities scholars with a tool to estimate the gender of names across large textual collections. The article first describes the methodology as implemented in the gender package for the R programming language. It goes on to apply the method to a case study in which we examine gender and gatekeeping in the American historical profession over the past half-century. The gender package illustrates the importance of incorporating historical approaches into computer science and related fields.

Jane, John … Leslie? A Historical Method for Algorithmic Gender Prediction →

A new release of the USAboundaries package (v0.2.0) for R is available on CRAN. This package continues to provide historical boundaries of U.S. counties and states from 1629 to 2000, thanks to the Newberry Library’s Atlas of Historical County Boundaries. In this release I have added current county, state, and congressional district boundaries from the U.S. Census Bureau. Both the historical and contemporary boundaries data gain higher resolution versions suitable for mapping at the level of the state rather than the nation. This higher resolution data is optional, and will be installed the first time that a user requests it. Finally, the entire package interface has been improved, adding geography-specific functions (e.g., us_states(), us_counties()) instead of forcing everything through a single function, and removing a bunch of needless package dependencies.

One of my side projects (eventually to turn into a main project) is figuring out what can be done with historical data about religious groups in the United States. This ground is in some ways well trodden. The field has a very fine atlas in the form of Gaustad, Barlow, and Dishno’s New Historical Atlas of Religion in America, as well as an experimental Digital Atlas of American Religion for the twentieth century. Then too, the field has more or less decided that this ground is not worth treading anyway. There are a number of sophisticated critiques of the whole enterprise of dealing with religious statistics and mapping. If I can sum these up in a broad statement, the point is that numbers don’t tell us anything that the field actually wants to know. As Laurie Maffly-Kipp puts it in a well-argued review essay, “our dazzling new technologies and spatial theories” might only have “brought us back to much more circumscribed definitions of religious experience.” I recognize the weight of these arguments, and a full justification for dealing with religious statistics will eventually have to take them into account.

But not yet. I want to argue that historians of American religion have barely begun to take advantage of the quantitative data available to them. While we have to keep the theoretical arguments I alluded to in mind at all times, the pressing issue at the moment is one of basic research. Until we make a fuller attempt at using these quantitative records, we can’t really know whether we will find anything useful from them.

Here is the argument. Mapping and quantitative analysis of historical statistics about U.S. religion have been sorely limited by the kinds of data that have typically been used, namely county-level aggregates of Federal census data, and by the way that mapping has focused on general comparisons rather than the specifics of the data.

Continue reading →

A number of problems in digital history/humanities require one to calculate the similarity of documents or to identify how one text borrows from another. To give one example, the Viral Texts project, by Ryan Cordell, David Smith, et al., has been very successful at identifying reprinted articles in American newspapers. Kellen Funk and I have been working on a text reuse problem in nineteenth-century legal history, where we seek to track how codes of civil procedure were borrowed and modified in jurisdictions across the United States.

As part of that project, I have recently released the textreuse package for R to CRAN. (Thanks to Noam Ross for giving this package a very thorough open peer review for rOpenSci, to whom I’ve contributed the package.) This package is a general purpose implementation of several algorithms for detecting text reuse, as well as classes and functions for investigating a corpus of texts. Put most simply, full text goes in and measures of similarity come out. Put more formally, here is the package description:

Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality- sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.

Continue reading →

Earlier today Paul Putz wrote a post about an interactive bibliography that he and I created of books that study American religion in the context of cities. Paul explained our motivation for the project and how we created it. I’d like to offer a few observations about what I think we can learn from the map.

Screenshot of the [Bibliography of Urban Religious History](
Figure 1: Screenshot of the Bibliography of Urban Religious History. [PNG]

First, and utterly unsurprisingly, the map basically aligns with the urban population of the United States. So New York, Chicago and Boston, followed closely by New Orleans, Washington, Detroit, San Francisco, and Los Angeles, are the most written about cities.

Continue reading →