Makefiles for Writing, Data Analysis, OCR, and Converting Shapefiles

I love Makefiles more than I ought to.1 If you haven’t come across GNU Make before, then Mike Bostock’s “Why Use Make” is a good introduction, as is the section on Make at Kieran Healy’s “Plain Person’s Guide to Plain Text Social Science.” I like Make for several reasons. It lets you specify how your final products (like a website or a PDF document) are related to inputs, and that discipline is invaluable for producing reproducible research and for structuring your project sensibly. For lots of tasks it provides free parallelization and rebuilds only what is absolutely necessary. Since my projects fit into several different genres, once I have created a Makefile for the genre, it is trivial to adapt it to different projects. Whether it is an article or a book manuscript, a data analysis project, a website, my CV, or some kind of file conversion process, all that I need to remember how to do is type make to build the project, make deploy to put it on the web, and make clean to start over.

Continue reading “Makefiles for Writing, Data Analysis, OCR, and Converting Shapefiles”

A Course in Computational Methods and Nineteenth-Century Religious Data

This semester I am teaching a graduate course on Data and Visualization in Digital History. The aim of this course is to teach students how to do the kind of data analysis and visualization that they are likely to do for a dissertation chapter or a journal article. In my way of working, that means the first part of the semester is an introduction to scripting in R, focusing on the grammar of graphics with ggplot2 and the grammar of data manipulation with dplyr and tidyr. Then the second part of the course is aimed at introducing specific kinds of analysis in the context of historical work. My aim is that this course will be the first in a two course sequence, where the second course (colloquially known as Clio 3) will have more programming in R (as opposed to scripting), will have more *nix-craft, will tackle a more advanced historical problem, will possibly cover more machine learning, and will end up creating interactive analyses in Shiny.

There are a few things about the Data and Visualization course that I think are worth mentioning.

First, I’ve been creating worksheets for historical data analysis each week. These worksheets tend to demonstrate some technique, then ask students to build up an analysis step by step. The questions within each worksheet range in difficulty from the rote and mechanical to the very difficult. While for now these worksheets are aimed at this class in particular, I intend over time to write worksheets like these for any topic in R I end up teaching. I’m rather pleased with these worksheets as a method of teaching data analysis by example.1 If I’m judging my students’ initial reactions correctly, they are also finding them helpful, if rather difficult at times.

Continue reading “A Course in Computational Methods and Nineteenth-Century Religious Data”

Working Paper on the Migration of Codes of Civil Procedure

Kellen Funk and I are working on detecting how a New York legal code of civil procedure spread to most other jurisdictions in the United States. That Field Code and the other codes derived from it are the basis of modern American legal practice, so tracking the network and content of the borrowings reveals the structure of a significant part of American legal history.

Figure 1: States which adopted a version of the Field Code. [JPEG]

In response to an invitation from the Digital Humanities Working Group at George Mason, we wrote a working paper that describes the current state of our research. In the paper we explain the historical problem to show why it is worth tracking how the Field Code spread. Then we give an overview of how we went about detecting which civil procedure codes were similar to one another, after which we give a few sample visualizations to show how we went about learning from those similarities. And finally we wrap up with a summary of what we think our project tells us about the history of nineteenth-century American law. We are working on an article, which will be structured rather differently with a fuller statement of our argument and many more visualizations, but in the meantime the working paper gives a fairly succinct overview of the project and its argument. It may also be of interest for going into more detail as to how a historical data analysis project proceeds from problem to interpretation than we may be able to do in the article. We also have a notebook with more details about the project.

A Very Preliminary Taxonomy of Sources of Nineteenth-Century U.S. Religious Data

In my last post I explained that historians of U.S. religion have barely begun to scratch the surface of the data (meaning, sources that are amenable to computation) that are available to them. To demonstrate this I gave the example of a single source, the Minutes of the Annual Conferences of the Methodist Episcopal Church.

In this post I want to attempt a very preliminary taxonomy of the kinds of sources that are available to religious historians who wish to use mapping or quantitative analysis of some kind or another. Let’s call this a taxonomy instead of a catalog, because I’m going to list the kinds of sources that I’ve come about rather than try to give a bibliography of all of the sources themselves. I’d love to be able to list all the sources, but I haven’t done all that work yet. And let’s say this is very preliminary, because I hope this post is an example of the so-called Cunningham’s Law: “the best way to get the right answer on the Internet is not to ask a question; it’s to post the wrong answer.” That is to say, if you know of a source or category of source that I don’t know about, I hope you’ll correct me in the comments. Finally, I should mention that I’m teaching a course this semester on “Data and Visualization in Digital History” where we are working on nineteenth-century U.S. religious statistics. I’m indebted to the excellent students in that course, who have already turned up many sources that I didn’t know about.

Enough throat clearing.

All U.S. religious statistics are divided into two parts, those from the Census, and those not from the Census.

Continue reading “A Very Preliminary Taxonomy of Sources of Nineteenth-Century U.S. Religious Data”

How Digital Projects Are Framed at the AHA Lightening Round

While everyone else is live tweeting, I’m live blogging the AHA’s digital projects lightening round. While of course the projects are widely varied in terms of content, they all have something in common. With very few exceptions, all of the digital projects are expressed in terms of a historical argument or interpretation. This is rather different than many DH presentations, which tend to focus on methods or tools or technologies. Why the difference? It can probably be attributed to the way that the context of the AHA meeting pushes everyone to frame their project in disciplinary terms. That is a very good thing. And maybe everyone here was in the room for Cameron Blevin’s talk at the AHA last year, and took his admonishment to heart.

Don’t Waste Students’ Time Installing Software in DH Courses

Here is a summary of what I said in the DH pedagogy lightening sessions at the AHA.

Simple idea 1: Installing software takes a lot of time, and installing software can often be harder and require more technological skill than actually using the software.

Simple idea 2: You must scaffold your digital history courses, so that one assignment leads into the next, and so that students build the methodological and technical skills that they need as they go through the course.

The problem is that students need to install the software before they can use it. The most technologically difficult, and the least pedagogically or historically interesting task, happens at the beginning of the course. This presents a tremendous barrier to student involvement. It wastes course time early in the semester, when building momentum is crucial.

Not so simple solution: My solution to this problem is to try to take the burden of installing software on myself so as to not waste students’ time. For my “Data and DH” course next semester, as well as for previous courses, I have relied on an RRCHNM installation of RStudio Server. This lets students access a full development environment through their browser: no installing or configuring software.

You might object: there is no way my institution will give me a server of my own, and installing RStudio Server might be too difficult for me. For RStudio, at least, the analogsea package can help you get a server up and running at Digital Ocean. Assuming you already have a Digital Ocean account, it can be as simple as these few lines of code (though you will also have to add users and configure the memory).

docklet_create() %>%

The point is not that you should use RStudio Server (though it’s great), and there are other options like Anaconda for Python. The point is to find a way to reduce or eliminate the waste of student time and attention that comes from installing software. Find a way to scaffold your courses so that you can get straight into the digital history.

Jane, John … Leslie? A Historical Method for Algorithmic Gender Prediction

Cameron Blevins and I recently published an article in Digital Humanities Quarterly titled “Jane, John … Leslie? A Historical Method for Algorithmic Gender Prediction.” The article has two related goals. First we explain the historical method behind the gender package for R, showing how it takes into account changes in the associations between names and genders. This method can be used by historians and other scholars to guess genders from first names as reliably as possible. Then, to show how the method can actually be used to make an argument, we apply the method to show that, while the number of history dissertations written by men and women is nearly equal, there continues to be a gap between the number of books on history reviewed in the American Historical Review written by men and women.

Here is the abstract:

This article describes a new method for inferring the gender of personal names using large historical datasets. In contrast to existing methods of gender prediction that treat names as if they are timelessly associated with one gender, this method uses a historical approach that takes into account how naming practices change over time. It uses historical data to measure the likelihood that a name was associated with a particular gender based on the time or place under study. This approach generates more accurate results for sources that encompass changing periods of time, providing digital humanities scholars with a tool to estimate the gender of names across large textual collections. The article first describes the methodology as implemented in the gender package for the R programming language. It goes on to apply the method to a case study in which we examine gender and gatekeeping in the American historical profession over the past half-century. The gender package illustrates the importance of incorporating historical approaches into computer science and related fields.

An introduction to the textreuse package, with suggested applications

A number of problems in digital history/humanities require one to calculate the similarity of documents or to identify how one text borrows from another. To give one example, the Viral Texts project, by Ryan Cordell, David Smith, et al., has been very successful at identifying reprinted articles in American newspapers.1 Kellen Funk and I have been working on a text reuse problem in nineteenth-century legal history, where we seek to track how codes of civil procedure were borrowed and modified in jurisdictions across the United States.

As part of that project, I have recently released the textreuse package for R to CRAN. (Thanks to Noam Ross for giving this package a very thorough open peer review for rOpenSci, to whom I’ve contributed the package.) This package is a general purpose implementation of several algorithms for detecting text reuse, as well as classes and functions for investigating a corpus of texts. Put most simply, full text goes in and measures of similarity come out.2 Put more formally, here is the package description:

Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality- sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.

Continue reading “An introduction to the textreuse package, with suggested applications”

Some Thoughts on Digital Dissertations at Mason

The Department of History and Art History at George Mason University has recently approved guidelines for digital dissertations. While PhD students at several universities have already produced digital dissertations in the humanities, to my knowledge these are the first guidelines for born-digital dissertations created at the departmental level. The guidelines take a broad view of what producing a digital dissertation might entail. The primary author of the guidelines was my colleague Sharon Leon, who has written a post about how she put the guidelines together. These guidelines open a lot of room for graduate students to determine the form of their dissertations, while also providing some concrete guidance about the essential elements. I hope the guidelines clear the way for graduate students in our department to create the kinds of dissertations that they want. Graduate students should have room to be intellectual pioneers without having to always be institutional pioneers as well.

There are three things that I want to say about these guidelines.

Continue reading “Some Thoughts on Digital Dissertations at Mason”