Syllabus for “Text Analysis for Historians”

This semester I am teaching an independent study for graduate students on “Text Analysis for Historians.” You can see the syllabus here. It’s an unashamedly disciplinary course. While of course the readings are heavily dependent on the work that is being done in digital humanities or digital literary studies, the organizing principle is whether a method is likely to be useful for historical questions. And the syllabus is organized around a corpus of Anglo-American legal treatises, with readings to frame our work in the context of U.S. legal history.

They are mentioned on the syllabus, but this class draws from syllabi from Ted Underwood, Andrew Goldstone, and Ben Schmidt, and Kellen Funk offered suggestions for the readings.


New package tokenizers joins rOpenSci

This post originally appeared at the rOpenSci blog.

The R package ecosystem for natural language processing has been flourishing in recent days. R packages for text analysis have usually been based on the classes provided by the NLP or tm packages. Many of them depend on Java. But recently there have been a number of new packages for text analysis in R, most notably text2vec, quanteda, and tidytext. These packages are built on top of Rcpp instead of rJava, which makes them much more reliable and portable. And instead of the classes based on NLP, which I have never thought to be particularly idiomatic for R, they use standard R data structures. The text2vec and quanteda packages both rely on the sparse matrices provided by the rock solid Matrix package. The tidytext package is idiosyncratic (in the best possible way!) for doing all of its work in data frames rather than matrices, but a data frame is about as standard as you can get. For a long time when I would recommend R to people, I had to add the caveat that they should use Python if they were primarily interested in text analysis. But now I no longer feel the need to hedge.

Still there is a lot of duplicated effort between these packages on the one hand and a lot of incompatibilities between the packages on the other. The R ecosystem for text analysis is not exactly coherent or consistent at the moment.

My small contribution to the new text analysis ecosystem is the tokenizers package, which was recently accepted into rOpenSci after a careful peer review by Kevin Ushey. A new version of the package is on CRAN. (Also check out Jeroen Ooms’s hunspell package, which is a part of rOpensci.)

Continue reading

Introducing America’s Public Bible (Beta)

It’s the start of August, and I don’t want to presume on the good graces of this blog’s readers. So in the spirit of late summer, I’m finally getting around to briefly describing of one of my summer projects in the hope that you find it fun, leaving a fuller accounting of the why and wherefore of the project for another time.

America’s Public Bible is a website which looks for all of the biblical quotations in Chronicling America. Chronicling America is a collection of digitized newspapers from the Library of Congress as part of the NEH’s National Digital Newspaper Program. ChronAm currently has some eleven million newspaper pages, spanning the years 1836 to 1922. Using the text that ChronAm provides, I have looked for which Bible verses (just from the KJV for now) are quoted or alluded to on every page. If you want an explanation of why I think this is an interesting scholarly question, there is an introductory essay at the site.

Continue reading

Makefiles for Writing, Data Analysis, OCR, and Converting Shapefiles

I love Makefiles more than I ought to.1 If you haven’t come across GNU Make before, then Mike Bostock’s “Why Use Make” is a good introduction, as is the section on Make at Kieran Healy’s “Plain Person’s Guide to Plain Text Social Science.” I like Make for several reasons. It lets you specify how your final products (like a website or a PDF document) are related to inputs, and that discipline is invaluable for producing reproducible research and for structuring your project sensibly. For lots of tasks it provides free parallelization and rebuilds only what is absolutely necessary. Since my projects fit into several different genres, once I have created a Makefile for the genre, it is trivial to adapt it to different projects. Whether it is an article or a book manuscript, a data analysis project, a website, my CV, or some kind of file conversion process, all that I need to remember how to do is type make to build the project, make deploy to put it on the web, and make clean to start over.

Continue reading

A Course in Computational Methods and Nineteenth-Century Religious Data

This semester I am teaching a graduate course on Data and Visualization in Digital History. The aim of this course is to teach students how to do the kind of data analysis and visualization that they are likely to do for a dissertation chapter or a journal article. In my way of working, that means the first part of the semester is an introduction to scripting in R, focusing on the grammar of graphics with ggplot2 and the grammar of data manipulation with dplyr and tidyr. Then the second part of the course is aimed at introducing specific kinds of analysis in the context of historical work. My aim is that this course will be the first in a two course sequence, where the second course (colloquially known as Clio 3) will have more programming in R (as opposed to scripting), will have more *nix-craft, will tackle a more advanced historical problem, will possibly cover more machine learning, and will end up creating interactive analyses in Shiny.

There are a few things about the Data and Visualization course that I think are worth mentioning.

First, I’ve been creating worksheets for historical data analysis each week. These worksheets tend to demonstrate some technique, then ask students to build up an analysis step by step. The questions within each worksheet range in difficulty from the rote and mechanical to the very difficult. While for now these worksheets are aimed at this class in particular, I intend over time to write worksheets like these for any topic in R I end up teaching. I’m rather pleased with these worksheets as a method of teaching data analysis by example.1 If I’m judging my students’ initial reactions correctly, they are also finding them helpful, if rather difficult at times.

Continue reading

Materials for DH Workshop at the University of Helsinki

This past December I was invited by the Department of Modern Languages at the University of Helsinki to give a workshop introduction to DH with a special emphasis on data visualization. I had a wonderful time with the scholars there, and learned more about the wide-ranging DH research coming out of that university. I posted my workshop materials online here.

Working Paper on the Migration of Codes of Civil Procedure

Kellen Funk and I are working on detecting how a New York legal code of civil procedure spread to most other jurisdictions in the United States. That Field Code and the other codes derived from it are the basis of modern American legal practice, so tracking the network and content of the borrowings reveals the structure of a significant part of American legal history.


Figure 1: States which adopted a version of the Field Code. [JPEG]

In response to an invitation from the Digital Humanities Working Group at George Mason, we wrote a working paper that describes the current state of our research. In the paper we explain the historical problem to show why it is worth tracking how the Field Code spread. Then we give an overview of how we went about detecting which civil procedure codes were similar to one another, after which we give a few sample visualizations to show how we went about learning from those similarities. And finally we wrap up with a summary of what we think our project tells us about the history of nineteenth-century American law. We are working on an article, which will be structured rather differently with a fuller statement of our argument and many more visualizations, but in the meantime the working paper gives a fairly succinct overview of the project and its argument. It may also be of interest for going into more detail as to how a historical data analysis project proceeds from problem to interpretation than we may be able to do in the article. We also have a notebook with more details about the project.

Kate Bowler – Death, the Prosperity Gospel, and Me

Kate Bowler writes about the prosperity gospel and her cancer:

The prosperity gospel popularized a Christian explanation for why some people make it and some do not. They revolutionized prayer as an instrument for getting God always to say “yes.” It offers people a guarantee: Follow these rules, and God will reward you, heal you, restore you. It’s also distressingly similar to the popular cartoon emojis for the iPhone, the ones that show you images of yourself in various poses. One of the standard cartoons shows me holding a #blessed sign. My world is conspiring to make me believe that I am special, that I am the exception whose character will save me from the grisly predictions and the CT scans in my inbox. I am blessed.

The prosperity gospel holds to this illusion of control until the very end. If a believer gets sick and dies, shame compounds the grief. Those who are loved and lost are just that — those who have lost the test of faith. In my work, I have heard countless stories of refusing to acknowledge that the end had finally come. An emaciated man was pushed about a megachurch in a wheelchair as churchgoers declared that he was already healed. A woman danced around her sister’s deathbed shouting to horrified family members that the body can yet live. There is no graceful death, no ars moriendi, in the prosperity gospel. There are only jarring disappointments after fevered attempts to deny its inevitability.

The prosperity gospel has taken a religion based on the contemplation of a dying man and stripped it of its call to surrender all. Perhaps worse, it has replaced Christian faith with the most painful forms of certainty. The movement has perfected a rarefied form of America’s addiction to self-rule, which denies much of our humanity: our fragile bodies, our finitude, our need to stare down our deaths (at least once in a while) and be filled with dread and wonder. At some point, we must say to ourselves, I’m going to need to let go.

A Very Preliminary Taxonomy of Sources of Nineteenth-Century U.S. Religious Data

In my last post I explained that historians of U.S. religion have barely begun to scratch the surface of the data (meaning, sources that are amenable to computation) that are available to them. To demonstrate this I gave the example of a single source, the Minutes of the Annual Conferences of the Methodist Episcopal Church.

In this post I want to attempt a very preliminary taxonomy of the kinds of sources that are available to religious historians who wish to use mapping or quantitative analysis of some kind or another. Let’s call this a taxonomy instead of a catalog, because I’m going to list the kinds of sources that I’ve come about rather than try to give a bibliography of all of the sources themselves. I’d love to be able to list all the sources, but I haven’t done all that work yet. And let’s say this is very preliminary, because I hope this post is an example of the so-called Cunningham’s Law: “the best way to get the right answer on the Internet is not to ask a question; it’s to post the wrong answer.” That is to say, if you know of a source or category of source that I don’t know about, I hope you’ll correct me in the comments. Finally, I should mention that I’m teaching a course this semester on “Data and Visualization in Digital History” where we are working on nineteenth-century U.S. religious statistics. I’m indebted to the excellent students in that course, who have already turned up many sources that I didn’t know about.

Enough throat clearing.

All U.S. religious statistics are divided into two parts, those from the Census, and those not from the Census.

Continue reading