Syllabus for “Text Analysis for Historians”

This semester I am teaching an independent study for graduate students on “Text Analysis for Historians.” You can see the syllabus here. It’s an unashamedly disciplinary course. While of course the readings are heavily dependent on the work that is being done in digital humanities or digital literary studies, the organizing principle is whether a method is likely to be useful for historical questions. And the syllabus is organized around a corpus of Anglo-American legal treatises, with readings to frame our work in the context of U.S. legal history.

They are mentioned on the syllabus, but this class draws from syllabi from Ted Underwood, Andrew Goldstone, and Ben Schmidt, and Kellen Funk offered suggestions for the readings.


New package tokenizers joins rOpenSci

This post originally appeared at the rOpenSci blog.

The R package ecosystem for natural language processing has been flourishing in recent days. R packages for text analysis have usually been based on the classes provided by the NLP or tm packages. Many of them depend on Java. But recently there have been a number of new packages for text analysis in R, most notably text2vec, quanteda, and tidytext. These packages are built on top of Rcpp instead of rJava, which makes them much more reliable and portable. And instead of the classes based on NLP, which I have never thought to be particularly idiomatic for R, they use standard R data structures. The text2vec and quanteda packages both rely on the sparse matrices provided by the rock solid Matrix package. The tidytext package is idiosyncratic (in the best possible way!) for doing all of its work in data frames rather than matrices, but a data frame is about as standard as you can get. For a long time when I would recommend R to people, I had to add the caveat that they should use Python if they were primarily interested in text analysis. But now I no longer feel the need to hedge.

Still there is a lot of duplicated effort between these packages on the one hand and a lot of incompatibilities between the packages on the other. The R ecosystem for text analysis is not exactly coherent or consistent at the moment.

My small contribution to the new text analysis ecosystem is the tokenizers package, which was recently accepted into rOpenSci after a careful peer review by Kevin Ushey. A new version of the package is on CRAN. (Also check out Jeroen Ooms’s hunspell package, which is a part of rOpensci.)

Continue reading “New package tokenizers joins rOpenSci”

Introducing America’s Public Bible (Beta)

It’s the start of August, and I don’t want to presume on the good graces of this blog’s readers. So in the spirit of late summer, I’m finally getting around to briefly describing of one of my summer projects in the hope that you find it fun, leaving a fuller accounting of the why and wherefore of the project for another time.

America’s Public Bible is a website which looks for all of the biblical quotations in Chronicling America. Chronicling America is a collection of digitized newspapers from the Library of Congress as part of the NEH’s National Digital Newspaper Program. ChronAm currently has some eleven million newspaper pages, spanning the years 1836 to 1922. Using the text that ChronAm provides, I have looked for which Bible verses (just from the KJV for now) are quoted or alluded to on every page. If you want an explanation of why I think this is an interesting scholarly question, there is an introductory essay at the site.

Continue reading “Introducing America’s Public Bible (Beta)”

An introduction to the textreuse package, with suggested applications

A number of problems in digital history/humanities require one to calculate the similarity of documents or to identify how one text borrows from another. To give one example, the Viral Texts project, by Ryan Cordell, David Smith, et al., has been very successful at identifying reprinted articles in American newspapers.1 Kellen Funk and I have been working on a text reuse problem in nineteenth-century legal history, where we seek to track how codes of civil procedure were borrowed and modified in jurisdictions across the United States.

As part of that project, I have recently released the textreuse package for R to CRAN. (Thanks to Noam Ross for giving this package a very thorough open peer review for rOpenSci, to whom I’ve contributed the package.) This package is a general purpose implementation of several algorithms for detecting text reuse, as well as classes and functions for investigating a corpus of texts. Put most simply, full text goes in and measures of similarity come out.2 Put more formally, here is the package description:

Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality- sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.

Continue reading “An introduction to the textreuse package, with suggested applications”