At his blog, Andrew Goldstone has posted a pre-print of his essay on “Teaching Quantitative Methods: What Makes It Hard (in Literary Studies)” for the forthcoming Debates in DH 2018. It’s a “lessons learned” essay from one of his courses that is well worth reading if you’re teaching or taking that kind of a course in a humanities discipline. This semester I’m teaching my fourth course that fits into that category (fifth, if you count DHSI), and I can co-sign nearly everything that Goldstone writes, having committed many of the same mistakes and learned some of the same lessons. (Except over time I’ve relaxed my *nix-based fundamentalism and repealed my ban on Windows.) Here is a response to Goldstone’s main points.
In my first semester teaching one of my department’s graduate methods courses in digital history, I realized that there was not a lot good material for teaching computer programming and data analysis in R for historians. So I started writing up a series of tutorials for my students, which they said were helpful. It seemed like those materials could be the nucleus of a textbook, so I started writing one with the title Digital History Methods in R.
It was too soon to start writing, though. Besides needing to spend my time on more pressing projects, I didn’t really have a clear conception of how to teach the material. And in the past few years, the landscape for teaching computational history has been transformed. There are many more books available, some specifically aimed at humanists, such as Graham, Milligan, and Weingart’s Exploring Big Historical Data and Arnold and Tilton’s Humanities Data in R, and others aimed at teaching a modern version of R, such as Hadley Wickham’s Advanced R and R for Data Science. The “tidyverse” of R packages has made a consistent approach to data analysis possible, and the set of packages for text analysis in R is now much better. R markdown and bookdown have made writing a technical book about R much easier, and Shiny has made it much easier to demonstrate concepts interactively.
Kellen Funk and I have just published an article titled “A Servile Copy: Text Reuse and Medium Data in American Civil Procedure” (PDF). The article is a brief invited contribution to a forum in Rechtsgeschichte [Legal History] on legal history and digital history. Kellen and I give an overview of our project to discover how nineteenth-century codes of civil procedure in the United States borrowed from one another. (We will have more soon about this project in a longer research article.)
If you are interested in digital legal history, you might also look at some of the articles which have been posted in advance of the next issue of Law and History Review, which will be focused on digital legal history.
This semester I am teaching an independent study for graduate students on “Text Analysis for Historians.” You can see the syllabus here. It’s an unashamedly disciplinary course. While of course the readings are heavily dependent on the work that is being done in digital humanities or digital literary studies, the organizing principle is whether a method is likely to be useful for historical questions. And the syllabus is organized around a corpus of Anglo-American legal treatises, with readings to frame our work in the context of U.S. legal history.
This post originally appeared at the rOpenSci blog.
The R package ecosystem for natural language processing has been flourishing in recent days. R packages for text analysis have usually been based on the classes provided by the NLP or tm packages. Many of them depend on Java. But recently there have been a number of new packages for text analysis in R, most notably text2vec, quanteda, and tidytext. These packages are built on top of Rcpp instead of rJava, which makes them much more reliable and portable. And instead of the classes based on NLP, which I have never thought to be particularly idiomatic for R, they use standard R data structures. The text2vec and quanteda packages both rely on the sparse matrices provided by the rock solid Matrix package. The tidytext package is idiosyncratic (in the best possible way!) for doing all of its work in data frames rather than matrices, but a data frame is about as standard as you can get. For a long time when I would recommend R to people, I had to add the caveat that they should use Python if they were primarily interested in text analysis. But now I no longer feel the need to hedge.
Still there is a lot of duplicated effort between these packages on the one hand and a lot of incompatibilities between the packages on the other. The R ecosystem for text analysis is not exactly coherent or consistent at the moment.
My small contribution to the new text analysis ecosystem is the tokenizers package, which was recently accepted into rOpenSci after a careful peer review by Kevin Ushey. A new version of the package is on CRAN. (Also check out Jeroen Ooms’s hunspell package, which is a part of rOpensci.)
It’s the start of August, and I don’t want to presume on the good graces of this blog’s readers. So in the spirit of late summer, I’m finally getting around to briefly describing of one of my summer projects in the hope that you find it fun, leaving a fuller accounting of the why and wherefore of the project for another time.
America’s Public Bible is a website which looks for all of the biblical quotations in Chronicling America. Chronicling America is a collection of digitized newspapers from the Library of Congress as part of the NEH’s National Digital Newspaper Program. ChronAm currently has some eleven million newspaper pages, spanning the years 1836 to 1922. Using the text that ChronAm provides, I have looked for which Bible verses (just from the KJV for now) are quoted or alluded to on every page. If you want an explanation of why I think this is an interesting scholarly question, there is an introductory essay at the site.
I love Makefiles more than I ought to.1 If you haven’t come across GNU Make before, then Mike Bostock’s “Why Use Make” is a good introduction, as is the section on Make at Kieran Healy’s “Plain Person’s Guide to Plain Text Social Science.” I like Make for several reasons. It lets you specify how your final products (like a website or a PDF document) are related to inputs, and that discipline is invaluable for producing reproducible research and for structuring your project sensibly. For lots of tasks it provides free parallelization and rebuilds only what is absolutely necessary. Since my projects fit into several different genres, once I have created a Makefile for the genre, it is trivial to adapt it to different projects. Whether it is an article or a book manuscript, a data analysis project, a website, my CV, or some kind of file conversion process, all that I need to remember how to do is type
make to build the project,
make deploy to put it on the web, and
make clean to start over.
This semester I am teaching a graduate course on Data and Visualization in Digital History. The aim of this course is to teach students how to do the kind of data analysis and visualization that they are likely to do for a dissertation chapter or a journal article. In my way of working, that means the first part of the semester is an introduction to scripting in R, focusing on the grammar of graphics with ggplot2 and the grammar of data manipulation with dplyr and tidyr. Then the second part of the course is aimed at introducing specific kinds of analysis in the context of historical work. My aim is that this course will be the first in a two course sequence, where the second course (colloquially known as Clio 3) will have more programming in R (as opposed to scripting), will have more *nix-craft, will tackle a more advanced historical problem, will possibly cover more machine learning, and will end up creating interactive analyses in Shiny.
There are a few things about the Data and Visualization course that I think are worth mentioning.
First, I’ve been creating worksheets for historical data analysis each week. These worksheets tend to demonstrate some technique, then ask students to build up an analysis step by step. The questions within each worksheet range in difficulty from the rote and mechanical to the very difficult. While for now these worksheets are aimed at this class in particular, I intend over time to write worksheets like these for any topic in R I end up teaching. I’m rather pleased with these worksheets as a method of teaching data analysis by example.1 If I’m judging my students’ initial reactions correctly, they are also finding them helpful, if rather difficult at times.
This past December I was invited by the Department of Modern Languages at the University of Helsinki to give a workshop introduction to DH with a special emphasis on data visualization. I had a wonderful time with the scholars there, and learned more about the wide-ranging DH research coming out of that university. I posted my workshop materials online here.
Kellen Funk and I are working on detecting how a New York legal code of civil procedure spread to most other jurisdictions in the United States. That Field Code and the other codes derived from it are the basis of modern American legal practice, so tracking the network and content of the borrowings reveals the structure of a significant part of American legal history.
In response to an invitation from the Digital Humanities Working Group at George Mason, we wrote a working paper that describes the current state of our research. In the paper we explain the historical problem to show why it is worth tracking how the Field Code spread. Then we give an overview of how we went about detecting which civil procedure codes were similar to one another, after which we give a few sample visualizations to show how we went about learning from those similarities. And finally we wrap up with a summary of what we think our project tells us about the history of nineteenth-century American law. We are working on an article, which will be structured rather differently with a fuller statement of our argument and many more visualizations, but in the meantime the working paper gives a fairly succinct overview of the project and its argument. It may also be of interest for going into more detail as to how a historical data analysis project proceeds from problem to interpretation than we may be able to do in the article. We also have a notebook with more details about the project.