All blog posts: by date RSS feed

Chronicling America OCR debatcher

This probably useful only for me, but I’ve made a small utility to help get the Chronicling America OCR files. The batch files from the Chronicling America bulk data downloads are .tar.bz2 files with both plain text and XML versions of the OCR text of the newspaper pages. The files are slow to unzip and dump tens of thousands of files, at least half of which you don’t need, onto your disk. So the utility process the batches without unzipping them and creates a CSV file with the text and the IDs used elsewhere in Chronicling America. You can get the utility at GitHub.

Goodman on the meaning of “tradition”

Martin Goodman in his History of Judaism:1

The past 2,000 years have witnessed a great variety of expressions of Judaism. It would be straightforward to define the essence of Judaism in light of the characteristics valued by one or another of its branches in the present day, and to trade the development of those characteristics over the centuries, and such histories have indeed been written in past centuries. But it is evidently unsatisfactory to assume that what now seems essential was always seen as such. In any case it cannot be taken for granted that there was always a mainstream within Judaism and that the other varieties of the religion were, and should be, seen as tributaries. The metaphors of a great river of tradition, or of a tree with numerous branches, are seductive but dangerous, for the most important aspects of Judaism now may have little connection with antiquity. It is self-evident, for instance, that the central liturgical concern of 2,000 years ago—the performance of sacrificial worship in the Jerusalem Temple—has little to do with most forms of Judaism today.

Goodman’s questioning of the metaphors for tradition is helpful. While there is a place for finding the origins of traditions, as Goodman goes on to explain, the discontinuities in the history of Judaism in Goodman’s case or the history of Christianity in mine are almost more striking, and harder to craft into a historical narrative.

  1. Martin Goodman, A History of Judaism (Princeton University Press, 2018), xxiii–xxiv. [return]

The most interesting tech company of 2018

After thinking about it, I came to the conclusion that the most interesting tech company of 2018 was … Microsoft? My formative experiences with computers came in the 1990s, and even though the first computer my family had was a Windows PC, I imbibed anti-Microsoft sentiment in my youth. That attitude only hardened once I came to do much of my work in a way that requires a *nix system. My new-found appreciation for the company comes as a surprise, but let me make my case.

  • Xbox. I went about fifteen years without playing video games, skipping every console from the Super NES to Xbox One, only recently returning to playing on the Xbox platform. Whatever problems Microsoft may have had with this generation of consoles, they obviously righted the ship in 2018. From the Xbox One X, which is a phenomenal piece of hardware, to the studio acquisitions, to the backwards compatibility program which shows an appreciation for their history and which allows latecomers like me to catch up, Microsoft has been more interesting than any other gaming company, though the Nintendo Switch follows close behind. Since my employer provides me with one of Apple’s frightfully expensive computers, I likely spend more of my own money on Microsoft than on any other tech company.
  • Visual Studio Code. In 2018, Visual Studio Code became my primary text editor instead of Vim. Vim will always have a place in my heart and my work, and I can’t imagine using any text editor that doesn’t have Vim keybindings. But Code’s IDE-like features work much better than even a highly customized Vim. I have flirted with other modern text editors like Sublime Text and Atom, but Visual Studio Code’s performance and features are much better.
  • Windows Subsystem for Linux. Maybe Windows 10 is good; maybe it’s a mess. I don’t know and I don’t care. But the Windows Subsystem for Linux has made my life better even though I don’t use it. Almost all of my digital history work requires a *nix system. But students show up in class with Windows machines and it is hard to support them. Windows Subsystem for Linux has helped a great deal by giving them a bona fide Linux terminal that they can use with only some fuss.
  • GitHub. Microsoft bought GitHub in 2018 and they haven’t screwed it up. As far as I can tell, the only substantial change is that GitHub users now get free private repositories. That was probably a bad move since people will be more likely to keep their software private instead of making it public. But honestly, I intend to make a lot of my half-baked repositories private when I get a chance, so it’s hard to find fault.
  • Programming languages. Microsoft has been doing interesting things with R for a while now. I don’t use most of it, except for a few packages here and there, but they do make the ecosystem stronger. If I had to do anything with R in the cloud, I would probably try Azure. It also helps with teaching that R’s support for Windows is strong, though that is more a virtue of the R core team than of Windows. Recently I have been getting back into JavaScript, but via TypeScript, a Microsoft-created superset of JavaScript. I like it so far, since it addresses much of what is ugly about JavaScript. Microsoft also seems to be a strong supporter of Go, a Google-created language that I have been dabbling with.
  • Microsoft Word. After years of avoiding all things Word (including a few tense moments with the publisher of my book), I finally broke down and installed it on my work computer. You know what? As long as you aren’t writing in Word and are just commenting on other people’s documents, it’s not that bad.

That’s an entirely personal case, of course. But in terms of where I spend my money and what I use to do my work, Microsoft made more of a play in 2018 than any of the other tech giants.

Bunkmail’s “Best American History Reads” and public engagement

The most recent “Bunkmail” offers up a list of “Best American History Reads of 2018.” It’s a remarkable collection of, by my count, sixty-one publicly-engaged essays, visualizations, or even bibliographies on topics ranging from Trump (of course) to historic preservation.

It’s not clear to me how many of the authors cited in that list are academic historians engaging the public, though I certainly recognize many of the names, or how many are journalists writing about historical topics. But it does seem clear that the tired old story that historians don’t engage with public audiences and that public audiences don’t engage with history is put to rest by collections like Bunk’s and the #everythinghasahistory hashtag popularized by Jim Grossman, or by the frequency with which historians write for venues like The Atlantic or the Washington Post’s Made by History blog, to say nothing of the public engagement that goes on in museums and classrooms. My own take is that the reason this worn-out idea sticks around is not because historians aren’t engaging with public audiences, though we could certainly do more. Rather the problem is that at least some public audiences don’t want the hard-to-swallow interpretations historians offer in place of the spoonful-of-sugar myths about American history that they’ve been fed.

You can subscribe to the Bunkmail newsletter at Bunk.

Lamin Sanneh (1942–2019)

Lamin Sanneh:

The idea of what Christianity should look like back when it was conceived and launched from its base in metropolitan centers often bore little relationship to realities on the ground once the religion was adopted. On the contrary, those who adopted the faith often expanded and transformed the assumptions of those who transmitted it. Thus, the history of Christianity has become properly the history of the world’s peoples and cultures, not simply the history of missionaries and their cultures. It goes without saying that the gospel has necessarily been conveyed in the cultural vessels of missionaries, yet only in the crucible of indigenous appropriation did new faith emerge among the recipients.

That quotation is from Sanneh’s book Disciples of All Nations: Pillars of World Christianity, which I have been revisiting in preparation for my class on the history of Christianity. If that passage’s claim about the transmission and reception of the Christian gospel seems commonplace to scholars today, it is only because Sanneh so thoroughly and convincingly explained that idea in Translating the Message and Whose Religion is Christianity?, as well as his other works. For my own part, I am grateful to have had my assumptions “expanded and transformed” by Sanneh’s writings.

From Sanneh’s autobiography, Summoned from the Margin: Homecoming of an African:

That is where the empty tomb juts in to solidify the idea that Jesus’ embodiment of death and resurrection was a necessary and designated landmark of the God of history. The ground is God’s own by design and choice, and it compels engagement and response on our part because the historical events in question are laden with moral import for us here and now: not the import of our natural and commendable desire to rescue Jesus, but the import that his death and resurrection speak solicitously to our estrangement and reconciliation with God—on God’s terms. … It would be better to be a forgiven enemy of Jesus, I reasoned, than to be his unforgiving defender.


Tomorrow is the start of the annual meeting of the American Society of Church History, now completely divorced—as far as I can tell—from the American Historical Association annual meeting. (I hear the MLA will also be in town.) I’m planning to spend most of my time at ASCH, though I will wander over to a few AHA panels as well. I was originally scheduled for only one session, but for me this is the year of filling in, and now I am on three.

On Thursday (3:30 p.m.), I’ll be a last-minute sub for an ASCH roundtable discussion of “Timothy Larsen’s John Stuart Mill: A Secular Life and OUP’s Spiritual Lives Series.” It’s a fascinating biography and I’m looking forward to hearing what Tim and the other panelists have to say.

My friend (and collaborator) Kellen Funk is out sick, so on Saturday (8:30 a.m.) I will be filling in for him at an AHA/American Society of Legal History panel on “New Directions in American Legal History.” The subject is our ongoing work doing text analysis on the Making of Modern Law corpus of legal treatises.

Keeping alive my long tradition of drawing an early morning Sunday panel, on Sunday (9:00 a.m.) I’ll be the chair and commentator for the ASCH panel “Eighteenth- and Nineteenth-Century Anti-Catholicism in America and its Legacies.” This session will feature the work of Maura Jane Farrelly, Paul Gutacker, and Timothy D. Grundmeier.

The Chance of Salvation at Fall for the Book

Fall for the Book is an annual book festival held at George Mason University and other venues in Fairfax, VA. If you are in Fairfax on October 10–13, it’s worth attending.

I will be talking about The Chance of Salvation: A History of Conversion in America on Thursday, October 11, at 1:30 p.m. in the Johnson Center, third floor, meeting room F. Here is the event on the Fall for the Book schedule. The stories of seven converts from several religious traditions or no religion at all in 45 minutes.

How long does it take to publish in the AHR?

The April 2018 issue of the American Historical Review has a note by the editor, Alex Lichtenstein, explaining the journal’s process of peer review and giving a summary of the average length of time it takes an article to go from initial submission to final acceptance. It’s an interesting note, and I appreciate the editor’s transparency. I also appreciate that the journal has clearly stated in its author guidelines how many articles it receives, how many of those make it through the full review process, and how many are published. Since Kellen Funk and I published an article in the AHR earlier this year, I thought I might comment on the process of peer review, and especially the time it takes to get one’s work into circulation, from the perspective of an author.

The peer review reports from the AHR were certainly the most useful that I have received on any of my published work. The reviewers genuinely helped us develop our article to broaden its reach. They couched their suggestions in terms of enthusiasm for the article’s possibility rather than fatal flaws in its argument, which certainly made those suggestions more palatable from our perspective. Eric Nystrom, who signed his review, even ran our code and gave it a thorough review as well. I also want to note that when the article was in production the AHR staff were fabulous, and Jane Lyle did a great deal to make the visualizations successful on the web, in the PDF version, and on the printed page.

Lichtenstein writes that “our peer review process leads to a much less elitist procedure of publication selection than many people imagine.” I’m inclined to think that’s right. Digital historians often suspect that traditional historical journals are hostile to their work. Perhaps that is true in some cases, but my own experience has been that journals and editors are actually interested in digital historical scholarship if it can be framed in a way that also appeals to the historical profession more broadly. And I think that the AHR’s system of review is likely to give such articles a fair shake.

The process of peer review, however, did take a long time. Lichtenstein sums up the time to acceptance for articles published between 2015 and 2017:

Of the remaining thirty-seven articles, the elapsed time between submission and final acceptance ranged between 408 and 1,259 days: the average time was 740 days, and the median was 701 days. (We have little backlog, so delay between acceptance and publication is not significant.) … While this typical duration of about two years is longer than I would like—a median closer to 500 days would be optimal, in my view—I do not think it is so bad in light of the elaborate procedures described above.

Frankly, I find those figures staggering. I’m sympathetic to the difficulties of getting peer reviewers and the other considerations that the editor mentions. But an average period of two years from submission to acceptance (let alone the further delay for publication) strikes me as a real problem for getting ideas into circulation and for carrying on scholarly conversations.

Let me compare those averages with a table detailing how long it took our article to go through each step in the AHR’s process. (Our 2018 article was not included in Lichtenstein’s calculations, since he counted articles published in 2015–2017.) The column “days elapsed” shows how long that step took from the previous. The “author elapsed” and “journal elapsed” columns are a running total of how long the article was in each party’s hands.

StepDateDays elapsedAuthor elapsedJournal elapsedTotal elapsed
Article submittedJuly 28, 20160000
Sent to editorial boardAug. 25, 20162802828
Sent to external reviewersOct. 6, 20164207070
Accepted pending revisionsMarch 10, 20171550225225
Revised article resubmittedJune 11, 20179393225318
Article acceptedJuly 12, 20173193256349
Final version submittedNov. 15, 2017126219256475
Article publishedFeb. 1, 201878219334553

To sum up, after we submitted the article, the journal took 225 days for editorial and external peer review, after which it was accepted pending minor revisions. We took 93 days to make the required changes, and the article was accepted. At that point we were assigned to the February 2018 issue and given a November deadline to make any other changes we wished. Sending it in earlier at that point wouldn’t have gotten the article published any faster.

In other words, we went from submission to acceptance in 349 days, well below the 408 day minimum for the articles that the editor counted, and less than half of the average for those articles. Relative to the figures that the editor reports, we sped through.

But in absolute terms, the article a very long time to publish. The time from submission to publication took 553 days, or 18 months. Without minimizing the contributions the AHR editors and reviewers made, the article as published was better, but its argument was substantially the same as the article we submitted.

I don’t see how a lag from submission to publication of a minimum of a year and half—and quite likely of three years—can be good for scholarship. It is clear that no effort will be made to speed up the publication process for the AHR, since Lichtenstein writes, “Whatever other changes might be in the offing during my editorship, a revamping of the peer review process for our refereed articles is not one of them.” Nor do I think that the AHR is that far out the norm for humanities journals in terms of time to publication.

Nevertheless, I want to make two suggestions that I think could speed up the delay from completing an article to getting it into circulation among scholars.

The first suggestion is to give peer reviewers shorter deadlines. I don’t have any hard figures, but from my own experience as a referee, humanities journals tend to give scholars two or even three months to complete a peer review. To be honest, when I get a review deadline that is three months out, I start it about a week before the review is due. There are just too many other things to do to get a review back months early. But when I review software or software papers, I am typically given a deadline that is two or three weeks away—and start the review about a week before it is due. Whether the deadline is three weeks or three months, some reviwers will be late, and some will never finish the review. But with a shorter deadline, there isn’t a guaranteed delay of several months before problems crop up. (The Journal of the American Academy of Religion, for example, requests reviews within in two weeks.)

The second suggestion is that authors should post preprints at the earliest possible opportunity. Oxford University Press, the publisher of the AHR, has a fairly reasonable preprint policy. In brief, authors are allowed to do whatever they like with the “author’s original version” that is initially submitted. Kellen and I posted that version to the SSRN and SocArXiv preprint servers the day that our article was accepted. That was about a year after we submitted the article, but it got the piece into circulation nearly seven months before the article was published. If we had been a little bolder, we could have posted that preprint the same day we submitted the article—a delay between submission and circulation of zero days.

Whether or not humanities journals speed up the time to publication on their end, authors can use preprints to speed up the time to circulation. In a later post, I’ll have some further thoughts about an effective preprint strategy for authors.

Why I joined the rOpenSci editorial team

Today I joined the rOpenSci editorial team, taking on a role editing R packages and seeing them through rOpenSci’s peer-review process. It might seem a bit strange for a historian to formally join a group of scientists writing packages for a programming language. So why am I joining rOpenSci?

I’ve been involved with rOpenSci since about 2015, after Scott Chamberlain reached out to see if I was interested in participating with the group. Since then I’ve contributed a number of R packages, including several that went through rOpenSci’s process of peer review, and I’ve guest edited several packages that have gone through the same process. I’ve also been to two of their unconferences and both of their workshops for developers of text analysis packages. The rOpenSci developer collective has been very helpful for me in improving and peer reviewing the software that I write for many of my digital projects.

I’m joining the rOpenSci editorial team because I believe that their mission of creating “a culture that values open and reproducible research using shared data and reusable software” is just as much needed for digital history and the digital humanities as it is for the sciences. (For “science,” maybe read Wissenschaft?) It is not because I think digital history is a science, or some other such nonsense that might get written about in the Chronicle of Higher Education. At higher levels of abstraction the disciplinary differences between the sciences and the humanities are very real, but at the level of code and computation there is a great deal that the two domains of knowledge can learn from one another.

Because I value the contributions that scores of editors and peer reviewers have made to my prose scholarship, I want there to be a similar process of editing and review available for scholarship expressed in software. And the rOpenSci process of onboarding R packages through an open, well-documented peer-review process is pretty great. It’s rather similar, I think, to the open review process created by the Programming Historian, though for software rather than tutorials. So I am glad to have a chance to give something back to the #rstats community and help academic developers get their software reviewed as scholarship.

If you are working in R for digital history or the digital humanities, especially if it involves text analysis or geospatial data, please take a look at the rOpenSci descriptions of packages that are within their scope and consider submitting your work.

New release: tokenizers v0.2.0

A new v0.2.0 release of the tokenizers package for R is now available on CRAN. The tokenizers package provides a fast, consistent set of functions to turn natural language text into tokens. This is a fairly substantial release which is a result of collaboration with Dmitry Selivanov, Os Keyes, Ken Benoit, and Jeffrey Arnold. This version adds these features (see the changelog for more details and attribution):

  • A new tokenizer for tweets that preserves usernames, hashtags, and URLS.
  • A new tokenizer for Penn Treebank style tokenization.
  • A new function to split long documents into pieces of equal length.
  • New functions to count words, characters, and sentences without tokenization.
  • The package now uses C++98 rather than C++11, so more users will be able to install it without upgrading their compiler. (No more e-mails from CentOS 6 users.)

Most important, the package implements the draft recommendations of the Text Interchange Format. The TIF standards were drafted at the 2017 rOpenSci Text Workshop. They define standards for tokens, corpora, and document-term matrices to allow R text analysis packages to interoperate with one another. I think these standards, once finalized and widely adopted, will be a very positive development for the coherence of the ecosystem of packages around text analysis in R. A new vignette explains how the tokenizers package fits into this ecosystem.

Finally, the package now has a new website for vignettes and documentation, thanks to pkgdown. What the package does not have is a nice hex-sticker logo, but perhaps that can come in due time.

All blog posts: by date RSS feed