Yesterday when I wrote about experimenting with TEI Boilerplate, I mentioned that one of the impediments I’d found to using TEI was being able to do something with it immediately. TEI Boilerplate lets you see a TEI file in your browser immediately. But I also wanted to experiment with analyzing a TEI file programmatically, so I found some sample documents and wrote an easy script in Ruby to serve as my own proof of concept.
For experimental purposes, I downloaded the Folger Shakespeare Library’s Digital Texts, a collection of Shakespeare’s plays encoded in TEI. I choose these texts because they had each speaker marked up, as in this snippet from Macbeth. For my purposes, a text that marked up names, dates, or places would be more interesting, but the principles are identical.
I decided to write a Ruby script that identified all the speakers and counted the number of times each spoke. The heavy lifting is done by the Ruby library Nokogiri, “an HTML, XML, SAX, and Reader parser” able “to search documents via XPath or CSS3 selectors.” I learned about Nokogiri from this post by Jason Heppler, from whom I’ve learned most of what I know about Ruby. (See his Rubyist Historian for a primer.)
Nokogiri is very powerful—more powerful than I know what to do with. In this script, it does all the analytical work in two lines. One line opens the TEI file, and Nokogiri then parses the document.
The other line finds each <speaker> element and cleans up the name of the speaker.
The rest of the script just sets up some scaffolding to keep track of the speakers and the number of their lines. Here is the whole thing.
The output for Macbeth looks like this. It’s nothing too impressive, but it does show how Ruby and Nokogiri can be used to analyze TEI files.