Parsing TEI Files in Ruby with Nokogiri

I've recently migrated this blog, and the older posts might not yet be satisfactorily cleaned up. Apologies for the temporary mess.

Yesterday when I wrote about experimenting with TEI Boilerplate, I mentioned that one of the impediments I’d found to using TEI was being able to do something with it immediately. TEI Boilerplate lets you see a TEI file in your browser immediately. But I also wanted to experiment with analyzing a TEI file programmatically, so I found some sample documents and wrote an easy script in Ruby to serve as my own proof of concept.

For experimental purposes, I downloaded the Folger Shakespeare Library’s Digital Texts, a collection of Shakespeare’s plays encoded in TEI. I choose these texts because they had each speaker marked up, as in this snippet from Macbeth. For my purposes, a text that marked up names, dates, or places would be more interesting, but the principles are identical.

<span class="nt"><speaker</span> <span class="na">xml:id=</span><span class="s">"spk-1490"</span><span class="nt">></span>
<span class="nt"><w</span> <span class="na">xml:id=</span><span class="s">"w0221430"</span><span class="nt">></span>SECOND<span class="nt"></w></span>
<span class="nt"><c</span> <span class="na">xml:id=</span><span class="s">"c0221440"</span><span class="nt">></span> <span class="nt"></c></span>
<span class="nt"><w</span> <span class="na">xml:id=</span><span class="s">"w0221450"</span><span class="nt">></span>WITCH<span class="nt"></w></span>
<span class="nt"></speaker></span>

I decided to write a Ruby script that identified all the speakers and counted the number of times each spoke. The heavy lifting is done by the Ruby library Nokogiri, “an HTML, XML, SAX, and Reader parser” able “to search documents via XPath or CSS3 selectors.” I learned about Nokogiri from this post by Jason Heppler, from whom I’ve learned most of what I know about Ruby. (See his Rubyist Historian for a primer.)

Nokogiri is very powerful—more powerful than I know what to do with. In this script, it does all the analytical work in two lines. One line opens the TEI file, and Nokogiri then parses the document.

<span class=“n”>doc</span> <span class=“o”>=</span> <span class=“no”>Nokogiri</span><span class=“o”>::</span><span class=“no”>XML</span><span class=“p”>(</span><span class=“nb”>open</span><span class=“p”>(</span><span class=“n”>filename</span><span class=“p”>))</span>
The other line finds each <speaker> element and cleans up the name of the speaker.
<span class=“nb”>name</span> <span class=“o”>=</span> <span class=“n”>speaker</span><span class=“p”>.</span><span class=“nf”>content</span><span class=“p”>.</span><span class=“nf”>gsub</span><span class=“p”>(</span><span class=“sr”>/\n/</span><span class=“p”>,</span><span class=“s2”>“”</span><span class=“p”>)</span>
The rest of the script just sets up some scaffolding to keep track of the speakers and the number of their lines. Here is the whole thing.
<span class=“c1”>#!/usr/bin/env ruby</span>
<span class=“c1”># encoding: utf-8</span>

<span class=“c1”># Name:: speakers.rb</span> <span class=“c1”># Author:: Lincoln Mullen (mailto:lincoln@lincolnmullen.com)</span> <span class=“c1”># Copyright:: Copyright © 2013 Lincoln Mullen </span> <span class=“c1”># License:: MIT License | http://lmullen.mit-license.org/</span>

<span class=“c1”># This program finds all of the speakers in a TEI file and lists them </span> <span class=“c1”># by the number of times that they speak.</span> <span class=“c1”># Usage: ./speakers.rb my-tei-file.xml</span>

<span class=“nb”>require</span> <span class=“s1”>'nokogiri'</span> <span class=“c1”># for xml parsing</span> <span class=“nb”>require</span> <span class=“s1”>'pp'</span> <span class=“c1”># for a nicer output</span>

<span class=“c1”># Get the file name to open</span> <span class=“n”>filename</span> <span class=“o”>=</span> <span class=“no”>ARGV</span><span class=“p”>[</span><span class=“mi”>0</span><span class=“p”>]</span>

<span class=“c1”># Open a hash to store our data</span> <span class=“n”>speakers</span> <span class=“o”>=</span> <span class=“no”>Hash</span><span class=“p”>.</span><span class=“nf”>new</span>

<span class=“k”>begin</span> <span class=“c1”># Open the file and parse it with Nokogiri</span> <span class=“n”>doc</span> <span class=“o”>=</span> <span class=“no”>Nokogiri</span><span class=“o”>::</span><span class=“no”>XML</span><span class=“p”>(</span><span class=“nb”>open</span><span class=“p”>(</span><span class=“n”>filename</span><span class=“p”>))</span> <span class=“c1”># Find each instance of a <speaker> tag</span> <span class=“n”>doc</span><span class=“p”>.</span><span class=“nf”>search</span><span class=“p”>(</span><span class=“s1”>'speaker'</span><span class=“p”>).</span><span class=“nf”>each</span> <span class=“k”>do</span> <span class=“o”>|</span><span class=“n”>speaker</span><span class=“o”>|</span> <span class=“c1”># Clean up the line breaks in the speaker's name</span> <span class=“nb”>name</span> <span class=“o”>=</span> <span class=“n”>speaker</span><span class=“p”>.</span><span class=“nf”>content</span><span class=“p”>.</span><span class=“nf”>gsub</span><span class=“p”>(</span><span class=“sr”>/\n/</span><span class=“p”>,</span><span class=“s2”>“”</span><span class=“p”>)</span> <span class=“k”>if</span> <span class=“n”>speakers</span><span class=“p”>.</span><span class=“nf”>has_key?</span><span class=“p”>(</span><span class=“nb”>name</span><span class=“p”>)</span> <span class=“c1”># If the speaker is already in our hash then add 1 to the count </span> <span class=“c1”># of utterances</span> <span class=“n”>speakers</span><span class=“p”>[</span><span class=“nb”>name</span><span class=“p”>]</span> <span class=“o”>+=</span> <span class=“mi”>1</span> <span class=“k”>else</span> <span class=“c1”># If the speaker is not already in the hash then add the speaker </span> <span class=“c1”># with a count of 1</span> <span class=“n”>speakers</span><span class=“p”>[</span><span class=“nb”>name</span><span class=“p”>]</span> <span class=“o”>=</span> <span class=“mi”>1</span> <span class=“k”>end</span> <span class=“k”>end</span> <span class=“k”>rescue</span> <span class=“no”>Errno</span><span class=“o”>::</span><span class=“no”>ENOENT</span> <span class=“c1”># If the file we've been passed doesn't exist, catch the error</span> <span class=“nb”>puts</span> <span class=“s2”>“That file does not exist.”</span> <span class=“k”>end</span>

<span class=“c1”># Sort the hash of speakers by the number of times they speak, in </span> <span class=“c1”># descending order, then print the output</span> <span class=“n”>pp</span> <span class=“n”>speakers</span><span class=“p”>.</span><span class=“nf”>sort_by</span> <span class=“p”>{</span> <span class=“o”>|</span><span class=“nb”>name</span><span class=“p”>,</span> <span class=“n”>lines</span><span class=“o”>|</span> <span class=“n”>lines</span> <span class=“p”>}.</span><span class=“nf”>reverse</span>

The output for Macbeth looks like this. It’s nothing too impressive, but it does show how Ruby and Nokogiri can be used to analyze TEI files.
<span class=“p”>[[</span><span class=“s2”>“MACBETH”</span><span class=“p”>,</span> <span class=“mi”>145</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“LADY MACBETH”</span><span class=“p”>,</span> <span class=“mi”>59</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“MACDUFF”</span><span class=“p”>,</span> <span class=“mi”>59</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“MALCOLM”</span><span class=“p”>,</span> <span class=“mi”>40</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“ROSS”</span><span class=“p”>,</span> <span class=“mi”>39</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“BANQUO”</span><span class=“p”>,</span> <span class=“mi”>33</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“FIRST WITCH”</span><span class=“p”>,</span> <span class=“mi”>23</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“LENNOX”</span><span class=“p”>,</span> <span class=“mi”>21</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“DOCTOR”</span><span class=“p”>,</span> <span class=“mi”>20</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“LADY MACDUFF”</span><span class=“p”>,</span> <span class=“mi”>19</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“DUNCAN”</span><span class=“p”>,</span> <span class=“mi”>18</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“SECOND WITCH”</span><span class=“p”>,</span> <span class=“mi”>15</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“SON”</span><span class=“p”>,</span> <span class=“mi”>14</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“ALL”</span><span class=“p”>,</span> <span class=“mi”>13</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“THIRD WITCH”</span><span class=“p”>,</span> <span class=“mi”>13</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“GENTLEWOMAN”</span><span class=“p”>,</span> <span class=“mi”>11</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“SIWARD”</span><span class=“p”>,</span> <span class=“mi”>11</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“FIRST MURDERER”</span><span class=“p”>,</span> <span class=“mi”>11</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“MURDERER”</span><span class=“p”>,</span> <span class=“mi”>7</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“SERVANT”</span><span class=“p”>,</span> <span class=“mi”>6</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“SECOND MURDERER”</span><span class=“p”>,</span> <span class=“mi”>6</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“THIRD MURDERER”</span><span class=“p”>,</span> <span class=“mi”>6</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“MESSENGER”</span><span class=“p”>,</span> <span class=“mi”>6</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“SEYTON”</span><span class=“p”>,</span> <span class=“mi”>5</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“MENTEITH”</span><span class=“p”>,</span> <span class=“mi”>5</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“YOUNG SIWARD”</span><span class=“p”>,</span> <span class=“mi”>4</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“PORTER”</span><span class=“p”>,</span> <span class=“mi”>4</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“ANGUS”</span><span class=“p”>,</span> <span class=“mi”>4</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“OLD MAN”</span><span class=“p”>,</span> <span class=“mi”>4</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“DONALBAIN”</span><span class=“p”>,</span> <span class=“mi”>3</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“CAITHNESS”</span><span class=“p”>,</span> <span class=“mi”>3</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“CAPTAIN”</span><span class=“p”>,</span> <span class=“mi”>3</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“LORDS”</span><span class=“p”>,</span> <span class=“mi”>3</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“LORD”</span><span class=“p”>,</span> <span class=“mi”>3</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“MURDERERS”</span><span class=“p”>,</span> <span class=“mi”>3</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“HECATE”</span><span class=“p”>,</span> <span class=“mi”>2</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“FLEANCE”</span><span class=“p”>,</span> <span class=“mi”>2</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“SECOND APPARITION”</span><span class=“p”>,</span> <span class=“mi”>2</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“THIRD APPARITION”</span><span class=“p”>,</span> <span class=“mi”>1</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“FIRST APPARITION”</span><span class=“p”>,</span> <span class=“mi”>1</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“SOLDIER”</span><span class=“p”>,</span> <span class=“mi”>1</span><span class=“p”>],</span>
 <span class=“p”>[</span><span class=“s2”>“MACBETH AND LENNOX”</span><span class=“p”>,</span> <span class=“mi”>1</span><span class=“p”>]]</span>