Text Analysis for Historians

Independent study, fall 2016. Department of History and Art History, George Mason University. Meets every other Wednesday at 1 p.m. for discussion; work sessions on alternate weeks as necessary. Instructor: Lincoln Mullen <lmullen@gmu.edu>. Office: Research Hall 457.

This independent study is an advanced course in the theory and practice of text analysis for historians. You will read current research in digital history and cognate fields. The aim is to learn the methods of text analysis which are most likely to produce insights useful for historical interpretation. You will work primarily in the R programming language and with common Unix-style utilities, performing an analysis using each method we discuss week by week. By the end of the semester, you will write a paper which uses text analysis of a corpus to make a historical argument or interpretation.

Schedule

Week 1 (August 31): Introduction

Practicum: Before our first meeting, install R and the RStudio Desktop IDE (you may wish to install the preview version to get notebook support). Start to become familiar with the basics of R as described in the introductory chapters of either Jockers or Arnold and Tilton. You should also become start to become familiar with the basics of the Unix-style command line (see Shotts, Linux Command Line, as a reference).

Week 2 (September 7): Working with corpus metadata

Practicum: Download at least one of the provided corpora. Extract the metadata into tables. Do an exploratory data analysis of the corpus metadata, paying special attention to the question of which texts actually belong in the corpus. See the R Markdown documentation for help getting started with your first notebook. See Jenny Bryan, Happy Git and GitHub for the useR, for guidance using Git/GitHub.

Week 3 (September 21): Vector space models

Practicum: Using text2vec, create a document-term model of a corpus. Experiment with using words and n-grams, filtering terms and stemming, applying transformations such as TF-IDF, and applying distance measures such as cosine distance. Produce plots of how terms are used. (You may want to work ahead on unsupervised clustering or principal component analysis.)

Week 4 (October 5): Clustering and classification

Practicum: Try a variety of unsupervised classification methods (e.g., K-means) and dimensionality reduction methods (e.g., PCA) if you did not do so last week. Using the caret package make a supervised classifier (using various machine learning methods) to predict some aspect of your text. You should use the corpus metadata for your classification labels.

Week 5 (October 19): Text reuse

Practicum: Using either the textreuse or LSHR packages, look for reused passages in the corpus you are working with. Create clusters or networks of those reuses.

Week 6 (November 2): Topic modeling

Practicum: Using the textmineR, create topic models of the corpus that you are working with. Make plots of changes in topics over time.

Week 7 (November 16): Word-embedded models

Practicum: Train a word-embedded model on your corpus, or train several models on chronological or thematic subsets of your corpus. Use the model(s) to find distinctive use of language. More ambitious: use some of the more cutting-edge algorithms that use word vectors as inputs, for example, for finding document distances.

Week 8 (November 30): Named-entity recognition

Practicum: Using the openNLP or coreNLP packages, run named-entity recognition on your corpus. Plot the entities over time or space.

Assignments and expectations

Come prepared to discuss the readings assigned for each meeting. There are several kinds of reading. Works on legal history set the stage for the corpus we are investigating, so that you know what questions are worth asking. If you decide to work on a corpus other than MOML, plan on substituting your own readings for works in that category. Works on other historical subjects demonstrate how the methods we are studying can be applied to historical or literary historical questions. The methodological pieces, well, explain how the methods work. Don’t get hung up on the details of implementation: read to understand what the transformation is that the method accomplishes. Some of the methodological pieces are more practical, while others provide the theoretical basis. There is much more to read than I can assign, but the references below (also in a Zotero group library) contain many additional works.

Before each meeting, you will create an R Markdown notebook which—in prose, code, and figures—explores your chosen corpus of texts using the specified methodology. Get as far as you can with each method. You are always welcome to share code and ideas with other people in the class, though each person must turn in his or her own notebook. Submit all of the notebooks for the semester in a single GitHub repository. Name each notebook something sensible, like 08-named-entity-recognition.nb.html.

At the end of the semester, you will write a conference paper which makes a historical argument on the basis of the corpus you have chosen. Submit these papers as a separate GitHub repository. Prepare a one-page proposal by week 4, share a one-page statement of progress and problems by week 7, and submit the draft by 5 p.m. on December 16.

Grades will be assigned with 50% of the weight given to completing the readings and notebooks, and 50% to the final paper.

Acknowledgments and fine print

See the George Mason University catalog for general policies, as well as the university statement on diversity. You are expected to know and follow George Mason’s policies on academic integrity and the honor code. If you are a student with a disability and you need academic accommodations, please see me and contact the Office of Disability Services at 703-993-2474 or through their website. All academic accommodations must be arranged through that office.

References

Archer, Dawn, ed. What’s in a Word-List?: Investigating Word Frequency and Keyword Extraction. Ashgate, 2009.

Arnold, Taylor, and Lauren Tilton. Humanities Data in R. Springer, 2015. http://link.springer.com/10.1007/978-3-319-20702-5.

Binder, Jeffrey M. “Alien Reading: Text Mining, Language Standardization, and the Humanities.” In Debates in the Digital Humanities 2016, edited by Matthew K. Gold and Lauren F. Klein, 201–17. University of Minnesota Press, 2016. http://dhdebates.gc.cuny.edu/debates/text/69.

Blevins, Cameron. “Space, Nation, and the Triumph of Region: A View of the World from Houston.” Journal of American History 101, no. 1 (June 1, 2014): 122–147. doi:10.1093/jahist/jau184.

Bryan, Jenny. Happy Git and GitHub for the useR, 2016. http://happygitwithr.com/.

Cohen, Dan. “Searching for the Victorians,” October 4, 2010. http://www.dancohen.org/2010/10/04/searching-for-the-victorians/.

Cordell, Ryan. “Reprinting, Circulation, and the Network Author in Antebellum Newspapers.” American Literary History 27, no. 3 (September 1, 2015): 417–445. doi:10.1093/alh/ajv028.

Dalgaard, Peter. Introductory Statistics with R. Statistics and Computing. Springer, 2008. http://link.springer.com/10.1007/978-0-387-79054-1.

Flanders, Julia, and Fotis Jannidis. “Data Modeling.” In A New Companion to the Digital Humanities, edited by Susan Schreibman, Ray Siemens, and John Unsworth, 229–37. Wiley Blackwell, 2016.

Fraas, Mitch, and Benjamin Schmidt. “Mapping the State of the Union.” The Atlantic (January 18, 2015). http://www.theatlantic.com/politics/archive/2015/01/mapping-the-state-of-the-union/384576/.

Friedman, Lawrence M. A History of American Law. 2nd ed. New York: Simon & Schuster, 1985.

Gavin, Michael A. “The Arithmetic of Concepts: A Response to Peter de Bolla. Modeling Literary History,” September 18, 2015. http://modelingliteraryhistory.org/2015/09/18/the-arithmetic-of-concepts-a-response-to-peter-de-bolla/.

Gold, Matthew K., Lauren F. Klein, Stephen Ramsay, Ted Underwood, Tanya E. Clement, Lisa Marie Rhody, Tressie McMillan Cottom, Benjamin M. Schmidt, Joanna Swafford, and Alan Liu. “Forum: Text Analysis at Scale.” In Debates in the Digital Humanities 2016, 525–568. University of Minnesota Press, 2016. http://dhdebates.gc.cuny.edu/debates/text/93.

Goldstone, Andrew, and Ted Underwood. “The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us.” New Literary History 45, no. 3 (2014): 359–384. doi:10.1353/nlh.2014.0025.

Graham, Shawn, Ian Milligan, and Scott Weingart. Exploring Big Historical Data: The Historian’s Macroscope. Imperial College Press, 2015.

Hitchcock, Tim, and William J. Turkel. “The Old Bailey Proceedings, 1674–1913: Text Mining for Evidence of Court Behavior.” Law and History Review 34, no. 4 (August 2016): 1–27. doi:10.1017/S0738248016000304.

Hoeflich, Michael H. Legal Publishing in Antebellum America. New York: Cambridge University Press, 2010.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning with Applications in R. Springer, 2013.

Jockers, Matthew L. Macroanalysis: Digital Methods and Literary History. University of Illinois Press, 2013.

———. Text Analysis with R for Students of Literature. Springer, 2014. http://link.springer.com/10.1007/978-3-319-03164-4.

Jockers, Matthew L., and Ted Underwood. “Text-Mining the Humanities.” In A New Companion to the Digital Humanities, edited by Susan Schreibman, Ray Siemens, and John Unsworth, 291–306. Wiley Blackwell, 2016.

Knox, Doug. “Understanding Regular Expressions.” Programming Historian (June 22, 2013). http://programminghistorian.org/lessons/understanding-regular-expressions.

Kuhn, Max, and Kjell Johnson. Applied Predictive Modeling. Springer, 2013.

Kusner, Matt J., Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. “From Word Embeddings to Document Distances.” In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), 957–966, 2015. http://www.jmlr.org/proceedings/papers/v37/kusnerb15.pdf.

Leskovec, Jure, Anand Rajaraman, and Jeff Ullman. Mining of Massive Datasets. 2nd ed. Cambridge University Press, 2014. http://www.mmds.org/.

Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, et al. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331, no. 6014 (January 14, 2011): 176–182. doi:10.1126/science.1199644.

Mikolov, T., and J. Dean. “Distributed Representations of Words and Phrases and Their Compositionality.” Advances in Neural Information Processing Systems (2013). https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf.

Milligan, Ian. “Automated Downloading with Wget.” Programming Historian (June 27, 2012). http://programminghistorian.org/lessons/automated-downloading-with-wget.

Moretti, Franco. Distant Reading. Verso, 2013.

———. Graphs, Maps, Trees: Abstract Models for a Literary History. Verso, 2005.

Nelson, Robert K., and Digital Scholarship Lab, University of Richmond. “Mining the Dispatch,” 2011. http://dsl.richmond.edu/dispatch/.

Newman, David J., and Sharon Block. “Probabilistic Topic Decomposition of an Eighteenth-Century American Newspaper.” Journal of the American Society for Information Science and Technology 57, no. 6 (2006): 753–767. http://onlinelibrary.wiley.com/doi/10.1002/asi.20342/full.

Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. “Glove: Global Vectors for Word Representation.” In EMNLP, 14:1532–43, 2014. http://nlp.stanford.edu/pubs/glove.pdf.

Ramsay, Stephen. Reading Machines: Toward an Algorithmic Criticism. University of Illinois Press, 2011.

Robertson, Stephen. “Searching for Anglo-American Digital Legal History.” Law and History Review 34, no. 4 (August 2016). doi:10.1017/S0738248016000389.

———. “Signs, Marks, and Private Parts: Doctors, Legal Discourses, and Evidence of Rape in the United States, 1823-1930.” Journal of the History of Sexuality 8, no. 3 (1998): 345–388. http://www.jstor.org/stable/3704870.

———. “The Differences Between Digital Humanities and Digital History.” In Debates in the Digital Humanities 2016, edited by Matthew K. Gold and Lauren F. Klein, 289–307. University of Minnesota Press, 2016. http://dhdebates.gc.cuny.edu/debates/text/76.

Schmidt, Benjamin. “Age Cohort and Vocabulary Use. Sapping Attention,” April 11, 2011. http://sappingattention.blogspot.com/2011/04/age-cohort-and-vocabulary-use.html.

———. “Rejecting the Gender Binary: A Vector-Space Operation,” October 30, 2015. http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html.

———. “Vector Space Models for the Digital Humanities,” October 25, 2015. http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html.

———. “Women in the Libraries. Sapping Attention,” May 8, 2012. http://sappingattention.blogspot.com/2012/05/women-in-libraries.html.

Schmidt, Benjamin, and Mitch Fraas. “The Language of the State of the Union.” The Atlantic (January 18, 2015). http://www.theatlantic.com/politics/archive/2015/01/the-language-of-the-state-of-the-union/384575/.

Sculley, D., and Bradley M. Pasanek. “Meaning and Mining: The Impact of Implicit Assumptions in Data Mining for the Humanities.” Literary and Linguistic Computing 23, no. 4 (2008): 409–424. http://llc.oxfordjournals.org/content/23/4/409.short.

Shotts, William. The Linux Command Line. 3rd internet ed. No Starch Press, 2016. http://linuxcommand.org/tlcl.php.

Silge, Julia, and David Robinson. Tidy Text Mining in R, 2016. http://tidytextmining.com/.

Simpson, A. W. B. “The Rise and Fall of the Legal Treatise: Legal Principles and the Forms of Legal Literature.” The University of Chicago Law Review 48, no. 3 (1981): 632–679. doi:10.2307/1599330.

Sinclair, Stéfan, and Geoffrey Rockwell. “Text Analysis and Visualization: Making Meaning Count.” In A New Companion to the Digital Humanities, edited by Susan Schreibman, Ray Siemens, and John Unsworth, 274–90. Wiley Blackwell, 2016.

Smith, David A., Ryan Cordell, and Abby Mullen. “Computational Methods for Uncovering Reprinted Texts in Antebellum Newspapers.” American Literary History 27, no. 3 (September 1, 2015): E1–E15. doi:10.1093/alh/ajv029.

Smith, David A., Ryan Cordell, Elizabeth Maddock Dillon, Nick Stramp, and John Wilkerson. “Detecting and Modeling Local Text Reuse.” In Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries, 183–192. IEEE Press, 2014. http://dl.acm.org/citation.cfm?id=2740800.

Underwood, Ted. “Seven Ways Humanists Are Using Computers to Understand Text. The Stone and the Shell,” June 4, 2015. https://tedunderwood.com/2015/06/04/seven-ways-humanists-are-using-computers-to-understand-text/.

———. “The Literary Uses of High-Dimensional Space.” Big Data & Society 2, no. 2 (December 1, 2015): 2053951715602494. doi:10.1177/2053951715602494.

———. “Theorizing Research Practices We Forgot to Theorize Twenty Years Ago.” Representations 127, no. 1 (August 1, 2014): 64–72. doi:10.1525/rep.2014.127.1.64.

Welke, Barbara Young. Law and the Borders of Belonging in the Long Nineteenth Century United States. Cambridge University Press, 2010.

Wickham, Hadley. Advanced R. Chapman; Hall, 2014. http://adv-r.had.co.nz/.

Wickham, Hadley, and Garrett Grolemund. R for Data Science. O’Reilly, 2016. http://r4ds.had.co.nz/.

Witmore, Michael. “Text: A Massively Addressable Object.” In Debates in the Digital Humanities 2012. University of Minnesota Press, 2012. http://dhdebates.gc.cuny.edu/debates/text/28.

Xu, Shaobin, David A. Smith, Abigail Mullen, and Ryan Cordell. “Detecting and Evaluating Local Text Reuse in Social Networks.” ACL 2014 (2014): 50. http://www.aclweb.org/website/old_anthology/W/W14/W14-27.pdf#page=62.