Aug. 27: Unix as a Way of Life
How to interact with your computer and to run programs through the command-line interface. You will also learn a philosophy for writing programs.
- Mike Gancarz, Linux and the Unix Philosophy, chs. 1–8, focusing on the ten tenets.
- William E. Shotts Jr., The Linux Command Line: A Complete Introduction. Most of this book is a reference source, but familiarize yourself at a minimum with chapters 2 (navigation), 4 (file manipulation), 5 (commands), 6 (redirection), 10 (processes), 11 (environment). Nearly all of what Shotts writes about Linux will apply to the Unix terminal in Mac OS X.
Try out all the Unix style commands in your terminal.
Before class, do your best to get the following installed:
- A text editor of your choice: Sublime Text, TextWrangler, Atom, and Vim are all solid choices.
- Google Chrome
- Homebrew (if you’re on a Mac)
- Git (through Homebrew on a Mac; through the package manager on Linux)
- Node.js (through Homebrew on a Mac; through a package manager on Linux)
- R language
- R Studio Desktop
If you are on a Mac, you should install Homebrew and any necessary dependencies as you go along. If you are on some kind of Linux machine, then probably everything you need is in your package manager. If you are on a Windows PC, you should install Ubuntu 14.04 LTS inside Virtual Box using Vagrant. Follow this tutorial on Vagrant, substituting
Sept. 3: Version Control and Reproducible Research
Version control lets you contribute to projects and distribute your code. GNU Make helps automate and reproduce your results.
- Work through GitHub’s interactive tutorial for Git.
- GitRef on basics and branching and merging; GitHub’s tutorial on pull requests (video).
- Look at Scott Chacon, Pro Git, especially chs. 1–3, 5, for reference.
- Read Karl Broman’s lectures about reproducible research: introduction; command line; version control.
- Read Mike Bostock, “Why Use Make”.
- Read the documentation for GNU Make.
- At least one day before class, submit a pull request to the repository for this syllabus. The pull request should modify the list of participants (
source/participants.md) to add your name with a link to your personal website, as well as your GitHub user name and a link to your GitHub user profile. Feel free to include your Twitter user name and link if you like. (A guide to Markdown if you need it.)
- Create a minimal Makefile. This Makefile should take a text file (provided by you) and find and replace words of your choosing to a new text file. (Hint:
sed 's/foo/bar/g' input-file.txt > output-file.txtreplaces all instances of
barand redirects standard out to a file.) The Makefile should also put the time stamp for when the output file was generated at the bottom of the file. (Hint: in your shell the
>>operator appends to a file; there is also a command to get the current time.) Can you rewrite the Makefile so that it uses rules? So that it uses special targets? So that it works on several text files at once? On an arbitrary number of files? So that it uses a default rule? Post your Makefile and input text files to GitHub.
- Create separate
.jsfiles with the solutions for each exercise in these chapters, and post them to GitHub.
An introduction to how data is structured, and an introduction to the object-oriented style of programming for modeling data.
- Read EJ, ch. 4, ch. 6.
- Browse the documentation for the DPLA’s API and sign up for an API key.
- Browse the API for the American Converts Database, especially the items page, as well as the Omeka REST API documentation.
- Create separate
.jsfiles with the solutions for each exercise in these chapters, and post them to GitHub.
Sept. 24: Introduction to R / Grammar of Graphics in R
We learn our second programming language and begin to make real visualizations.
- Watch the Google Developers’ introduction to R. You might also like R Twotorials.
- For a more thorough introduction to R, read the opening chapters of Norman Matloff, The Art of R Programming: A Tour of Statistical Software Design (No Starch Press, 2011) or of Hadley Wickham, Advanced R.
- Read Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis (Springer, 2009), chs. 1–5. For the theory behind ggplot, look at Leland Wilkinson, The Grammar of Graphics, 2nd ed. (Springer, 2005). You may find Winston Chang, R Graphics Cookbook, appendix A, chs. 1–4, a useful introduction to ggplot.
- Browse the ggplot2 documentation.
- Experiment with ggplot2 in R Studio as you read the assigned books.
- Find a historical data set and make as many different kinds of charts with it as you can. (Some of them should be bad charts or unhelpful charts.) Annotate the charts in RMarkdown and Knitr (guide here). Post the code to GitHub and the document to RPubs.
Oct. 1: Manipulating Data in R
Data seldom comes in the format we need it: this is how to munge it into a useful form.
- Watch Hadley Wickham, “Tidy Data and Tidy Tools,” NYC Open Statistical Computing Meetup, Dec. 2011.
- Read Hadley Wickham, “Tidy Data,” Journal of Statistical Software 50, no. 10 (2014).
- Browse documentation for tidyr and dplyr. Be aware of the more full-featured packages reshape2 and plyr.
- Skim Hadley Wickham, “Reshaping Data in R,” Statistical Computing and Graphics 16, no. 2 (Dec. 2005): 5–8.
- Skim Hadley Wickham, “The Split-Combine-Apply Strategy for Data Analysis,” Journal of Statistical Software 40, no. 1 (Apr. 2011): 1–29.
- You may find Seth van Hooland, Ruben Verborgh, and Max De Wilde, “Cleaning Data with OpenRefine,” to be helpful.
- In the
data-rawdirectory of the historydata package, there are several raw data files stored in untidy formats as CSV files. I have transformed these into tidy data in the actual package. In other words, loading
sarna.csvgives you different results than loading the package and accessing the
sarnadataset. Try turning those untidy datasets into tidy data sets that match the versions actually using dplyr and tidyr. (Start with
sarna.csv.) You can see how I have done this using the corresponding R files in
data-raw, for example,
- In the
data-rawdirectory, there is a
nhgis0011_ts_state.csvwhich has counts of the state populations. Can you use this data with the
summarize()function to create counts of the national population for each census year? (In other words, can you
sum()up the state populations for each year?)
- Now take some dataset of your own. Can you turn it into a tidy dataset? Can you clean the data as necessary? Can you use all seven data manipulation verbs on your data? The verbs are
spread(). (There is also a family of verbs that fall under the category of joins: in dplyr this includes
left_join(); the base R function is
merge(). We’ll deal with these later.) What new visualizations can you make?
- The Biographical Directory of Federal Judges, 1789-present is an immensely interesting dataset, but also messy and untidy. Can you make it better?
Share your code on GitHub, and publish your results to Rpubs.
Oct. 8: Spatial Analysis in R
How to make maps and perform other kinds of spatial analysis.
Read Roger S. Bivand, Edzer Pebesma, and Virgilio Gómez-Rubio, Applied Spatial Data Analysis with R (Springer, 2013).
Read Robin Lovelace and James Cheshire, “Introduction to Spatial Data and ggplot2,” Spatial.ly, Dec. 9, 2013.
Select from the spatial data sets available to you, or find your own. Make maps. Publish your code to GitHub and your results to Rpubs. Hint: If you use ggplot2 with a projected shapefile (i.e., a shapefile whose coordinates are stored in some coordinate reference system other than latitude and longitude) it will probably blow up. First convert the shapefile to EPSG 4326/WGS 84.
Sara: Open Refine
Oct. 15: More about R
This week we’ll learn whatever we haven’t covered about R that would be most helpful for your projects.
George: Web scraping
Oct. 22: Text Mining in R
How to do “distant reading,” document similarity, and other kinds of textual analysis.
- Read Matthew Jockers, Text Analysis with R for Students of Literature (Springer, 2014). You may also wish to consult Matthew Jockers, Macroanalysis (University of Illinois Press, 2013).
- Read Fred Gibbs, “Document Similarity with R,”
- Read the text mining and topic modeling sections sections of Shawn Graham, Ian Milligan, and Scott Weingart, The Historian’s Macroscope.
- Read the topic-modeling issue of the Journal of Digital Humanities 2, no. 1 (2012).
- Browse the documentation for MALLET and the mallet and tm R package.
Chose one (or both) of the following, in either case posting your code to GitHub and your results to RPubs:
- In the nineteenth-century United States, there was a fierce debate over whether to codify laws. New York created several codes of civil procedure, which other states then borrowed. You will be given a handful of codes. Which codes borrowed from one another? What did they borrow? How can you visualize this? How can you browse the borrowings? What interpretations do you draw from this? You can clone this repository: the OCRed codes are in the
text/directory. The RMarkdown files in the directory will provide some hints about how to proceed.
- You will be given a cleaned up set of texts from the Oxford Movement’s Tracts for the Times (zipfile here). What does text mining and topic modeling these texts tell you? You may substitute another corpus if you wish.
Peter: Image processing
Oct. 29: Network Analysis in R
How to measure and visualize networks of people, events, ideas, sources, you name it.
- Read Eric D. Kolaczyk and Gábor Csárdi, Statistical Analysis of Network Data with R (Springer, 2014).
- Read the networks sections of Shawn Graham, Ian Milligan, and Scott Weingart, The Historian’s Macroscope.
- Read Elijah Meeks, “More Networks in the Humanities.”
- Read Scott Weingart, “Demystifying Networks, Parts I & II,” Journal of Digital Humanities 1, no. 1 (2011).
- Browse the documentation for the statnet, sna, and network R packages.
You will be provided with some historical data suitable for network analysis, or you may bring your own. Do some network analysis with visualizations and interpretations.
Nov. 5: D3.js Concepts
The basics of a powerful visualization library for the web.
- Read Scott Murray, Interactive Data Visualization for the Web (O’Reilly, 2013).
- Read Mike Bostock, “Let’s Make a Bar Chart” parts 1–3, “Let’s Make a Map,” “Let’s Make a Bubble Map,” and “Thinking with Joins.”
- Browse the D3 documentation.
- Experiment with the examples in the D3 gallery.
- Using some suitable data set(s), create as many different kinds of D3 visualizations as you can manage. (These need not be complicated visualizations.) Can you add interactivity to them? What does interactivity add to the graphics? What does it take away?
Nov. 12: D3.js Applications
From D3 basics to D3 for history.
No assigned reading, but you may find Elijah Meeks, D3.js in Action (Manning, 2014) useful for advanced D3.
Over the course of the semester we have written programs to do many kinds of analysis. Take one of the kinds of analysis that seems most promising for your work, and translate it to the web using D3. Create the most sophisticated (not flashy) visualization that you can, and embed it in an interpretation or narrative. Use the principles of reproducible research as appropriate.
Nov. 19: Workshop day / TBD
This week we will work collaboratively on the projects for the course. We may also cover additional topics such as web applications and frameworks (Ruby on Rails, Sinatra, Node.js); programming practices such as debugging, refactoring, and testing; other programming languages (Python, Ruby, PHP); basic statistics of use to historians; or other topics relevant to your research.
Nov. 26: No class
Dec. 3: Project Presentations
You will present your final projects, with an emphasis on both their code and historical interpretations. Final projects are due by 6 p.m. on December 10.