Analyzing Historical History Dissertations: Cleaning the Data Again

This series on “Analyzing Historical History Dissertations” is a work in progress. If you would like to cite or link to it, please consider using the landing page, which will always have the most up-to-date information and a list of all the posts.

After posts about the length of dissertations about history university by university and for the American academy as a whole, along with a map of where history dissertations have been written, this post will be much less interesting. In this post I want to write about some decisions that I’ve made to further clean and tidy the data, and the consequent revisions to earlier visualizations. I’m writing about this process because I have conceived of this series of posts as being not just a presentation of my findings about history dissertations, but a set of reflections on the process of data analysis as performed by a historian. I hope this series will be pedagogically useful. Certainly no one ever taught me how to do this kind of data analysis in a history department, and it’s not that hard to find scientists complaining that no one ever taught them how to do this kind of work. So I hope this series of posts and the accompanying code is one resource among many, and in particular that it will be useful as a guide to the many steps in digging through a mid- to large-sized data set.

First, what do I mean by cleaning and tidying data? By cleaning data I mean standardizing the values of the data to make them more useful. For example, I have filtered out at least twenty thousand irrelevant dissertations from the data set. Cleaning might also mean performing an operation like standardizing the name of a university, or joining all the fields that hold paragraphs for an abstract into one field. By tidying data I mean what Hadley Wickham means in this fantastic journal article and talk, namely, changing the structure of data so that “each variable is a column, each observation is a row, and each type of observational unit is a table.” To put it a different way, tidying data is normalizing the data; cleaning the data is akin to proofreading the data.

So why do I need to clean and tidy the data after five posts? One possible model of how data analysis should proceed is that you first clean the data, then analyze. Another possible model is that you first set up the basic structure of the analysis, then clean the data to get the results that you want. In practice, learning from data is a process that iterates between analysis and cleaning. The process is iterative because you have to analyze the data before you know how the data must be cleaned and tidied, and you have to clean and tidy the data to make certain kinds of analysis available.

These are the kinds of tidying and cleaning that can happen at this stage.

First, mapping the dissertations made it obvious that the ProQuest data set contains very incomplete information for dissertations written outside North America. We may as well throw away those observations, since they are too incomplete to tell us anything useful. The data set also contains 4,608 dissertations from Canada. Looking at the quality of the data recorded for those Canadian dissertations, it seems that they are worth keeping. The deciding point for me is that the AHA Directory of History Dissertations also includes Canadian dissertations.

Second, looking closely at the list of universities shows that there are some, such as the “Yale University, School of Forestry and Environmental Studies” which represent the occasional dissertation that might be about history, but is probably so far afield from the norm of a dissertation in history department that these may as well be excluded. Most of these will be dropped anyway when we filter the dissertations by subject.

Now we have a more difficult decision to make, since we have to decide what counts as a dissertation in history. There is a strong argument to be made that the best definition is a dissertation written in a history department. Unfortunately, since only 7.64 percent of the dissertations in this data set identify their department, it is impossible to use that definition. But there are intellectual justifications for a more capacious definition. Certainly in my own field of American religious history, many if not most of the dissertations are written by people in religious studies departments, divinity schools, Judaic Studies departments, and the like. Economic history, particularly of a quantitative bent, has mostly moved to economics departments (though history of capitalism is making a roaring comeback in history departments). So I have no problem with using the ProQuest subject headings to study a broader definition of what a history dissertation is.

The sticking point for me is whether to include dissertations with the subject “Education, History of” as dissertations in history. The argument that such dissertations are usually written in schools of education rather than colleges of arts and sciences, are therefore held to different norms and so should be excluded from this study is persuasive to me. But we can also test the hypothesis that they are held to different norms.

Here is a chart of the page lengths of history of education dissertations vs all other kinds of history dissertations.

There is a very wide divergence between the history of education and the other history dissertations, so I’ve decided to reverse my earlier decision and exclude them from the data set.

There is no such divergence for history of religion dissertations, so they can stay.

History of science has a bizarre trajectory compared to every other kind of history. (Historians of science, any ideas what is going on here?) But even though historians of science are often in different departments, they can stay in the data set.

History of economics diverges from general history (though not as sharply as history of education). I’m betting that’s because such dissertations are more likely to use quantitative methods, so they can stay.

Biography also seems close to the profession in general.

As a last step, I’m dropping all universities with fewer than five dissertations that meet all these criteria. Looking through these universities, many of their dissertations seem like false positives, so this cut-off should clean up problems in the long tail.

Making these changes to the data set gives us fewer dissertations: 84428 dissertations before, 80575 after cleaning, which is 3853 fewer dissertations. It’s a messy process, but then the data is messy. Keep that in mind the next time you’re impressed by a slick visualization.


Now that the data is more thoroughly cleaned, I have redone some of the more significant visualizations from the earlier posts.

First, the number of dissertations per year.

Second, the mean and median page counts of history dissertations. These are slightly higher after cleaning the data more thoroughly.

Third, a map of the locations where history dissertations were written.




Finally, a redo of all the page counts by university. (Because nothing makes a post popular like giving every historian in North America a little blue dot of their own to look for.)


Columbia University.

The University of Wisconsin – Madison.

University of California, Berkeley.

Yale University.

University of California, Los Angeles.

Harvard University.

New York University.

University of Michigan.

University of Pennsylvania.

The University of Chicago.

Princeton University.

Indiana University.

University of Minnesota.

Stanford University.

The University of Texas at Austin.

The University of North Carolina at Chapel Hill.

University of Virginia.

The Ohio State University.

Duke University.

The Johns Hopkins University.

University of Illinois at Urbana-Champaign.

Cornell University.

City University of New York.

Northwestern University.

University of Washington.

Rutgers The State University of New Jersey – New Brunswick.

University of Maryland, College Park.

University of California, Santa Barbara.

The University of Iowa.

Emory University.

The Florida State University.

Brown University.

University of Kansas.

University of Pittsburgh.

Temple University.

Michigan State University.

University of Notre Dame.

Boston University.

Georgetown University.

University of Southern California.

Brandeis University.

University of Missouri – Columbia.

State University of New York at Binghamton.

The Pennsylvania State University.

State University of New York at Buffalo.

Vanderbilt University.

University of California, San Diego.

The University of Arizona.

The Catholic University of America.

University of Colorado at Boulder.

Syracuse University.

University of Florida.

Saint Louis University.

The University of New Mexico.

The Claremont Graduate University.

The George Washington University.

Tulane University.

University of Massachusetts Amherst.

University of California, Irvine.

State University of New York at Stony Brook.

University of Rochester.

University of Kentucky.

University of South Carolina.

University of Delaware.

University of Hawai’i at Manoa.

The University of Oklahoma.

University of California, Davis.

Arizona State University.

University of Connecticut.

University of Oregon.

Louisiana State University and Agricultural & Mechanical College.

University of Georgia.

American University.

Washington University in St. Louis.

Boston College.

The University of Nebraska – Lincoln.

University of California, Riverside.

Case Western Reserve University.

Drew University.

Fordham University.

Texas A&M University.

The University of Utah.

Rice University.

Washington State University.

Kent State University.

University of Illinois at Chicago.

Purdue University.

University of Cincinnati.

Loyola University Chicago.

Howard University.

Texas Christian University.

Ohio University.

Marquette University.

The University of Tennessee.

Boston University Graduate School.

The University of Alabama.

Graduate Theological Union.

The College of William and Mary.

Bryn Mawr College.

University of California, Santa Cruz.

University of North Texas.

The Southern Baptist Theological Seminary.

West Virginia University.

University of Houston.

Bowling Green State University.

University of Denver.

Texas Tech University.

Auburn University.

State University of New York at Albany.

Northern Illinois University.

The Union Institute.

Southern Illinois University at Carbondale.

Southwestern Baptist Theological Seminary.

Wayne State University.

The University of Mississippi.

Miami University.

New School for Social Research.

Princeton Theological Seminary.

Oklahoma State University.

The University of Maine.

University of Arkansas.

University of Miami.

George Mason University.

Georgia State University.

Carnegie Mellon University.

Fuller Theological Seminary.

Mississippi State University.

University of New Hampshire.

Baylor University.

Kansas State University.

Clark University.

Iowa State University.

The University of Toledo.

The University of Southern Mississippi.

Ball State University.

Union Institute and University.

New School University.

Lehigh University.

The University of Wisconsin – Milwaukee.

The University of Texas at Arlington.

Tufts University.

Brigham Young University.

The University of Memphis.

Northern Arizona University.

The University of Texas at Dallas.

Union Theological Seminary.

Southern Methodist University.

Virginia Polytechnic Institute and State University.

Westminster Theological Seminary.

Duquesne University.

Western Michigan University.

University of Idaho.

California Institute of Integral Studies.

Regent University.

The University of Akron.

The Jewish Theological Seminary of America.

University of Missouri – Kansas City.

Florida International University.

Pacifica Graduate Institute.

University of Nevada, Reno.

Virginia Commonwealth University.

The University of North Carolina at Greensboro.

Trinity Evangelical Divinity School.

New Orleans Baptist Theological Seminary.

University of Louisville.

Hebrew Union College – Jewish Institute of Religion (Ohio).

Georgia Institute of Technology.

Andrews University.

Northeastern University.

Southeastern Baptist Theological Seminary.

Yeshiva University.

The Iliff School of Theology and University of Denver.

University of Nevada, Las Vegas.

Oregon State University.

Rensselaer Polytechnic Institute.

Mid-America Baptist Theological Seminary.

Memphis State University.

University of California, San Francisco.

Texas Woman’s University.

Colorado State University.

Salve Regina University.

Indiana University of Pennsylvania.

University of South Florida.

Walden University.

Lutheran School of Theology at Chicago.

Middle Tennessee State University.

University of Wyoming.

Capella University.

Dallas Theological Seminary.

University of New Orleans.