Please note that this book is a work in progress, the equivalent of a pre-alpha release. All the code should work, because if it didn’t the site could not be built. But there is still a lot of work to do to explain the historical methods under discussion. Feel free to leave feedback as issues on the GitHub repository, or to e-mail me.

Network analysis, like spatial analysis or text analysis, is one of the major types of digital history work that you might want to perform. A network is simply a representation of the relationships between objects: people, places, events, etc. The objects we call vertices; the relationships between them we call edges. Take the following (randomly generated) example. The nine vertices are represented by blue dots. The ten relationships are represented by light gray lines.

Leaving aside for a moment our knowledge that the graph is randomly generated, what might this kind of a network mean? The vertices (blue dots) could represent people, and the edges (gray lines) could indicate kinship, or shared membership in an organization, or friendships on Facebook, or correspondence. Or the vertices could represent organizations, and their relationships could be funding, or political ties. Or perhaps the vertices are texts, and the edges indicate shared text, or ownership. Virtually anything, that is, could be signified by this plot of the network, just as the lines in a bar chart could represent virtually any quantity.

Therein lies the problem. If virtually any relationship can be represented in a network, then a network plot might just as easily mean nothing as it means anything. This might seem overly pessimistic, given how readily most people accept bar charts and other quantitative visualization. But it is really a question of literacy and convention that distinguishes network analysis from quantitative visualization. We have had several centuries to get used to “reading” the conventions of bar charts, and it is all too easy even there to violate best practices or inculcate false understandings. With network analysis, which is based on very old mathematics but is nevertheless a relatively new form of visualization, there is much greater possibility of creating meaningless visualizations which relate objects to one another without any genuine or meaningful relationship.

Then too, networks are easy to plug in to Gephi or some other GUI tool, but they can be very difficult to visualize well. It is fairly easy to notice some groups in the plot above. Vertices 3, 5, 9, and 2, form one group, connected to another group of 7, 8, 1, and 4; vertex 6 stands alone. We might, for example, just as easily and as justifiably have laid the nodes out in a circle.

A change in the parameters for laying out the network, and it is far more difficult to interpret the plot. This problem becomes all the more acute when we have a network with a non-trival number of vertices. Here are plots of networks with ten, one hundred, and five hundred vertices each.

At even 500 vertices, which is still not that many, the graph collapses under its own weight.

Finally there is the worst of all network visualizations, created with both inadequate data and inadequate theorization. A scholar will find a collection of correspondence in an archive or collected works (worse yet, the selected works). The scholar will then make a plot of the correspondence “network” showing letters (edges = gray lines) between people (vertices = blue dots).

This network is true enough: person 1, whose collected works we have, did send letters to people 2 through 8, and they to him or her. But the network is meaningless because, upon a moment’s reflection, one realizes that a collection of correspondence gathered for an individual will by definition take this shape: it only collects letters to or from a single person. To make a meaningful network, one would need not just the correspondence of person 1, but of people 2 through 8, and of all the people they correspondend with to boot. It is still all too common to see naive visualizations.

But less these cautions be taken as too pessimistic, we will emphasize the flip side of the coin in the remainder of the chapter. Because network visualizations can show the relationships between any kinds of objects or people or institutions, they are extraordinarily powerful for historical work. For them to be useful, however, you must do two things. First, you must keep your wits as a historian about you, and ask good historical questions that can be investigated with good historical data. This principle is of course true of any kind of historical work, but it has been too often misplaced when using networks for history. Second, you must understand to some extent the mathematics behind networks. Networks are a part of the study of graph theory, a subfield within mathematics. You need not become an expert in graph theory, of course, though the more you work with networks the more it will become useful to you. (See the references for suggested reading about networks in general.) But you will have to pick up the terminology from graph theory and an understanding of what it means. You have already learned two terms: vertex and edge. You will learn several others in the course of this chapter. This terminology is necessary because frequently the best way to study a network is to extract meaningful subnetworks (which can be more easily visualized) or to compute summary statistics on the network.

This chapter, then, will introduce you to the basics of data which can be used for network analysis. Then it will show you how to plot that network in a variety of layouts. Next it will introduce a special type of graph: the bipartite graph. Finally, it will demonstrate a few basic ways to analyze graphs using the methods of graph theory; these including deriving implied relationships and making predictions.

R packages for networks

There are several packages which will allow you to perform network analysis. These include the network package, the sna (social network analysis) package, and the igraph package. We are going to prefer the igraph package since it is the most feature rich and since it is based on a C library by the same name. However it is worth being aware of the other packages. Since problems for graphs or networks fall into common patterns, it may be that one of the other packages implements a solution which will be easier for you to use without reinventing the wheel.

The historydata package contains several data sets suitable for network analysis. The tudors data set contains spousal and parent/child relationships for selected members of the Tudor dynasty. Since human relationships are familiar, this dataset is useful for experimenting with and checking our results. Also included is the judges_people and judges_appointments tables which provide a much larger dataset of networks between people and courts within the federal judiciary.

Once you have installed these packages, you can load them.

library(igraph)
library(dplyr)
library(historydata)

Format of network data

A network (to repeat) is comprised of vertices and edges. At a minimum, the data that represents a network must include a list of edges in the form of a mapping: node A is connected to node B, node B is connected to node C, and so on. This kind of mapping is easily stored in a CSV file or a data frame. We can generate a very simple mapping as a data frame.^[The data_frame() function is provided by dply, and it provides better default options than the base R function data.frame().

sample_data <- data_frame(node_1 = c("A", "B", "C"), 
                          node_2 = c("B", "C", "A"))
sample_data
## Source: local data frame [3 x 2]
## 
##   node_1 node_2
## 1      A      B
## 2      B      C
## 3      C      A

This data represents the edges (that is, the connections) but by implication it also represents the nodes, since they are also listed in the data. Using the igraph package, we can turn this data frame into a graph (or network).1

sample_graph <- graph.data.frame(sample_data)
sample_graph
## IGRAPH DN-- 3 3 -- 
## + attr: name (v/c)

The default display for an object of class igraph is not particular informative. But using the functions E() and V() we can extract the edges and vertices from the graph. The edges function simply repeats the data that we already knew:

E(sample_graph)
## Edge sequence:
##           
## [1] A -> B
## [2] B -> C
## [3] C -> A

But the vertices function gives us a list of the vertices derived from that network.

V(sample_graph)
## Vertex sequence:
## [1] "A" "B" "C"

Finally, we can plot this graph using the base R plot() function.

plot(sample_graph)

Note that it is possible to represent this same network in alternative forms. One common alternative is an adjacency matrix. In such a matrix, the rows and columns both represent all the vertices of the graph. In the values of the matrix, a non-negative value indicates a connection between the vertices. You will find that functions in other packages, e.g. packages for topic modeling or spatial analysis, will return adjacency matrixes which you can then use in network analysis.2

get.adjacency(sample_graph)
## 3 x 3 sparse Matrix of class "dgCMatrix"
##   A B C
## A . 1 .
## B . . 1
## C 1 . .

In many cases, however, you will want to include other information besides just the edges between vertices. You may wish to include information about the edges, such as the number of letters sent between people or the type of relationship that people have. You may also wish to include information about the vertices, such as the name of a person or institution. The information about the edges can be contained in the same data frame as the mappings from vertex to vertex; the information about the vertices must be contained in a separate data frame. This is easiest to demonstrate using a sample dataset of a kinship network.

An example network: The Tudors

In the historydata package is contained a data frame of kinship within the Tudor dynasty in early modern England. This small and manageable dataset will let us experiment with network visualization. To begin, notice that the dataset includes two columns listing the people who are connected, and a third column defining their relationship.

data(tudors)
tudors
## Source: local data frame [35 x 3]
## 
##          person_1            person_2 relationship
## 1       Henry VII   Elizabeth of York       spouse
## 2    Arthur Tudor Catharine of Aragon       spouse
## 3      Henry VIII Catharine of Aragon       spouse
## 4      Henry VIII         Anne Boleyn       spouse
## 5      Henry VIII        Jane Seymour       spouse
## 6      Henry VIII      Anne of Cleves       spouse
## 7      Henry VIII    Katherine Howard       spouse
## 8      Henry VIII      Catherine Parr       spouse
## 9  Margaret Tudor            James IV       spouse
## 10     Mary Tudor           Louis XII       spouse
## ..            ...                 ...          ...

We can turn this dataset into a graph. Notice in our sample A-B-C-D dataset above that arrows are included indicating the direction of the connection. Often it is desirable to have a network where it is only possible to travel in specified directions. For example, person A sent a letter to person B, but that does not imply that person B sent a letter to person A. In this case, kinship is necessarily reciprocal, and so we want an undirected graph, which we can specify as an argument.

tudors_g <- graph.data.frame(tudors, directed = FALSE)

Now that we have an igraph object, we can create a plot.

plot(tudors_g)
title("The Tudors, take one")

We shouldn’t expect that our first plot will be much good, though this plot is not awful. We can see some kind of relationship among the various people and our vertices don’t overlap too badly. But we can do much better. We should indicate which people were monarchs of England. In particular, we should code the kinds of relationships: a marriage is different than a parent/child relationship. And we could improve the layout of the graph so it reads more or less chronologically.

First we will indicate which people were monarchs of England. This is information about our vertices, and so must be contained in a separate data frame from the information about edges that we loaded from the historydata package. If we had a large dataset, the information might we contained in a separate CSV file. But in this small dataset, we will have to construct it for ourself. The V() function will give us a list of all the people represented in our graph. It represents the data as integers, so we have to access the name value with the $ operator. We can associate that vector with a vector of TRUE or FALSE values for whether or not the person was a monarch. We might just as easily associate a person with a gender, a birth date, a nationality, or whatever information we are interested in studyying.

tudor_people <- data_frame(name = V(tudors_g)$name, 
                           monarch =  c(TRUE, FALSE, TRUE, FALSE, FALSE,
                                       FALSE, FALSE, TRUE, TRUE, TRUE, TRUE,
                                       FALSE, FALSE, FALSE, FALSE, FALSE,
                                       FALSE, FALSE, FALSE, FALSE, FALSE,
                                       FALSE, FALSE, FALSE, FALSE))
tudor_people
## Source: local data frame [25 x 2]
## 
##                   name monarch
## 1            Henry VII    TRUE
## 2         Arthur Tudor   FALSE
## 3           Henry VIII    TRUE
## 4       Margaret Tudor   FALSE
## 5           Mary Tudor   FALSE
## 6              James V   FALSE
## 7  Mary Queen of Scots   FALSE
## 8               Mary I    TRUE
## 9           James VI/I    TRUE
## 10         Elizabeth I    TRUE
## ..                 ...     ...

Now that we know whether each person was a monarch, it is possible to change the shape and color of the vertex. We will use the function ifelse(), which takes three arguments: a comparison or function that returns a boolean, a value to return if TRUE, and a value to return if FALSE. (The chief advantage of using ifelse() over if() is that ifelse() is vectorized and if() is not.) We will change the shape of monarchs to squares, and their color to red. Then we can plot the network again.

V(tudors_g)$shape <- ifelse(tudor_people$monarch, "square", "circle")
V(tudors_g)$color <- ifelse(tudor_people$monarch, "red", "lightblue")
plot(tudors_g)
title("The Tudors, take two")

This makes it much easier to pick out the key people of interest. But it will also be useful to colorize the edges. Because we loaded our edge data from a data frame with a relationship column.

E(tudors_g)$color <- ifelse(E(tudors_g)$relationship == "child",
                            "yellow", "green")
plot(tudors_g)
title("The Tudors, take three")

This visualization is still hard to read because the data has an implied chronological sequence which this chart does not recognize. Using the layout = argument to the plot() function it is possible to specify an alternative mode of laying out a network. There are many different kinds of layouts; some of them we will use in this chapter; the rest you can find by reading the igraph package documentation. These layouts are provided by functions. In this case, we know that our network is heirarchical (e.g., children suceed parents) so the layout.reingold.tilford() function will approximate this.

plot(tudors_g, layout = layout.reingold.tilford)
title("The Tudors, take three")

This layout is not perfect: it confuses spouses with children. Nevertheless, it does indicate the possibilities for creating different visualizations by experimenting with the layout function.

Note that we have set several different properties on the tudors_g object. To find all the properties of the graph itself and its edges and vertices, you can call the str() function with the following options. There are also the functions list.edge.attributes(), get.edge.attribute(), set.edge.attribute(), list.vertex.attributes(), get.vertex.attribute(), get.vertex.attribute() that can get or set the same information.

str(tudors_g, e = TRUE, v = TRUE, g = TRUE)
## IGRAPH UN-- 25 35 -- 
## + attr: name (v/c), shape (v/c), color (v/c), relationship (e/c),
##   color (e/c)
## + vertex attributes:
##                        name  shape     color
## [1]               Henry VII square       red
## [2]            Arthur Tudor circle lightblue
## [3]              Henry VIII square       red
## [4]          Margaret Tudor circle lightblue
## [5]              Mary Tudor circle lightblue
## [6]                 James V circle lightblue
## [7]     Mary Queen of Scots circle lightblue
## [8]                  Mary I square       red
## [9]              James VI/I square       red
## [10]            Elizabeth I square       red
## [11]              Edward VI square       red
## [12]      Elizabeth of York circle lightblue
## [13]    Catharine of Aragon circle lightblue
## [14]            Anne Boleyn circle lightblue
## [15]           Jane Seymour circle lightblue
## [16]         Anne of Cleves circle lightblue
## [17]       Katherine Howard circle lightblue
## [18]         Catherine Parr circle lightblue
## [19]               James IV circle lightblue
## [20]              Louis XII circle lightblue
## [21] Charles Duke of Suffok circle lightblue
## [22]          Mary of Guise circle lightblue
## [23]   Frances II of France circle lightblue
## [24]     Henry Lord Darnley circle lightblue
## [25]              Philip II circle lightblue
## + edges (vertex names) and their attributes:
##                                                edge relationship  color
## [1]  Henry VII             --Elizabeth of York            spouse  green
## [2]  Arthur Tudor          --Catharine of Aragon          spouse  green
## [3]  Henry VIII            --Catharine of Aragon          spouse  green
## [4]  Henry VIII            --Anne Boleyn                  spouse  green
## [5]  Henry VIII            --Jane Seymour                 spouse  green
## [6]  Henry VIII            --Anne of Cleves               spouse  green
## [7]  Henry VIII            --Katherine Howard             spouse  green
## [8]  Henry VIII            --Catherine Parr               spouse  green
## [9]  Margaret Tudor        --James IV                     spouse  green
## [10] Mary Tudor            --Louis XII                    spouse  green
## [11] Mary Tudor            --Charles Duke of Suffok       spouse  green
## [12] James V               --Mary of Guise                spouse  green
## [13] Mary Queen of Scots   --Frances II of France         spouse  green
## [14] Mary Queen of Scots   --Henry Lord Darnley           spouse  green
## [15] Mary I                --Philip II                    spouse  green
## [16] Henry VII             --Arthur Tudor                  child yellow
## [17] Arthur Tudor          --Elizabeth of York             child yellow
## [18] Henry VII             --Henry VIII                    child yellow
## [19] Henry VIII            --Elizabeth of York             child yellow
## [20] Henry VII             --Margaret Tudor                child yellow
## [21] Margaret Tudor        --Elizabeth of York             child yellow
## [22] Henry VII             --Mary Tudor                    child yellow
## [23] Mary Tudor            --Elizabeth of York             child yellow
## [24] James V               --James IV                      child yellow
## [25] Margaret Tudor        --James V                       child yellow
## [26] Mary Queen of Scots   --Mary of Guise                 child yellow
## [27] James V               --Mary Queen of Scots           child yellow
## [28] Mary Queen of Scots   --James VI/I                    child yellow
## [29] James VI/I            --Henry Lord Darnley            child yellow
## [30] Mary I                --Catharine of Aragon           child yellow
## [31] Henry VIII            --Mary I                        child yellow
## [32] Elizabeth I           --Anne Boleyn                   child yellow
## [33] Henry VIII            --Elizabeth I                   child yellow
##  [ reached getOption("max.print") -- omitted 2 rows ]

Bipartite graphs

There is more we can and should do to explore the Tudors graph, but to do so we must get comfortable with exploring and manipulating the underlying structure of the graph. To do so, we will us as an example a particularly important kind of graph: the bipartite graph.

To understand bipartite graphs, let’s take a historical example.

members <- data_frame(person = c("A", "B", "C", "A", "D", 
                                 "D", "E", "F", "F", "G", "G"),
                      organization = c("Odd Fellows", "Odd Fellows",
                                       "Odd Fellows", "Masons", 
                                       "Masons", "Pythians", "Pythians",
                                       "Masons", "Odd Fellows", "Masons",
                                       "Odd Fellows"))
members
## Source: local data frame [11 x 2]
## 
##    person organization
## 1       A  Odd Fellows
## 2       B  Odd Fellows
## 3       C  Odd Fellows
## 4       A       Masons
## 5       D       Masons
## 6       D     Pythians
## 7       E     Pythians
## 8       F       Masons
## 9       F  Odd Fellows
## 10      G       Masons
## 11      G  Odd Fellows
members_g <- graph.data.frame(members, directed = FALSE)

Here we have a network of membership in fraternal organizations. Unlike in our kinship network above, the people are not directly related to one another. Rather they are related to one another through an intermediary, in this case, membership in an organization. We can see the problem when we plot the graph.

plot(members_g)

We want to see the connections between people A, B, and C, but instead they are connected through the Odd Fellows. In general, it is often a bad idea to mix two different kinds of vertices in a network. Our network, however, is of a special type called bipartite. The vertices in a bipartite graph can be divided into two sets (with no vertex appearing in both sets). In our case, we have vertices which represent people and which represent fraternal organizations. In a bipartite graph, each edge connects one vertex from the first set to a vertex in the second set. In this case each edge connects a person to an organization. If our graph also included direct connections between people, or between organizations, it would not be bipartite. 3 Yet another way to say this is that in our data frame of edges, no vertex can appear in both the first column and the second column.

We happen to know that our membership rolls fits that description. But the igraph package cannot infer on its own that the name of an organization is in a different category than the name of a person. So we must supply that information to our graph object, using the types variable inside the vertex objects. In other words, we need to specify who is a person, and what is an organization.

We can retrieve the list of vertices:

V(members_g)
## Vertex sequence:
##  [1] "A"           "B"           "C"           "D"           "E"          
##  [6] "F"           "G"           "Odd Fellows" "Masons"      "Pythians"

If we had a separate list of organizations and people, we could use that to create a vector of values. But the igraph package can compute which verticies belong in which set. To do this we use the bipartite.mapping() function to separate the vertices into two groups.

bipartite.mapping(members_g)
## $res
## [1] TRUE
## 
## $type
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

Note that this solution is not guaranteed to be unique; there may be other solutions which are not what we expect. But in this case it is correct. We can use that value to specify the types variable in our graph object.

V(members_g)$type <- bipartite.mapping(members_g)$type

Now we can use the is.bipartite() function to test whether the graph is properly bipartite.

is.bipartite(members_g)
## [1] TRUE

Now that we have included the information, we can plot our graph using a new layout which puts the people on one side and the organizations on the other.

plot(members_g, layout = layout.bipartite)
title("A bipartite layout of a fraternal network")

This is conceptually an improvement, since we have distinguished between people and organizations. In this plot, there are two kinds of vertices, representing people and organizations, and the edges represent membership. Still, what we want are the connections between people directly without the intermediary organizations. We can use the function bipartite.projection() to break up the vertices into the two separate groups and show just the connections among the people.

bipartite <- bipartite.projection(members_g)
str(bipartite)
## List of 2
##  $ proj1:IGRAPH UNW- 7 14 -- 
## + attr: name (v/c), weight (e/n)
## + edges (vertex names):
## A -- B, C, D, F, G
## B -- A, C, F, G
## C -- A, B, F, G
## D -- A, E, F, G
## E -- D
## F -- A, B, C, D, G
## G -- A, B, C, D, F
##  $ proj2:IGRAPH UNW- 3 2 -- 
## + attr: name (v/c), weight (e/n)
## + edges (vertex names):
## [1] Odd Fellows--Masons   Masons     --Pythians

Note that our new bipartite object contains two network graphs, proj1 and proj2. We can plot the first one.

plot(bipartite$proj1)
title("Just the people in the fraternal organizations")

That is exactly what we wanted. Now we can see directly which people likely knew one another because of their shared membership. Now there is only one kind of vertex, people, and the edges represent shared membership in an organization. That graph shows the relationships among people, but we can also use the graph of the relationship among the organizations.

plot(bipartite$proj2)
title("Just the fraternal organizations")

This plot also has only one kind of vertex, organizations, and edges represent at least one person who had membership in both organizations. However, this plot shows the same level of connection, no matter how many people shared membership in the organizations. What we would like to do is draw the connecting lines thicker if organizations have more shared members. When we use bipartite.projection() to separate the original graph into two parts, that information was preserved in the weight variable for the edges.

E(bipartite$proj2)
## Edge sequence:
##                               
## [1] Masons      -- Odd Fellows
## [2] Pythians    -- Masons
E(bipartite$proj2)$weight
## [1] 3 1

The way to read this information is that there were three people who were both Odd Fellows and Masons, but only one person who was both a Pythian and a Mason. Using the edge.width= argument to the plot() function, we can show this visually.

plot(bipartite$proj2, edge.width = E(bipartite$proj2)$weight)
title("Fraternal organizations weighted by shared membership")

Now we have a plot which shows the closer connection between the Masons and the Odd Fellows using a thicker line. But it would be nicer to show the closer connection by making the points, well, closer. The Fruchterman-Reingold layout uses edge weights to bring “heavier” connections closer together. As a final tweak, we can offset the labels slightly so they don’t overlap the vertices, using the vertex.label.dist= argument.

plot(bipartite$proj2, edge.width = E(bipartite$proj2)$weight,
     layout = layout.fruchterman.reingold(bipartite$proj2, 
                                          weights = E(bipartite$proj2)$weight),
     vertex.label.dist = 1)
title("Fraternal organizations weighted by shared membership")

Now that we are familiar with some of the basics of working with and plotting bipartite graphs, we are ready to work with a real and much larger dataset.

Federal judges and courts as a network

The federal judges dataset contains biographical information about judges as well as information about which courts they served on. We can load the judges_appointments table to see that information.

data(judges_appointments)
judges_appointments
## Source: local data frame [4,202 x 15]
## 
##    judge_id                                             court_name
## 1      3419    U. S. District Court, Southern District of New York
## 2         1     U. S. District Court, Eastern District of New York
## 3         2 U. S. District Court, Western District of Pennsylvania
## 4         3     U. S. District Court, Northern District of Alabama
## 5         4           U. S. District Court, District of New Jersey
## 6         5    U. S. District Court, Southern District of Illinois
## 7         6          U. S. District Court, District of Puerto Rico
## 8         7    U. S. District Court, Southern District of Illinois
## 9         8           U. S. Court of Appeals for the Third Circuit
## 10        9     U. S. District Court, Eastern District of Missouri
## ..      ...                                                    ...
## Variables not shown: court_type (chr), president_name (chr),
##   president_party (chr), nomination_date (chr), predecessor_last_name
##   (chr), predecessor_first_name (chr), senate_confirmation_date (chr),
##   commission_date (chr), chief_judge_begin (int), chief_judge_end (int),
##   retirement_from_active_service (chr), termination_date (chr),
##   termination_reason (chr)

There are a lot of variables in that dataset. Now that you are familiar with networks, however, you should notice that the first two columns set us up for a bipartite network graph. That is, in the first column we have judges, in the second colum we have courts, and their a relationship between them. There are many questions we could ask this dataset, but we will ask a question which is structurally similar to what we did with the sample fraternal organization data. Let’s ask the question, how do judges move from one federal court to another? Put in terms of networks, we can phrase that question this way: how are federal courts related to one another by the judges which have served on each of them? Of course, we are likely to be especially interested in which courts are most likely to provide justices for the Supreme Court of the United States.

We can start with some summary counts of the number of judges. First, let’s find out how many judges have served on two or more courts. These judges will be the edges in our network.

judges_appointments %>%
  group_by(judge_id) %>%
  summarize(n = n()) %>%
  filter(n > 1) %>%
  arrange(desc(n))
## Source: local data frame [546 x 2]
## 
##    judge_id n
## 1      2689 6
## 2      1126 5
## 3      1453 5
## 4        57 4
## 5       354 4
## 6       375 4
## 7       945 4
## 8      1026 4
## 9      1065 4
## 10     1293 4
## ..      ... .

There are 3532 judges in the dataset; only 546 of them have served on multiple federal courts. They will be the edges and this is a very modest size. But before we can use this data, we should clean up a peculiarity. The chief judges of certain important courts like SCOTUS are listed separately from the rest of the court, which will make two vertices for SCOTUS where there should be one. We can easily clean these up.

library(stringr)
# Remove appendages to court names
judges_appointments$court_name <- judges_appointments$court_name %>%
  str_replace("Chief Judge, ", "")
judges_appointments$court_name <- judges_appointments$court_name %>%
  str_replace(", Chief Judge", "")
judges_appointments$court_name <- judges_appointments$court_name %>%
  str_replace(", Chief Justice", "")
judges_appointments$court_name <- judges_appointments$court_name %>%
  str_replace("Associate Judge, ", "")

There would be no point in plotting our network now: it will be an inchoate mess and contain vertices for both courts and judges. What we want to do first is turn our data frame into a graph, our graph into a bipartite graph, and then split apart the network of courts weighted by the number of judges connecting them. We can do this in a few steps:

courts <- graph.data.frame(judges_appointments, directed = FALSE)
V(courts)$type <- bipartite.mapping(courts)$type
courts <- bipartite.projection(courts, which = TRUE)
courts
## IGRAPH UNW- 161 271 -- 
## + attr: name (v/c), weight (e/n)

We can see that our graph has 161 vertices: that is the number of different courts that are represented. The graph also has 271 edges. That is, there are that many connections between courts, but because there are 3532 judges who served on multiple courts, those edges are weighted to represent multiple judges.

set.seed(723) # keep the same arrangement each time
l <-  layout.fruchterman.reingold(courts, weights = E(courts)$weight)
plot(courts, 
     vertex.label = NA, 
     vertex.size = 3,
     edge.width = E(courts)$weight,
     layout = l)
title("Movement of judges between federal courts, first attempt")

This first attempt at a plot of the federal courts is by no means perfect, but it at least provides an interesting start. We notice that there are some courts in the center, that some branches of the federal court system seem closer than others, and that some courts have never shared judges. But we don’t know which vertices represent what. We must start to explore the graph more fully.

We’ll begin by adding some color for the kinds of courts. Our judges_appointments data frame contains th court_type variable. We will first have to look up each vertex name to find the court type and assign it to a variable in the graph. The base R function match() fill find a row in our data frame; we can then use that row number to look up the type of court. This is most easily wrapped in a function.

lookup_court <- function(name) {
  judges_appointments$court_type[
    match(V(courts)$name, judges_appointments$court_name)
    ]
}
V(courts)$court_type <- lookup_court(V(courts)$name)

Now our vertices know which type of court they are. We could create a look up table of colors and court types, but in this case we will write a function that returns a color based on the court type and use the Vectorize() wrapper so that it works with vectors.

lookup_color <- function(type) {
  require(stringr)
  if(is.na(type)) return("blue")
  if(type == "USDC") return("green")
  if(type == "USCA") return("yellow")
  if(str_detect(type, "USCC")) return("yellow")
  if(type == "USSC") return("red")
  else return("blue")
} 
lookup_color <- Vectorize(lookup_color, USE.NAMES = FALSE)

Now we can assign a color based on the court type.

V(courts)$color <- lookup_color(V(courts)$court_type)
plot(courts, 
     vertex.label = NA, 
     vertex.size = 3,
     edge.width = E(courts)$weight,
     layout = l)
title("Movement of judges between federal courts, second attempt")
legend("bottomleft", legend = c("Distict", "Circuit/Appeal", "SCOTUS", "Other"),
       col = c("green", "yellow", "red", "blue"), pch = 19,
       title = "Court type")

This new plot is a big improvement, though of course it could be improved fruther. Now we notice that the other courts (mostly commerce and trade courts) have an affinity for one another, though one “other” court has a close affinity for SCOTUS. We might speculate that the branches on this diagram relate to region, because we notice that many district courts are associated with one or at most two appeals courts. Further looking might bear this out and cause us to label the branches. Some appeals courts are apparently much more likely to send their judges on to other courts, including SCOTUS, while other courts have never promoted a judge to a higher federal court. There are many other things we could do to visualize this network, depending on the questions we want to ask. One obvious thing would be to change the size of the vertices based on the number of judges who have ever been affiliated with the court. But for now we will turn to studying the properties of the graph quantitatively rather than visually.

Statistical summaries of network graphs

There are many things that one can do to statistically analyze network graphs. But we will focus on two of them: measures of centrality, and community detection.

First we have to learn some basic techniques of exploring network graphs. The neighbors() function takes a graph and a vertex, and returns the vertices that are attached to it.

neighbors(courts, 1)
## [1]  67 122 130 138

We can make this output more useful. Here we find the vertex that represent the Supreme Court, then find its neighbors, then find the names of the neighbors.

scotus <- which(V(courts)$name == "Supreme Court of the United States")
scotus_promotions <- neighbors(courts, scotus)
V(courts)[scotus_promotions]
## Vertex sequence:
##  [1] "U. S. District Court, Southern District of New York"        
##  [2] "U. S. Court of Appeals for the Third Circuit"               
##  [3] "U. S. District Court, Northern District of Ohio"            
##  [4] "U. S. Court of Appeals for the Ninth Circuit"               
##  [5] "U. S. District Court, Eastern District of Virginia"         
##  [6] "U. S. Court of Appeals for the Sixth Circuit"               
##  [7] "U. S. Court of Appeals for the Seventh Circuit"             
##  [8] "U. S. Court of Appeals for the District of Columbia Circuit"
##  [9] "U. S. Court of Appeals for the First Circuit"               
## [10] "U. S. District Court, Eastern District of Michigan"         
## [11] "U. S. District Court, District of Kentucky"                 
## [12] "U. S. District Court, Western District of Missouri"         
## [13] "U. S. Circuit Courts for the Sixth Circuit"                 
## [14] "U. S. Court of Appeals for the Eighth Circuit"              
## [15] "U. S. Circuit Courts for the Eighth Circuit"                
## [16] "U. S. District Court, Eastern District of Tennessee"        
## [17] "U. S. Court of Appeals for the Second Circuit"              
## [18] "U. S. District Court, Middle District of Tennessee"         
## [19] "U. S. Circuit Courts for the Ninth Circuit"                 
## [20] "U. S. Circuit Courts for the Second Circuit"                
## [21] "U. S. Circuit Courts for the Fifth Circuit"

See also the neighborhood() function to find vertices that are several edges away from a vertex. You can use the function graph.neighborhood() to extract a vertex and its neighbors. Here we pull out the Supreme Court and its immediate neighbors. By change the order, we could control how distant we pull vertices from.

scotus_neighborhood <- graph.neighborhood(courts, scotus, order = 1)
plot(scotus_neighborhood[[1]], vertex.label = NA)

Measures of centrality

There are many measures of centrality (e.g., how important a given vertex or edge is. These are the most important functions:

  • degree()
  • betweenness()
  • closeness()
  • evcent()
  • page.rank()

Community detection

One major thing you might want to do with a network graph is to identify its subgraphs. For example, in a network of people, you might wish to identify the communities that those people have formed. There are many algorithms for identifying communities; which algorithm is most suitable for your purpose will depend on your data. You should try several different algorithms to see which gives the most historically sensible results. The basic workflow within igraph is more or less the same regardless of which algorithm you use. These algorithms are all named in the pattern *.community(). We will try several of these algorithms.

The basic pattern is to pass your graph object to one of the community detection algorithms. The function will return an object of class communities. (See ?communities for documentation.)

comm <- fastgreedy.community(courts)

There are many things you can derive from the community object. The length() will give you the number of communities that were detected.

length(comm)
## [1] 23

The sizes() are the number of members in each group. For each of the 23 communities that were detected, R will tell you how many vertices are in that community. Note that some community detection algorithms will assign vertices to more than one community if appropriate.

sizes(comm)
## Community sizes
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
## 17 21 24 13 15 11  9 12  8  4  7  6  4  1  1  1  1  1  1  1  1  1  1

The modularity() function reports how well the detection algorithm worked. The number runs from 0 to 1: the higher the number, the more valid the separation into subgraphs is. Note that modularity (when called this way) takes into account edge weights.

modularity(comm)
## [1] 0.8104425

And membership() returns a named numeric vector with one element for each vertex in the graph. The number stands for the assigned community; the name of the element is the name of the vertex. Here we will look at the first elements assigned to group 1. Each of those courts is in the Ninth Circuit Court of Appeals, so it is not surprising that judges move around within that circuit.

membership(comm) %>%
  sort() %>%
  head()
## U. S. District Court, Northern District of California 
##                                                     1 
##              U. S. District Court, District of Oregon 
##                                                     1 
##          U. S. Court of Appeals for the Ninth Circuit 
##                                                     1 
##               U. S. District Court, District of Idaho 
##                                                     1 
##  U. S. District Court, Central District of California 
##                                                     1 
## U. S. District Court, Southern District of California 
##                                                     1

It is possible to find all of the members in any given group. Here we subset the membership vector to show just the members of group 4, which appears to be the courts in Virginia, Maryland, West Vrigina, and at least parts of North and South Carolina (interestingly, not Washington, DC).

memb <- membership(comm)
names(memb[memb == 4])
##  [1] "U. S. Court of Appeals for the Fourth Circuit"           
##  [2] "U. S. District Court, Eastern District of Virginia"      
##  [3] "U. S. District Court, District of South Carolina"        
##  [4] "U. S. District Court, Northern District of West Virginia"
##  [5] "U. S. District Court, Western District of Virginia"      
##  [6] "U. S. District Court, District of Maryland"              
##  [7] "U. S. District Court, Southern District of West Virginia"
##  [8] "U. S. Circuit Courts for the Fourth Circuit"             
##  [9] "U. S. District Court, Western District of North Carolina"
## [10] "U. S. District Court, Eastern District of South Carolina"
## [11] "U. S. District Court, District of Virginia"              
## [12] "U. S. District Court, Western District of South Carolina"
## [13] "U. S. District Court, District of West Virginia"

Once you have detected communities, it is easy to plot them. One method is to assign the vertex colors based on their membership. In the code below, the argument that controls the color is vertex.color = memb (memb was previous assigned).

plot(courts, vertex.color = memb, 
     vertex.label = NA, layout = l, vertex.size = 3,
     edge.width = E(courts)$weight)
title("Communities detected by fastgreedy.community()")

Another option is to plot the communities object directly, which will both color the vertices and show their groupings.

plot(comm, courts,
     vertex.label = NA, layout = l, vertex.size = 3,
     edge.width = E(courts)$weight)
title("Communities detected by fastgreedy.community()")

We can try several different community detection algorithms to see what results they produce. The work here is done by the *.community() functions.

courts %>%
  edge.betweenness.community() %>%
  plot(., courts, vertex.label = NA, layout = l, vertex.size = 3)
title("Communities detected by edge.betweenness.community()")

courts %>%
  walktrap.community() %>%
  plot(., courts, vertex.label = NA, layout = l, vertex.size = 3)
title("Communities detected by walktrap.community()")

There are many other community detection functions. One other option for community detection is the optimal.community() function. This function looks for the best community over every possible way of grouping the vertices. The length of time to run the algorithm thus grows exponentially with the number of vertices, so the functions documentation recommends that you use caution for graphs with more than fifty vertices. We won’t try it for our courts graph which is too large, but you can see the function’s documentation for details on how to use it. Whatever community detection algorithm you use, be sure to read its documentation to understand what it does, and then to interrogate its results to see if they make historical sense.

Finally, we can extract one of these communities into its own subgraph. We’ll use community 4 that we selected above. The key function is induced.subgraph().

induced.subgraph(courts, which(membership(comm) == 4)) %>%
  plot()
title("Community 4 among the federal courts")

Plotting Options

There are many options you may wish to change when plotting network graphs. You can find all of these options by reading the relevant manual page: ?igraph.plotting.

Integration with D3.js

D3.js is a very powerful library for data visualization in JavaScript. D3 is often used for network analysis. It is complex to learn and use, however. The d3Network package for R will let you take R objects and conver them to d3 network plots. The use of this package is beyond the scope of this book, but I will demonstrate one way that you can use the package to create an embedded, interactive visualization.

library(d3Network)
courts_network_df <- courts %>%
  get.edgelist() %>%
  as.data.frame

d3SimpleNetwork(courts_network_df, iframe = TRUE, width = 750, height = 900)

Next steps

Further reading:


  1. If you want to save a graph object to disk, you can do so with write.graph(). To open data from other formats, try read.graph(), which probably has a function for the data that you want.

  2. A “sparse” matrix is a matrix is which few of the cells contain actual data. In this adjacency matrix, for example, there are nine cells (3 x 3) but only three connections. Thus less then half of the matrix contains useful information. The opposite of a sparse matrix is a dense matrix, in which more than half of the cells contain data.

  3. “Lastly, a bipartite graph is a graph G = (V,E) such that the vertex set V may be partitioned into two disjoint sets, say V1 and V2, and each edge in E has one endpoint in V1 and the other in V2.” (26) There can be k-partite graphs; that is graphs which extend this definition to several different sets.