Introduction to QGIS
QGIS is an advanced tool, comparable to ArcGIS.1 In this workshop we will use data from NHGIS which has U.S. Census data to create a data map similar to what we did earlier. Working with NHGIS will serve as an introduction to navigating large spatial datasets that are available from governments or academic sources. But because the NHGIS shapefiles are too large to be usable in CartoDB2 we will create a similar kind of map in QGIS.
QGIS, like CartoDB, uses map layers. (You might compare it to Photoshop in this regard.) You might initially find it surprising that QGIS does not start with a base layer, such as a Google Maps. QGIS expects you to add every layer for yourself.
Getting Data from NHGIS
We will use data from the National Historic Geographic Information System (NHGIS), a project from the Minnesota Population Center to digitize U.S. Census data and connect it to spatial data. You should navigate through NHGIS in order to download data for yourself. To download data from NHGIS, you will need to create an account, then navigate the data finder. See the NHGIS User’s Guide for details. After downloading those files, you should unzip them into a directory. Be sure to include both census data and shapefiles. The geographic level (i.e., county, state, or census tract) and the year must match one another for the census data and the spatial data. Or you can download this assortment of data from 1890 U.S. Census (CSV files) with accompanying county boundaries (a shapefile).3
Adding a Vector Layer to QGIS
Each data set can be added to QGIS as a layer. The type of layer will depend on the kind of data that you are using. (See the spatial data section for an explanation.) We will begin by adding the county boundaries for the U.S. in 1890. Since these are in a shapefile, they are a vector layer.
To add the layer, click “Layer > Add Layer > Add Vector Layer.” You will then navigate to the directory where the NHGIS shapefile is stored.4
After loading the shapefile, it will be displayed as a layer in QGIS.
There are a number of things you can do now that QGIS is displaying data. You can pan and zoom using the tools in the tool bar. It is particularly useful to right click on the layer in the list of layers to the left and choose “Zoom to layer.” Using the “identify features” tool, you can click on individual counties and see the data associated with them.
We are going to do two things: inspect the data associated with the layer, and change how the layer appears on the map based on the data.
First, to see the data associated with the shapefile, you can right click on the layer and choose “Open Attribute Table.” A window will pop up with a table similar to this.
Notice the way that this data is structured. There is one row for each “feature” in the shapefile, in this case, one row for each county. Each county is associated with a set of variables, and each variable is stored in a column. These variables can have a variety of kinds of information associated with them: in this case there are text (i.e., string) fields as well as numeric fields. The kinds of information are important too. Some of the fields are place names, such as
NHGISNAM for the county name; others are geographic information, like
SHAPE_AREA for the area of the county in square meters. Yet others seem cryptic, such as
GISJOIN, but these fields provide an unique identifier that lets us connect this spatial data to other kinds of data. This table could also contain information that might be of interest in mapping, such as the population of the county, but it does not. (For instance, the Natural Earth data, linked from the resources page include population data and other fields.) We will eventually have to join that information to the shapefile ourselves.
Second, let’s change the way that the map is displayed. Instead of displaying the map based on the random colors assigned it by QGIS, we will assign the colors on the map to data in the attributes. You can do this by right clicking on the layer in the browser, then clicking “properties” and the “style” tab. QGIS calls the way that data is displayed its “symbol.” The symbol is normally the same for each feature in a layer. For instance, if all we wanted was to change the color and border of the boundaries, we could do so by selecting the “single symbol” option. This would be appropriate if we were only interested in the boundaries, or if we had, say, one shapefile for schools, another for churches, and so on, and wanted to represent each by a different symbol. In our case we want to pick the “graduated symbol” option, meaning that we are going to assign each feature to a bin associated with a color. By selecting the column to be the
SHAPE_AREA, we are saying that the color should be determined by that variable. The number of “classes” is the number of bins. The “mode” is the way we determine what the boundaries of the bins should be. In this case we will use the Jenks natural breaks algorithm which tries to make each bin as distinct as possible from every other bin, while making the items in each bin as much alike as possible.5 There are a number of color ramps to choose from, most taken from the Color Brewer palettes. Clicking “classify” assigns our counties to bins.
We get the following map as a result. It shows the counties classified by their size. We can inspect the map and see that the bigger counties do in fact receive a darker color.
But we want to create a map of more interesting information than the area of counties. To do that we need to use the data that we downloaded from NHGIS as a CSV. Try opening the file
nhgis0040_ds27_1890_county.csv in a program like LibreOffice or Excel. It should look like this.
The first thing to notice about this file is that the first column,
GISJOIN, contains the same kind of code that we saw in the attributes table of our shapefile. This key identifies each county in space and time. In other words,
G0100010 represents Autauga County, Alabama, in 1890;
G2700270 represents Clay County, Minnesota, in 1890, and so on. This is the key to joining the data in our spatial data (the shapefile) to our census data (the CSV file). We can see other data as well: the name of the county, and so on. But the information that is actually of interest to us is contained in columns with cryptic names such as
AUM001. To learn what these names mean, we have to use the codebook that is associated with out data file. The codebook is contained in the
nhgis0040_ds27_1890_county_codebook.txt file. Opening up the codebook in a text editor lets us know what the columns mean. Now we are ready to join the data.
To join the data, we need to load the CSV into QGIS. We can do this by choosing the “Layer > Add Layer > Add Delimited Text Layer” menu option, then navigating to our file. If the file had spatial information (e.g., latitude and longitude) we could let QGIS know where to find it. But this file has no spatial information so we will select “no geometry.”
Now we have both layers in QGIS and need to join them together. To do that, we will right click on the original shapefile layer, select “Properties,” and click the “Joins tab.” Earlier we noticed that both the shapefile and the CSV file had a field called
GISJOIN. We need to specify which layer we are joining, then let QGIS know to use the
GISJOIN column in both datasets. Note that your data will not always be so clean and well-organized as this. In a different data set, for instance, you might have to join a CSV of country data to a shapefile of countries using their respective columns with the names of the countries, and inevitably there will be some names that will need to be standardized.
Once we have completed the join, we can reopen the shapefile’s attribute table. Where before we only had geographic information, now we have access to all of the columns that were in the shapefile. We can create a graduated symbol as we did above. In this case, we can use the column
AVL016, which gives us the number of people born in Germany by county.
- Can you plot different columns in the NHGIS data set?
- Can you try a different algorithm for determining the colors?
- Can you normalize the map by population?
- Can you normalize the map by area?
- Can you change the map projection?
- Can you use the map composer to output a map as an image?
- Can you filter the shapefile?
- Can you plot an entirely different data set?
One of the most important skills to learn in using technology for scholarly research is how to read the manual successfully. You may wish to read the guide that QGIS provides: “A Gentle Introduction to GIS.” There are many tutorials for QGIS online; see the resources page for a list. QGIS provides many advanced techniques which you should look over.
Your institution may have a (rather expensive) subscription to ArcGIS. ArcGIS is indeed a powerful tool, and perhaps in some ways the de facto standard among cartographers. However, I prefer QGIS for two reasons. The first is a strong preference for open-source software for scholarly research and teaching. Second, in order to use ArcGIS, you would likely have to use your institutional subscription in a Windows computer lab, while QGIS permits students to use the software on their own machines. That said, what you learn about QGIS should be broadly transferable to ArcGIS should you choose to use it for yourself.↩
Even though the shapefile actually has several different files associated with it, you want to add it as a file and not as a directory. QGIS will automatically import the other files as well.↩
In this case, other options like “pretty” breaks don’t work well because the Alaska counties are much larger than any others. We might consider removing Alaska from the shapefile if we don’t intend to include it in the map.↩