This tutorial covers how to download data from the census. In particular, we are going to use the bulk FTP downloads to download the census variables (e.g. Total population) for census geographies. I will also show how to download the shapefiles (the actual geographic data you can map).
There is an interactive tool you can use to download a variety of census data (the variables) called the American Fact Finder. It is somewhat self-explanatory, but you cannot download the smallest geographies (block groups) from that for the most recent American Community Survey. Also I like doing the bulk downloads, as I can see all of the variables available for a particular geography – figuring this out from the American Fact Finder is difficult.
I have written a tutorial and a helper macro to do this for those using SPSS, but here I am going to show you how to do this using Excel.
Census data is very extensive, so the documentation can be overwhelming. Here I am going to try to make sense of it in some simpler steps. This part is to download variables at particular geographic units, such as total population and median income at census tracts or block groups, from versions of the American Community Survey. Specifically we are going to download all of the data for tracts and block groups in New York for the 2012 5 year ACS data.
First we are going to go to https://www.census.gov/programs-surveys/acs/data/data-via-ftp.html, which provides key links to several locations where we can download the data. Subsequently select the link in the top right of the table. This takes you to the FTP site to bulk download the data.
You should then be in a directory that has separate years. Here go to 2012.
First lets go and download the data, so click on the [data](http://www2.census.gov/programs-surveys/acs/summary_file/2012/data/) folder.
Now since we want to work with the small geographies, we need to work with the 5 year estimates. First click the 2012_5yr_Summary_FileTemplates.zip and download that into your folder you will be conducting this assignment in. Next click the folder 5_year_by_state/.
Now navigate down to the file for New York tracts and block groups only. This is large, so may take awhile to both download and for your computer to now unzip the file.
Before we leave will download some of the documentation we are interested in. Go back to first folder you were in after clicking 2012 that listed data and documentation. This time select documentation and then select 5 year.
Then select the Tech Doc PDF, which in Chrome I can open up right in my browser.
Go ahead and unzip the files we downloaded (the actual data for New York takes quite a while to unzip). While that is going on, lets go to the documentation. Specifically, lets go to Appendix E, which starts on page 46.
The columns you need to pay attention to are the “Table Title” column, the “Table Available for Block Groups on Summary File”, and the “Summary File Sequence Number”. The Table title column gives a description of what that table gives estimates for. The “Table Available for Block Groups on Summary File” says Yes if that data is available at the block group level, if it is empty that data is only available at the Census Tract level. The sequence number tells you what text file the data is available in.
So now lets try to figure out how to get the data for the first row, the unweighted sample count of the population, which is in sequence 1.
First navigate to where you unzipped the ‘NewYork_All_Geographies_Tracts_Block_Groups_Only’ data. You will see a series of text files that have either an e, m, or g as the prefix.
The e stands for “estimates”, the “m” stands for margin of error, and the “g” stands for geography. The suffix data corresponds to the sequence number. The stuff inbetween corresponds to which year the data is for and the 1,3 or 5 year product. So for example e20125ny0001000.txt
would signify the “estimates for the 2012 5 year ACS data in New York - Sequence 1”. The margin of errors correspond to the 90% confidence interval of the mean estimate. So if you had an estimate of 500, and the margin of error was 100, a 90% confidence interval around the estimate would be 400-600.
If you open up one of these text files you will see they are comma delimited, but have no headers to describe the variable fields. We will use the File templates to figure out the variable names. So now navigate to where you unzipped the summary file templates and open that folder. You should see a series of Excel files that look like this
Go ahead and open up the Seq1.xls
file. (Note that this Set1 data file is not the population estimates, but the number of individuals and households sampled at the block group level.) This lists the variable names, but now we want to import our data. First select the A3 cell, then select the Data tab and click the “From Text” option.
Then navigate to the e20125ny0001000.txt
file and click Import.
You will then go through three screens specifying how to import the data. For the first page the defaults are fine, go ahead and click next.
The second you need to change the delimiter from Tab to Comma. Then click next.
The third step should give you a preview of your data. Go ahead and then click finish if the data is split up like the fields below.
Now it will ask you what field you want to import the data into. You want to import it into cell $A$3.
Now your data should look like below.
Now the F column (LOGRECNO) is the field that allows us to link to a geographic area, but not in this file.
Go ahead and now open up the “2012_SFGeoFileTemplate.xls” file. Do the same text data import for this file, except import the g20125ny.csv
file. This should then look like below.
You can see in the E column that it also contains the same LOGRECNO file. Now we need to merge our two tables together.
To merge these two tables together, I dragged the geographytemplate sheet into the Seq1 file. After that go ahead and save the excel spreadsheet as a new file, so we do not lose our work.
Now we are going to create a named range and then use the VLOOKUP
function to merge the tables together. First in the Formula toolbar select the Define Name option.
Now name this range GeoLookup
, and in the refers to option select the range of =geographytemplate!$E$2:$BA$34784
. For VLOOKUP
to work, the matching column needs to be the first in the named range - here LOGRECNO. We can then specify which fields to merge that are to the right of the LOGRECNO field.
Now navigate to the Seq1 worksheet. In Column I and row 3, type in the formula =VLOOKUP(F3,GeoLookup,6)
. Your worksheet should now look like this:
You can see that it imported the value of “36”, which happens to be the census defined state code for New York. Here the main value we want happens to be in the 45th column after the LOGRECNO, the big GEOID field. So go ahead and change the VLOOKUP formula to grab the 45th column. Once you have done that, double click on the lower right hand corner of the I3 cell, and that will propagate the formula down to the rest of the geographies in the state. As a final step, then rename some of the columns (in the second or first row) to more meaningful labels). And then delete the other header row. Here is what my final spreadsheet looks like:
We now have the necessary info to merge this demographic data into a shapefile. The field should be familiar, as it is similar to the GEOID10 field in the prior tutorial. If you need to trim the field down to the GEOID10, make a new column and use the formula =RIGHT(I2,11)
in Excel. Refer back to Tutorial 3 for how to merge tables in ArcGIS.
Now that I have shown you how to download the bulk census data, you may be wondering how to download the actual shapefiles for the census geographies – so we can actually map our data! This is much simpler, and the online web services to do it are much simpler than the American Fact Finder.1 As an FYI you could download the geographies from that same FTP site, but you need to download them for an entire state. This procedure I will walk you through you can download them for smaller areas, such as a specific county.
First go to this webpage, https://www.census.gov/geo/maps-data/data/tiger-line.html. We will now be downloading the census blocks for Albany County NY in 2013. Note that the years after 2010 use the same geographies as the 2010 census, but that the 2000 and 2010 census have changed their boundaries.
Then click the 2012 tab in the webpage. You will then be given an option to use the Web Interface or the FTP site. Go ahead and click the Web Interface link.
Next you will have a self explanatory set of drop-downs, a year and a geography. Here I choose 2013 and blocks. Note that the census maintains the same geographic boundaries for each 10 years, so the 2012 block groups should be equivalent to the 2013 block groups.
For this set of characteristics, it then asks for what State I want. Go ahead and choose New York, and then click download. Note you may need to turn off-pop ups in your browser to get this to download. (I’ve consistently had a poor time using Google Chrome when working with the Census websites and the American Fact Finder.)
This example you need to download the entire state. For some of the other geographies though you have the option to choose both the state and the specific county you are interested in. This then downloads a zip file that contains a shapefile of the census geographies.
In later tutorials I will shown how you can select data, so if you were only interested in blocks in Albany you could select out the ones of interest from the statewide file. Also you will learn how to merge tables together in ArcGIS. So you can construct your table of demographic data, and then merge it to the census shapefile data in ArcGIS.
Pro-academic-tip: The census changes some of its boundaries for areas, like block groups and tracts, every decennial census. For those working with census estimates over a longer period it is a pain in the butt to normalize the data to the same geographies. To normalize the areas between any two census’s, you need the crosswalk files. These files provide the necessary information to use proportional areal allocation (the dasymetric procedure we talked about in class) to make the estimates comparable between census’s.
There are a few projects though that have done some of this work for you. There is the Longitudinal Tract database for tract data from 1970 to 2010. There is the National Historical Geographic Information System that has census tract geographies going back even further, sometimes to 1910. There is also a paid for product, the Neighborhood Change Database, which for one state is currently sold for $1200. (I saw the latter referred to distinguish between the Latino categories in the 1970 data, as this is not available in the first reference.)
For your homework I want you to estimate the percentage white population (available in sequence #4) and make a table of a few specific block groups. This table will already include the total population estimates used for both the numerator and the denominator (not the sample size estimates that were drawn earlier). The block groups I want you to place in the table are:
LOGRECNO GEOID
20345 15000US360470576002
20346 15000US360470578001
20347 15000US360470578002
20348 15000US360470579001
20349 15000US360470579002
In your table include the total population and the white population estimates. Then calculate the percent white population in an additional column.
For 5 points extra credit, include the margin of error white population estimates, which can be found by importing the data in the m20125ny0004000.txt
text file. Then calculate a range of estimates for the white population (so you will have three extra columns, margin of error, low % and high %).
Pretend the total population has zero error, but calculate the range of percent white estimates at the low range of white population and the high range. So for example, if the total population was 1,000, the and the white population was 500 with a margin of error of 100, the percent white population ranges would be:
low = (500-100)/1000 = 0.4
high = (500+100)/1000 = 0.6
So between 40% and 60%. If the percent white happens to be over 100% or under 0%, simply round it to 100 or 0 respectively.
In addition to census geographies, the census also provides water boundaries (for rivers and lakes), as well as roads and railroads.↩