Practical Employment project with R

Visually Exploring Employment Data (R)

In this Articule, we will cover:
Preparing for analysis
Importing employment data into R
  • Exploring the employment data
  • Obtaining and merging additional data
  • Adding geographical information
  • Extracting state- and county-level wage and employment information
  • Visualizing geographical distributions of pay
  • Exploring where the jobs are, by industry
  • Animating maps for a geospatial time series
  • Benchmarking performance for some common tasks

Preparing for analysis

This recipe will prepare the groundwork with the tools you need to complete this project. If you do not have R installed on your computer, please see the instructions in Chapter 1Preparing Your Data Science Environment.

Getting ready

You need a computer with R installed and an Internet connection.

How to do it…

The following steps will get you prepared for the remainder of this chapter by downloading the dataset from the BLS website and ensuring that we have the needed R packages:
  1. Download the 27.5 MB compressed data from http://www.bls.gov/cew/data/files/2012/csv/2012_annual_singlefile.zip, and save it to a location that you will remember.
  2. Uncompress the file by right-clicking on it in Explorer or Finder, and use the appropriate menu item.

    Tip

    If you are familiar with the command line in the terminal in Linux/Mac OS X, you can easily uncompress the downloaded file using unzip 2012_annual_singlefile.zip.
  3. Launch the RStudio IDE on your computer (or just plain R for purists).
  4. Load the R packages we will need for this project:
    library(data.table)
    library(plyr)
    library(dplyr)
    library(stringr)
    library(ggplot2)
    library(maps)
    library(bit64)
    library(RColorBrewer)
    library(choroplethr)
    

    Note

    If you do not have one of these packages installed on your machine, you can easily install it using the following command, exchanging data.table for the package name to be installed:
    install.packages('data.table',repos='http://cran.r-project.org')
    
    The R package repository, known as CRAN, has several mirrors around the world. A "mirror," in this usage, is a duplicate copy of the software repository that is run on a server in a different region, providing faster access for individuals near the location. You can, and should, choose a mirror geographically closest to your location to speed up the package download. In the preceding code snippet, you change http://cran.r-project.org to the URL of your preferred CRAN mirror.
  5. Finally, set your working directory to the path where you have saved the file. This will tell R where you want it to look for the file:
    setwd('path')
    

How it works…

We will primarily use three different R packages that are extremely useful in importing, manipulating, and transforming large datasets.
The package data.table improves upon the data.frame object in R, making operations faster and allowing the specification of an index variable in a dataset, a concept familiar to database experts. It also allows the fast aggregation of large data, including very fast-ordered joins. We will primarily use the data.tablepackage for its function fread, which allows (very) fast importing of large-structured datasets into R. We will also investigate the relative performance of functions in this package against other functions we use in the benchmarking recipe later in this chapter.
The stringr package provides tools for text and string manipulation. This package streamlines and syntactically unifies available string manipulation functionalities available in R, making tasks involving string search, manipulation, and extraction much easier. We will need these functionalities here.
The dplyr package is the next iteration of the popular package, plyr, by Dr. Hadley Wickham. It is targeted at rectangular data and allows very fast aggregation, transformation, summarization, column selection, and joins. It also provides syntactical sugar to allow commands to be strung together, piping the results of one command into the next. This will be our workhorse in this project.
The ggplot2 package will be our visualization workhorse. It implements the Grammar of Graphics paradigm in R and provides a very flexible means of creating visualizations.
The maps package provides geographical information about the US that we will use in our visualizations.

Importing employment data into R

Our first step in this project is to import the employment data into R so that we can start assessing the data and perform some preliminary analysis.

Getting ready

You should be ready to go ahead after completing the previous recipe.

How to do it…

The following steps will guide you through two different ways of importing the CSV file:
  1. We can directly load the data file into R (even from the compressed version) using the following command:
    ann2012 <- read.csv(unz('2012_annual_singlefile.zip','2012.annual.singlefile.csv'),stringsAsFactors=F)
    
    However, you will find that this takes a very long time with this file. There are better ways.
  2. We chose to import the data directly into R since there are further manipulations and merges we will do later. We will use the fread function from the data.table package to do this:
    library(data.table)
    ann2012 <- fread('data/2012.annual.singlefile.csv')
    
    That's it. Really! It is also many times faster than the other method. It will not convert character variables to factors by default, so if you need factors later, you will have to convert them, but that is, in our opinion, a desirable feature in any case.

How it works…

We were familiar with read.csv from previous article, Driving Visual Analysis with Automobile Data (R). It basically reads the data line by line, separating columns by commas. As the data we are using for this project is 3.5 million rows, it is still small enough to fit into the memory of most modern personal computers, but will take quite some time to import using read.csv().
The fread function from the data.table package uses an underlying C-level function to figure out, from the file, the length, number of fields, data types, and delimiters in the file, and it then reads the file using the parameters it has learned. As a result, it is significantly faster than read.csv. There is an extensive description of the details of fread in the R documentation.

There's more…

One of the limitations currently in R is that data imported into R needs to fit into the memory of the host computer. For large datasets, using a SQL-based database for data storage and manipulation takes advantage of the speed of the SQL-based database and circumvents R's memory limitation. Often, in enterprise environments, data is stored in Oracle, SAP, MySQL or PostgreSQL databases.
The sqldf package is extremely useful if you are from a SQL background, and many entering the world of data science have such a background. This package allows you to use SQL commands and queries within R, and treats data.frame objects in R just as you would treat tables in a SQL database. It also allows you to connect with most SQL-based databases that you have access to, including SQLite, MySQL, and PostgreSQL, among others.
As a demonstration, we can import data into a SQLite database, and then read and manipulate it there:
sqldf('attach blsdb as new')
read.csv.sql('data/2012.annual.singlefile.csv',
sql='create table main.ann2012 as select * from file',
dbname='blsdb')

Note

You must have SQLite installed and available on your machine for the preceding code to work. We will leave the installation to you.
This will also take some time, but you'll have the data in SQLite. The preceding command actually doesn't ingest the data into R, but just imports it into SQLite.
You can import the data into R using the following command:
ann2012<- sqldf("select * from main.ann2012", dbname='blsdb')
You can also choose to do your manipulations using SQL within SQLite with sqldf. For some users, it will be easier to manipulate the data using SQL, and then merely import the munged data into R.
As the first part of the book is focused on R, we will not delve deeper into the use of sqldf beyond what was presented here. However, if you are more familiar with SQL, you are welcome to replicate the steps presented in R in the various recipes with SQL commands, but we'll leave this as an exercise for you.

Exploring the employment data

Now that the data is imported into R and we have learned some strategies to import larger datasets into R, we will do some preliminary analysis of the data. The purpose is to see what the data looks like, identify idiosyncrasies, and ensure that the rest of the analysis plan can move forward.

Getting ready

If you completed the last recipe, you should be ready to go.

How to do it…

The following steps will walk you through this recipe to explore the data:
  1. First, let's see how large this data is:
    >dim(ann2012)
    [1] 3556289    15
    
    Good, it's only 15 columns.
  2. Let's take a peek at the first few rows so that we can see what the data looks like:
    head(ann2012)
    
    You can refer to the following screenshot:
    What are the variables own_codeindustry_code, and so on, and what do they mean? We might need more data to understand this data.
  3. There is also a weird idiosyncrasy in this data. Some of the values for total_annual_wagestaxable_annual_wages and annual_contributions look impossibly small. A peek at the actual data shows that these numbers don't appear to be correct. However, fread actually gives an indication of what is going on:
    ann2012 <- fread('data/2012.annual.singlefile.csv', sep=',', colClasses=c('character', 'integer', 'integer', 'integer', 'integer', 'integer', 'character',rep('integer',8)))
    
    You can refer to the following screenshot:
  4. This points to the fact that the bit64 package might be needed to properly display these large numbers. Installing and loading this package, and then reimporting the data corrects the problem, as seen in the following command lines:
    install.packages('bit64')
    library('bit64')
    ann2012 <- fread('data/2012.annual.singlefile.csv', sep=',', colClasses=c('character', 'integer', 'integer', 'integer', 'integer', 'integer', 'character',rep('integer',8)))
    
    You can refer to the following screenshot:

How it works…

The head command displays the first few lines (the default is the top six lines) of a data frame. We notice that some of the headings are self-explanatory, but some allude to codes that we currently don't have access to. We will have to obtain additional data in order to make a meaningful analysis. We could have looked at the data without importing it into R. The UNIX command, less, and the Windows PowerShell command, type, could have shown us the same thing as head did.

Obtaining and merging additional data

In the previous recipe, we found that additional data was needed to understand what the data in the csv file actually represents, and this recipe will directly address this need.

Getting ready

We can find additional data on the BLS website at http://www.bls.gov/cew/datatoc.htm under the header Associated Codes and Titles. There are five files there, which we will download to our computer. They are as follows:
We download them to our computer, remembering where we stored them. We need to get ready to import them into R and merge them with our original data.

How to do it…

The following steps will lead you through loading these files into R and joining them into a larger data frame:
  1. We will now import these data files into R using the following command lines:
    for(u in c('agglevel','area','industry',
    'ownership','size')){
    assign(u,read.csv(paste('data/',u,'_titles.csv',sep=''), stringsAsFactors=F))
    }
    
    This is an example of code that makes it easier for us to do repeated tasks.
  2. Each of these datasets has exactly one variable (column header) in common with our original data, which is ann2012, as shown in the following screenshot:
    So, it should be fairly easy to put the datasets together. We'll join four of these datasets with ann2012 now, and save area, that is, the data from area_titles.csv, for the next recipe, since we want to manipulate it a bit:
    codes <- c('agglevel','industry','ownership','size')
    ann2012full <- ann2012
    for(i in 1:length(codes)){
    eval(parse(text=paste('ann2012full <- left_join(ann2012full, ',codes[i],')', sep='')))
    }
    
    The end result is shown in the following screenshot:

How it works…

In step 1 of the How to do it… section, we want to assign each dataset to its own object. We can write an individual line of code for each import, but this code allows us to do this much faster, and it will be easier to maintain in the future. The assign command takes two basic inputs: a variable name and a value to be assigned to the variable name. The for loop here doesn't iterate over numbers, but over objects. At the first iteration, u takes the value agglevel. The assign function takes the name agglevel andassigns it the result of the read.csv command. Within the paste command, we again use the value of u, since all the files we are importing have the same naming convention. It is this common naming convention that allows us to use this type of coding. Thus, the first iteration gives assign('agglevel', read.csv('data/agglevel_title.csv')), and so forth.
In step 2, we join the datasets together. We first copy the original data to a new name, ann2012full, so that we can build up this new dataset without corrupting the original data in case something goes wrong. We then use a macro-like construct to join all the new datasets to the original one, iterating over numbers in the forloop and the indices of the vector code.
Let's work our way inside out in this complex command (a sound strategy to understand complex code in general). Within the paste command, we create the command we would like evaluated. We want to do a left_join (this is from the dplyr package), joining ann2012full with agglevel in the first iteration. The left_join ensures that all the rows of ann2012full are preserved, and the rows of agglevel are appropriately replicated to match the number of rows in ann2012full. Since there is only one common variable in the two datasets, left_join automatically chooses to join using this.

Note

In general, left_join will join the two datasets using all the variable names it finds common between the two. If you do not want this, you can specify which variables you want to use for the join by specifying, for example, left_join(ann2012full, agglevel, by="agglvl_code ").
The eval statement then evaluates the command we constructed using the paste command. We iterate over the names in code so that each of the four datasets gets joined with ann2012full.
A quick check will show that the number of rows in ann2012full after the joins, and in ann2012, is the same.

Adding geographical information

The main purpose of this chapter is to look at the geographical distribution of wages across the US. Mapping this out requires us to first have a map. Fortunately, maps of the US, both at the state-and county-levels, are available in the maps package, and the data required to make the maps can be extracted. We will align our employment data with the map data in this recipe so that the correct data is represented at the right location on the map.

Getting ready

We already have the area dataset imported into R, so we are ready to go.

How to do it…

The following steps will guide you through the process of creating your first map in R:
  1. Let's first look at the data in area:
    head (area)
    
    The output is shown in the following screenshot:
    We see that there is something called area_fips here. Federal Information Processing Standards (FIPS) codes are used by the Census Bureau to designate counties and other geographical areas in the US.
  2. We want to capitalize all the names, according to the conventions. We'll write a small function to do this:
    simpleCap <-function(x){
    if(!is.na(x)){
    s <- strsplit(x,' ')[[1]]
    paste(toupper(substring(s,1,1)), substring(s,2), 
    sep='', collapse=' ')
    } else {NA}
    }
    
  3. The maps package contains two datasets that we will use; they are county.fips and state.fips. We will first do some transformations. If we look at county.fips, we notice that the FIPS code there is missing a leading 0 on the left for some of the codes. All the codes in our employment data comprise five digits:
    > data(county.fips)
    > head(county.fips)
      fips        polyname
    1 1001 alabama,autauga
    2 1003 alabama,baldwin
    3 1005 alabama,barbour
    4 1007    alabama,bibb
    5 1009  alabama,blount
    6 1011 alabama,bullock
    
  4. The stringr package will help us out here:
    county.fips$fips <- str_pad(county.fips$fips, width=5, pad="0")
    
  5. We want to separate the county names from the polyname column in county.fips. We'll get the state names from state.fips in a minute:
    county.fips$polyname <- as.character(county.fips$polyname)
    county.fips$county <- sapply(
      gsub('[a-z\ ]+,([a-z\ ]+)','\\1',county.fips$polyname),
      simpleCap)
    county.fips <- unique(county.fips)
    
  6. The state.fips data involves a lot of details:
    > data(state.fips)
    
    The output is shown in the following screenshot:
  7. We'll again pad the fips column with a 0, if necessary, so that they have two digits, and capitalize the state names from polyname to create a new state column. The code is similar to the one we used for the county.fips data:
    state.fips$fips <- str_pad(state.fips$fips, width=2, pad="0", 
    side='left')
    state.fips$state <- as.character(state.fips$polyname)
    state.fips$state <- gsub("([a-z\ ]+):[a-z\ \\']+",'\\1',state.fips$state)
    state.fips$state <- sapply(state.fips$state, simpleCap)
    
  8. We make sure that we have unique rows. We need to be careful here, since we only need to have uniqueness in the fips and state values, and not in the other code:
    mystatefips <-unique(state.fips[,c('fips','abb','state')])
    
    The unique function, when applied to a data.frame object, returns the unique rows of the object. You might be used to using unique on a single vector to find the unique elements in the vector.
  9. We get a list of the lower 48 state names. We will filter our data to look only at these states:
    lower48 <- setdiff(unique(state.fips$state),c('Hawaii','Alaska'))
    

    Note

    The setdiff set operation looks for all the elements in the first set that are not in the second set.
  10. Finally, we put all this information together into a single dataset, myarea:
    myarea <- merge(area, county.fips, by.x='area_fips',by.y='fips', all.x=T)
    myarea$state_fips <- substr(myarea$area_fips, 1,2)
    myarea <- merge(myarea, mystatefips,by.x='state_fips',by.y='fips', all.x=T)
    
  11. Lastly, we join the geographical information with our dataset, and filter it to keep only data on the lower 48 states:
    ann2012full <- left_join(ann2012full, myarea)
    ann2012full <- filter(ann2012full, state %in% lower48)
    
  12. We now store the final dataset in an R data (rda) file on disk. This provides an efficient storage mechanism for R objects:
    save(ann2012full, file='data/ann2014full.rda',compress=T)
    

How it works…

The 12 steps of this recipe covered quite a bit of material, so let's dive into some of the details, starting with step 2. The simpleCap function is an example of a function in R. We use functions to encapsulate repeated tasks, reducing code duplication and ensuring that errors have a single point of origin. If we merely repeat code, changing the input values manually, we can easily make errors in transcription, break hidden assumptions, or accidentally overwrite important variables. Further, if we want to modify the code, we have to do it manually at every duplicate location. This is tedious and error-prone, and so we make functions, a best practice that we strongly encourage you to follow.
The simpleCap function uses three functions: strsplittoupper and substring. The strsplit function splits strings (or a vector of strings) whenever it finds the string fragment to split on (in our case, ' ' or a space). The substring function extracts substrings from strings between the character locations specified. Specifying only one character location implies extracting from this location to the end of the string. The toupper function changes the case of a string from lowercase to uppercase. The reverse operation is done by tolower.
From step 3, packages often have example data bundled with them. county.fips and state.fips are examples of datasets that have been bundled into the maps package.
The stringr package, used in step 4, is another package by Dr. Wickham, which provides string manipulation functions. Here, we use str_pad, which pads a string with a character (here, 0) to give the string a particular width.
In step 5, we use the inbuilt regular expression (regex) capabilities in R. We won't talk about regular expressions too much here. The gsub function looks for the first pattern and substitutes the second pattern in the string specified as third. Here, the pattern we're looking for comprises one or more letters or spaces ([a-z\ ]+), then a comma, and then one or more letters or spaces. The second set of letters and spaces is what we want to keep, so we put parentheses around it. The \\1 pattern says to replace the entire pattern with the first pattern we used parentheses around. This replacement happens for every element of the polyname field.
Since we want capitalization for every element in polyname, we can use a for loop, but choose to use the more efficient sapply instead. Every element in polyname is passed through the simpleCapfunction, and is thus capitalized in step 7.
In step 10, we join the areacounty.fips, and mystatefips datasets together. We use the merge function rather than left_join, since the variables we want to join on have different names for different data.frame objects. The merge function in the R standard library allows this flexibility. To ensure a left join, we specify all.x=TRUE.
In step 11, we join the myarea data frame to our ann2014full dataset. We then use the filterfunction to subset the data, restricting it to data from the lower 48 states. The filter function is from the dplyr package. We'll speak about the functionalities in dplyr in the next recipe.

Extracting state- and county-level wage and employment information

So far, we worked to get the data into shape for analysis. We'll now start with looking at the geographical distribution of the average annual pay per state and per county.

Getting ready

If you have thoroughly followed the recipes in this chapter till now, you will have the data in a form from where you can extract information at different levels. We're good to go!

How to do it…

We will first extract data from ann2014full at the state-level. We need to perform the following steps:
  1. We look at the aggregate state-level data. A peek at agglevel tells us that the code for the level of data that we want is 50. Also, we only want to look at the average annual pay (avg_annual_pay) and the average annual employment level (annual_avg_emplvl), and not the other variables:
    d.state <- filter(ann2014full, agglvl_code==50)
    d.state <- select(d.state, state, avg_annual_pay, annual_avg_emplvl)
    
  2. We create two new variables, wage and empquantile, which discretizes the pay and employment variables:
    d.state$wage <- cut(d.state$avg_annual_pay, 
    quantile(d.state$avg_annual_pay, c(seq(0,.8, by=.2), .9, .95, .99, 1)))
    d.state$empquantile <- cut(d.state$annual_avg_emplvl, quantile(d.state$annual_avg_emplvl, c(seq(0,.8,by=.2),.9,.95,.99,1)))
    
  3. We also want the levels of these discretized variables to be meaningful. So, we run the following commands:
    x <- quantile(d.state$avg_annual_pay, c(seq(0,.8,by=.2),.9, .95, .99, 1))
    xx <- paste(round(x/1000),'K',sep='')
    Labs <- paste(xx[-length(xx)],xx[-1],sep='-')
    levels(d.state$wage) <- Labs
    
    x <- quantile(d.state$annual_avg_emplvl, 
    c(seq(0,.8,by=.2),.9, .95, .99, 1))
    xx <- ifelse(x>1000, paste(round(x/1000),'K',sep=''),
    round(x))
    Labs <- paste(xx[-length(xx)],xx[-1],sep='-')
    levels(d.state$empquantile) <- Labs
    
  4. We repeat this process at the county-level. We will find that the appropriate aggregation level code is 70 (agglvl_code==70). Everything else will be the same. Let's try to be a bit smarter this time around. First of all, we will discretize our variables the same way, and then change the labels to match. A function might be a good idea! The following command lines depict this:
    Discretize <- function(x, breaks=NULL){
      if(is.null(breaks)){
        breaks <- quantile(x, c(seq(0,.8,by=.2),.9, .95, .99, 1))
        if (sum(breaks==0)>1) { 
          temp <- which(breaks==0, arr.ind=TRUE)
          breaks <- breaks[max(temp):length(breaks)]
        }
      }
      x.discrete <- cut(x, breaks, include.lowest=TRUE)
      breaks.eng <- ifelse(breaks > 1000,
                           paste0(round(breaks/1000),'K'),
                           round(breaks))
      Labs <- paste(breaks.eng[-length(breaks.eng)], breaks.eng[-1],
                    sep='-')
      levels(x.discrete) <- Labs
      return(x.discrete)
    }
    
  5. We alluded to the syntactic sugar of dplyr before; now, we see it in action. The dplyrpackage allows you to string together different operations, piping the results of one operation as input for the next, using the %.% operator. We'll describe the main operations of dplyr inthe next recipe. Using some function encapsulation, the following code achieves everything that we spent significantly more lines of code to achieve in steps 1-3:
    d.cty <- filter(ann2012full, agglvl_code==70)%.%
    select(state,county,abb, avg_annual_pay, annual_avg_emplvl)%.%
    mutate(wage=Discretize(avg_annual_pay),
    empquantile=Discretize(annual_avg_emplvl))
    
We now have the basic datasets we need to visualize the geographic patterns in the data.

How it works…

The preceding five steps covered a lot of R code, so let's start breaking things down. The two functions filter and select are from dplyr. The dplyr package provides five basic functions, which are as follows:
  • filter: This creates subsets of the data based on specified criteria
  • select: This selects columns or variables from the dataset
  • mutate: This creates new columns or variables in a dataset, which are derived from other variables in the dataset
  • group_by: This splits the data by a variable or set of variables, and subsequent functions operate on each component defined by a unique variable value or combination
  • arrange: This rearranges the data (or sorts it) according to variable(s) in the dataset
Each of these functions can operate on a data.framedata.table, or tbl object, which is part of dplyr.
The cut function discretizes a continuous variable based on specified breakpoints or thresholds. Here, we specify the thresholds based on quantiles of the variable. We specify which quantiles we want by a sequence of fractions:
c(seq(0, .8, by=.2), .9, .95, .99, 1)
This is done using the seq function to create a regular sequence of numbers with a start value, an end value, and the difference between two successive values.
In step 3, we take the specified thresholds and format them. Numbers above 1,000 are truncated at the thousands place and appended with a K, as is conventionally seen. Using round without specifying the number of decimal places implies no decimal places.
Further, in step 3, we want our labels to represent a range. So, we need to create the labels by putting a dash (-) between successive values of our formatted thresholds. One way of doing this is to create two copies of the vector of thresholds, one without the last element and another without the first element. We then paste them together with -. Notice that this trick allows successive thresholds to be aligned and pasted together. If you're not convinced, print out xx[-length(xx)] and xx[-1], and see for yourself.
The Discretize function encapsulates the work we were doing in discretizing our outcomes and formatting their labels.
This code snippet uses the syntax of dplyr to string together functions. We first subset the original data, keeping only data that has agglvl_code=50 (note the == in the code). We then pipe the resulting reduced data into the second function, select, which keeps only the four variables we're interested in. This further reduces data, and it is then inputted into the mutate function, which then creates two new variables in the data object. This final object is then stored with the variable name d.cty.

Visualizing geographical distributions of pay

We created datasets that contain the data we need to visualize average pay and employment by county and state. In this recipe, we will visualize the geographical distribution of pay by shading the appropriate areas of the map with a color that maps to a particular value or range of values. This is commonly referred to as a chloropleth map; this visualization type has become increasingly popular over the last few years as it has become much simpler to make such maps, especially online. Other geographic visualizations will overlay a marker or some other shape to denote data; there is no need to fill specific shapes with geographically meaningful boundaries.

Getting ready

After the last recipe, you should be ready to use the datasets we created to visualize geographical distributions. We will use the ggplot2 package to generate our visualizations. We will also use the RColorBrewer package, which provides "palettes" of colors that are visually appealing. If you don't currently have RColorBrewer, install it using install.packages('RColorBrewer', repos='http://cran.r-project.org').

How to do it…

The following steps walk you through the creation of this geospatial data visualization:
  1. We first need to get some data on the map itself. The ggplot2 package provides a convenient function, map_data, to extract this from data bundled in the maps package:
    library(ggplot2)
    library(RColorBrewer)
    state_df <- map_data('state')
    county_df <- map_data('county')
    
  2. We now do a bit of transforming to make this data conform to our data:
    transform_mapdata <- function(x){
       names(x)[5:6] <- c('state','county')
       for(u in c('state','county'){
         x[,u] <- sapply(x[,u],simpleCap)
       }
       return(x)
    }
    state_df <- transform_mapdata(state_df)
    county_df <- transform_mapdata(county_df)
    
  3. The data.frame objects, state_df and county_df, contain the latitude and longitude of points. These are our primary graphical data and need to be joined with the data we created in the previous recipe, which contains what is in effect the color information for the map:
    chor <- left_join(state_df, d.state, by='state')
    ggplot(chor, aes(long,lat,group=group))+
    geom_polygon(aes(fill=wage))+geom_path(color='black',size=0.2)+ scale_fill_brewer(palette='PuRd') +
    theme(axis.text.x=element_blank(), axis.text.y=element_blank(), axis.ticks.x=element_blank(), axis.ticks.y=element_blank())
    
    This gives us the following figure that depicts the distribution of average annual pay by state:
  4. We can similarly create a visualization of the average annual pay by county, which will give us a much more granular information about the geographical distribution of wages:
    chor <- left_join(county_df, d.cty)
    ggplot(chor, aes(long,lat, group=group))+
      geom_polygon(aes(fill=wage))+
      geom_path( color='white',alpha=0.5,size=0.2)+
      geom_polygon(data=state_df, color='black',fill=NA)+
      scale_fill_brewer(palette='PuRd')+
      labs(x='',y='', fill='Avg Annual Pay')+
      theme(axis.text.x=element_blank(), axis.text.y=element_blank(), axis.ticks.x=element_blank(), axis.ticks.y=element_blank())
    
    This produces the following figure showing the geographical distribution of average annual pay by county:
    It is evident from the preceding figure that there are well-paying jobs in western North Dakota, Wyoming, and northwestern Nevada, most likely driven by new oil exploration opportunities in these areas. The more obvious urban and coastal areas also show up quite nicely.

How it works…

Let's dive into the explanation of how the preceding 4 steps work. The map_data function is provided by ggplot2 to extract map data from the maps package. In addition to county and state, it can also extract data for the franceitalynzusaworld, and world2 maps provided by the mapspackage.
The columns that contain state and county information in county_df and state_df are originally named region and subregion. In step 2, we need to change their names to state and county, respectively, to make joining this data with our employment data easier. We also capitalize the names of the states and counties to conform to the way we formatted the data in our employment dataset.
For the creation of the map in step 3, we create the plotting dataset by joining state_df and d.stateusing the name of the state. We then use ggplot to draw the map of the US and fill in each state with a color corresponding to the level of wage and the discretized average annual pay created in the previous recipe. To elaborate, we establish that the data for the plot comes from chor, and we draw polygons (geom_polygon) based on the latitude and longitude of the borders of each state, filling them with a color depending on how high wage is, and then we draw the actual boundaries of the states (geom_path) in black. We specify that we will use a color palette that starts at white, goes through purple, and has red corresponding to the highest level of wage. The remainder of the code is formatted by specifying labels and removing axis annotations and ticks from the plot.
For step 4, the code is essentially the same as step 3, except that we draw polygons for the boundaries of the counties rather than the states. We add a layer to draw the state boundaries in black (geom_polygon(data=state_df, color='black', fill=NA)), in addition to the county boundaries in white.

Exploring where the jobs are, by industry

In the previous recipe, we saw how to visualize the top-level aggregate data on pay. The employment dataset has more granular data, divided by public/private sectors and types of jobs. The types of jobs in this data follow a hierarchical coding system called North American Industry Classification System (NIACS). In this recipe, we will consider four particular industries and look at visualizing the geographical distribution of employment in these industries, restricted to private sector jobs.
We will look at four industrial sectors in this recipe:
  • Agriculture, forestry, fishing, and hunting (NIACS 11)
  • Mining, quarrying, and oil and gas extraction (NIACS 21)
  • Finance and insurance (NIACS 52)
  • Professional and technical services (NIACS 54)

How to do it…

We need to create a subset of the employment data, including the data for industrial sectors, but restricting it to the private sector, by performing the following steps:
  1. We start by filtering the data by the conditions we are imposing on the industry and private sectors, and keep only relevant variables:
    d.sectors <- filter(ann2012full, industry_code %in% c(11,21,54,52),
    own_code==5, # Private sector 
    agglvl_code == 74 # county-level
    ) %.%
    select(state,county,industry_code, own_code,agglvl_code,  industry_title, own_title, avg_annual_pay, annual_avg_emplvl)%.%
    mutate(wage=Discretize(avg_annual_pay),
    emplevel=Discretize(annual_avg_emplvl))
    d.sectors <- filter(d.sectors, !is.na(industry_code))
    

    Note

    Here, our selection is based on a set of industry codes, and we restrict ourselves to county-level data. This code is different from before since we're now looking at industry-specific data.
  2. We now create the visualization using ggplot2. This visualization will be an array of four panels, one for each industrial sector. Each panel will have a county-level map of the US, with colors signifying the level of employment in each county in 2012 in each particular industry. We will also choose a blue-dominated palette for this visualization:
    chor <- left_join(county_df, d.sectors)
    ggplot(chor, aes(long,lat,group=group))+
      geom_polygon(aes(fill=emplevel))+
      geom_polygon(data=state_df, color='black',fill=NA)+
      scale_fill_brewer(palette='PuBu')+
      facet_wrap(~industry_title, ncol=2, as.table=T)+
      labs(fill='Avg Employment Level',x='',y='')+
      theme(axis.text.x=element_blank(), axis.text.y=element_blank(),   axis.ticks.x=element_blank(), axis.ticks.y=element_blank())
    
    This produces the following visualization, showing geographical distribution of employment by industry:

How it works…

In this recipe, we used the dplyr functions for data munging. One of our filter criteria was that the industry_code variable should have one of the values out of 112152, or 54. This was achieved by the %in% operator, which is a set operations. It asks if an element on the left is a member of the set on the right. We have multiple criteria in the filter statement, separated by commas. This implies an AND relationship in that all of the criteria must be satisfied in order for the data to pass the filter.
We noticed that there were some missing values in the industry code. This resulted in an extra panel in the visualization, corresponding to data where the industry code was missing. We didn't want this, so we filtered out this data in this first step.
In the second step, the command to create the visualization is essentially the same as in the previous recipe, except for the following line:
facet_wrap(~industry_title, ncol=2)
This command splits the data up by the value of industry_title, creates separate visualizations for each value of industry_title, and puts them back onto a grid with two columns and an appropriate number of rows. We also used industry_title instead of industry_code here (they give the same visualization) so that the labeling of the panels is understandable, rather than comprising just some numbers that require the reader to look up their meaning.

There's more…

This recipe is the tip of the iceberg for this dataset. There are many levels that can be explored with this data, both in terms of private/public sectors and in terms of drilling down into different industries. The additional quarterly data from 2012 is also available and can shed light on temporal patterns. The annual and quarterly data is available from 1990 onwards. Further analyses are possible association of temporal patterns of employment with other socioeconomic events. The choroplethr and rMaps packages provide ways of creating animations over time for this type of data.

Animating maps for a geospatial time series

One of the real interests in this project is to see how wage patterns, as a surrogate for income patterns, changed over time. The QCEW site provides data from 2003 to 2012. In this recipe, we will look at the overall average annual pay by county for each of these years and create an animation that displays the changes in the pay pattern over this period.

Getting ready

For this recipe, we need to download the annual data for the years 2003 to 2011 from the BLS website, at http://www.bls.gov/cew/datatoc.htm. You will need to download the files corresponding to these years for the QCEW NIACS-based data files in the column CSVs Single Files-Annual Averages. Store these files (which are compressed .zip files) in the same location as the zipped 2012 data you downloaded at the beginning of this project. Don't unzip them! You must also download and install the choroplethr package using install.packages('chloroplethr'), if you haven't already done so.

Tip

Note that this recipe is relatively memory-intensive. Those running on 32-bit machines might face out-of-memory issues.

How to do it…

What we need to do is import the data for all the years from 2003 through 2012 and extract data for the county-level (agglvl_code==70) average annual pay (avg_annual_pay) for each county, plot it, and then string the pay values together in an animation. Since we basically need to do the same things for each year's data, we can do this in a for loop, and we can create functions to encapsulate the repeated actions. We start by writing code for a single year, performing the following steps:
  1. We import the data from the ZIP file that we call zipfile in this prototype code. In reality, the file names are of the pattern 2003_annual_singlefile.zip, and the CSV files in them are of the pattern 2003.annual.singlefile.csv. We will use the common patterns in the ZIP and CSV files in our code to automate the process. For me, the data lies in a folder called data, which is reflected in the following code:
    unzip(file.path('data',zipfile), exdir='data') # unzips the file
    csvfile <- gsub('zip','csv', zipfile) # Change file name
    csvfile <- gsub('_','.',csvfile) # Change _ to . in name
    dat <- fread(file.path('data', csvfile)) # read data
    
  2. We now join the employment data with the geographical data from myarea:
    dat <- left_join(dat, myarea)
    
  3. We then use the dplyr functions to extract the county-level aggregate pay data, keeping the state and county information:
    dat <- filter(dat, agglvl_code==70) %.% # County-level aggregate
    select(state, county, avg_annual_pay) # Keep variables
    
  4. We then encapsulate the actions in steps 1 through 3 in a function:
    get_data <- function(zipfile){
    unzip(file.path('data',zipfile), exdir='data') # unzips the file
    csvfile <- gsub('zip','csv', zipfile) # Change file name
    csvfile <- gsub('_','.',csvfile) # Change _ to . in name
    dat <- fread(file.path('data', csvfile)) # read data
    dat <- left_join(dat, myarea)
    dat <- filter(dat, agglvl_code==70) %.% # County-level aggregate
    select(state, county, avg_annual_pay) # Keep variables
    return(dat)
    }
    
  5. We now have to repeat this for each of the 10 years and store the data. For this type of data, a list object usually makes sense:
    files <- dir('data', pattern='annual_singlefile.zip') # file names
    n <- length(files)
    dat_list <- vector('list',n) # Initialize the list
    for(i in 1:n){
    dat_list[[i]]<- get_data(files[i])  # ingest data
    names(dat_list)[i] <- substr(files[i],1,4) #label list with years
    }
    
  6. Next, we start creating the visualizations. Since we are essentially creating 10 visualizations, the colors need to mean the same thing on all of them for comparison purposes. So, the discretization needs to be the same for all the years:
    annpay <- ldply(dat_list) # puts all the data together
    breaks <- quantile(annpay$avg_annual_pay,    c(seq(0,.8,.2),.9,.95,.99,1)) # Makes a common set of breaks
    
  7. We will create the same visualization for each year, using the same breaks. Let's create a function for this common visualization to be produced. We will use ggplot2 for the visualizations. The input values are the data that we create using get_data, and the output is a plot object that can create the visualization:
    mychoro <- function(d, fill_label=''){
    # d has a variable "outcome" that 
    # is plotted as the fill measure
    chor <- left_join(county_df, d)
    plt <- ggplot(chor, aes(long,lat, group=group))+
    geom_polygon(aes(fill=outcome))+    
    geom_path(color='white',alpha=0.5,size=0.2)+
    geom_polygon(data=state_df, color='black',fill=NA)+
    scale_fill_brewer(palette='PuRd')+
    labs(x='',y='', fill=fill_label)+
    theme(axis.text.x=element_blank(), axis.text.y=element_blank(), 
    axis.ticks.x=element_blank(),axis.ticks.y=element_blank())
    return(plt)
    }
    
  8. We now create plot objects for each year using a for loop. We store these objects in a list, with each element corresponding to each year. In the process, we create a new variable, outcome, which is the discretized pay variable, using the common breaks. This variable needs to be called outcome because of the way we designed the mychoro function:
    plt_list <- vector('list',n)
    for(i in 1:n){
    dat_list[[i]] <- mutate(dat_list[[i]],                        
    outcome=Discretize(avg_annual_pay,breaks=breaks))
    plt_list[[i]] <-
    mychoro(dat_list[[i]])+ggtitle(names(dat_list)[i])
    }
    
  9. The choroplethr package has the utility function, choroplethr_animate, which takes a list of plot objects created with ggplot2 and makes a web page with an animated GIF, layeringthe plots we created in order. The default web file is animated_choropleth.html:
    library(choroplethr)
    choroplethr_animate(plt_list)
    
    We extract three panels from this animation here to give you a flavor of what the animation looks like:
    Even from this limited view of the data, we can see the striking growth of employment and wealth in western North Dakota, Wyoming and northeast Nevada, probably due to the discovery, exploration, and mining of shale oil in this region. We can also see that generally, over the 8 years shown here, pay has risen across the country; however, regions throughout the heart of the continental United States have seen almost no change in average pay over this period. We also see a clear increase and expansion of high pay in both California and the northeast. Recall that all three plots are on the same color scale, and so, the interpretation is consistent across them.

How it works…

We covered the individual functionality used in this recipe in previous recipes and so will not repeat them. However, looking at the big picture here shows us two key points worth noting. First, we had to run through a common set of steps multiple times to create a single image. As we had to go through the same steps to create each image used in the final animation, we started to "operationalize" the code a bit, refactoring repeated code blocks into functions.
Secondly, the set of steps needed to create each image is another demonstration of the stages of the data science pipeline. In stage one, we acquire the data through the ingestion of CSV files. Our understanding of the dataset has been built in previous recipes, so the exploration and understanding stage (stage 2 of the pipeline) is a bit light. We join disparate datasets and filter them for stage 3, the munging, wrangling, and manipulation stage. For stage 4, analysis and modeling, we simply discretize the data and then map it to a particular but consistent color scale. Finally, the final data product, the animated choropleth, is used to communicate the vast amount of data in a concise and quickly understandable fashion.

There is more…

The R package choroplethr can directly create the individual choropleths using the choroplethfunction, which uses ggplot2. However, we didn't like the default appearance of the output, and customization was easier using ggplot2 directly.
Dr. Vaidyanathan, the creator of the popular rCharts package, also created the rMaps package. It creates choropleths from R using JavaScript visualization libraries for presentation on the Web, and it can also create animated choropleths using the function ichoropleth. However, the package is still in development at the time of writing this book, so we didn't have the facility to create county-level maps. An example with state-level maps is shown in the rMaps blog at http://rmaps.github.io/blog/posts/animated-choropleths/.

Benchmarking performance for some common tasks

R and its package ecosystem often provide several alternative ways of performing the same task. R also promotes users to create their own functions for particular tasks. When execution time is important, benchmarking performance is necessary to see which strategy works best. We will concentrate on speed in this recipe. The two tasks we will look at are loading the data into R and joining two data objects based on a common variable. All tests are done on a Windows 7 desktop running a 2.4 GHz Intel processor with 8 GB of RAM.

Getting ready

We will use the ann2012full and industry data objects for our performance experiments here, along with the 2012 annual employment data CSV file for data loading. Since you already have these, you are good to go. If you don't, you will need to install the two functions, rbenchmark and microbenchmark, using the install.packages() command.

How to do it…

The following steps will walk us through benchmarking two different tasks in R:
  1. Our first task is to load the employment data into R. The 2012.annual.singlefile.csv file has 3,556,289 lines of data and 15 columns. While we used fread in this chapter, there are many other possible ways of importing this data, which are as follows:
    • The first and most standard way is to use read.csv to read the CSV file
    • You can also unzip the original 2012_annual_singlefile.zip data file on the fly and read the data using read.csv
    • We can save the data to an RData file the first time we load it, and also subsequent times we load this file, to import the data into R
  2. The most basic way to benchmark speed is using the system.time function, which measures the time (both elapsed and actual computing time) taken for the task to be performed:
    > system.time(fread('data/2012.annual.singlefile.csv'))
       user  system elapsed
      14.817   0.443   15.23
    
    Note that the times you see will be different than those listed in the preceding command.
  3. However, there are packages in R that make benchmarking and comparing different functions much easier. We will introduce the rbenchmark package, which provides the benchmarkfunction that allows the simultaneous comparison of different functions:
    library(rbenchmark)
    opload <- benchmark(
      CSV=read.csv('data/2012.annual.singlefile.csv', stringsAsFactors=F),
      CSVZIP=read.csv(unz('data/2012_annual_singlefile.zip',
       '2012.annual.singlefile.csv'), stringsAsFactors=F),
      LOAD = load('data/ann2012full.rda'),
      FREAD = fread('data/2012.annual.singlefile.csv'),
      order='relative', # Report in order from shortest to longest 
      replications=5
    )
    
    You can refer to the following screenshot for the output of the preceding commands:
    Note that the results are ordered, and the relative times are recorded under the relativecolumn. This shows that fread is quite a bit faster than reading using read.csv. The very interesting thing is that, on an average, it is 4 times faster than loading the data from an RDatafile, which is the usual storage method for R data. It is apparently faster to load the data from the file using fread than storing the data in R's own serialized format!
  4. Our second task is to perform a left outer join of two data objects. We'll look at a task that we have already performed—a left join of the employment data with the industry codes. A left join ensures that the rows of data on the left of the operation will be preserved through the operation, and the other data will be expanded by repetition or missing data to have the same number of rows. We used left_join in this chapter, but there are three other strategies we can take, which are as follows:
    • The merge function available in R's standard library
    • The join function from the plyr package
    • The merge function from the data.table package, first transforming the data into data.table objects
  5. We will again use the benchmark function to compare these strategies with left_join:
    ann2012full_dt <- data.table(ann2012full, key='industry_code')
    industry_dt <- data.table(industry, key='industry_code')
    op <- benchmark(
       DT = data.table::merge(ann2012full_dt, industry_dt,
                  by='industry_code', all.x=T),
       PLYR = plyr::join(ann2012full, industry,         by='industry_code',type='left'),
     DPLYR = dplyr::left_join(ann2012full, industry),
     DPLYR2 = dplyr::left_join(ann2012full_dt, industry_dt),
     MERGE = merge(ann2012full, industry,
                   by='industry_code', all.x=T),
     order='relative',
     replications=5
    ) 
    
    You can refer to the following screenshot for the output of the preceding commands:
    Here, we see that the data.table method is a lot faster than any other strategy. Using dplyr is about 12 times slower for this particular task, plyr is about 100 times slower, and the standard merge method is 200 times slower. There is a bit of overhead in converting the data.frame objects to data.table objects, but the margin of advantage in this task overcomes this overhead.

How it works…

The basic workhorse of time benchmarking in R is the system.time function. This function records the time when evaluation of an expression starts, runs the expression, and then notes the time when it finishes. It then reports the difference of the two times. By default, garbage collection takes place before each evaluation so that the results are more consistent and maximal memory is freed for each evaluation.
The benchmark function in the rbenchmark package provides additional flexibility. It wraps the system.time function and allows several expressions to be evaluated in a single run. It also does some basic computations, such as relative times, to simplify reporting.
In terms of our tasks here, fread uses a powerful optimized C function to read the data, resulting in a high degree of speed optimization. The read.csv function just reads the datafile line by line and parses the fields by the comma separator. We can get some speed improvements in our experiments by specifying the column types in read.csv, using the colClasses option, since determining data types consumes some execution time. The load function reads the data from the RData files created using the savefunction, which stores binary representations of R objects. It compresses the size of the data a lot, but we see that there are more efficient ways of reading data than loading the RData file.
The second task we set ourselves to benchmark is a left outer join of the employment data ann2014full, with the data object of the industry industry codes. The former has 3,556,289 rows and 15 columns, and the latter has 2,469 rows and 2 columns. They are merged based on the common variable, industry_code. In a left join, all the rows of ann2014full will be preserved. For this, the merge commands will use the all.x=T option. The join function has the type='left' option for a left join. For the data.table merge, we first convert the data.frame objects to data.table objects, specifying that each has the same key variable, (think index in a database) industry_code. The data.table objects are then merged using this key variable.
There is a bit of new code formatting in this code snippet. We use plyr::join and dplyr::left_join, rather than just join and left_join. This style of coding explicitly specifies that we are using a particular function from a particular package to avoid confusion. Sometimes, this style of coding is useful when you have functions with the same name in two different packages that are both loaded in R.

There's more…

The data.table package provides very fast tools for data loading, munging, and joining. The data.table object is a derivative object of the data.frame package, and many of the functions in R that input data.frame objects can also import data.table objects. It is for this reason that the data.table object becomes your default container for rectangular data.

Comments

Popular posts from this blog

Driving Visual Analysis with Automobile Data (R)

Evaluating Classification Model Performance