Practical Employment project with R
Visually Exploring Employment Data (R)
In this Articule, we will cover:
Preparing for analysis
Importing employment data into R
- Exploring the employment data
- Obtaining and merging additional data
- Adding geographical information
- Extracting state- and county-level wage and employment information
- Visualizing geographical distributions of pay
- Exploring where the jobs are, by industry
- Animating maps for a geospatial time series
- Benchmarking performance for some common tasks
Preparing for analysis
This recipe will prepare the groundwork with the tools you need to complete this project. If you do not have R installed on your computer, please see the instructions in Chapter 1, Preparing Your Data Science Environment.
The following steps will get you prepared for the remainder of this chapter by downloading the dataset from the BLS website and ensuring that we have the needed R packages:
- Download the 27.5 MB compressed data from http://www.bls.gov/cew/data/files/2012/csv/2012_annual_singlefile.zip, and save it to a location that you will remember.
- Uncompress the file by right-clicking on it in Explorer or Finder, and use the appropriate menu item.
- Launch the RStudio IDE on your computer (or just plain R for purists).
- Load the R packages we will need for this project:
library(data.table) library(plyr) library(dplyr) library(stringr) library(ggplot2) library(maps) library(bit64) library(RColorBrewer) library(choroplethr)
Note
If you do not have one of these packages installed on your machine, you can easily install it using the following command, exchangingdata.table
for the package name to be installed:install.packages('data.table',repos='http://cran.r-project.org')
The R package repository, known as CRAN, has several mirrors around the world. A "mirror," in this usage, is a duplicate copy of the software repository that is run on a server in a different region, providing faster access for individuals near the location. You can, and should, choose a mirror geographically closest to your location to speed up the package download. In the preceding code snippet, you changehttp://cran.r-project.org
to the URL of your preferred CRAN mirror. - Finally, set your working directory to the path where you have saved the file. This will tell R where you want it to look for the file:
setwd('path')
We will primarily use three different R packages that are extremely useful in importing, manipulating, and transforming large datasets.
The package
data.table
improves upon the data.frame
object in R, making operations faster and allowing the specification of an index variable in a dataset, a concept familiar to database experts. It also allows the fast aggregation of large data, including very fast-ordered joins. We will primarily use the data.table
package for its function fread
, which allows (very) fast importing of large-structured datasets into R. We will also investigate the relative performance of functions in this package against other functions we use in the benchmarking recipe later in this chapter.
The
stringr
package provides tools for text and string manipulation. This package streamlines and syntactically unifies available string manipulation functionalities available in R, making tasks involving string search, manipulation, and extraction much easier. We will need these functionalities here.
The
dplyr
package is the next iteration of the popular package, plyr
, by Dr. Hadley Wickham. It is targeted at rectangular data and allows very fast aggregation, transformation, summarization, column selection, and joins. It also provides syntactical sugar to allow commands to be strung together, piping the results of one command into the next. This will be our workhorse in this project.
The
ggplot2
package will be our visualization workhorse. It implements the Grammar of Graphics paradigm in R and provides a very flexible means of creating visualizations.
The
maps
package provides geographical information about the US that we will use in our visualizations.Importing employment data into R
Our first step in this project is to import the employment data into R so that we can start assessing the data and perform some preliminary analysis.
The following steps will guide you through two different ways of importing the CSV file:
- We can directly load the data file into R (even from the compressed version) using the following command:
ann2012 <- read.csv(unz('2012_annual_singlefile.zip','2012.annual.singlefile.csv'),stringsAsFactors=F)
However, you will find that this takes a very long time with this file. There are better ways. - We chose to import the data directly into R since there are further manipulations and merges we will do later. We will use the
fread
function from thedata.table
package to do this:library(data.table) ann2012 <- fread('data/2012.annual.singlefile.csv')
That's it. Really! It is also many times faster than the other method. It will not convert character variables to factors by default, so if you need factors later, you will have to convert them, but that is, in our opinion, a desirable feature in any case.
We were familiar with
read.csv
from previous article, Driving Visual Analysis with Automobile Data (R). It basically reads the data line by line, separating columns by commas. As the data we are using for this project is 3.5 million rows, it is still small enough to fit into the memory of most modern personal computers, but will take quite some time to import using read.csv()
.
The
fread
function from the data.table
package uses an underlying C-level function to figure out, from the file, the length, number of fields, data types, and delimiters in the file, and it then reads the file using the parameters it has learned. As a result, it is significantly faster than read.csv
. There is an extensive description of the details of fread
in the R documentation.
One of the limitations currently in R is that data imported into R needs to fit into the memory of the host computer. For large datasets, using a SQL-based database for data storage and manipulation takes advantage of the speed of the SQL-based database and circumvents R's memory limitation. Often, in enterprise environments, data is stored in Oracle, SAP, MySQL or PostgreSQL databases.
The
sqldf
package is extremely useful if you are from a SQL background, and many entering the world of data science have such a background. This package allows you to use SQL commands and queries within R, and treats data.frame
objects in R just as you would treat tables in a SQL database. It also allows you to connect with most SQL-based databases that you have access to, including SQLite, MySQL, and PostgreSQL, among others.
As a demonstration, we can import data into a SQLite database, and then read and manipulate it there:
sqldf('attach blsdb as new')
read.csv.sql('data/2012.annual.singlefile.csv',
sql='create table main.ann2012 as select * from file',
dbname='blsdb')
Note
You must have SQLite installed and available on your machine for the preceding code to work. We will leave the installation to you.
This will also take some time, but you'll have the data in SQLite. The preceding command actually doesn't ingest the data into R, but just imports it into SQLite.
You can import the data into R using the following command:
ann2012<- sqldf("select * from main.ann2012", dbname='blsdb')
You can also choose to do your manipulations using SQL within SQLite with
sqldf
. For some users, it will be easier to manipulate the data using SQL, and then merely import the munged data into R.
As the first part of the book is focused on R, we will not delve deeper into the use of
sqldf
beyond what was presented here. However, if you are more familiar with SQL, you are welcome to replicate the steps presented in R in the various recipes with SQL commands, but we'll leave this as an exercise for you.Exploring the employment data
Now that the data is imported into R and we have learned some strategies to import larger datasets into R, we will do some preliminary analysis of the data. The purpose is to see what the data looks like, identify idiosyncrasies, and ensure that the rest of the analysis plan can move forward.
The following steps will walk you through this recipe to explore the data:
- First, let's see how large this data is:
>dim(ann2012) [1] 3556289 15
Good, it's only15
columns. -
head(ann2012)
You can refer to the following screenshot:What are the variablesown_code
,industry_code
, and so on, and what do they mean? We might need more data to understand this data. - There is also a weird idiosyncrasy in this data. Some of the values for
total_annual_wages
,taxable_annual_wages
andannual_contributions
look impossibly small. A peek at the actual data shows that these numbers don't appear to be correct. However,fread
actually gives an indication of what is going on:ann2012 <- fread('data/2012.annual.singlefile.csv', sep=',', colClasses=c('character', 'integer', 'integer', 'integer', 'integer', 'integer', 'character',rep('integer',8)))
You can refer to the following screenshot: - This points to the fact that the
bit64
package might be needed to properly display these large numbers. Installing and loading this package, and then reimporting the data corrects the problem, as seen in the following command lines:install.packages('bit64') library('bit64') ann2012 <- fread('data/2012.annual.singlefile.csv', sep=',', colClasses=c('character', 'integer', 'integer', 'integer', 'integer', 'integer', 'character',rep('integer',8)))
You can refer to the following screenshot:
The
head
command displays the first few lines (the default is the top six lines) of a data frame. We notice that some of the headings are self-explanatory, but some allude to codes that we currently don't have access to. We will have to obtain additional data in order to make a meaningful analysis. We could have looked at the data without importing it into R. The UNIX command, less
, and the Windows PowerShell command, type
, could have shown us the same thing as head
did.Obtaining and merging additional data
In the previous recipe, we found that additional data was needed to understand what the data in the csv file actually represents, and this recipe will directly address this need.
We can find additional data on the BLS website at http://www.bls.gov/cew/datatoc.htm under the header Associated Codes and Titles. There are five files there, which we will download to our computer. They are as follows:
agglevel_titles.csv
(http://www.bls.gov/cew/doc/titles/agglevel/agglevel_titles.csv)area_titles.csv
(http://www.bls.gov/cew/doc/titles/area/area_titles.csv)industry_titles.csv
(http://www.bls.gov/cew/doc/titles/industry/industry_titles.csv)ownership_titles.csv
(http://www.bls.gov/cew/doc/titles/ownership/ownership_titles.csv)size_titles.csv
(http://www.bls.gov/cew/doc/titles/size/size_titles.csv)
We download them to our computer, remembering where we stored them. We need to get ready to import them into R and merge them with our original data.
The following steps will lead you through loading these files into R and joining them into a larger data frame:
- We will now import these data files into R using the following command lines:
for(u in c('agglevel','area','industry', 'ownership','size')){ assign(u,read.csv(paste('data/',u,'_titles.csv',sep=''), stringsAsFactors=F)) }
This is an example of code that makes it easier for us to do repeated tasks. - Each of these datasets has exactly one variable (column header) in common with our original data, which is
ann2012
, as shown in the following screenshot:So, it should be fairly easy to put the datasets together. We'll join four of these datasets withann2012
now, and savearea
, that is, the data fromarea_titles.csv
, for the next recipe, since we want to manipulate it a bit:codes <- c('agglevel','industry','ownership','size') ann2012full <- ann2012 for(i in 1:length(codes)){ eval(parse(text=paste('ann2012full <- left_join(ann2012full, ',codes[i],')', sep=''))) }
The end result is shown in the following screenshot:
In step 1 of the How to do it… section, we want to assign each dataset to its own object. We can write an individual line of code for each import, but this code allows us to do this much faster, and it will be easier to maintain in the future. The
assign
command takes two basic inputs: a variable name and a value to be assigned to the variable name. The for
loop here doesn't iterate over numbers, but over objects. At the first iteration, u
takes the value agglevel
. The assign
function takes the name agglevel
andassigns it the result of the read.csv
command. Within the paste
command, we again use the value of u
, since all the files we are importing have the same naming convention. It is this common naming convention that allows us to use this type of coding. Thus, the first iteration gives assign('agglevel', read.csv('data/agglevel_title.csv'))
, and so forth.
In step 2, we join the datasets together. We first copy the original data to a new name,
ann2012full
, so that we can build up this new dataset without corrupting the original data in case something goes wrong. We then use a macro-like construct to join all the new datasets to the original one, iterating over numbers in the for
loop and the indices of the vector code.
Let's work our way inside out in this complex command (a sound strategy to understand complex code in general). Within the
paste
command, we create the command we would like evaluated. We want to do a left_join
(this is from the dplyr
package), joining ann2012full
with agglevel
in the first iteration. The left_join
ensures that all the rows of ann2012full
are preserved, and the rows of agglevel
are appropriately replicated to match the number of rows in ann2012full
. Since there is only one common variable in the two datasets, left_join
automatically chooses to join using this.Note
In general,
left_join
will join the two datasets using all the variable names it finds common between the two. If you do not want this, you can specify which variables you want to use for the join by specifying, for example, left_join(ann2012full, agglevel, by="agglvl_code ")
.
The
eval
statement then evaluates the command we constructed using the paste
command. We iterate over the names in code so that each of the four datasets gets joined with ann2012full
.
A quick check will show that the number of rows in
ann2012full
after the joins, and in ann2012
, is the same.Adding geographical information
The main purpose of this chapter is to look at the geographical distribution of wages across the US. Mapping this out requires us to first have a map. Fortunately, maps of the US, both at the state-and county-levels, are available in the
maps
package, and the data required to make the maps can be extracted. We will align our employment data with the map data in this recipe so that the correct data is represented at the right location on the map.
The following steps will guide you through the process of creating your first map in R:
- Let's first look at the data in
area
:head (area)
The output is shown in the following screenshot: - We want to capitalize all the names, according to the conventions. We'll write a small function to do this:
simpleCap <-function(x){ if(!is.na(x)){ s <- strsplit(x,' ')[[1]] paste(toupper(substring(s,1,1)), substring(s,2), sep='', collapse=' ') } else {NA} }
- The
maps
package contains two datasets that we will use; they arecounty.fips
andstate.fips
. We will first do some transformations. If we look atcounty.fips
, we notice that the FIPS code there is missing a leading0
on the left for some of the codes. All the codes in our employment data comprise five digits:> data(county.fips) > head(county.fips) fips polyname 1 1001 alabama,autauga 2 1003 alabama,baldwin 3 1005 alabama,barbour 4 1007 alabama,bibb 5 1009 alabama,blount 6 1011 alabama,bullock
- The
stringr
package will help us out here:county.fips$fips <- str_pad(county.fips$fips, width=5, pad="0")
- We want to separate the county names from the
polyname
column incounty.fips
. We'll get the state names fromstate.fips
in a minute:county.fips$polyname <- as.character(county.fips$polyname) county.fips$county <- sapply( gsub('[a-z\ ]+,([a-z\ ]+)','\\1',county.fips$polyname), simpleCap) county.fips <- unique(county.fips)
- The
state.fips
data involves a lot of details:> data(state.fips)
The output is shown in the following screenshot: - We'll again pad the
fips
column with a0
, if necessary, so that they have two digits, and capitalize the state names frompolyname
to create a newstate
column. The code is similar to the one we used for thecounty.fips
data:state.fips$fips <- str_pad(state.fips$fips, width=2, pad="0", side='left') state.fips$state <- as.character(state.fips$polyname) state.fips$state <- gsub("([a-z\ ]+):[a-z\ \\']+",'\\1',state.fips$state) state.fips$state <- sapply(state.fips$state, simpleCap)
- We make sure that we have unique rows. We need to be careful here, since we only need to have uniqueness in the
fips
andstate
values, and not in the other code:mystatefips <-unique(state.fips[,c('fips','abb','state')])
Theunique function
, when applied to adata.frame
object, returns the unique rows of the object. You might be used to usingunique
on a single vector to find the unique elements in the vector. - We get a list of the lower 48 state names. We will filter our data to look only at these states:
lower48 <- setdiff(unique(state.fips$state),c('Hawaii','Alaska'))
- Finally, we put all this information together into a single dataset,
myarea
:myarea <- merge(area, county.fips, by.x='area_fips',by.y='fips', all.x=T) myarea$state_fips <- substr(myarea$area_fips, 1,2) myarea <- merge(myarea, mystatefips,by.x='state_fips',by.y='fips', all.x=T)
- Lastly, we join the geographical information with our dataset, and filter it to keep only data on the lower 48 states:
ann2012full <- left_join(ann2012full, myarea) ann2012full <- filter(ann2012full, state %in% lower48)
- We now store the final dataset in an R data (
rda
) file on disk. This provides an efficient storage mechanism for R objects:save(ann2012full, file='data/ann2014full.rda',compress=T)
The 12 steps of this recipe covered quite a bit of material, so let's dive into some of the details, starting with step 2. The
simpleCap
function is an example of a function in R. We use functions to encapsulate repeated tasks, reducing code duplication and ensuring that errors have a single point of origin. If we merely repeat code, changing the input values manually, we can easily make errors in transcription, break hidden assumptions, or accidentally overwrite important variables. Further, if we want to modify the code, we have to do it manually at every duplicate location. This is tedious and error-prone, and so we make functions, a best practice that we strongly encourage you to follow.
The
simpleCap
function uses three functions: strsplit
, toupper
and substring
. The strsplit
function splits strings (or a vector of strings) whenever it finds the string fragment to split on (in our case, ' '
or a space). The substring
function extracts substrings from strings between the character locations specified. Specifying only one character location implies extracting from this location to the end of the string. The toupper
function changes the case of a string from lowercase to uppercase. The reverse operation is done by tolower
.
From step 3, packages often have example data bundled with them.
county.fips
and state.fips
are examples of datasets that have been bundled into the maps
package.
The
stringr
package, used in step 4, is another package by Dr. Wickham, which provides string manipulation functions. Here, we use str_pad
, which pads a string with a character (here, 0
) to give the string a particular width.
In step 5, we use the inbuilt regular expression (regex) capabilities in R. We won't talk about regular expressions too much here. The
gsub
function looks for the first pattern and substitutes the second pattern in the string specified as third. Here, the pattern we're looking for comprises one or more letters or spaces ([a-z\ ]+
), then a comma, and then one or more letters or spaces. The second set of letters and spaces is what we want to keep, so we put parentheses around it. The \\1
pattern says to replace the entire pattern with the first pattern we used parentheses around. This replacement happens for every element of the polyname
field.
Since we want capitalization for every element in
polyname
, we can use a for
loop, but choose to use the more efficient sapply
instead. Every element in polyname
is passed through the simpleCap
function, and is thus capitalized in step 7.
In step 10, we join the
area
, county.fips
, and mystatefips
datasets together. We use the merge
function rather than left_join
, since the variables we want to join on have different names for different data.frame
objects. The merge
function in the R standard library allows this flexibility. To ensure a left join, we specify all.x=TRUE
.
In step 11, we join the
myarea
data frame to our ann2014full
dataset. We then use the filter
function to subset the data, restricting it to data from the lower 48 states. The filter
function is from the dplyr
package. We'll speak about the functionalities in dplyr
in the next recipe.Extracting state- and county-level wage and employment information
So far, we worked to get the data into shape for analysis. We'll now start with looking at the geographical distribution of the average annual pay per state and per county.
If you have thoroughly followed the recipes in this chapter till now, you will have the data in a form from where you can extract information at different levels. We're good to go!
We will first extract data from
ann2014full
at the state-level. We need to perform the following steps:- We look at the aggregate state-level data. A peek at
agglevel
tells us that the code for the level of data that we want is50
. Also, we only want to look at the average annual pay (avg_annual_pay
) and the average annual employment level (annual_avg_emplvl
), and not the other variables:d.state <- filter(ann2014full, agglvl_code==50) d.state <- select(d.state, state, avg_annual_pay, annual_avg_emplvl)
- We create two new variables,
wage
andempquantile
, which discretizes thepay
andemployment
variables:d.state$wage <- cut(d.state$avg_annual_pay, quantile(d.state$avg_annual_pay, c(seq(0,.8, by=.2), .9, .95, .99, 1))) d.state$empquantile <- cut(d.state$annual_avg_emplvl, quantile(d.state$annual_avg_emplvl, c(seq(0,.8,by=.2),.9,.95,.99,1)))
- We also want the levels of these discretized variables to be meaningful. So, we run the following commands:
x <- quantile(d.state$avg_annual_pay, c(seq(0,.8,by=.2),.9, .95, .99, 1)) xx <- paste(round(x/1000),'K',sep='') Labs <- paste(xx[-length(xx)],xx[-1],sep='-') levels(d.state$wage) <- Labs x <- quantile(d.state$annual_avg_emplvl, c(seq(0,.8,by=.2),.9, .95, .99, 1)) xx <- ifelse(x>1000, paste(round(x/1000),'K',sep=''), round(x)) Labs <- paste(xx[-length(xx)],xx[-1],sep='-') levels(d.state$empquantile) <- Labs
- We repeat this process at the county-level. We will find that the appropriate aggregation level code is
70
(agglvl_code==70
). Everything else will be the same. Let's try to be a bit smarter this time around. First of all, we will discretize our variables the same way, and then change the labels to match. A function might be a good idea! The following command lines depict this:Discretize <- function(x, breaks=NULL){ if(is.null(breaks)){ breaks <- quantile(x, c(seq(0,.8,by=.2),.9, .95, .99, 1)) if (sum(breaks==0)>1) { temp <- which(breaks==0, arr.ind=TRUE) breaks <- breaks[max(temp):length(breaks)] } } x.discrete <- cut(x, breaks, include.lowest=TRUE) breaks.eng <- ifelse(breaks > 1000, paste0(round(breaks/1000),'K'), round(breaks)) Labs <- paste(breaks.eng[-length(breaks.eng)], breaks.eng[-1], sep='-') levels(x.discrete) <- Labs return(x.discrete) }
- We alluded to the syntactic sugar of
dplyr
before; now, we see it in action. Thedplyr
package allows you to string together different operations, piping the results of one operation as input for the next, using the%.%
operator. We'll describe the main operations ofdplyr
inthe next recipe. Using some function encapsulation, the following code achieves everything that we spent significantly more lines of code to achieve in steps 1-3:d.cty <- filter(ann2012full, agglvl_code==70)%.% select(state,county,abb, avg_annual_pay, annual_avg_emplvl)%.% mutate(wage=Discretize(avg_annual_pay), empquantile=Discretize(annual_avg_emplvl))
We now have the basic datasets we need to visualize the geographic patterns in the data.
The preceding five steps covered a lot of R code, so let's start breaking things down. The two functions
filter
and select
are from dplyr
. The dplyr
package provides five basic functions, which are as follows:
Each of these functions can operate on a
data.frame
, data.table
, or tbl
object, which is part of dplyr
.
The
cut
function discretizes a continuous variable based on specified breakpoints or thresholds. Here, we specify the thresholds based on quantiles of the variable. We specify which quantiles we want by a sequence of fractions:c(seq(0, .8, by=.2), .9, .95, .99, 1)
This is done using the
seq
function to create a regular sequence of numbers with a start value, an end value, and the difference between two successive values.
In step 3, we take the specified thresholds and format them. Numbers above 1,000 are truncated at the thousands place and appended with a
K
, as is conventionally seen. Using round
without specifying the number of decimal places implies no decimal places.
Further, in step 3, we want our labels to represent a range. So, we need to create the labels by putting a dash (
-
) between successive values of our formatted thresholds. One way of doing this is to create two copies of the vector of thresholds, one without the last element and another without the first element. We then paste them together with -
. Notice that this trick allows successive thresholds to be aligned and pasted together. If you're not convinced, print out xx[-length(xx)]
and xx[-1]
, and see for yourself.
The
Discretize
function encapsulates the work we were doing in discretizing our outcomes and formatting their labels.
This code snippet uses the syntax of
dplyr
to string together functions. We first subset the original data, keeping only data that has agglvl_code=50
(note the ==
in the code). We then pipe the resulting reduced data into the second function, select
, which keeps only the four variables we're interested in. This further reduces data, and it is then inputted into the mutate
function, which then creates two new variables in the data object. This final object is then stored with the variable name d.cty
.Visualizing geographical distributions of pay
We created datasets that contain the data we need to visualize average pay and employment by county and state. In this recipe, we will visualize the geographical distribution of pay by shading the appropriate areas of the map with a color that maps to a particular value or range of values. This is commonly referred to as a chloropleth map; this visualization type has become increasingly popular over the last few years as it has become much simpler to make such maps, especially online. Other geographic visualizations will overlay a marker or some other shape to denote data; there is no need to fill specific shapes with geographically meaningful boundaries.
After the last recipe, you should be ready to use the datasets we created to visualize geographical distributions. We will use the
ggplot2
package to generate our visualizations. We will also use the RColorBrewer
package, which provides "palettes" of colors that are visually appealing. If you don't currently have RColorBrewer
, install it using install.packages('RColorBrewer', repos='http://cran.r-project.org')
.- We first need to get some data on the map itself. The
ggplot2
package provides a convenient function,map_data
, to extract this from data bundled in themaps
package:library(ggplot2) library(RColorBrewer) state_df <- map_data('state') county_df <- map_data('county')
- We now do a bit of transforming to make this data conform to our data:
transform_mapdata <- function(x){ names(x)[5:6] <- c('state','county') for(u in c('state','county'){ x[,u] <- sapply(x[,u],simpleCap) } return(x) } state_df <- transform_mapdata(state_df) county_df <- transform_mapdata(county_df)
- The
data.frame
objects,state_df
andcounty_df
, contain the latitude and longitude of points. These are our primary graphical data and need to be joined with the data we created in the previous recipe, which contains what is in effect the color information for the map:chor <- left_join(state_df, d.state, by='state') ggplot(chor, aes(long,lat,group=group))+ geom_polygon(aes(fill=wage))+geom_path(color='black',size=0.2)+ scale_fill_brewer(palette='PuRd') + theme(axis.text.x=element_blank(), axis.text.y=element_blank(), axis.ticks.x=element_blank(), axis.ticks.y=element_blank())
This gives us the following figure that depicts the distribution of average annual pay by state: - We can similarly create a visualization of the average annual pay by county, which will give us a much more granular information about the geographical distribution of wages:
chor <- left_join(county_df, d.cty) ggplot(chor, aes(long,lat, group=group))+ geom_polygon(aes(fill=wage))+ geom_path( color='white',alpha=0.5,size=0.2)+ geom_polygon(data=state_df, color='black',fill=NA)+ scale_fill_brewer(palette='PuRd')+ labs(x='',y='', fill='Avg Annual Pay')+ theme(axis.text.x=element_blank(), axis.text.y=element_blank(), axis.ticks.x=element_blank(), axis.ticks.y=element_blank())
This produces the following figure showing the geographical distribution of average annual pay by county:
Let's dive into the explanation of how the preceding 4 steps work. The
map_data
function is provided by ggplot2
to extract map data from the maps
package. In addition to county and state, it can also extract data for the france
, italy
, nz
, usa
, world
, and world2
maps provided by the maps
package.
The columns that contain state and county information in
county_df
and state_df
are originally named region
and subregion
. In step 2, we need to change their names to state
and county
, respectively, to make joining this data with our employment data easier. We also capitalize the names of the states and counties to conform to the way we formatted the data in our employment
dataset.
For the creation of the map in step 3, we create the plotting dataset by joining
state_df
and d.state
using the name of the state. We then use ggplot
to draw the map of the US and fill in each state with a color corresponding to the level of wage and the discretized average annual pay created in the previous recipe. To elaborate, we establish that the data for the plot comes from chor
, and we draw polygons (geom_polygon
) based on the latitude and longitude of the borders of each state, filling them with a color depending on how high wage is, and then we draw the actual boundaries of the states (geom_path
) in black. We specify that we will use a color palette that starts at white, goes through purple, and has red corresponding to the highest level of wage. The remainder of the code is formatted by specifying labels and removing axis annotations and ticks from the plot.
For step 4, the code is essentially the same as step 3, except that we draw polygons for the boundaries of the counties rather than the states. We add a layer to draw the state boundaries in black (
geom_polygon(data=state_df, color='black', fill=NA)
), in addition to the county boundaries in white.Exploring where the jobs are, by industry
In the previous recipe, we saw how to visualize the top-level aggregate data on pay. The employment dataset has more granular data, divided by public/private sectors and types of jobs. The types of jobs in this data follow a hierarchical coding system called North American Industry Classification System (NIACS). In this recipe, we will consider four particular industries and look at visualizing the geographical distribution of employment in these industries, restricted to private sector jobs.
We will look at four industrial sectors in this recipe:
- Agriculture, forestry, fishing, and hunting (NIACS 11)
- Mining, quarrying, and oil and gas extraction (NIACS 21)
- Finance and insurance (NIACS 52)
- Professional and technical services (NIACS 54)
We need to create a subset of the employment data, including the data for industrial sectors, but restricting it to the private sector, by performing the following steps:
- We start by filtering the data by the conditions we are imposing on the industry and private sectors, and keep only relevant variables:
d.sectors <- filter(ann2012full, industry_code %in% c(11,21,54,52), own_code==5, # Private sector agglvl_code == 74 # county-level ) %.% select(state,county,industry_code, own_code,agglvl_code, industry_title, own_title, avg_annual_pay, annual_avg_emplvl)%.% mutate(wage=Discretize(avg_annual_pay), emplevel=Discretize(annual_avg_emplvl)) d.sectors <- filter(d.sectors, !is.na(industry_code))
- We now create the visualization using
ggplot2
. This visualization will be an array of four panels, one for each industrial sector. Each panel will have a county-level map of the US, with colors signifying the level of employment in each county in 2012 in each particular industry. We will also choose a blue-dominated palette for this visualization:chor <- left_join(county_df, d.sectors) ggplot(chor, aes(long,lat,group=group))+ geom_polygon(aes(fill=emplevel))+ geom_polygon(data=state_df, color='black',fill=NA)+ scale_fill_brewer(palette='PuBu')+ facet_wrap(~industry_title, ncol=2, as.table=T)+ labs(fill='Avg Employment Level',x='',y='')+ theme(axis.text.x=element_blank(), axis.text.y=element_blank(), axis.ticks.x=element_blank(), axis.ticks.y=element_blank())
This produces the following visualization, showing geographical distribution of employment by industry:
In this recipe, we used the
dplyr
functions for data munging. One of our filter criteria was that the industry_code
variable should have one of the values out of 11
, 21
, 52
, or 54
. This was achieved by the %in%
operator, which is a set operations. It asks if an element on the left is a member of the set on the right. We have multiple criteria in the filter
statement, separated by commas. This implies an AND
relationship in that all of the criteria must be satisfied in order for the data to pass the filter.
We noticed that there were some missing values in the industry code. This resulted in an extra panel in the visualization, corresponding to data where the industry code was missing. We didn't want this, so we filtered out this data in this first step.
In the second step, the command to create the visualization is essentially the same as in the previous recipe, except for the following line:
facet_wrap(~industry_title, ncol=2)
This command splits the data up by the value of
industry_title
, creates separate visualizations for each value of industry_title
, and puts them back onto a grid with two columns and an appropriate number of rows. We also used industry_title
instead of industry_code
here (they give the same visualization) so that the labeling of the panels is understandable, rather than comprising just some numbers that require the reader to look up their meaning.
This recipe is the tip of the iceberg for this dataset. There are many levels that can be explored with this data, both in terms of private/public sectors and in terms of drilling down into different industries. The additional quarterly data from 2012 is also available and can shed light on temporal patterns. The annual and quarterly data is available from 1990 onwards. Further analyses are possible association of temporal patterns of employment with other socioeconomic events. The
choroplethr
and rMaps
packages provide ways of creating animations over time for this type of data.Animating maps for a geospatial time series
One of the real interests in this project is to see how wage patterns, as a surrogate for income patterns, changed over time. The QCEW site provides data from 2003 to 2012. In this recipe, we will look at the overall average annual pay by county for each of these years and create an animation that displays the changes in the pay pattern over this period.
For this recipe, we need to download the annual data for the years 2003 to 2011 from the BLS website, at http://www.bls.gov/cew/datatoc.htm. You will need to download the files corresponding to these years for the QCEW NIACS-based data files in the column CSVs Single Files-Annual Averages. Store these files (which are compressed
.zip
files) in the same location as the zipped 2012 data you downloaded at the beginning of this project. Don't unzip them! You must also download and install the choroplethr
package using install.packages('chloroplethr')
, if you haven't already done so.
What we need to do is import the data for all the years from 2003 through 2012 and extract data for the county-level (
agglvl_code==70
) average annual pay (avg_annual_pay
) for each county, plot it, and then string the pay values together in an animation. Since we basically need to do the same things for each year's data, we can do this in a for
loop, and we can create functions to encapsulate the repeated actions. We start by writing code for a single year, performing the following steps:- We import the data from the ZIP file that we call
zipfile
in this prototype code. In reality, the file names are of the pattern2003_annual_singlefile.zip
, and the CSV files in them are of the pattern2003.annual.singlefile.csv
. We will use the common patterns in the ZIP and CSV files in our code to automate the process. For me, the data lies in a folder calleddata
, which is reflected in the following code:unzip(file.path('data',zipfile), exdir='data') # unzips the file csvfile <- gsub('zip','csv', zipfile) # Change file name csvfile <- gsub('_','.',csvfile) # Change _ to . in name dat <- fread(file.path('data', csvfile)) # read data
- We now join the employment data with the geographical data from
myarea
:dat <- left_join(dat, myarea)
- We then use the
dplyr
functions to extract the county-level aggregate pay data, keeping the state and county information:dat <- filter(dat, agglvl_code==70) %.% # County-level aggregate select(state, county, avg_annual_pay) # Keep variables
-
get_data <- function(zipfile){ unzip(file.path('data',zipfile), exdir='data') # unzips the file csvfile <- gsub('zip','csv', zipfile) # Change file name csvfile <- gsub('_','.',csvfile) # Change _ to . in name dat <- fread(file.path('data', csvfile)) # read data dat <- left_join(dat, myarea) dat <- filter(dat, agglvl_code==70) %.% # County-level aggregate select(state, county, avg_annual_pay) # Keep variables return(dat) }
- We now have to repeat this for each of the 10 years and store the data. For this type of data, a list object usually makes sense:
files <- dir('data', pattern='annual_singlefile.zip') # file names n <- length(files) dat_list <- vector('list',n) # Initialize the list for(i in 1:n){ dat_list[[i]]<- get_data(files[i]) # ingest data names(dat_list)[i] <- substr(files[i],1,4) #label list with years }
- Next, we start creating the visualizations. Since we are essentially creating 10 visualizations, the colors need to mean the same thing on all of them for comparison purposes. So, the discretization needs to be the same for all the years:
annpay <- ldply(dat_list) # puts all the data together breaks <- quantile(annpay$avg_annual_pay, c(seq(0,.8,.2),.9,.95,.99,1)) # Makes a common set of breaks
- We will create the same visualization for each year, using the same breaks. Let's create a function for this common visualization to be produced. We will use
ggplot2
for the visualizations. The input values are the data that we create usingget_data
, and the output is aplot
object that can create the visualization:mychoro <- function(d, fill_label=''){ # d has a variable "outcome" that # is plotted as the fill measure chor <- left_join(county_df, d) plt <- ggplot(chor, aes(long,lat, group=group))+ geom_polygon(aes(fill=outcome))+ geom_path(color='white',alpha=0.5,size=0.2)+ geom_polygon(data=state_df, color='black',fill=NA)+ scale_fill_brewer(palette='PuRd')+ labs(x='',y='', fill=fill_label)+ theme(axis.text.x=element_blank(), axis.text.y=element_blank(), axis.ticks.x=element_blank(),axis.ticks.y=element_blank()) return(plt) }
- We now create
plot
objects for each year using afor
loop. We store these objects in a list, with each element corresponding to each year. In the process, we create a new variable,outcome
, which is the discretized pay variable, using the common breaks. This variable needs to be calledoutcome
because of the way we designed themychoro
function:plt_list <- vector('list',n) for(i in 1:n){ dat_list[[i]] <- mutate(dat_list[[i]], outcome=Discretize(avg_annual_pay,breaks=breaks)) plt_list[[i]] <- mychoro(dat_list[[i]])+ggtitle(names(dat_list)[i]) }
- The
choroplethr
package has the utility function,choroplethr_animate
, which takes a list ofplot
objects created withggplot2
and makes a web page with an animated GIF, layeringthe plots we created in order. The default web file isanimated_choropleth.html
:library(choroplethr) choroplethr_animate(plt_list)
We extract three panels from this animation here to give you a flavor of what the animation looks like:Even from this limited view of the data, we can see the striking growth of employment and wealth in western North Dakota, Wyoming and northeast Nevada, probably due to the discovery, exploration, and mining of shale oil in this region. We can also see that generally, over the 8 years shown here, pay has risen across the country; however, regions throughout the heart of the continental United States have seen almost no change in average pay over this period. We also see a clear increase and expansion of high pay in both California and the northeast. Recall that all three plots are on the same color scale, and so, the interpretation is consistent across them.
We covered the individual functionality used in this recipe in previous recipes and so will not repeat them. However, looking at the big picture here shows us two key points worth noting. First, we had to run through a common set of steps multiple times to create a single image. As we had to go through the same steps to create each image used in the final animation, we started to "operationalize" the code a bit, refactoring repeated code blocks into functions.
Secondly, the set of steps needed to create each image is another demonstration of the stages of the data science pipeline. In stage one, we acquire the data through the ingestion of CSV files. Our understanding of the dataset has been built in previous recipes, so the exploration and understanding stage (stage 2 of the pipeline) is a bit light. We join disparate datasets and filter them for stage 3, the munging, wrangling, and manipulation stage. For stage 4, analysis and modeling, we simply discretize the data and then map it to a particular but consistent color scale. Finally, the final data product, the animated choropleth, is used to communicate the vast amount of data in a concise and quickly understandable fashion.
The R package
choroplethr
can directly create the individual choropleths using the choropleth
function, which uses ggplot2
. However, we didn't like the default appearance of the output, and customization was easier using ggplot2
directly.
Dr. Vaidyanathan, the creator of the popular
rCharts
package, also created the rMaps
package. It creates choropleths from R using JavaScript visualization libraries for presentation on the Web, and it can also create animated choropleths using the function ichoropleth
. However, the package is still in development at the time of writing this book, so we didn't have the facility to create county-level maps. An example with state-level maps is shown in the rMaps
blog at http://rmaps.github.io/blog/posts/animated-choropleths/.Benchmarking performance for some common tasks
R and its package ecosystem often provide several alternative ways of performing the same task. R also promotes users to create their own functions for particular tasks. When execution time is important, benchmarking performance is necessary to see which strategy works best. We will concentrate on speed in this recipe. The two tasks we will look at are loading the data into R and joining two data objects based on a common variable. All tests are done on a Windows 7 desktop running a 2.4 GHz Intel processor with 8 GB of RAM.
We will use the
ann2012full
and industry
data objects for our performance experiments here, along with the 2012 annual employment data CSV file for data loading. Since you already have these, you are good to go. If you don't, you will need to install the two functions, rbenchmark
and microbenchmark
, using the install.packages()
command.
The following steps will walk us through benchmarking two different tasks in R:
- Our first task is to load the employment data into R. The
2012.annual.singlefile.csv
file has 3,556,289 lines of data and 15 columns. While we usedfread
in this chapter, there are many other possible ways of importing this data, which are as follows:- The first and most standard way is to use
read.csv
to read the CSV file - You can also unzip the original
2012_annual_singlefile.zip
data file on the fly and read the data usingread.csv
- We can save the data to an
RData
file the first time we load it, and also subsequent times we load this file, to import the data into R
- The most basic way to benchmark speed is using the
system.time
function, which measures the time (both elapsed and actual computing time) taken for the task to be performed:> system.time(fread('data/2012.annual.singlefile.csv')) user system elapsed 14.817 0.443 15.23
Note that the times you see will be different than those listed in the preceding command. - However, there are packages in R that make benchmarking and comparing different functions much easier. We will introduce the
rbenchmark
package, which provides thebenchmark
function that allows the simultaneous comparison of different functions:library(rbenchmark) opload <- benchmark( CSV=read.csv('data/2012.annual.singlefile.csv', stringsAsFactors=F), CSVZIP=read.csv(unz('data/2012_annual_singlefile.zip', '2012.annual.singlefile.csv'), stringsAsFactors=F), LOAD = load('data/ann2012full.rda'), FREAD = fread('data/2012.annual.singlefile.csv'), order='relative', # Report in order from shortest to longest replications=5 )
You can refer to the following screenshot for the output of the preceding commands:Note that the results are ordered, and the relative times are recorded under therelative
column. This shows thatfread
is quite a bit faster than reading usingread.csv
. The very interesting thing is that, on an average, it is 4 times faster than loading the data from anRData
file, which is the usual storage method for R data. It is apparently faster to load the data from the file usingfread
than storing the data in R's own serialized format! - Our second task is to perform a left outer join of two data objects. We'll look at a task that we have already performed—a left join of the employment data with the industry codes. A left join ensures that the rows of data on the left of the operation will be preserved through the operation, and the other data will be expanded by repetition or missing data to have the same number of rows. We used
left_join
in this chapter, but there are three other strategies we can take, which are as follows:- The
merge
function available in R's standard library - The
join
function from theplyr
package - The
merge
function from thedata.table
package, first transforming the data intodata.table
objects
-
ann2012full_dt <- data.table(ann2012full, key='industry_code') industry_dt <- data.table(industry, key='industry_code') op <- benchmark( DT = data.table::merge(ann2012full_dt, industry_dt, by='industry_code', all.x=T), PLYR = plyr::join(ann2012full, industry, by='industry_code',type='left'), DPLYR = dplyr::left_join(ann2012full, industry), DPLYR2 = dplyr::left_join(ann2012full_dt, industry_dt), MERGE = merge(ann2012full, industry, by='industry_code', all.x=T), order='relative', replications=5 )
You can refer to the following screenshot for the output of the preceding commands:Here, we see that thedata.table
method is a lot faster than any other strategy. Usingdplyr
is about 12 times slower for this particular task,plyr
is about 100 times slower, and the standardmerge
method is 200 times slower. There is a bit of overhead in converting thedata.frame
objects todata.table
objects, but the margin of advantage in this task overcomes this overhead.
The basic workhorse of time benchmarking in R is the
system.time
function. This function records the time when evaluation of an expression starts, runs the expression, and then notes the time when it finishes. It then reports the difference of the two times. By default, garbage collection takes place before each evaluation so that the results are more consistent and maximal memory is freed for each evaluation.
The
benchmark
function in the rbenchmark
package provides additional flexibility. It wraps the system.time
function and allows several expressions to be evaluated in a single run. It also does some basic computations, such as relative times, to simplify reporting.
In terms of our tasks here,
fread
uses a powerful optimized C function to read the data, resulting in a high degree of speed optimization. The read.csv
function just reads the datafile line by line and parses the fields by the comma separator. We can get some speed improvements in our experiments by specifying the column types in read.csv
, using the colClasses
option, since determining data types consumes some execution time. The load
function reads the data from the RData
files created using the save
function, which stores binary representations of R objects. It compresses the size of the data a lot, but we see that there are more efficient ways of reading data than loading the RData
file.
The second task we set ourselves to benchmark is a left outer join of the employment data
ann2014full
, with the data object of the industry
industry codes. The former has 3,556,289 rows and 15 columns, and the latter has 2,469 rows and 2 columns. They are merged based on the common variable, industry_code
. In a left join, all the rows of ann2014full
will be preserved. For this, the merge
commands will use the all.x=T
option. The join
function has the type='left'
option for a left join. For the data.table
merge, we first convert the data.frame
objects to data.table
objects, specifying that each has the same key variable, (think index in a database) industry_code
. The data.table
objects are then merged using this key variable.
There is a bit of new code formatting in this code snippet. We use
plyr::join
and dplyr::left_join
, rather than just join
and left_join
. This style of coding explicitly specifies that we are using a particular function from a particular package to avoid confusion. Sometimes, this style of coding is useful when you have functions with the same name in two different packages that are both loaded in R.
The
data.table
package provides very fast tools for data loading, munging, and joining. The data.table
object is a derivative object of the data.frame
package, and many of the functions in R that input data.frame
objects can also import data.table
objects. It is for this reason that the data.table
object becomes your default container for rectangular data.
Comments
Post a Comment