Importing automobile fuel efficiency data into R

For the following recipes, you will need the R statistical programming language installed on your computer (either the base R or RStudio, but the authors strongly recommend using the excellent and free RStudio) and the automobile fuel efficiency dataset. This quick recipe will help you ensure that you have everything you will need to complete this analysis project.

Getting ready

You will need an Internet connection to complete this recipe, and we assume that you have installed RStudio for your particular platform, based on the instructions in the previous chapter.

How to do it...

If you are using RStudio, the following three steps will get you ready to roll:

Launch RStudio on your computer.

At the R console prompt, install the two R packages needed for this project:

install.packages("plyr")
install.packages("ggplot2")
install.packages("reshape2")

Load the R packages, as follows:

library(plyr)
library(ggplot2)
library(reshape2)

Once you have downloaded and installed everything in the previous recipe, you can import the dataset into R to start doing some preliminary analysis and get a sense of what the data looks like.

Getting ready

Much of the analysis in this chapter is cumulative, and the efforts of the previous recipes will be used for subsequent recipes. Thus, if you completed the previous recipe, you should have everything you need to continue.

How to do it...

The following steps will walk you through the initial import of the data into the R environment:

First, set the working directory to the location where we saved the vehicles.csv.zip file:
```
setwd("path")
```

Tip

Substitute the path for the actual directory.
We can load the data directly from compressed (ZIP) files, as long as you know the filename of the file inside the ZIP archive that you want to load:
```
vehicles <- read.csv(unz("vehicles.csv.zip", "vehicles.csv"), stringsAsFactors = F)
```
To see whether this worked, let's display the first few rows of data using the head command:
```
head(vehicles)
```
You should see the first few rows of the dataset printed on your screen.

Tip

Note that we could have used the tail command, which would have displayed the last few rows of the data frame instead of the first few rows.
The labels command gives the variable labels for the vehicles.csv file. Note that we use labels, since labels is a function in R. A quick look at the file shows that the variable names and their explanations are separated by -. So, we will try to read the file using - as the separator:
```
labels <- read.table("varlabels.txt", sep = "-", header = FALSE)
## Error: line 11 did not have 2 elements
```
This doesn't work! A closer look at the error shows that in line 11 of the data file, there are two - symbols, and it thus gets broken into three parts rather than two, unlike the other rows. We need to change our file-reading approach to ignore hyphenated words:
```
labels <- do.call(rbind, strsplit(readLines("varlabels.txt"), " - "))
```

To check whether it works, we use the head function again:

head(labels)

     [,1]         [,2]                                                       
[1,] "atvtype"    "type of alternative fuel or advanced technology vehicle"
[2,] "barrels08"  "annual petroleum consumption in barrels for fuelType1 (1)"
[3,] "barrelsA08" "annual petroleum consumption in barrels for fuelType2 (1)"
[4,] "charge120"  "time to charge an electric vehicle in hours at 120 V"
[5,] "charge240"  "time to charge an electric vehicle in hours at 240 V"

How it works...

Let's break down the last complex statement in step 5, piece-by-piece, starting from the innermost portion and working outward.

First, let's read the file line by line:

x <- readLines("varlabels.txt")

Each line needs to be split at the string -. The spaces are important, so we don't split hyphenated words (such as in line 11). This results in each line split into two parts as a vector of strings, and the vectors stored in a single list:

y <- strsplit(x, " - ")

Now, we stack these vectors together to make a matrix of strings, where the first column is the variable name and the second column is the description of the variable:

labels <- do.call(rbind, y)

There's more...

Astute readers might have noticed that the read.csv function call included stringsAsFactors = F as its final parameter. By default, R converts strings to a datatype, known as factors in many cases. Factors are the names for R's categorical datatype, which can be thought of as a label or tag applied to the data. Internally, R stores factors as integers with a mapping to the appropriate label. This technique allows older versions of R to store factors in much less memory than the corresponding character.

Categorical variables do not have a sense of order (where one value is considered greater than another). In the following snippet, we create a quick toy example converting four values of the character class to factor and do a comparison:

colors <- c('green', 'red', 'yellow', 'blue')
colors_factors <- factor(colors)
colors_factors
[1] green  red    yellow blue
Levels: blue green red yellow
colors_factors[1] > colors_factors[2]
[1] NA
Warning message:
In Ops.factor(colors_factors[1], colors_factors[2]) :
>not meaningful for factors

However, there is an ordered categorical variable, also known in the statistical world as ordinal data. Ordinal data is just like categorical data, with one exception. There is a sense of scale or value to the data. It can be said that one value is larger than another, but the magnitude of the difference cannot be measured.

Further, when importing data into R, we often run into the situation where a column of numeric data might contain an entry that is non-numeric. In this case, R might import the column of data as factors, which is often not what was intended by the data scientist. Converting from factor to character is relatively routine, but converting from factor to numeric can be a bit tricky.

There's more...

R is capable of importing data from a wide range of formats. In this recipe, we handled a CSV file, but we could have used a Microsoft Excel file as well. CSV files are preferred as they are universally supported across operating systems and far more portable. Additionally, R can import data from numerous popular statistical programs, including SPSS, Stata, and SAS.

Next Page

Search This Blog

Analytics