Exploring and describing fuel efficiency data
Exploring and describing fuel efficiency data
Now that we have imported the automobile fuel efficiency dataset into R and learned a little about the nuances of importing, the next step is to do some preliminary analysis of the dataset. The purpose of this analysis is to explore what the data looks like and get your feet wet with some of R's most basic commands.
If you completed the previous recipe, you should have everything you need to continue.
The following steps will lead you through the initial exploration of our dataset, where we compute some basic parameters about the dataset:
- First, let's find out how many observations (rows) are in our data:
nrow(vehicles) ## 34287
- Next, let's find out how many variables (columns) are in our data:
ncol(vehicles) ## 74
- Now, let's get a sense of which columns of data are present in the data frame using the
name
function:> names(vehicles)
The preceding command will give you the following output:Luckily, a lot of these column or variable names are pretty descriptive and give us an idea of what they might contain. Remember, a more detailed description of the variables is available at http://www.fueleconomy.gov/feg/ws/index.shtml#vehicle. - Let's find out how many unique years of data are included in this dataset by computing a vector of the unique values in the
year
column, and then computing the length of the vector:length(unique(vehicles[, "year"])) ## 31
- Now, we determine the first and last years present in the dataset using the
min
andmax
functions:first_year <- min(vehicles[, "year"]) ## 1984 last_year <- max(vehicles[, "year"]) ## 2014
- Also, since we might use the
year
variable a lot, let's make sure that we have each year covered. The list of years from 1984 to 2014 should contain 31 unique values. To test this, use the following command:> length(unique(vehicles$year)) [1] 31
- Next, let's find out what types of fuel are used as the automobiles' primary fuel types:
table(vehicles$fuelType1) ## Diesel Electricity Midgrade Gasoline Natural Gas ## 1025 56 41 57 ## Premium Gasoline Regular Gasoline ## 8521 24587
From this, we can see that most cars in the dataset use regular gasoline, and the second most common fuel type is premium gasoline. - Let's explore the types of transmissions used by these automobiles. We first need to take care of all missing data by setting it to
NA
:vehicles$trany[vehicles$trany == ""] <- NA
- Now, the
trany
column is text, and we only care whether the car's transmission is automatic or manual. Thus, we use thesubstr
function to extract the first four characters of eachtrany
column value and determine whether it is equal toAuto
. If so, we set a new variable,trany2
, equal toAuto
; otherwise, the value is set toManual
:vehicles$trany2 <- ifelse(substr(vehicles$trany, 1, 4) == "Auto", "Auto", "Manual")
- Finally, we convert the new variable to a factor and then use the table function to see the distribution of values:
vehicles$trany <- as.factor(vehicles$trany) table(vehicles$trany2) ## Auto Manual ## 22451 11825
We can see that there are roughly twice as many automobile models with automatic transmission as there are models with manual transmission.
The data frame is an incredibly powerful datatype used by R, and we will leverage it heavily throughout this recipe. The data frame allows us to group variables of different datatypes (numeric, strings, logical, factors, and so on) into rows of related information. One example will be a data frame of customer information. Each row in the data frame can contain the name of the person (a string), along with an age (numeric), a gender (a factor), and a flag to indicate whether they are a current customer (Boolean). If you are familiar with relational databases, this is much like a table in a database.
Further, in this recipe, we looked at several ways of getting a quick read on a dataset imported into R. Most notably, we used the powerful
table
function to create a count of the occurrence of values for the fuelType1
variable. This function is capable of much more, including cross tabulations, as follows:with(vehicles, table(sCharger, year))
The preceding command will give you the following output:
Here, we looked at the number of automobile models by year, with and without a super charger (and we saw that super chargers have seemingly become more popular more recently than they were in the past).
To provide a cautionary tale about data import, let's look at the
sCharger
and tCharger
columns more closely. Note that these columns indicate whether the car contains a super charger or a turbo charger, respectively.
Starting with
sCharger
, we look at the class of the variable and the unique values present in the data frame:> class(vehicles$sCharger)
[1] "character"
> unique(vehicles$sCharger)
[1] "" "S"
We next look at
tCharger
, expecting things to be the same:> class(vehicles$tCharger)
[1] "logical"
> unique(vehicles$tCharger)
[1] NA TRUE
However, what we find is that these two seemingly similar variables are different datatypes completely. While the
tCharger
variable is a logical variable, also known as a Boolean variable in other languages, and is used to represent the binary values of true
and false
, the sCharger
variable appears to be the more general character datatype. Something seems wrong. In this case, because we can, let's check the original data. Luckily, the data is in a .csv
file, and we can use a simple text editor to open and read the file. (Notepad on Windows and vi on Unix systems are recommended for the task, but feel free to use your favorite, basic text editor.) When we open the file, we can see that sCharger
and tCharger
data columns either are blank or contains an S
or T
, respectively.
Thus, R has read in the
T
character in the tCharger
column as a Boolean TRUE
variable, as opposed to the character T
. This isn't a fatal flaw and might not impact an analysis. However, undetected bugs such as this can cause problems far down the analytical pipeline and necessitate significant repeated work.
Comments
Post a Comment