Exploring and describing fuel efficiency data

Exploring and describing fuel efficiency data

Now that we have imported the automobile fuel efficiency dataset into R and learned a little about the nuances of importing, the next step is to do some preliminary analysis of the dataset. The purpose of this analysis is to explore what the data looks like and get your feet wet with some of R's most basic commands.

Getting ready

If you completed the previous recipe, you should have everything you need to continue.

How to do it...

The following steps will lead you through the initial exploration of our dataset, where we compute some basic parameters about the dataset:
  1. First, let's find out how many observations (rows) are in our data:
    nrow(vehicles)
    ## 34287
    
  2. Next, let's find out how many variables (columns) are in our data:
    ncol(vehicles)
    ## 74
    
  3. Now, let's get a sense of which columns of data are present in the data frame using the namefunction:
    > names(vehicles)
    
    The preceding command will give you the following output:
    Luckily, a lot of these column or variable names are pretty descriptive and give us an idea of what they might contain. Remember, a more detailed description of the variables is available at http://www.fueleconomy.gov/feg/ws/index.shtml#vehicle.
  4. Let's find out how many unique years of data are included in this dataset by computing a vector of the unique values in the year column, and then computing the length of the vector:
    length(unique(vehicles[, "year"]))
    ## 31
    
  5. Now, we determine the first and last years present in the dataset using the min and maxfunctions:
    first_year <- min(vehicles[, "year"])
    ## 1984
    last_year <- max(vehicles[, "year"])
    ## 2014
    

    Note

    Note that depending on when you downloaded the dataset, the value of last_year maybe greater than 2014.
  6. Also, since we might use the year variable a lot, let's make sure that we have each year covered. The list of years from 1984 to 2014 should contain 31 unique values. To test this, use the following command:
    > length(unique(vehicles$year))
    [1] 31
    
  7. Next, let's find out what types of fuel are used as the automobiles' primary fuel types:
    table(vehicles$fuelType1)
    ##            Diesel       Electricity Midgrade Gasoline       Natural Gas
    ##              1025                56                41                57
    ##  Premium Gasoline  Regular Gasoline
    ##              8521             24587
    
    From this, we can see that most cars in the dataset use regular gasoline, and the second most common fuel type is premium gasoline.
  8. Let's explore the types of transmissions used by these automobiles. We first need to take care of all missing data by setting it to NA:
    vehicles$trany[vehicles$trany == ""] <- NA
    
  9. Now, the trany column is text, and we only care whether the car's transmission is automatic or manual. Thus, we use the substr function to extract the first four characters of each tranycolumn value and determine whether it is equal to Auto. If so, we set a new variable, trany2, equal to Auto; otherwise, the value is set to Manual:
    vehicles$trany2 <- ifelse(substr(vehicles$trany, 1, 4) == "Auto", "Auto", "Manual")
    
  10. Finally, we convert the new variable to a factor and then use the table function to see the distribution of values:
    vehicles$trany <- as.factor(vehicles$trany)
    table(vehicles$trany2)
    ##   Auto Manual
    ##  22451  11825
    
    We can see that there are roughly twice as many automobile models with automatic transmission as there are models with manual transmission.

How it works...

The data frame is an incredibly powerful datatype used by R, and we will leverage it heavily throughout this recipe. The data frame allows us to group variables of different datatypes (numeric, strings, logical, factors, and so on) into rows of related information. One example will be a data frame of customer information. Each row in the data frame can contain the name of the person (a string), along with an age (numeric), a gender (a factor), and a flag to indicate whether they are a current customer (Boolean). If you are familiar with relational databases, this is much like a table in a database.
Further, in this recipe, we looked at several ways of getting a quick read on a dataset imported into R. Most notably, we used the powerful table function to create a count of the occurrence of values for the fuelType1 variable. This function is capable of much more, including cross tabulations, as follows:
with(vehicles, table(sCharger, year))
The preceding command will give you the following output:
Here, we looked at the number of automobile models by year, with and without a super charger (and we saw that super chargers have seemingly become more popular more recently than they were in the past).
Also, note that we use the with command. This command tells R to use vehicles as the default data when performing the subsequent command, in this case, table. Thus, we can omit prefacing the sCharger and year column names with the name of the data frame and vehicles, followed by the dollar sign.

There's more...

To provide a cautionary tale about data import, let's look at the sCharger and tCharger columns more closely. Note that these columns indicate whether the car contains a super charger or a turbo charger, respectively.
Starting with sCharger, we look at the class of the variable and the unique values present in the data frame:
> class(vehicles$sCharger)
[1] "character"
> unique(vehicles$sCharger)
[1] ""  "S"
We next look at tCharger, expecting things to be the same:
> class(vehicles$tCharger)
[1] "logical"
> unique(vehicles$tCharger)
[1]   NA TRUE
However, what we find is that these two seemingly similar variables are different datatypes completely. While the tCharger variable is a logical variable, also known as a Boolean variable in other languages, and is used to represent the binary values of true and false, the sCharger variable appears to be the more general character datatype. Something seems wrong. In this case, because we can, let's check the original data. Luckily, the data is in a .csv file, and we can use a simple text editor to open and read the file. (Notepad on Windows and vi on Unix systems are recommended for the task, but feel free to use your favorite, basic text editor.) When we open the file, we can see that sCharger and tCharger data columns either are blank or contains an S or T, respectively.
Thus, R has read in the T character in the tCharger column as a Boolean TRUE variable, as opposed to the character T. This isn't a fatal flaw and might not impact an analysis. However, undetected bugs such as this can cause problems far down the analytical pipeline and necessitate significant repeated work.


Comments

Popular posts from this blog

Driving Visual Analysis with Automobile Data (R)

Evaluating Classification Model Performance

Practical Employment project with R