Analysing automobile fuel efficiency over time

             Analysing automobile fuel efficiency over time

We have now successfully imported the data and looked at some important high-level statistics that provided us with a basic understanding of what values are in the dataset and how frequently some features appear. With this recipe, we continue the exploration by looking at some of the fuel efficiency metrics over time and in relation to other data points.

Getting ready

If you completed the previous recipe, you should have everything you need to continue.

How to do it...

The following steps will use both plyr and the graphing library, ggplot2, to explore the dataset:
  1. Let's start by looking at whether there is an overall trend of how MPG changes over time on an average. To do this, we use the ddply function from the plyr package to take the vehicles data frame, aggregate rows by year, and then, for each group, we compute the mean highway, city, and combine fuel efficiency. The result is then assigned to a new data frame, mpgByYr. Note that this is our first example of split-apply-combine. We split the data frame into groups by year, we apply the mean function to specific variables, and then we combine the results into a new data frame:
    mpgByYr <- ddply(vehicles, ~year, summarise, avgMPG = mean(comb08), avgHghy = mean(highway08), avgCity = mean(city08))
    
  2. To gain a better understanding of this new data frame, we pass it to the ggplot function, telling it to plot the avgMPG variable against the year variable, using points. In addition, we specify that we want axis labels, a title, and even a smoothed conditional mean (geom_smooth()) represented as a shaded region of the plot:
    ggplot(mpgByYr, aes(year, avgMPG)) + geom_point() + geom_smooth() + xlab("Year") + ylab("Average MPG") + ggtitle("All cars")
    ## geom_smooth: method="auto" and size of largest group is <1000, so using
    ## loess. Use 'method = x' to change the smoothing method.
    
    The preceding commands will give you the following plot:
  3. Based on this visualization, one might conclude that there has been a tremendous increase in the fuel economy of cars sold in the last few years. However, this can be a little misleading as there have been more hybrid and non-gasoline vehicles in the later years, which is shown as follows:
    table(vehicles$fuelType1)
    ##            Diesel       Electricity Midgrade Gasoline       Natural Gas
    ##              1025                56                41                57
    ##  Premium Gasoline  Regular Gasoline
    ##              8521             24587
    
  4. Let's look at just gasoline cars, even though there are not many non-gasoline powered cars, and redraw the preceding plot. To do this, we use the subset function to create a new data frame, gasCars, which only contains the rows of vehicles in which the fuelType1 variable is one among a subset of values:
    gasCars <- subset(vehicles, fuelType1 %in% c("Regular Gasoline", "Premium Gasoline", "Midgrade Gasoline") & fuelType2 == "" & atvType != "Hybrid")
    mpgByYr_Gas <- ddply(gasCars, ~year, summarise, avgMPG = mean(comb08))
    ggplot(mpgByYr_Gas, aes(year, avgMPG)) + geom_point() + geom_smooth() + xlab("Year") + ylab("Average MPG") + ggtitle("Gasoline cars")
    ## geom_smooth: method="auto" and size of largest group is <1000, so using
    ## loess. Use 'method = x' to change the smoothing method.
    
    The preceding commands will give you the following plot:
  5. Have fewer large engine cars been made recently? If so, this can explain the increase. First, let's verify whether cars with larger engines have worse fuel efficiency. We note that the displvariable, which represents the displacement of the engine in liters, is currently a string variable that we need to convert to a numeric variable:
    typeof(gasCars$displ)
    ##  "character"
    gasCars$displ <- as.numeric(gasCars$displ)
    ggplot(gasCars, aes(displ, comb08)) + geom_point() + geom_smooth()
    
    ## geom_smooth: method="auto" and size of largest group is >=1000, so using
    ## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the
    ## smoothing method.
    ## Warning: Removed 2 rows containing missing values (stat_smooth).
    ## Warning: Removed 2 rows containing missing values (geom_point).
    
    The preceding commands will give you the following plot:
    This scatter plot of the data offers the convincing evidence that there is a negative, or even inverse correlation, between engine displacement and fuel efficiency; thus, smaller cars tend to be more fuel-efficient.
  6. Now, let's see whether more small cars were made in later years, which can explain the drastic increase in fuel efficiency:
    avgCarSize <- ddply(gasCars, ~year, summarise, avgDispl = mean(displ))
    ggplot(avgCarSize, aes(year, avgDispl)) + geom_point() + geom_smooth() + xlab("Year") + ylab("Average engine displacement (l)")
    
    ## geom_smooth: method="auto" and size of largest group is <1000, so using
    ## loess. Use 'method = x' to change the smoothing method.
    ## Warning: Removed 1 rows containing missing values (stat_smooth).
    ## Warning: Removed 1 rows containing missing values (geom_point).
    
    The preceding commands will give you the following plot:
  7. From the preceding figure, the average engine displacement has decreased substantially since 2008. To get a better sense of the impact this might have had on fuel efficiency, we can put both MPG and displacement by year on the same graph. Using ddply, we create a new data frame, byYear, which contains both the average fuel efficiency and the average engine displacement by year:
    byYear <- ddply(gasCars, ~year, summarise, avgMPG = mean(comb08), avgDispl = mean(displ))
    > head(byYear)
      year   avgMPG avgDispl
    1 1984 19.12162 3.068449
    2 1985 19.39469       NA
    3 1986 19.32046 3.126514
    4 1987 19.16457 3.096474
    5 1988 19.36761 3.113558
    6 1989 19.14196 3.133393
    
  8. The head function shows us that the resulting data frame has three columns: yearavgMPG, and avgDispl. To use the faceting capability of ggplot2 to display Average MPG and Avg engine displacement by year on separate but aligned plots, we must melt the data frame, converting it from what is known as a wide format to a long format:
    byYear2 = melt(byYear, id = "year")levels(byYear2$variable) <- c("Average MPG", "Avg engine displacement")
    
    head(byYear2)
      year    variable    value
    1 1984 Average MPG 19.12162
    2 1985 Average MPG 19.39469
    3 1986 Average MPG 19.32046
    4 1987 Average MPG 19.16457
    5 1988 Average MPG 19.36761
    6 1989 Average MPG 19.14196
    
    If we use the nrow function, we can see that the byYear2 data frame has 62 rows and the byYear data frame has only 31. The two separate columns from byYear (avgMPG and avgDispl) have now been melted into one new column (value) in the byYear2 data frame. Note that the variable column in the byYear2 data frame serves to identify the column that the value represents:
    ggplot(byYear2, aes(year, value)) + geom_point() + geom_smooth() + facet_wrap(~variable, ncol = 1, scales = "free_y") + xlab("Year") + ylab("")
    ## geom_smooth: method="auto" and size of largest group is <1000, so using
    ## loess. Use 'method = x' to change the smoothing method.## geom_smooth: method="auto" and size of largest group is <1000, so using
    ## loess. Use 'method = x' to change the smoothing method.
    ## Warning: Removed 1 rows containing missing values (stat_smooth).
    ## Warning: Removed 1 rows containing missing values (geom_point).
    
    The preceding commands will give you the following plot:
    From this plot, we can see the following:
    • Engine sizes have generally increased until 2008, with a sudden increase in large cars between 2006 and 2008.
    • Since 2009, there has been a decrease in the average car size, which partially explains the increase in fuel efficiency.
    • Until 2005, there was an increase in the average car size, but the fuel efficiency remained roughly constant. This seems to indicate that engine efficiency has increased over the years.
    • The years 2006–2008 are interesting. Though the average engine size increased quite suddenly, the MPG remained roughly the same as in previous years. This seeming discrepancy might require more investigation.
  9. Given the trend toward smaller displacement engines, let's see whether automatic or manual transmissions are more efficient for four cylinder engines, and how the efficiencies have changed over time:
    gasCars4 <- subset(gasCars, cylinders == "4")
    
    ggplot(gasCars4, aes(factor(year), comb08)) + geom_boxplot() + facet_wrap(~trany2, ncol = 1) + theme(axis.text.x = element_text(angle = 45)) + labs(x = "Year", y = "MPG")
    
    The preceding command will give you the following plot:
    This time, ggplot2 was used to create box plots that help visualize the distribution of values (and not just a single value, such as a mean) for each year.
  10. Next, let's look at the change in proportion of manual cars available each year:
    ggplot(gasCars4, aes(factor(year), fill = factor(trany2))) + geom_bar(position = "fill") + labs(x = "Year", y = "Proportion of cars", fill = "Transmission") + theme(axis.text.x = element_text(angle = 45)) + geom_hline(yintercept = 0.5, linetype = 2)
    
    The preceding command will give you the following plot:
In step 9, it appears that manual transmissions are more efficient than automatic transmissions, and they both exhibit the same increase, on an average, since 2008. However, there is something odd here. There appear to be many very efficient cars (less than 40 MPG) with automatic transmissions in later years, and almost no manual transmission cars with similar efficiencies in the same time frame. The pattern is reversed in earlier years. Is there a change in the proportion of manual cars available each year? Yes. What are these very efficient cars? In the next section, we look at the makes and models of the cars in the database.

How it works...

With this recipe, we threw you into the deep end of data analysis with R, using two very important R packages, plyr and ggplot2. Just as traditional software development has design patterns for common constructs, a few such patterns are emerging in the field of data science. One of the most notable is the split-apply-combine pattern highlighted by Dr. Hadley Wickham. In this strategy, one breaks up the problem into smaller, more manageable pieces by some variable. Once aggregated, you perform an operation on the new grouped data, and then combine the results into a new data structure. As you can see in this recipe, we used this strategy of split-apply-combine repeatedly, examining the data from many different perspectives, as a result.
Beyond plyr, this recipe heavily leveraged the ggplot2 library, which deserves additional exposition. We will refrain from providing an extensive ggplot2 tutorial as there are a number of excellent tutorials available online. What is important is that you understand the important idea of how ggplot2 allows you to construct such complex statistical visualizations in such a terse fashion.
The ggplot2 library is an open source implementation of the foundational grammar of graphics by Wilkinson, Anand, and Grossman for R. The Grammar of Graphics attempts to decompose statistical data visualizations into component parts to better understand how such graphics are created. With ggplot2, Hadley Wickham, takes these ideas and implements a layered approach, allowing the user to assemble complex visualizations from individual pieces very quickly. Take, for example, the first graph for this recipe, which shows the average fuel efficiency of all models of cars in a particular year over time:
ggplot(mpgByYr, aes(year, avgMPG)) + geom_point() + geom_smooth() + xlab("Year") + ylab("Average MPG") + ggtitle("All cars")
To construct this plot, we first tell ggplot the data frame that will serve as the data for the plot (mpgByYr), and then the aesthetic mappings that will tell ggplot2 which variables will be mapped into visual characteristics of the plot. In this case, aes(year, avgMPG) implicitly specifies that the year will be mapped to the x axis and avgMPG will be mapped to the y axis. Geom_point() tells the library to plot the specified data as points and a second geom, geom_smooth(), adds a shaded region showing the smoothed mean (with a confidence interval set to 0.95, by default) for the same data. Finally, the xlab()ylab(), and ggtitle() functions are used to add labels to the plot. Thus, we can generate a complex, publication quality graph in a single line of code; ggplot2 is capable of doing far more complex plots.
Also, it is important to note that ggplot2, and the grammar of graphics in general, does not tell you how best to visualize your data, but gives you the tools to do so rapidly. If you want more advice on this topic, we strongly recommend looking into the works of Edward Tufte, who has numerous books on the matter, including the classic The Visual Display of Quantitative InformationGraphics Press USA. Further, ggplot2 does not allow for dynamic data visualizations.


Comments

Popular posts from this blog

Driving Visual Analysis with Automobile Data (R)

Evaluating Classification Model Performance

Practical Employment project with R