Analysing automobile fuel efficiency over time
Analysing automobile fuel efficiency over time
We have now successfully imported the data and looked at some important high-level statistics that provided us with a basic understanding of what values are in the dataset and how frequently some features appear. With this recipe, we continue the exploration by looking at some of the fuel efficiency metrics over time and in relation to other data points.
If you completed the previous recipe, you should have everything you need to continue.
The following steps will use both
plyr
and the graphing library, ggplot2
, to explore the dataset:- Let's start by looking at whether there is an overall trend of how MPG changes over time on an average. To do this, we use the
ddply
function from theplyr
package to take thevehicles
data frame, aggregate rows by year, and then, for each group, we compute the mean highway, city, and combine fuel efficiency. The result is then assigned to a new data frame,mpgByYr
. Note that this is our first example of split-apply-combine. We split the data frame into groups by year, we apply the mean function to specific variables, and then we combine the results into a new data frame:mpgByYr <- ddply(vehicles, ~year, summarise, avgMPG = mean(comb08), avgHghy = mean(highway08), avgCity = mean(city08))
- To gain a better understanding of this new data frame, we pass it to the
ggplot
function, telling it to plot theavgMPG
variable against theyear
variable, using points. In addition, we specify that we want axis labels, a title, and even a smoothed conditional mean (geom_smooth()
) represented as a shaded region of the plot:ggplot(mpgByYr, aes(year, avgMPG)) + geom_point() + geom_smooth() + xlab("Year") + ylab("Average MPG") + ggtitle("All cars") ## geom_smooth: method="auto" and size of largest group is <1000, so using ## loess. Use 'method = x' to change the smoothing method.
The preceding commands will give you the following plot: - Based on this visualization, one might conclude that there has been a tremendous increase in the fuel economy of cars sold in the last few years. However, this can be a little misleading as there have been more hybrid and non-gasoline vehicles in the later years, which is shown as follows:
table(vehicles$fuelType1) ## Diesel Electricity Midgrade Gasoline Natural Gas ## 1025 56 41 57 ## Premium Gasoline Regular Gasoline ## 8521 24587
- Let's look at just gasoline cars, even though there are not many non-gasoline powered cars, and redraw the preceding plot. To do this, we use the subset function to create a new data frame,
gasCars
, which only contains the rows of vehicles in which thefuelType1
variable is one among a subset of values:gasCars <- subset(vehicles, fuelType1 %in% c("Regular Gasoline", "Premium Gasoline", "Midgrade Gasoline") & fuelType2 == "" & atvType != "Hybrid") mpgByYr_Gas <- ddply(gasCars, ~year, summarise, avgMPG = mean(comb08)) ggplot(mpgByYr_Gas, aes(year, avgMPG)) + geom_point() + geom_smooth() + xlab("Year") + ylab("Average MPG") + ggtitle("Gasoline cars") ## geom_smooth: method="auto" and size of largest group is <1000, so using ## loess. Use 'method = x' to change the smoothing method.
The preceding commands will give you the following plot: - Have fewer large engine cars been made recently? If so, this can explain the increase. First, let's verify whether cars with larger engines have worse fuel efficiency. We note that the
displ
variable, which represents the displacement of the engine in liters, is currently a string variable that we need to convert to a numeric variable:typeof(gasCars$displ) ## "character" gasCars$displ <- as.numeric(gasCars$displ) ggplot(gasCars, aes(displ, comb08)) + geom_point() + geom_smooth() ## geom_smooth: method="auto" and size of largest group is >=1000, so using ## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the ## smoothing method. ## Warning: Removed 2 rows containing missing values (stat_smooth). ## Warning: Removed 2 rows containing missing values (geom_point).
The preceding commands will give you the following plot:This scatter plot of the data offers the convincing evidence that there is a negative, or even inverse correlation, between engine displacement and fuel efficiency; thus, smaller cars tend to be more fuel-efficient. - Now, let's see whether more small cars were made in later years, which can explain the drastic increase in fuel efficiency:
avgCarSize <- ddply(gasCars, ~year, summarise, avgDispl = mean(displ)) ggplot(avgCarSize, aes(year, avgDispl)) + geom_point() + geom_smooth() + xlab("Year") + ylab("Average engine displacement (l)") ## geom_smooth: method="auto" and size of largest group is <1000, so using ## loess. Use 'method = x' to change the smoothing method. ## Warning: Removed 1 rows containing missing values (stat_smooth). ## Warning: Removed 1 rows containing missing values (geom_point).
The preceding commands will give you the following plot: - From the preceding figure, the average engine displacement has decreased substantially since 2008. To get a better sense of the impact this might have had on fuel efficiency, we can put both MPG and displacement by year on the same graph. Using
ddply
, we create a new data frame,byYear
, which contains both the average fuel efficiency and the average engine displacement by year:byYear <- ddply(gasCars, ~year, summarise, avgMPG = mean(comb08), avgDispl = mean(displ)) > head(byYear) year avgMPG avgDispl 1 1984 19.12162 3.068449 2 1985 19.39469 NA 3 1986 19.32046 3.126514 4 1987 19.16457 3.096474 5 1988 19.36761 3.113558 6 1989 19.14196 3.133393
- The
head
function shows us that the resulting data frame has three columns:year
,avgMPG
, andavgDispl
. To use the faceting capability ofggplot2
to displayAverage MPG
andAvg engine displacement
by year on separate but aligned plots, we must melt the data frame, converting it from what is known as a wide format to a long format:byYear2 = melt(byYear, id = "year")levels(byYear2$variable) <- c("Average MPG", "Avg engine displacement") head(byYear2) year variable value 1 1984 Average MPG 19.12162 2 1985 Average MPG 19.39469 3 1986 Average MPG 19.32046 4 1987 Average MPG 19.16457 5 1988 Average MPG 19.36761 6 1989 Average MPG 19.14196
If we use thenrow
function, we can see that thebyYear2
data frame has 62 rows and thebyYear
data frame has only 31. The two separate columns frombyYear
(avgMPG
andavgDispl
) have now been melted into one new column (value
) in thebyYear2
data frame. Note that the variable column in thebyYear2
data frame serves to identify the column that the value represents:ggplot(byYear2, aes(year, value)) + geom_point() + geom_smooth() + facet_wrap(~variable, ncol = 1, scales = "free_y") + xlab("Year") + ylab("") ## geom_smooth: method="auto" and size of largest group is <1000, so using ## loess. Use 'method = x' to change the smoothing method.## geom_smooth: method="auto" and size of largest group is <1000, so using ## loess. Use 'method = x' to change the smoothing method. ## Warning: Removed 1 rows containing missing values (stat_smooth). ## Warning: Removed 1 rows containing missing values (geom_point).
The preceding commands will give you the following plot:From this plot, we can see the following:- Since 2009, there has been a decrease in the average car size, which partially explains the increase in fuel efficiency.
- Until 2005, there was an increase in the average car size, but the fuel efficiency remained roughly constant. This seems to indicate that engine efficiency has increased over the years.
- The years 2006–2008 are interesting. Though the average engine size increased quite suddenly, the MPG remained roughly the same as in previous years. This seeming discrepancy might require more investigation.
- Given the trend toward smaller displacement engines, let's see whether automatic or manual transmissions are more efficient for four cylinder engines, and how the efficiencies have changed over time:
gasCars4 <- subset(gasCars, cylinders == "4") ggplot(gasCars4, aes(factor(year), comb08)) + geom_boxplot() + facet_wrap(~trany2, ncol = 1) + theme(axis.text.x = element_text(angle = 45)) + labs(x = "Year", y = "MPG")
The preceding command will give you the following plot:This time,ggplot2
was used to create box plots that help visualize the distribution of values (and not just a single value, such as a mean) for each year. -
ggplot(gasCars4, aes(factor(year), fill = factor(trany2))) + geom_bar(position = "fill") + labs(x = "Year", y = "Proportion of cars", fill = "Transmission") + theme(axis.text.x = element_text(angle = 45)) + geom_hline(yintercept = 0.5, linetype = 2)
The preceding command will give you the following plot:
In step 9, it appears that manual transmissions are more efficient than automatic transmissions, and they both exhibit the same increase, on an average, since 2008. However, there is something odd here. There appear to be many very efficient cars (less than 40 MPG) with automatic transmissions in later years, and almost no manual transmission cars with similar efficiencies in the same time frame. The pattern is reversed in earlier years. Is there a change in the proportion of manual cars available each year? Yes. What are these very efficient cars? In the next section, we look at the makes and models of the cars in the database.
With this recipe, we threw you into the deep end of data analysis with R, using two very important R packages,
plyr
and ggplot2
. Just as traditional software development has design patterns for common constructs, a few such patterns are emerging in the field of data science. One of the most notable is the split-apply-combine pattern highlighted by Dr. Hadley Wickham. In this strategy, one breaks up the problem into smaller, more manageable pieces by some variable. Once aggregated, you perform an operation on the new grouped data, and then combine the results into a new data structure. As you can see in this recipe, we used this strategy of split-apply-combine repeatedly, examining the data from many different perspectives, as a result.
Beyond
plyr
, this recipe heavily leveraged the ggplot2
library, which deserves additional exposition. We will refrain from providing an extensive ggplot2
tutorial as there are a number of excellent tutorials available online. What is important is that you understand the important idea of how ggplot2
allows you to construct such complex statistical visualizations in such a terse fashion.
The
ggplot2
library is an open source implementation of the foundational grammar of graphics by Wilkinson, Anand, and Grossman for R. The Grammar of Graphics attempts to decompose statistical data visualizations into component parts to better understand how such graphics are created. With ggplot2
, Hadley Wickham, takes these ideas and implements a layered approach, allowing the user to assemble complex visualizations from individual pieces very quickly. Take, for example, the first graph for this recipe, which shows the average fuel efficiency of all models of cars in a particular year over time:ggplot(mpgByYr, aes(year, avgMPG)) + geom_point() + geom_smooth() + xlab("Year") + ylab("Average MPG") + ggtitle("All cars")
To construct this plot, we first tell
ggplot
the data frame that will serve as the data for the plot (mpgByYr
), and then the aesthetic mappings that will tell ggplot2
which variables will be mapped into visual characteristics of the plot. In this case, aes(year, avgMPG)
implicitly specifies that the year will be mapped to the x axis and avgMPG
will be mapped to the y axis. Geom_point()
tells the library to plot the specified data as points and a second geom, geom_smooth()
, adds a shaded region showing the smoothed mean (with a confidence interval set to 0.95
, by default) for the same data. Finally, the xlab()
, ylab()
, and ggtitle()
functions are used to add labels to the plot. Thus, we can generate a complex, publication quality graph in a single line of code; ggplot2
is capable of doing far more complex plots.
Also, it is important to note that
ggplot2
, and the grammar of graphics in general, does not tell you how best to visualize your data, but gives you the tools to do so rapidly. If you want more advice on this topic, we strongly recommend looking into the works of Edward Tufte, who has numerous books on the matter, including the classic The Visual Display of Quantitative Information, Graphics Press USA. Further, ggplot2
does not allow for dynamic data visualizations.
Comments
Post a Comment