Investigating the makes and models of automobiles
Investigating the makes and models of automobiles
With the first set of questions asked and answered about this dataset, let's move on to additional analyses.
If you completed the previous recipe, you should have everything you need to continue.
This recipe will investigate the makes and models of automobiles and how they have changed over time:
- Let's look at how the makes and models of cars inform fuel efficiency over time. First, let's look at the frequency of the makes and models of cars available in the US over this time and concentrate on four-cylinder cars:
carsMake <- ddply(gasCars4, ~year, summarise, numberOfMakes = length(unique(make))) ggplot(carsMake, aes(year, numberOfMakes)) + geom_point() + labs(x = "Year", y = "Number of available makes") + ggtitle("Four cylinder cars")
We see in the following graph that there has been a decline in the number of makes available over this period, though there has been a small uptick in recent times: - Can we look at the makes that have been available for every year of this study? We find there are only 12 manufactures that made four-cylinder cars every year during this period:
uniqMakes <- dlply(gasCars4, ~year, function(x) uniq ue(x$make)) commonMakes <- Reduce(intersect, uniqMakes) commonMakes ## [1] "Ford" "Honda" "Toyota" "Volkswagen" "Chevrolet" ## [6] "Chrysler" "Nissan" "Dodge" "Mazda" "Mitsubishi" ## [11] "Subaru" "Jeep"
- How have these manufacturers done over time with respect to fuel efficiency? We find that most manufacturers have shown improvement over this time, though several manufacturers have demonstrated quite sharp fuel efficiency increases in the last 5 years:
carsCommonMakes4 <- subset(gasCars4, make %in% commonMakes) avgMPG_commonMakes <- ddply(carsCommonMakes4, ~year + make, summarise, avgMPG = mean(comb08)) ggplot(avgMPG_commonMakes, aes(year, avgMPG)) + geom_line() + facet_wrap(~make, nrow = 3)
The preceding commands will give you the following plot:
In step 2, there is definitely some interesting magic at work, with a lot being done in only a few lines of code. This is both a beautiful and a problematic aspect of R. It is beautiful because it allows the concise expression of programmatically complex ideas, but it is problematic because R code can be quite inscrutable if you are not familiar with the particular library.
In the first line, we use
dlply
(not ddply
) to take the gasCars4
data frame, split it by year, and then apply the unique function to the make
variable. For each year, a list of the unique available automobile makes is computed, and then dlply
returns a list of these lists (one element each year). Note dlply
, and not ddply
, because it takes a data frame (d
) as input and returns a list (l
) as output, whereas ddply
takes a data frame (d
) as input and outputs a data frame (d
):uniqMakes <- dlply(gasCars4, ~year, function(x) unique(x$make))
commonMakes <- Reduce(intersect, uniqMakes)
commonMakes
The next line is even more interesting. It uses the
Reduce
higher order function, and this is the same Reduce
function and idea in the map reduce programming paradigm introduced by Google that underlies Hadoop. R is, in some ways, a functional programming language and offers several higher order functions as part of its core. A higher order function accepts another function as input. In this line, we pass the intersect
function to Reduce
, which will apply the intersect
function pairwise to each element in the list of unique makes per year that was created previously. Ultimately, this results in a single list of automobile makes that is present every year.
The two lines of code express a very simple concept (determining all automobile makes present every year) that took two paragraphs to describe.
The final graph in this recipe is an excellent example of the faceted graphics capabilities of
ggplot2
. Adding + facet_wrap(~make, nrow = 3)
tells ggplot2
that we want a separate set of axes for each make of automobile and distribute these subplots between three different rows. This is an incredibly powerful data visualization technique as it allows us to clearly see patterns that might only manifest for a particular value of a variable.
We kept things simple in this first data science project. The dataset itself was small—only 12 megabytes uncompressed, easily stored, and handled on a basic laptop. We used R to import the dataset, check the integrity of some (but not all) of the data fields, and summarize the data. We then moved on to exploring the data by asking a number of questions and using two key libraries,
plyr
and ggplot2
, to manipulate the data and visualize the results. In this data science pipeline, our final stage was simply the text that we wrote to summarize our conclusions and the visualizations produced by ggplot2
.
Comments
Post a Comment