Posts

Showing posts from 2016

Cross-Validation: Concept and Example in R

Image
Cross-validation , sometimes called  rotation estimation ,  is a  model validation  technique for assessing how the results of a  statistical  analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how  accurately  a  predictive model  will perform in practice . In Machine Learning,  Cross-validation  is a resampling method used for model evaluation to avoid testing a model on the same dataset on which it was trained. This is a common mistake, especially that a separate testing dataset is not always available. However, this usually leads to inaccurate performance measures (as the model will have an almost perfect score since it is being tested on the same data it was trained on). To avoid this kind of mistakes, cross validation is usually preferred. The concept of  cross-validation  is actually simple: Instead of using the whole dataset to train and then test on same data, we could randomly divide

Investigating the makes and models of automobiles

Image
    Investigating the makes and models of automobiles With  the first set of questions asked and answered about this dataset, let's move on to additional analyses. Getting ready If you completed the previous recipe, you should have everything you need to continue. How to do it... This recipe will investigate the makes and models of automobiles and how they have changed over time: Let's look at how the makes and models of cars inform fuel efficiency over time. First, let's look at the frequency of the makes and models of cars available in the US over this time and concentrate on four-cylinder cars: Copy carsMake <- ddply(gasCars4, ~year, summarise, numberOfMakes = length(unique(make))) ggplot(carsMake, aes(year, numberOfMakes)) + geom_point() + labs(x = "Year", y = "Number of available makes") + ggtitle("Four cylinder cars") We see in the following graph that there has been a decline in the number

Analysing automobile fuel efficiency over time

Image
              Analysing automobile fuel efficiency over time We have  now successfully imported the data and looked at some important high-level statistics that provided us with a basic understanding of what values are in the dataset and how frequently some features appear. With this recipe, we continue the exploration by looking at some of the fuel efficiency metrics over time and in relation to other data points. Getting ready If you completed the previous recipe, you should have everything you need to continue. How to do it... The following steps will use both  plyr  and the graphing library,  ggplot2 , to explore the dataset: Let's start by looking at whether there is an overall trend of how MPG changes  over time on an average. To do this, we use the  ddply  function from the  plyr  package to take  the  vehicles  data frame, aggregate rows by year, and then, for each group, we compute the mean highway, city, and combine fuel efficiency.