Linear Discriminant Analysis

Discriminant function analysis - Meauring brines on well

Let us assume that a study of ancient artifacts that have been collected from mines needs to be carried out. Rock samples have been collected from the mines. On the collected rock samples geochemical measurements have been carried out. A similar study has been carried out on the collected artifacts. In order to separate the samples into the mine from which they were excavated, DFA can be used as a function. The function can then be applied to the artifacts to predict which mine was the source of each artifact.

Getting ready

In order to perform discriminant function analysis we shall be using a dataset collected from mines.

Step 1 - collecting and describing data

The dataset on data analysis in geology titled BRINE shall be used. This can be obtained from http://www.kgs.ku.edu/Mathgeo/Books/Stat/ASCII/BRINE.TXT . The dataset is in a standard form, with rows corresponding to samples and columns corresponding to variables. Each sample is assigned to a stratigraphic unit, listed in the last column. There are 19 cases and eight variables in the dataset. The eight numeric measurements include the following:
  • No
  • HCO3
  • SO4
  • CL
  • CA
  • MG
  • NA
  • Group

How to do it...

Let's get into the details.

Step 2 - exploring data

The first step is to load the following package:
    > library(MASS)

Note

Version info: Code for this page was tested in R version 3.2.3 (2015-12-10)
Let's explore the data and understand the relationships among the variables. We'll begin by importing the txt data file named brine.txt. We will be saving the data to the brine data frame, as follows:
> brine <- read.table("d:/brine.txt", header=TRUE, sep=",", row.names=1)
Next we shall print the brine data frame. The head() function returns the brine data frame. The brine data frame is passed as an input parameter. Use the following code:
    > head(brine)
The results are as follows:
    HCO3    SO4      Cl      Ca      Mg       Na   GROUP
1   10.4   30.0    967.1    95.9    53.7    857.7     1
2    6.2   29.6   1174.9   111.7    43.9   1054.7     1
3    2.1   11.4   2387.1   348.3   119.3   1932.4     1
4    8.5   22.5   2186.1   339.6    73.6   1803.4     1
5    6.7   32.8   2015.5   287.6    75.1   1691.8     1
6    3.8   18.9   2175.8   340.4    63.8   1793.9     1
DFA assumes multivariate normality. The data must be checked to verify the normality before performing the analysis.
In order to verify the appropriateness of the transformation, plotting of the data is carried out. The pairs() function is used to plot the data. It produces a matrix of scatterplots. The cross-plots should only compare the measurement variables in columns 1-6, the last (7th column) is the name of the group. Consider the following:
> pairs(brine[ ,1:6])
The plot is as shown in the following screenshot:

Step 3 - transforming data

It is visible that the data has a comet-shaped distribution pattern. This indicates that log transformation of the data is required, which is common for geochemical data. It is good practice to first make a copy of the entire dataset, and then apply the log transformation only to the geochemical measurements. Since the data includes zeros as well; log+1 transformation should be carried out instead of log transformation on the dataset. The brine data frame is copied to the brine.log data frame. The log transformation on the data frame is carried out. As stated earlier, log transformation is carried out. Look at the following code:
    > brine.log <- brine
    > brine.log[ ,1:6] <- log(brine[ ,1:6]+1)
    > pairs(brine.log[ ,1:6])
After the data transformation, in order to re-evaluate the normality condition using the pairs() function data frame, brine.log is replotted. The distribution appears to be more normal. The skewness has been reduced compared to the earlier plot:
    > pairs(brine.log[ ,1:6])
The plot is as shown in the following screenshot:

Step 4 - training the model

The next step is about training the model. This is carried out by discriminant function analysis. The lda()function is called to perform discriminant function analysis as follows:
> brine.log.lda <- lda(GROUP ~ HCO3 + SO4 + Cl + Ca + Mg + Na, data=brine.log)
The format of this call is much like a linear regression or ANOVA, in that we specify a formula. Here, the GROUP variable should be treated as the dependent variable, with the geochemical measurements as the independent variables. In this case, no interactions between the variables are being modeled, so the variables are added with + instead of *. Because attach() was not called, the name of the data frame must be supplied to the data parameter. After running the DFA, the first step is to view the results, as follows:
    > brine.log.lda
The results are as follows:
Call:
lda(GROUP ~ HCO3 + SO4 + Cl + Ca + Mg + Na, data = brine.log)
Prior probabilities of groups:
        1             2             3 
0.3684211     0.3157895     0.3157895 
Group means:
        HCO3        SO4         Cl         Ca         Mg         Na
1   1.759502   3.129009   7.496891   5.500942   4.283490   7.320686
2   2.736481   3.815399   6.829565   4.302573   4.007725   6.765017
3   1.374438   2.378965   6.510211   4.641049   3.923851   6.289692
Coefficients of linear discriminants:
                  LD1             LD2
HCO3      -1.67799521      0.64415802
SO4        0.07983656      0.02903096
Cl         22.27520614     -0.31427770
Ca        -1.26859368      2.54458682
Mg        -1.88732009     -2.89413332
Na       -20.86566883      1.29368129
Proportion of trace:
      LD1        LD2 
   0.7435     0.2565
  • The first part of the output displays the formula that was fitted.
  • The second section is the prior probabilities of the groups, which reflects the proportion of each group within the dataset. In other words, if you had no measurements and the number of measured samples represented the actual relative abundances of the groups, the prior probabilities would describe the probability that any unknown sample would belong to each of the groups.
  • The third section shows the group means, which is a table of the average value of each of the variables for each of your groups. Scanning this table can help you to see if the groups are distinctive in terms of one or more of the variables.
  • The fourth section reports the coefficients of the discriminant function (a, b, and c). Because there are three groups, there are 3-1 linear discriminants (if you had only two groups, you would need only 1 [2-1] linear discriminants). For each linear discriminant (LD1 and LD2), there is one coefficient corresponding, in order, to each of the variables.
  • Finally, the fifth section shows the proportion of the trace, which gives the variance explained by each discriminant function. Here, first the discriminant explains 75% of the variance, with the remainder explained by the second discriminant.

Step 5 - classifying the data

The predict() function, also part of the MASS package, uses the lda() results to assign the samples to the groups. In other words, since lda() derived a linear function that should classify the groups, predict() allows you to apply this function to the same data to see how successful the classification function is. Following the statistical convention that x-hat is the prediction of x, (hat is added to the object name to make it clear that these are the predictions). Consider the following:
    > brine.log.hat <- predict(brine.log.lda)
Let us print brine.log.hat as follows:
> brine.log.hat
The results are as follows:
$class
 [1] 2 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3
Levels: 1 2 3
$posterior
              1                2                3
1    2.312733e-01     7.627845e-01     5.942270e-03
2    9.488842e-01     3.257237e-02     1.854347e-02
3    8.453057e-01     9.482540e-04     1.537461e-01
4    9.990242e-01     8.794725e-04     9.632578e-05
5    9.965920e-01     2.849903e-03     5.581176e-04
6    9.984987e-01     1.845534e-05     1.482872e-03
7    8.676660e-01     7.666611e-06     1.323263e-01
8    4.938019e-03     9.949035e-01     1.584755e-04
9    4.356152e-03     9.956351e-01     8.770078e-06
10   2.545287e-05     9.999439e-01     3.066264e-05
11   2.081510e-02     9.791728e-01     1.210748e-05
12   1.097540e-03     9.989023e-01     1.455693e-07
13   1.440307e-02     9.854613e-01     1.356671e-04
14   4.359641e-01     2.367602e-03     5.616683e-01
15   6.169265e-02     1.540353e-04     9.381533e-01
16   7.500357e-04     4.706701e-09     9.992500e-01
17   1.430433e-03     1.095281e-06     9.985685e-01
18   2.549733e-04     3.225658e-07     9.997447e-01
19   6.433759e-02     8.576694e-03     9.270857e-01
$x
              LD1            LD2
1      -1.1576284     -0.1998499
2     -0.1846803      0.6655823
3       1.0179998      0.6827867
4      -0.3939366      2.6798084
5      -0.3167164      2.0188002
6       1.0061340      2.6434491
7       2.0725443      1.5714400
8      -2.0387449     -0.9731745
9      -2.6054261     -0.2774844
10     -2.5191350     -2.8304663
11     -2.4915044      0.3194247
12     -3.4448401      0.1869864
13     -2.0343204     -0.4674925
14      1.0441237     -0.0991014
15      1.6987023     -0.6036252
16      3.9138884     -0.7211078
17      2.7083649     -1.3896956
18      2.9310268     -1.9243611
19      0.7941483     -1.2819190
The output starts with the assigned classifications of each of our samples. Next, it lists the posterior probabilities of each sample to each group, with the probabilities in each row (that is, for each sample) summing to 1.0. These posterior probabilities measure the strength of each classification. If one of these probabilities for a sample is much greater than all the others, that sample is assigned to one group with a high degree of certainty. If two or more of the probabilities are nearly equal, the assignment is much less certain. If there are many groups, the following command is a quick way to find the maximum probability for each sample:
> apply(brine.log.hat$posterior, MARGIN=1, FUN=max)
        1           2             3  4             5         6             7         8 
0.7627845 0.9488842 0.8453057 0.9990242 0.9965920 0.9984987 0.8676660 0.9949035 
        9          10          11        12          13        14          15        16 
0.9956351 0.9999439 0.9791728 0.9989023 0.9854613 0.5616683 0.9381533 0.9992500 
       17          18          19 
0.9985685 0.9997447 0.9270857
Since most of the probabilities in the dataset are large (>0.9), this indicates that most of the samples in the set have been assigned to one group.
If most of these probabilities are large, the overall classification is successful. The last part of the predict() output lists the scores of each sample for each discriminant function axis. These scores can be plotted to show graphically how the groups are distributed in the discriminant function, just as scores from a principal components analysis could be plotted, as follows:
> plot(brine.log.lda)
The three groups occupy distinctly different and non-overlapping regions. There is just one case of group 1 being close to group 2, so one can clearly state that the discrimination has been successful.
The plot is as shown in the following figure:
A second type of plot shows the data plot along a particular discriminant function axis as follows:
> plot(brine.log.lda, dimen=1, type="both")
Again, note the good separation of the groups along discriminant function 1, and particularly so for group 2.

Step 6 - evaluating the model

The effectiveness of DFA in classifying the groups must be evaluated, and this is done by comparing the assignments made by predict() to the actual group assignments. The table() function is most useful for this. By convention, it is called with the actual assignments as the first argument and the fitted assignments as the second argument, as follows:
> tab <- table(brine.log$GROUP, brine.log.hat$class)
Printing the value of tab.
    > tab
The results are as follows:
      1   2   3
  1   6   1   0
  2   0   6   0
  3   0   0   6
The rows in the output correspond to the groups specified in the original data and the columns correspond to the classification made by the DFA. In a perfect classification, large values would lie along the diagonal, with zeroes off the diagonal, which would indicate that all samples that belong to group 1 were discriminated by the DFA as belonging to group 1, and so on. The form of this table can give you considerable insight into which groups are reliably discriminated. It can also show which groups are likely to be confused and which types of misclassification are more common than others. The following command will calculate the overall predictive accuracy, that is, the proportion of cases that lie along the diagonal:
> sum(tab[row(tab) == col(tab)]) / sum(tab)
The result is as follows:
[1] 0.9473684
Here the predictive accuracy is almost 95%, quite a success. This approach measures what is called the resubstitution error, how well the samples are classified when all the samples are used to develop the discriminant function.
A second approach for evaluating a DFA is leave-one-out cross-validation (also called jackknifed validation), which excludes one observation. This approach of evaluating DFA uses the data that has been left out, that is, excluding one observation. We are now left with n - 1 observation. This cross-validation technique is done automatically for each sample in the dataset. To do this, add CV=TRUE (think Cross-validation) to the lda() call as follows:
> brine.log.lda <- lda(GROUP ~ HCO3 + SO4 + Cl + Ca + Mg + Na, data=brine.log, CV=TRUE) 
The success of the discrimination can be measured similarly, as follows:
> tab <- table(brine.log$GROUP, brine.log.lda$class)
Print the value of tab as follows:
> tab
The result is as follows:
      1   2   3
  1  6   1   0
  2   1   4   1
  3   1   0   5
> sum(tab[row(tab) == col(tab)]) / sum(tab)
The result is as follows:
[1] 0.7894737
In this dataset, the jackknifed validation is considerably less accurate (only 79% accurate), reflecting that the resubstitution error always overestimates the performance of a DFA. Such a discrepancy is particularly common with small datasets such as this, and discriminant function analysis is often much more successful with large datasets.

Comments

Popular posts from this blog

Driving Visual Analysis with Automobile Data (R)

Evaluating Classification Model Performance

Importing automobile fuel efficiency data into R