Unlocking Ancient Secrets with Data
A Guide to Discriminant Function Analysis in R
Imagine you are an archaeologist. You’ve discovered a cache of ancient artifacts near a series of salt mines, but you have a problem: you don't know which specific mine these artifacts originated from.
Fortunately, science gives us a clue. The chemical composition of the artifacts matches the "brine" (salt water) signatures of the mines. But how do you match them mathematically?
Enter Discriminant Function Analysis (DFA). In this tutorial, we will use R to build a predictive model that acts as a "chemical fingerprint scanner," identifying the source of a sample based on its geochemical makeup.
Getting Started
We will be using the BRINE dataset from the Kansas Geological Survey. It contains 19 water samples with measurements for distinct chemical elements (like Calcium, Magnesium, etc.) and the stratigraphic unit (Group) they belong to.
Prerequisites: You will need R installed. We will use the MASS library.
Step 1: Load the Data
First, let's load our library and pull the data directly from the source.
library(MASS)
# Load the dataset
url <- "http://www.kgs.ku.edu/Mathgeo/Books/Stat/ASCII/BRINE.TXT"
brine <- read.table(url, header=TRUE, sep=",", row.names=1)
# Peek at the data
head(brine)
What you’ll see:
Rows representing samples and columns representing chemical concentrations ($HCO_3$, $SO_4$, $Cl$, etc.). The last column, GROUP, is our target—it tells us which mine/unit the sample came from.
Step 2: Explore and Check Assumptions
DFA works best when data is "normally distributed" (bell-shaped curve). Let's visualize our data to see if it meets this requirement using a scatterplot matrix.
# Plot columns 1-6 (measurements)
pairs(brine[ ,1:6])
The Diagnosis:
If you look closely at the plot above, you’ll see the data points form "comet" shapes. This indicates the data is skewed, which violates the assumption of normality. We need to fix this before we calculate our formula.
Step 3: Transform the Data
To fix the skew, we apply a Log Transformation. Because some values might be zero (and you cannot take the log of zero), we will add 1 to every value (log(x + 1)).
# Create a copy of the data to keep the original safe
brine.log <- brine
# Apply log transformation to the chemical variables (cols 1-6)
brine.log[ ,1:6] <- log(brine[ ,1:6] + 1)
# Check the plot again
pairs(brine.log[ ,1:6])
The Result:
The data points should now look more like round clouds rather than comets. The distribution is normalized, and we are ready to build the model.
Step 4: Train the Model
We will use Linear Discriminant Analysis (LDA). Think of this as drawing lines (or planes) through the data cloud to separate the three different groups as distinctly as possible.
# Train the model
# GROUP is the target (~), everything else represents the predictors (+)
brine.log.lda <- lda(GROUP ~ HCO3 + SO4 + Cl + Ca + Mg + Na, data=brine.log)
# View the results
brine.log.lda
Understanding the Output:
Prior Probabilities: The percentage of samples in each group.
Group Means: The average chemical fingerprint for each group.
Coefficients: The math behind the lines drawn to separate the groups.
Proportion of Trace: How much of the variance is explained. Usually,
LD1(the first function) explains the majority of the difference between groups.
Step 5: Visualize the Separation
Let's see how well our model separated the three groups.
plot(brine.log.lda)
Success! The plot should show three clusters that do not overlap. This means the chemistry of the three mines is distinct enough that our computer model can tell them apart.
Step 6: Making Predictions
Now that we have a model, we can ask it to classify the existing data to see how confident it is. We use the predict() function.
# Predict the groups based on the model
brine.log.hat <- predict(brine.log.lda)
# specific prediction confidence (posterior probabilities)
head(brine.log.hat$posterior)
The Posterior Probability is a number between 0 and 1. If a sample has a probability of 0.99 for Group 1, the model is 99% sure it belongs there. In this dataset, most probabilities are very high, indicating a strong model.
Step 7: Evaluation (The Truth Test)
How accurate is the model really? We use a Confusion Matrix to compare the Predicted group vs. the Actual group.
Method A: Resubstitution Accuracy
This checks how well the model fits the data it was trained on.
# Create the confusion matrix
tab <- table(brine.log$GROUP, brine.log.hat$class)
print(tab)
# Calculate accuracy percentage
sum(tab[row(tab) == col(tab)]) / sum(tab)
Result: ~94.7%. This is excellent, but it can be misleading because the model has already "seen" this data.
Method B: Leave-One-Out Cross-Validation (Jackknifed)
This is a tougher test. It removes one sample, trains the model on the remaining 18, and then tries to guess the missing one. It repeats this for every sample.
# Run LDA again with CV=TRUE
brine.cv <- lda(GROUP ~ HCO3 + SO4 + Cl + Ca + Mg + Na, data=brine.log, CV=TRUE)
# Create the new confusion matrix
tab_cv <- table(brine.log$GROUP, brine.cv$class)
# Calculate strict accuracy
sum(tab_cv[row(tab_cv) == col(tab_cv)]) / sum(tab_cv)
Result: ~79%.
This drop in accuracy (from 95% to 79%) is normal for small datasets. It gives us a more realistic expectation of how the model would perform if we dug up a brand new artifact today.
Conclusion
You have successfully trained a machine learning model to perform geochemical fingerprinting! While the model is nearly perfect on known data, the cross-validation shows us that with small sample sizes, we must be careful not to be overconfident.
Comments
Post a Comment