Hide the code
# install.packages("caret")
# install.packages("CAST")
# install.packages("blockCV")
# install.packages("sf")
# install.packages("terra")
This tutorial shows the application of caret for spatial modelling at the example of predicting air temperature in Spain. Hereby, we use measurements of air temperature available only at specific locations in Spain to create a spatially continuous map of air temperature. Therefore, machine-learning models are trained to learn the relationship between spatially continuous predictors and air temperature.
When using machine-learning methods with spatial data, we need to take care of, e.g., spatial autocorrelation, as well as extrapolation when predicting to regions that are far away from the training data. To deal with these issues, several methods have been developed. In this tutorial, we will show how to combine the machine-learning workflow of caret with packages designed to deal with machine-learning with spatial data. Hereby, we use blockCV::cv_spatial()
and CAST::knndm()
for spatial cross-validation, and CAST::aoa()
to mask areas of extrapolation. We use sf and terra for processing vector and raster data, respectively.
The caret package contains functions to train machine-learning models, as well as for, e.g., model selection. Its main function is the caret::train()
, which provides a uniform interface to over 200 machine-learning algorithms. (User-specified-) cross-Validation methods can be defined via caret::trainControl()
. An extensive online tutorial is available at https://topepo.github.io/caret/. Furthermore, a paper (https://doi.org/10.18637/jss.v028.i05), as well as a book (http://appliedpredictivemodeling.com/), describing the use of caret are available.
# install.packages("caret")
# install.packages("CAST")
# install.packages("blockCV")
# install.packages("sf")
# install.packages("terra")
library(caret)
library(CAST)
library(blockCV)
library(sf)
library(terra)
Load data needed in this modelling example:
<- terra::rast("https://github.com/LOEK-RS/FOSSGIS2025-examples/raw/refs/heads/main/data/predictors.tif")
predictor_stack <- names(predictor_stack)
predictor_names <- sf::st_read("https://github.com/LOEK-RS/FOSSGIS2025-examples/raw/refs/heads/main/data/spain.gpkg") spain
Reading layer `spain' from data source
`https://github.com/LOEK-RS/FOSSGIS2025-examples/raw/refs/heads/main/data/spain.gpkg'
using driver `GPKG'
Simple feature collection with 1 feature and 0 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -13454.15 ymin: 3988025 xmax: 1020771 ymax: 4859816
Projected CRS: ED50 / UTM zone 30N
<- sf::st_read("https://github.com/LOEK-RS/FOSSGIS2025-examples/raw/refs/heads/main/data/temp_train.gpkg") train_points
Reading layer `temp_train' from data source
`https://github.com/LOEK-RS/FOSSGIS2025-examples/raw/refs/heads/main/data/temp_train.gpkg'
using driver `GPKG'
Simple feature collection with 195 features and 1 field
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 36026.79 ymin: 3988818 xmax: 978160.6 ymax: 4858999
Projected CRS: ED50 / UTM zone 30N
<- terra::extract(
train_data
predictor_stack,
train_points,bind = TRUE,
ID = FALSE
|>
) ::st_as_sf()
sf
plot(sf::st_geometry(spain))
plot(sf::st_geometry(train_points), col = "blue4", add = TRUE)
Firstly, a simple modelling workflow without feature selection and hyperparameter tuning is shown:
Geometry column needs to be dropped before using caret::train()
.
# 1. train-test split
<- caret::createDataPartition(
trainIndex $temp,
train_datap = .8,
list = FALSE,
times = 1
)<- train_data[trainIndex, ]
temperature_train <- train_data[-trainIndex, ]
temperature_test
# 2. model training
<- caret::train(
model ~ .,
temp data = st_drop_geometry(temperature_train),
method = "ranger",
tuneGrid = expand.grid(
"mtry" = 4,
"splitrule" = "variance",
"min.node.size" = 5
),num.trees = 100
)
# 3. predict on test data
<- temperature_test[, "temp", drop = FALSE]
test_df $prediction <- predict(model, temperature_test)
test_df
<- caret::postResample(
test_metrics pred = test_df$prediction,
obs = test_df$temp
|>
) round(3) |>
t() |>
as.data.frame()
print(test_metrics)
RMSE Rsquared MAE
1 0.931 0.892 0.727
# 4. predict to raster stack
<- terra::predict(predictor_stack, model, na.rm = TRUE) prediction_spatial
Cross-validation (CV) methods are often employed to obtain optimal hyperparameter values. Therefore, the training data are split into \(k\) folds, and a model is trained on \(k\)-1 folds. The fold not used for model training is then used to obtain the test statistic. This is repeated over all folds, and the test metric is averaged over the \(k\) folds.
In spatial machine-learning, a spatial CV is often needed to prevent very similar data to be in the training and testing fold at the same time, which is often the case if training data are clustered and leads to overly optimistic CV error estimates. R packages that implement spatial CV include, e.g., blockCV and CAST. Here, we will explore the integration of those two with caret.
The blockCV package implements different blocking methods for spatial CV. The resulting object of the main function blockCV::cv_spatial()
contains a nested list of the \(k\) folds and the training data rows that belong to each fold, as well as a list of the test data left out in each of the \(k\) iteration. These lists can be obtained using lapply()
and then be used as an input to the caret::trainControl()
function of caret that defines the CV strategy used in caret::train()
. The grid of hyperparameter values tested during CV is defined using the tune_grid
argument in caret::train()
. Here, we test mtry
values from 2 to 12 and min.node.size
values between 5 and 15. The combination of mtry
and min.node.size
that minimizes the RMSE is then automatically used to re-train a final model with the complete training data set.
<- blockCV::cv_spatial(
spatial_blocks
temperature_train,k = 5,
progress = FALSE
)
train test
1 127 32
2 127 32
3 128 31
4 126 33
5 128 31
<- lapply(spatial_blocks$folds_list, function(x) x[[1]])
train_ids <- lapply(spatial_blocks$folds_list, function(x) x[[2]])
test_ids
<- caret::trainControl(
tr_control_block method = "cv",
index = train_ids,
indexOut = test_ids,
savePredictions = TRUE
)
<- expand.grid(
hyperparameter_grid "mtry" = c(2, 4, 6, 10, 12),
"min.node.size" = c(5, 10, 15),
"splitrule" = "variance"
)
<- caret::train(
model_spatial ~ .,
temp data = st_drop_geometry(temperature_train),
method = "ranger",
trControl = tr_control_block,
tuneGrid = hyperparameter_grid,
num.trees = 100
)
$finalModel model_spatial
Ranger result
Call:
ranger::ranger(dependent.variable.name = ".outcome", data = x, mtry = min(param$mtry, ncol(x)), min.node.size = param$min.node.size, splitrule = as.character(param$splitrule), write.forest = TRUE, probability = classProbs, ...)
Type: Regression
Number of trees: 100
Sample size: 159
Number of independent variables: 22
Mtry: 12
Target node size: 5
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 0.7392349
R squared (OOB): 0.909403
Another spatial CV method is kNNDM, which is implemented in the CAST package and aims at emulating the prediction situation encountered by the model during CV. In this case, the prediction situation is to predict from the temperature measurement stations to the whole area of Spain. Since the temperature measurement stations are rather randomly distributed over the area of Spain, no spatial blocking is needed and kNNDM randomly assigns training points to CV folds. The output of kNNDM contains a list of row indices of training data points that are used in each CV-iteration (indx_train
), as well as of indices that are left out in each iteration (indx_test
). These lists can easily be used as input to the caret::trainControl()
function of caret that defines the CV used in caret::train()
.
<- CAST::knndm(
knndm_folds tpoints = temperature_train,
modeldomain = spain,
space = "geographical",
clustering = "kmeans",
k = 5
)
<- caret::trainControl(
tr_control_knndm method = "cv",
index = knndm_folds$indx_train,
indexOut = knndm_folds$indx_test,
savePredictions = TRUE
)
<- expand.grid(
hyperparameter_grid "mtry" = c(2, 4, 6, 10, 12),
"min.node.size" = c(5, 10, 15),
"splitrule" = "variance"
)
<- caret::train(
model_knndm ~ .,
temp data = st_drop_geometry(temperature_train),
method = "ranger",
trControl = tr_control_knndm,
tuneGrid = hyperparameter_grid,
num.trees = 100
)
$finalModel model_knndm
Ranger result
Call:
ranger::ranger(dependent.variable.name = ".outcome", data = x, mtry = min(param$mtry, ncol(x)), min.node.size = param$min.node.size, splitrule = as.character(param$splitrule), write.forest = TRUE, probability = classProbs, ...)
Type: Regression
Number of trees: 100
Sample size: 159
Number of independent variables: 22
Mtry: 12
Target node size: 10
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 0.7454315
R squared (OOB): 0.9086436
To reduce the number of environmental predictors, and thus enhance the generalizability of the model, feature selection is commonly applied in machine-learning workflows. CAST implements Forward-Feature-Selection, that can be used with spatial CV. Here, we use the results of the hyperparameter tuning above and kNNDM CV to select the most relevant features. Plotting the results of FFS()
shows that the variables DEM
, Y
, EDF5
and primaryroads
were selected.
<- model_knndm$bestTune
selected_hyperparams
<- CAST::ffs(
model_ffs predictors = st_drop_geometry(temperature_train)[, predictor_names],
response = st_drop_geometry(temperature_train)$temp,
method = "ranger",
num.trees = 100,
trControl = tr_control_knndm,
tuneGrid = selected_hyperparams,
verbose = FALSE
)
plot(model_ffs, plotType = "selected")
# obtain prediction
<- terra::predict(predictor_stack, model_ffs, na.rm = TRUE) prediction_ffs
Lastly, the area which is too dissimilar from the training data for the models to make reliable predictions (area of applicability, AOA) is delineated using the function CAST::aoa()
. The function CAST::aoa()
takes as inputs the predictor stack, as well as the trained caret model. The resulting object contains the Dissimilarity values, the threshold used to delineate the AOA (every DI above this threshold is considered outside the AOA), as well as the final AOA raster. Since our training data are randomly distributed in the study area, most of the area falls within the AOA.
<- CAST::aoa(
AOA_without_tuning newdata = predictor_stack,
model = model,
verbose = FALSE
)<- CAST::aoa(
AOA_with_tuning newdata = predictor_stack,
model = model_ffs,
verbose = FALSE
)
par(mfrow = c(2, 2))
plot(prediction_spatial, main = "prediction without tuning")
plot(AOA_without_tuning$AOA, main = "AOA without tuning")
plot(prediction_ffs, main = "prediction with model selection")
plot(AOA_with_tuning$AOA, main = "AOA with model selection")
caret has no functions that explicitly deal with spatial data. However, due to its rather flexible design, caret is compatible with several packages designed for spatial machine-learning. The caret::trainControl()
takes a list of CV indices as input, which makes it quite flexible to work with the output of e.g. CAST::knndm()
and blockCV::cv_spatial()
. Furthermore, caret is easy to use due to its functional programming paradigm. The documentation is extensive, and it’s rather easy to find modelling algorithms and their hyperparameters. Lastly, it should be noted that caret is not actively developed, since its main developer moved to tidymodels.