Miscellaneous spatial cross-validation packages in R

Author

Jakub Nowosad

Published

Last Updated: March, 2025

This document provides an overview of two R packages, sperrorest and blockCV, that can be used for spatial cross validation, but are outside of standard machine learning frameworks like caret, tidymodels, or mlr3.

All of the examples below use the same dataset, which includes the temperature measurements in Spain, a set of covariates, and the spatial coordinates of the temperature measurements.

Hide the code
spain <- sf::read_sf("https://github.com/LOEK-RS/FOSSGIS2025-examples/raw/refs/heads/main/data/spain.gpkg")
covariates <- terra::rast("https://github.com/LOEK-RS/FOSSGIS2025-examples/raw/refs/heads/main/data/predictors.tif")
temperature <- sf::read_sf("https://github.com/LOEK-RS/FOSSGIS2025-examples/raw/refs/heads/main/data/temp_train.gpkg")

temperature <- terra::extract(covariates, temperature, bind = TRUE) |>
    sf::st_as_sf()

1 sperrorest

The sperrorest (https://doi.org/10.32614/CRAN.package.sperrorest) package is designed for spatial error estimation and variable importance assessment for predictive models. The package itself does not fit the models but provides a set of functions for spatial cross-validation, including data partitioning and model cross-validation.

While the sperrorest package has many functions (including a set of functions for data partitioning), its main function is sperrorest(). It performs spatial cross-validation for spatial prediction models, including variable importance assessment and prediction error estimation. To use this function, we need to provide the formula, the data, the coordinates, the model function, the model arguments, the prediction function, the sampling function, and the sampling arguments.

Let’s do it step by step. First, we need to prepare the data by extracting the coordinates and creating a data frame with the dependent variable, covariates, and coordinates.

Hide the code
library(sperrorest)
library(ranger)

coordinates <- sf::st_coordinates(temperature)
temperature_df <- sf::st_drop_geometry(temperature)
temperature_df$x <- coordinates[, 1]
temperature_df$y <- coordinates[, 2]

Second, we need to define the formula for the model and the prediction function.

Hide the code
response_name <- "temp"
covariate_names <- colnames(temperature_df)[2:(ncol(temperature_df) - 7)]
fo <- as.formula(paste(
    response_name,
    "~",
    paste(covariate_names, collapse = " + ")
))

Third, we need to define the custom prediction function. The sperrorest package works with many model functions, but it requires a custom prediction function to extract the predictions from the model object. In this example, we use the ranger model, so we need to define a custom prediction function that extracts the predictions from the ranger model object. The predict() function from the ranger package returns a list with several elements, so we need to extract the predictions from this list.1

Hide the code
mypred <- function(object, newdata) {
    predict(object, newdata)$predictions
}

Fourth, we can perform the spatial cross-validation using the sperrorest() function. We just need to provide previously prepared data, the formula, the model function, and the prediction function. Moreover, we can also define some additional parameters of the model, such as the number of trees in the ranger model. Finally, the important part is to define the sampling function (smp_fun) and its arguments (smp_args). The sampling function is used to partition the data into training and testing sets: here, we use the partition_kmeans() function to partition the data spatially into folds using k-means clustering of the coordinates.2

Hide the code
# Spatial cross-validation
sp_res <- sperrorest(
    formula = fo,
    data = temperature_df,
    coords = c("x", "y"),
    model_fun = ranger,
    model_args = list(num.trees = 100),
    pred_fun = mypred,
    smp_fun = partition_kmeans,
    smp_args = list(repetition = 1:2, nfold = 3),
    progress = FALSE
)

The result is a list with several components, including the error at the repetition and fold levels, the resampling object, the variable importance (only when importance = TRUE), the benchmark, and the package version.

Hide the code
summary(sp_res$error_rep)
                      mean          sd        median         IQR
train_bias    -0.003355906 0.004272852  -0.003355906 0.003021362
train_stddev   0.401721377 0.013651958   0.401721377 0.009653392
train_rmse     0.401230837 0.013669339   0.401230837 0.009665682
train_mad      0.316990283 0.010997654   0.316990283 0.007776516
train_median   0.032100513 0.002621221   0.032100513 0.001853483
train_iqr      0.430515600 0.015958048   0.430515600 0.011284044
train_count  390.000000000 0.000000000 390.000000000 0.000000000
test_bias      0.028511170 0.050086462   0.028511170 0.035416477
test_stddev    1.471919330 0.002386663   1.471919330 0.001687626
test_rmse      1.468843235 0.003351606   1.468843235 0.002369944
test_mad       1.406426772 0.103342947   1.406426772 0.073074499
test_median    0.137444954 0.038050318   0.137444954 0.026905638
test_iqr       1.892317500 0.001700626   1.892317500 0.001202524
test_count   195.000000000 0.000000000 195.000000000 0.000000000

We can contrast the obtained results with the non-spatial cross-validation by changing the sampling function to partition_cv().

Hide the code
# Non-spatial cross-validation
nsp_res <- sperrorest(
    formula = fo,
    data = temperature_df,
    coords = c("x", "y"),
    model_fun = ranger,
    model_args = list(num.trees = 100),
    pred_fun = mypred,
    smp_fun = partition_cv,
    smp_args = list(repetition = 1:2, nfold = 3),
    progress = FALSE
)

To compare both results, we can plot the RMSE values for the training and testing sets of both spatial and non-spatial cross-validation.

Hide the code
library(ggplot2)
# Extract train/test RMSE from spatial CV
sp_train_rmse <- sp_res$error_rep$train_rmse
sp_test_rmse <- sp_res$error_rep$test_rmse
# Extract train/test RMSE from non-spatial CV
nsp_train_rmse <- nsp_res$error_rep$train_rmse
nsp_test_rmse <- nsp_res$error_rep$test_rmse
# Build data frame
rmse_df <- data.frame(
    CV_Type = rep(c("Spatial", "Non-Spatial"), each = 4),
    Set = rep(c("Train", "Test"), each = 2),
    RMSE = c(sp_train_rmse, sp_test_rmse, nsp_train_rmse, nsp_test_rmse)
)
ggplot(rmse_df, aes(x = CV_Type, y = RMSE, fill = Set)) +
    geom_boxplot() +
    facet_wrap(~Set) +
    labs(title = "RMSE Comparison", x = "CV Method", y = "RMSE")

The results show that the estimation using the spatial-cross validation is less optimistic than the non-spatial cross-validation for the test set.

More examples of the package use can be found at https://giscience-fsu.github.io/sperrorest/articles/spatial-modeling-use-case.html/

2 blockCV

The blockCV (https://doi.org/10.1111/2041-210X.13107) package provides a set of functions for block cross-validation, spatial and environmental clustering, and spatial autocorrelation estimation. The package itself does not fit the models.

Hide the code
library(blockCV)

Cross-validation strategies separate the data into training and testing sets to evaluate the model’s performance. The blockCV package provides several cross-validation strategies, including block cross-validation, spatial clustering, environmental clustering, buffering LOO, and Nearest Neighbour Distance Matching (NNDM) LOO.

The block cross-validation is performed using the cv_spatial() function. It assigns blocks to the training and testing folds randomly, systematically or in a checkerboard pattern (the selection argument).

Hide the code
sb1 <- cv_spatial(
    x = temperature,
    k = 10, # number of folds
    size = 300000, # size of the blocks in meters
    selection = "random", # random blocks-to-fold
    iteration = 50, # find evenly dispersed folds
    progress = FALSE,
    biomod2 = TRUE
)

   train test
1    169   26
2    170   25
3    181   14
4    182   13
5    179   16
6    175   20
7    171   24
8    180   15
9    183   12
10   165   30

The result is a list with several components, including the folds list, the folds IDs, the biomod table, the number of folds, the input size, the column name, the blocks, and the records. For example, we can check the structure of the folds list with the str() function.

Hide the code
str(sb1$folds_list)
List of 10
 $ :List of 2
  ..$ : int [1:169] 84 98 105 104 121 117 87 119 96 97 ...
  ..$ : int [1:26] 91 110 108 106 88 153 90 144 155 94 ...
 $ :List of 2
  ..$ : int [1:170] 84 98 105 104 121 117 87 119 96 97 ...
  ..$ : int [1:25] 34 58 59 24 32 36 33 28 35 30 ...
 $ :List of 2
  ..$ : int [1:181] 84 98 105 104 121 117 87 119 96 97 ...
  ..$ : int [1:14] 158 156 160 149 161 147 150 146 142 148 ...
 $ :List of 2
  ..$ : int [1:182] 84 98 105 104 121 117 87 119 96 97 ...
  ..$ : int [1:13] 124 126 131 125 132 122 120 116 123 129 ...
 $ :List of 2
  ..$ : int [1:179] 34 58 59 24 32 36 33 28 35 30 ...
  ..$ : int [1:16] 84 98 105 104 121 117 87 119 96 97 ...
 $ :List of 2
  ..$ : int [1:175] 84 98 105 104 121 117 87 119 96 97 ...
  ..$ : int [1:20] 152 162 157 163 71 186 67 41 180 178 ...
 $ :List of 2
  ..$ : int [1:171] 84 98 105 104 121 117 87 119 96 97 ...
  ..$ : int [1:24] 46 168 17 20 47 11 45 169 171 177 ...
 $ :List of 2
  ..$ : int [1:180] 84 98 105 104 121 117 87 119 96 97 ...
  ..$ : int [1:15] 138 133 134 136 188 173 190 192 187 195 ...
 $ :List of 2
  ..$ : int [1:183] 84 98 105 104 121 117 87 119 96 97 ...
  ..$ : int [1:12] 10 5 6 2 8 193 1 164 4 194 ...
 $ :List of 2
  ..$ : int [1:165] 84 98 105 104 121 117 87 119 96 97 ...
  ..$ : int [1:30] 55 77 68 74 76 70 42 64 83 75 ...

The cv_plot() function additionally allows for the visualization of cross-validation results.

Hide the code
cv_plot(sb1, temperature)

Let’s compare the results of the block cross-validation with systematic and checkerboard patterns.

Hide the code
sb2 <- cv_spatial(
    x = temperature,
    k = 10,
    rows_cols = c(4, 6),
    hexagon = FALSE,
    selection = "systematic"
)

   train test
1    172   23
2    180   15
3    177   18
4    169   26
5    178   17
6    182   13
7    180   15
8    162   33
9    178   17
10   177   18

Hide the code
cv_plot(sb2, temperature)

Hide the code
sb3 <- cv_spatial(
    x = temperature,
    k = 10,
    size = 300000,
    hexagon = FALSE,
    selection = "checkerboard"
)

  train test
1    98   97
2    97   98

Hide the code
cv_plot(sb3, temperature)

The clustering strategies (cv_cluster()) are used to group the data into clusters based on spatial or environmental similarity. The spatial similarity is based only on the clustering of the spatial coordinates.

Hide the code
set.seed(6)
scv <- cv_cluster(x = temperature, k = 10)
   train test
1    178   17
2    173   22
3    182   13
4    169   26
5    179   16
6    176   19
7    178   17
8    180   15
9    171   24
10   169   26
Hide the code
cv_plot(scv, temperature)

The environmental clustering, on the other hand, is based on the clustering of the values of the covariates extracted from the raster data.

Hide the code
set.seed(6)
ecv <- cv_cluster(x = temperature, r = covariates, k = 5, scale = TRUE)
  train test
1    90  105
2   154   41
3   182   13
4   164   31
5   190    5
Hide the code
cv_plot(ecv, temperature)

The next cross-validation strategy is buffering LOO (also known as Spatial LOO). It is performed using the cv_buffer() function, which selects a buffer around each point (test point) and uses the points outside the buffer as the testing set.3

Hide the code
bloo <- cv_buffer(x = temperature, size = 300000, progress = FALSE)
     train            test  
 Min.   : 97.0   Min.   :1  
 Mean   :132.7   Mean   :1  
 Max.   :170.0   Max.   :1  
Hide the code
cv_plot(bloo, temperature, num_plots = c(1, 50, 100))

Note that above, we plot only the first, 50th, and 100th points to avoid overplotting.

The last cross-validation strategy implemented in the blockCV package is the Nearest Neighbour Distance Matching (NNDM) LOO. It is performed using the cv_nndm() function, which tries to match the nearest neighbor distance distribution function between the test and training data to the nearest neighbor distance distribution function between the target prediction and training points. Thus, in this base, we need to provide more arguments, including a raster with the covariates, the number of samples, the sampling strategy, and the minimum training size.

Hide the code
nncv <- cv_nndm(
    x = temperature,
    r = covariates,
    size = 300000,
    num_sample = 5000,
    sampling = "regular",
    min_train = 0.1,
    plot = TRUE
)
     train            test  
 Min.   :192.0   Min.   :1  
 Mean   :192.9   Mean   :1  
 Max.   :193.0   Max.   :1  

Hide the code
cv_plot(nncv, temperature, num_plots = c(1, 50, 100))

Let’s now use the block cross-validation to fit and evaluate a model.

Hide the code
# define formula
response_name <- "temp"
covariate_names <- colnames(temperature_df)[2:(ncol(temperature_df) - 7)]
fo <- as.formula(paste(
    response_name,
    "~",
    paste(covariate_names, collapse = " + ")
))

# extract the folds
folds <- sb1$folds_list

model_rmse <- data.frame(fold = seq_along(folds), rmse = rep(NA, length(folds)))

for (k in seq_along(folds)) {
    trainSet <- unlist(folds[[k]][1]) # training set indices; first element
    testSet <- unlist(folds[[k]][2]) # testing set indices; second element
    rf <- ranger(fo, temperature_df[trainSet, ], num.trees = 100) # model fitting on training set
    pred <- predict(rf, temperature_df[testSet, ])$predictions # predict the test set
    model_rmse[k, "rmse"] <- sqrt(mean(
        (temperature_df[testSet, response_name] - pred)^2
    )) # calculate RMSE
}
model_rmse
   fold      rmse
1     1 0.7560125
2     2 1.1496561
3     3 0.6749471
4     4 0.7774700
5     5 1.0578904
6     6 0.8293503
7     7 0.9245386
8     8 0.8030397
9     9 1.3123451
10   10 1.0863700

The blockCV package also provides functions for checking the similarity between the folds (cv_similarity()) and estimating the effective range of spatial autocorrelation (cv_spatial_autocor()). The first function is used to check the similarity between the folds in the cross-validation.

Hide the code
cv_similarity(cv = sb1, x = temperature, r = covariates, progress = FALSE)

The second function is used to estimate the effective range of spatial autocorrelation of all input raster layers or the response data – its role is to help to determine the size of the blocks in the block cross-validation.

Hide the code
cv_spatial_autocor(r = covariates, num_sample = 5000, progress = FALSE)

[1] "cv_spatial_autocor"

More examples of the package’s use can be found at https://cran.r-project.org/web/packages/blockCV/vignettes/tutorial_2.html.

Footnotes

  1. More information on the custom prediction functions is at https://cran.r-project.org/web/packages/sperrorest/vignettes/custom-pred-and-model-functions.html.↩︎

  2. There are several other partition functions available in the package, including partition_disc(), partition_tiles(), and partition_cv().↩︎

  3. This approach is a form of leave-one-out cross-validation.↩︎