This document provides an overview of two R packages, sperrorest and blockCV, that can be used for spatial cross validation, but are outside of standard machine learning frameworks like caret, tidymodels, or mlr3.
All of the examples below use the same dataset, which includes the temperature measurements in Spain, a set of covariates, and the spatial coordinates of the temperature measurements.
The sperrorest (https://doi.org/10.32614/CRAN.package.sperrorest) package is designed for spatial error estimation and variable importance assessment for predictive models. The package itself does not fit the models but provides a set of functions for spatial cross-validation, including data partitioning and model cross-validation.
While the sperrorest package has many functions (including a set of functions for data partitioning), its main function is sperrorest(). It performs spatial cross-validation for spatial prediction models, including variable importance assessment and prediction error estimation. To use this function, we need to provide the formula, the data, the coordinates, the model function, the model arguments, the prediction function, the sampling function, and the sampling arguments.
Let’s do it step by step. First, we need to prepare the data by extracting the coordinates and creating a data frame with the dependent variable, covariates, and coordinates.
Third, we need to define the custom prediction function. The sperrorest package works with many model functions, but it requires a custom prediction function to extract the predictions from the model object. In this example, we use the ranger model, so we need to define a custom prediction function that extracts the predictions from the ranger model object. The predict() function from the ranger package returns a list with several elements, so we need to extract the predictions from this list.1
Fourth, we can perform the spatial cross-validation using the sperrorest() function. We just need to provide previously prepared data, the formula, the model function, and the prediction function. Moreover, we can also define some additional parameters of the model, such as the number of trees in the ranger model. Finally, the important part is to define the sampling function (smp_fun) and its arguments (smp_args). The sampling function is used to partition the data into training and testing sets: here, we use the partition_kmeans() function to partition the data spatially into folds using k-means clustering of the coordinates.2
The result is a list with several components, including the error at the repetition and fold levels, the resampling object, the variable importance (only when importance = TRUE), the benchmark, and the package version.
The blockCV (https://doi.org/10.1111/2041-210X.13107) package provides a set of functions for block cross-validation, spatial and environmental clustering, and spatial autocorrelation estimation. The package itself does not fit the models.
Hide the code
library(blockCV)
Cross-validation strategies separate the data into training and testing sets to evaluate the model’s performance. The blockCV package provides several cross-validation strategies, including block cross-validation, spatial clustering, environmental clustering, buffering LOO, and Nearest Neighbour Distance Matching (NNDM) LOO.
The block cross-validation is performed using the cv_spatial() function. It assigns blocks to the training and testing folds randomly, systematically or in a checkerboard pattern (the selection argument).
Hide the code
sb1 <-cv_spatial(x = temperature,k =10, # number of foldssize =300000, # size of the blocks in metersselection ="random", # random blocks-to-folditeration =50, # find evenly dispersed foldsprogress =FALSE,biomod2 =TRUE)
The result is a list with several components, including the folds list, the folds IDs, the biomod table, the number of folds, the input size, the column name, the blocks, and the records. For example, we can check the structure of the folds list with the str() function.
The clustering strategies (cv_cluster()) are used to group the data into clusters based on spatial or environmental similarity. The spatial similarity is based only on the clustering of the spatial coordinates.
Hide the code
set.seed(6)scv <-cv_cluster(x = temperature, k =10)
The next cross-validation strategy is buffering LOO (also known as Spatial LOO). It is performed using the cv_buffer() function, which selects a buffer around each point (test point) and uses the points outside the buffer as the testing set.3
Note that above, we plot only the first, 50th, and 100th points to avoid overplotting.
The last cross-validation strategy implemented in the blockCV package is the Nearest Neighbour Distance Matching (NNDM) LOO. It is performed using the cv_nndm() function, which tries to match the nearest neighbor distance distribution function between the test and training data to the nearest neighbor distance distribution function between the target prediction and training points. Thus, in this base, we need to provide more arguments, including a raster with the covariates, the number of samples, the sampling strategy, and the minimum training size.
Let’s now use the block cross-validation to fit and evaluate a model.
Hide the code
# define formularesponse_name <-"temp"covariate_names <-colnames(temperature_df)[2:(ncol(temperature_df) -7)]fo <-as.formula(paste( response_name,"~",paste(covariate_names, collapse =" + ")))# extract the foldsfolds <- sb1$folds_listmodel_rmse <-data.frame(fold =seq_along(folds), rmse =rep(NA, length(folds)))for (k inseq_along(folds)) { trainSet <-unlist(folds[[k]][1]) # training set indices; first element testSet <-unlist(folds[[k]][2]) # testing set indices; second element rf <-ranger(fo, temperature_df[trainSet, ], num.trees =100) # model fitting on training set pred <-predict(rf, temperature_df[testSet, ])$predictions # predict the test set model_rmse[k, "rmse"] <-sqrt(mean( (temperature_df[testSet, response_name] - pred)^2 )) # calculate RMSE}model_rmse
The blockCV package also provides functions for checking the similarity between the folds (cv_similarity()) and estimating the effective range of spatial autocorrelation (cv_spatial_autocor()). The first function is used to check the similarity between the folds in the cross-validation.
Hide the code
cv_similarity(cv = sb1, x = temperature, r = covariates, progress =FALSE)
The second function is used to estimate the effective range of spatial autocorrelation of all input raster layers or the response data – its role is to help to determine the size of the blocks in the block cross-validation.
There are several other partition functions available in the package, including partition_disc(), partition_tiles(), and partition_cv().↩︎
This approach is a form of leave-one-out cross-validation.↩︎
Source Code
---title: "Miscellaneous spatial cross-validation packages in R"author: "Jakub Nowosad"---This document provides an overview of two R packages, **sperrorest** and **blockCV**, that can be used for spatial cross validation, but are outside of standard machine learning frameworks like **caret**, **tidymodels**, or **mlr3**.All of the examples below use the same dataset, which includes the temperature measurements in Spain, a set of covariates, and the spatial coordinates of the temperature measurements.```{r misc-scv-1 }#| message: false#| warning: falsespain <- sf::read_sf("https://github.com/LOEK-RS/FOSSGIS2025-examples/raw/refs/heads/main/data/spain.gpkg")covariates <- terra::rast("https://github.com/LOEK-RS/FOSSGIS2025-examples/raw/refs/heads/main/data/predictors.tif")temperature <- sf::read_sf("https://github.com/LOEK-RS/FOSSGIS2025-examples/raw/refs/heads/main/data/temp_train.gpkg")temperature <- terra::extract(covariates, temperature, bind = TRUE) |> sf::st_as_sf()```# sperrorestThe **sperrorest** (<https://doi.org/10.32614/CRAN.package.sperrorest>) package is designed for spatial error estimation and variable importance assessment for predictive models.The package itself does not fit the models but provides a set of functions for spatial cross-validation, including data partitioning and model cross-validation.While the **sperrorest** package has many functions (including a set of functions for data partitioning), its main function is `sperrorest()`.It performs spatial cross-validation for spatial prediction models, including variable importance assessment and prediction error estimation.To use this function, we need to provide the formula, the data, the coordinates, the model function, the model arguments, the prediction function, the sampling function, and the sampling arguments.Let's do it step by step.First, we need to prepare the data by extracting the coordinates and creating a data frame with the dependent variable, covariates, and coordinates.```{r misc-scv-2 }library(sperrorest)library(ranger)coordinates <- sf::st_coordinates(temperature)temperature_df <- sf::st_drop_geometry(temperature)temperature_df$x <- coordinates[, 1]temperature_df$y <- coordinates[, 2]```Second, we need to define the formula for the model and the prediction function.```{r misc-scv-3 }response_name <- "temp"covariate_names <- colnames(temperature_df)[2:(ncol(temperature_df) - 7)]fo <- as.formula(paste( response_name, "~", paste(covariate_names, collapse = " + ")))```Third, we need to define the custom prediction function.The **sperrorest** package works with many model functions, but it requires a custom prediction function to extract the predictions from the model object.In this example, we use the `ranger` model, so we need to define a custom prediction function that extracts the predictions from the `ranger` model object.The `predict()` function from the `ranger` package returns a list with several elements, so we need to extract the predictions from this list.^[More information on the custom prediction functions is at <https://cran.r-project.org/web/packages/sperrorest/vignettes/custom-pred-and-model-functions.html>.]```{r misc-scv-4 }mypred <- function(object, newdata) { predict(object, newdata)$predictions}```Fourth, we can perform the spatial cross-validation using the `sperrorest()` function.We just need to provide previously prepared data, the formula, the model function, and the prediction function.Moreover, we can also define some additional parameters of the model, such as the number of trees in the `ranger` model.Finally, the important part is to define the sampling function (`smp_fun`) and its arguments (`smp_args`).The sampling function is used to partition the data into training and testing sets: here, we use the `partition_kmeans()` function to partition the data spatially into folds using k-means clustering of the coordinates.^[There are several other partition functions available in the package, including `partition_disc()`, `partition_tiles()`, and `partition_cv()`.]```{r misc-scv-5 }# Spatial cross-validationsp_res <- sperrorest( formula = fo, data = temperature_df, coords = c("x", "y"), model_fun = ranger, model_args = list(num.trees = 100), pred_fun = mypred, smp_fun = partition_kmeans, smp_args = list(repetition = 1:2, nfold = 3), progress = FALSE)```The result is a list with several components, including the error at the repetition and fold levels, the resampling object, the variable importance (only when `importance = TRUE`), the benchmark, and the package version.```{r misc-scv-6 }summary(sp_res$error_rep)```We can contrast the obtained results with the non-spatial cross-validation by changing the sampling function to `partition_cv()`.```{r misc-scv-7 }# Non-spatial cross-validationnsp_res <- sperrorest( formula = fo, data = temperature_df, coords = c("x", "y"), model_fun = ranger, model_args = list(num.trees = 100), pred_fun = mypred, smp_fun = partition_cv, smp_args = list(repetition = 1:2, nfold = 3), progress = FALSE)```To compare both results, we can plot the RMSE values for the training and testing sets of both spatial and non-spatial cross-validation.```{r misc-scv-8 }library(ggplot2)# Extract train/test RMSE from spatial CVsp_train_rmse <- sp_res$error_rep$train_rmsesp_test_rmse <- sp_res$error_rep$test_rmse# Extract train/test RMSE from non-spatial CVnsp_train_rmse <- nsp_res$error_rep$train_rmsensp_test_rmse <- nsp_res$error_rep$test_rmse# Build data framermse_df <- data.frame( CV_Type = rep(c("Spatial", "Non-Spatial"), each = 4), Set = rep(c("Train", "Test"), each = 2), RMSE = c(sp_train_rmse, sp_test_rmse, nsp_train_rmse, nsp_test_rmse))ggplot(rmse_df, aes(x = CV_Type, y = RMSE, fill = Set)) + geom_boxplot() + facet_wrap(~Set) + labs(title = "RMSE Comparison", x = "CV Method", y = "RMSE")```The results show that the estimation using the spatial-cross validation is less optimistic than the non-spatial cross-validation for the test set.More examples of the package use can be found at <https://giscience-fsu.github.io/sperrorest/articles/spatial-modeling-use-case.html>/# blockCVThe **blockCV** (<https://doi.org/10.1111/2041-210X.13107>) package provides a set of functions for block cross-validation, spatial and environmental clustering, and spatial autocorrelation estimation.The package itself does not fit the models.```{r misc-scv-9 }library(blockCV)```Cross-validation strategies separate the data into training and testing sets to evaluate the model's performance.The **blockCV** package provides several cross-validation strategies, including block cross-validation, spatial clustering, environmental clustering, buffering LOO, and Nearest Neighbour Distance Matching (NNDM) LOO.The block cross-validation is performed using the `cv_spatial()` function.It assigns blocks to the training and testing folds randomly, systematically or in a checkerboard pattern (the `selection` argument).```{r misc-scv-10 }sb1 <- cv_spatial( x = temperature, k = 10, # number of folds size = 300000, # size of the blocks in meters selection = "random", # random blocks-to-fold iteration = 50, # find evenly dispersed folds progress = FALSE, biomod2 = TRUE)```The result is a list with several components, including the folds list, the folds IDs, the biomod table, the number of folds, the input size, the column name, the blocks, and the records.For example, we can check the structure of the folds list with the `str()` function.```{r misc-scv-11 }str(sb1$folds_list)```The `cv_plot()` function additionally allows for the visualization of cross-validation results.```{r misc-scv-12 }cv_plot(sb1, temperature)```Let's compare the results of the block cross-validation with systematic and checkerboard patterns.```{r misc-scv-13 }sb2 <- cv_spatial( x = temperature, k = 10, rows_cols = c(4, 6), hexagon = FALSE, selection = "systematic")cv_plot(sb2, temperature)``````{r misc-scv-14 }sb3 <- cv_spatial( x = temperature, k = 10, size = 300000, hexagon = FALSE, selection = "checkerboard")cv_plot(sb3, temperature)```The clustering strategies (`cv_cluster()`) are used to group the data into clusters based on spatial or environmental similarity.The spatial similarity is based only on the clustering of the spatial coordinates. ```{r misc-scv-15 }set.seed(6)scv <- cv_cluster(x = temperature, k = 10)cv_plot(scv, temperature)```The environmental clustering, on the other hand, is based on the clustering of the values of the covariates extracted from the raster data.```{r misc-scv-16 }set.seed(6)ecv <- cv_cluster(x = temperature, r = covariates, k = 5, scale = TRUE)cv_plot(ecv, temperature)```The next cross-validation strategy is buffering LOO (also known as Spatial LOO).It is performed using the `cv_buffer()` function, which selects a buffer around each point (test point) and uses the points outside the buffer as the testing set.^[This approach is a form of leave-one-out cross-validation.]```{r misc-scv-17 }bloo <- cv_buffer(x = temperature, size = 300000, progress = FALSE)cv_plot(bloo, temperature, num_plots = c(1, 50, 100))```Note that above, we plot only the first, 50th, and 100th points to avoid overplotting.The last cross-validation strategy implemented in the **blockCV** package is the Nearest Neighbour Distance Matching (NNDM) LOO.It is performed using the `cv_nndm()` function, which tries to match the nearest neighbor distance distribution function between the test and training data to the nearest neighbor distance distribution function between the target prediction and training points.Thus, in this base, we need to provide more arguments, including a raster with the covariates, the number of samples, the sampling strategy, and the minimum training size.```{r misc-scv-18 }nncv <- cv_nndm( x = temperature, r = covariates, size = 300000, num_sample = 5000, sampling = "regular", min_train = 0.1, plot = TRUE)cv_plot(nncv, temperature, num_plots = c(1, 50, 100))```Let's now use the block cross-validation to fit and evaluate a model.```{r misc-scv-19 }# define formularesponse_name <- "temp"covariate_names <- colnames(temperature_df)[2:(ncol(temperature_df) - 7)]fo <- as.formula(paste( response_name, "~", paste(covariate_names, collapse = " + ")))# extract the foldsfolds <- sb1$folds_listmodel_rmse <- data.frame(fold = seq_along(folds), rmse = rep(NA, length(folds)))for (k in seq_along(folds)) { trainSet <- unlist(folds[[k]][1]) # training set indices; first element testSet <- unlist(folds[[k]][2]) # testing set indices; second element rf <- ranger(fo, temperature_df[trainSet, ], num.trees = 100) # model fitting on training set pred <- predict(rf, temperature_df[testSet, ])$predictions # predict the test set model_rmse[k, "rmse"] <- sqrt(mean( (temperature_df[testSet, response_name] - pred)^2 )) # calculate RMSE}model_rmse```The **blockCV** package also provides functions for checking the similarity between the folds (`cv_similarity()`) and estimating the effective range of spatial autocorrelation (`cv_spatial_autocor()`).The first function is used to check the similarity between the folds in the cross-validation.```{r misc-scv-20 }cv_similarity(cv = sb1, x = temperature, r = covariates, progress = FALSE)```The second function is used to estimate the effective range of spatial autocorrelation of all input raster layers or the response data -- its role is to help to determine the size of the blocks in the block cross-validation.```{r misc-scv-21 }cv_spatial_autocor(r = covariates, num_sample = 5000, progress = FALSE)```More examples of the package's use can be found at <https://cran.r-project.org/web/packages/blockCV/vignettes/tutorial_2.html>.