Hide the code
install.packages("mlr3verse")
install.packages("mlr3spatiotempcv")
install.packages("sf")
install.packages("terra")
This post aims to give a minimal example on how to use mlr3 for a spatial prediction task. We want to get from measurements of temperature at specific locations in Spain to a spatially continuous map of temperature for all of Spain.
Such a spatial prediction task is often done by applying machine learning algorithms that are not necessarily developed for spatial tasks specifically and hence do not consider problems we might encounter in the spatial world, e.g., spatial autocorrelation or map extrapolation. In the last decade, a lot of methodological developments were made by various research groups to consider and deal with such specialties of spatial mapping. Many of which found their way in software packages such as mlr3.
install.packages("mlr3verse")
install.packages("mlr3spatiotempcv")
install.packages("sf")
install.packages("terra")
library(sf)
library(terra)
<- terra::rast("https://github.com/LOEK-RS/FOSSGIS2025-examples/raw/refs/heads/main/data/predictors.tif")
covariates <- sf::read_sf("https://github.com/LOEK-RS/FOSSGIS2025-examples/raw/refs/heads/main/data/temp_train.gpkg")
temperature
<- sf::read_sf("https://github.com/LOEK-RS/FOSSGIS2025-examples/raw/refs/heads/main/data/spain.gpkg") |>
spain st_cast("POLYGON") |>
st_transform(st_crs(temperature))
<- terra::extract(covariates, temperature, bind = TRUE) |>
temperature ::st_as_sf()
sf
# the sf object cannot contain a column named "X" or "Y". Otherwise the task creation will fail because "Assertion on 'data' failed: Must have unique colnames"
$X <- NULL
temperature$Y <- NULL temperature
Some words about the terminology that is specific to the example data:
spain
is the region outline - for visualization purposes and kNNDM setupcovariates
are the spatially explicit data data to predict on (to prevent wording confusions with the predict function or the prediction)temperature
are the measured temperature data (i.e. the response variable, i.e., the ground truth) along with the covariates at the measurement locationThe best general introduction to mlr3 is probably the official mlr3 book which you can access for free here: https://mlr3book.mlr-org.com/. The homepage https://mlr-org.com/ also gives a very detailed overview of all the associated packages an current developments.
The individual packages also have their own documentation pages, where you will find more specific topics. The “spatial” packages are the following:
mlr3spatial
implements support for spatial data types in R – https://mlr3spatial.mlr-org.com/mlr3spatiotemporalcv
implements spatial cross-validation methods in the workflow – https://mlr3spatiotempcv.mlr-org.com/articles/mlr3spatiotempcv.html – and also uses the mlr3spatial
package for the spatial data types handlinglibrary(mlr3verse)
library(mlr3spatiotempcv)
library(mlr3viz)
mlr3 functions can be very verbose. For this blog post, I turned off messages for a less overwhelming tutorial.
::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn") lgr
First of all, mlr3 uses R6
classes and therefor an object-oriented design paradigm. This might be not intuitive for a lot of R users and more in line with something like scikit-learn of Python. In general, with R6
classes you define an object first, and this object contains all the methods and parameters you can access.
The R6
class in R has a significant drawback when it comes to RStudio’s convenience features. For example, it does not provide in-line popup help for available parameters in a model. This means you must manually look up the parameters, e.g., needed for the ranger()
function.”
Let’s define a minimum spatial prediction task example with mlr3 classes:
mlr3 uses its own terminology. You have to know a lot of specific terms in order to comfortably use the functions. I try to use the mlr3 terminology here and give some alternative terms in parenthesis.
The task
defines general information about the data we have at hand. From our point observations we create the task and define what column the target (response / dependent) variable is, what column the geometry (coordinates) is and whether we want to use the coordinates as features (predictors). In this example, we also split the data in our task into a training and a test partition.
<- mlr3spatial::as_task_regr_st(
task_spain
temperature,target = "temp",
coordinate_names = "geometry",
coords_as_features = TRUE,
crs = st_crs(temperature)
)
<- mlr3::partition(task_spain, ratio = 0.7) train_test_split
st_drop_geometry()
In caret, if we want to use spatial data, we have to specifically exclude the geometry column of the sf
object. We lose the spatial information in the process of model training and prediction. However, this information can be critical, e.g., if we want to use CAST::errorprofiles()
or CAST::knncv()
. In mlr3 we can keep the geometry and can also directly define whether we want to use the coordinates as predictors.
The learner
is the model type or algorithm we want to use with all its necessary or optional parameters. Here I define a Random Forest regression method from the ranger package with 100 decision trees and an mtry
of 4.
<- mlr3::lrn("regr.ranger", num.trees = 100, mtry = 4) rfmodel
In order to evaluate the a score (model performance), e.g., from the previously defined test data partition, we need to define a measure
. Here I use the Root Mean Squared Error (RMSE).
<- mlr3::msr("regr.rmse") measure_rmse
We now have everything defined to actually do something. To train the model we can use the train
method from out defined learner. And we have to tell the train
method on what to actually fit the model – in this case the data in task_spain
, but only the subset we sampled as training data.
$train(task_spain, row_ids = train_test_split$train)
rfmodel$model rfmodel
Ranger result
Call:
ranger::ranger(dependent.variable.name = task$target_names, data = task$data(), case.weights = task$weights$weight, mtry = 4L, num.threads = 1L, num.trees = 100L)
Type: Regression
Number of trees: 100
Sample size: 136
Number of independent variables: 22
Mtry: 4
Target node size: 5
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 1.040724
R squared (OOB): 0.8595686
To test how well out model performs now on unseen data, we use the predict
method of our fitted learner. Again, we tell the predict
method on what data to predict. We can then calculate the score
of this test prediction, i.e., the RMSE we defined earlier as out measure.
<- rfmodel$predict(task_spain, row_ids = train_test_split$test)
test_prediction $score(measure_rmse) test_prediction
regr.rmse
1.016704
plot(test_prediction)
Finally, we can also use the model to predict temperatures for the whole raster by using the predict()
function from terra. Alternatively, we could use predict_spatial()
for the mlr3spatial package.
<- terra::predict(covariates, rfmodel)
prediction plot(prediction)
In many cases, the ideal model hyperparameters are not obvious and we have the possibility to tune them based on the data we have.
Again, we first define the necessary mlr3 objects for our tuning process. With rsmp
we decide on a resampling strategy, i.e., the data partitions we use as cross-validation folds. The package mlr3spatiotemporalcv contains the popular spatial resampling strategies such as block CV (spcv_block
used in the example below) or kNNDM.
For the learner
instead of predefined hyperparameters, we use the function to_tune()
where we can specify different values for the hyperparameters we want to test.
library(mlr3spatiotempcv)
<- rsmp("spcv_block", folds = 5, range = 5000)
resampling_blockcv
<- lrn(
rfmodel "regr.ranger",
num.trees = 100,
mtry = to_tune(c(2, 4, 6, 10, 12)),
min.node.size = to_tune(c(5, 10, 15)),
importance = "impurity"
)
Now we can assemble everything together in a tuning instance
object with the ti()
function. The first four arguments should be self explanatory: terminator
means, if we want to stop with our search for the best hyperparameter combination after some criteria is met. store_benchmark_results
and store_models
regulates whether we want to keep the measures and models for all the hyperparameter combinations we test.
Finally, we have to decide on a strategy on how to search with the tuner
object. Here I use grid_search
which is the brute force method that tests every possible combination once. We then execute the cross-validation with the optimize
method from the tuner
object.
<- ti(
tuning_blockcv task = task_spain, # the data
resampling = resampling_blockcv, # the folds
learner = rfmodel, # the rfmodel with mtry and min.node.size to tune
measures = measure_rmse, # how to measure performance
terminator = trm("none"),
store_benchmark_result = TRUE,
store_models = TRUE
)
<- mlr3tuning::tnr("grid_search")
tuner_grid_search
$optimize(tuning_blockcv) tuner_grid_search
min.node.size mtry learner_param_vals x_domain regr.rmse
<char> <char> <list> <list> <num>
1: 15 12 <list[5]> <list[2]> 0.8520169
The results are stored in the defined tuning instance
object. We also have some nice plotting options with the mlr3viz package.
$result_learner_param_vals tuning_blockcv
$importance
[1] "impurity"
$num.threads
[1] 1
$num.trees
[1] 100
$min.node.size
[1] 15
$mtry
[1] 12
$result_y tuning_blockcv
regr.rmse
0.8520169
autoplot(tuning_blockcv, type = "parallel")
Because we set store_benchmark_result = TRUE
and store_models = TRUE
, we also have a archive
where all the other results are stored.
$archive$data tuning_blockcv
min.node.size mtry regr.rmse warnings errors runtime_learners
<char> <char> <num> <int> <int> <num>
1: 5 12 0.8657247 0 0 0.153
2: 10 6 0.9098643 0 0 0.106
3: 5 6 0.8953927 0 0 0.111
4: 15 6 0.9194065 0 0 0.076
5: 5 2 1.1126819 0 0 0.075
6: 10 10 0.8602758 0 0 0.111
7: 5 4 0.9483508 0 0 0.087
8: 15 2 1.2024457 0 0 0.055
9: 15 12 0.8520169 0 0 0.103
10: 15 10 0.8758340 0 0 0.115
11: 10 2 1.1484717 0 0 0.064
12: 15 4 0.9986427 0 0 0.071
13: 5 10 0.8696826 0 0 0.139
14: 10 12 0.8588805 0 0 0.113
15: 10 4 0.9687282 0 0 0.073
uhash x_domain timestamp batch_nr
<char> <list> <POSc> <int>
1: f746d058-03dd-48ad-b1a5-6da70cc71c39 <list[2]> 2025-03-17 15:50:04 1
2: 40f80983-3657-4845-9a59-d7fc85fdf6b0 <list[2]> 2025-03-17 15:50:04 2
3: 32acbb67-833d-4c78-b135-1ea60ca52021 <list[2]> 2025-03-17 15:50:04 3
4: 5b8d2e6a-608a-4caf-974a-424319bcb60a <list[2]> 2025-03-17 15:50:04 4
5: bf116d61-c38b-46ec-8be5-ca42ab97b781 <list[2]> 2025-03-17 15:50:04 5
6: 113e3414-881c-479b-8dd6-e942116fc9b6 <list[2]> 2025-03-17 15:50:05 6
7: 3442f3ca-bcb5-4f28-a262-9a23470c01d7 <list[2]> 2025-03-17 15:50:05 7
8: a9ab533a-0e98-419e-bfc7-e7dbdb2d0959 <list[2]> 2025-03-17 15:50:05 8
9: dcdb50c6-df5d-40c7-ac4b-a065bd25414c <list[2]> 2025-03-17 15:50:05 9
10: 74510c03-3d05-4972-aea6-207ac92d7b29 <list[2]> 2025-03-17 15:50:05 10
11: 8feb5dc5-9c1d-4a79-9dac-ff90b3cadbd6 <list[2]> 2025-03-17 15:50:06 11
12: 88f2921f-2f54-4492-9466-ee329b35d2f7 <list[2]> 2025-03-17 15:50:06 12
13: db5cd572-cc23-4d5e-9bcb-0eeac93028fd <list[2]> 2025-03-17 15:50:06 13
14: 7221b6c4-6b04-4ba4-818d-00618199a778 <list[2]> 2025-03-17 15:50:06 14
15: b6197cbf-266f-4aa6-b4f8-7e53e3829b28 <list[2]> 2025-03-17 15:50:06 15
Once we found the optimal hyperparameter combination we can use them to train a final model on all the data. We define a new learner
and assign the parameters from our tuning instance
to it. Then we can train and predict again.
<- lrn("regr.ranger")
tuned_rfmodel $param_set$values <- tuning_blockcv$result_learner_param_vals
tuned_rfmodel
$train(task_spain)
tuned_rfmodel<- terra::predict(covariates, tuned_rfmodel)
tuned_prediction plot(tuned_prediction)
Similar to the hyperparameter tuning, we can use the cross-validation strategy to test whether we can build a model with reduces predictors / covariates / features / variable. Here I demonstrate a forward variable selection, that starts out with the best performing pair of variables and appends additional variables, if there is still an increase in our cross-validation performance estimation.
First we have to define the feature selection strategy with fs()
. I use a learner here with fixed hyperparameters. Everything is put together in the fselect()
function that creates a feature selection instance
in which all the information, methods and results are stored.
library(mlr3fselect)
library(mlr3filters)
<- fs("sequential", min_features = 2)
select_mode
<- lrn(
fs_rfmodel "regr.ranger",
num.trees = 50,
mtry = 2,
min.node.size = 5,
importance = "impurity"
)
set.seed(20)
<- fselect(
feature_selection fselector = select_mode,
task = task_spain,
learner = fs_rfmodel,
resampling = resampling_blockcv,
measure = measure_rmse
)
# selected variables
$result_feature_set feature_selection
[1] "Y" "dem" "lst_day" "lst_night" "ntl" "popdens"
# CV RMSE with the selected variables
$result$regr.rmse feature_selection
[1] 0.7795219
Once we found the ideal combination of variables, we can reduce our task to those predictors, train a model and predict.
<- task_spain$select(feature_selection$result_feature_set)
fs_task
$train(fs_task)
fs_rfmodel
<- terra::predict(covariates, fs_rfmodel)
fs_prediction plot(fs_prediction)
R6
classes, offering flexibility but requiring familiarity with this system in R.sf
object, allowing for the retention of spatial information.R6
logic: Understanding the R6
logic in help files can be tough for those unfamiliar with the system.R6
methods in RStudio makes it harder to access function details quickly.