NicheToolBox: An example of the model selection protocol for ellipsoid models
NicheToolBox: An example of the model selection protocol for ellipsoid models
- 0.1 The example
- 0.2 The R packages
- 0.3 Loading the occurrence data
- 0.4 Bioclimatic layers
- 0.5 Split occurrence
- 0.6 Extract the environmental information
- 0.7 Non-correlated variables
- 0.8 Specify the number of variables to fit the model
- 0.9 The level parameter
- 0.10 Environmental background to compute the approximated AUC ratio
- 0.11 Set the omission rate criteria
- 0.12 Model calibration and selection protocol
- References
0.1 The example
We demonstrate the use of the model selection protocol implemented in ntbox
(an article with more information about the method is in (Osorio-Olvera et al. 2020)) by using occurrence data for the giant hummingbird (Patagona gigas). The occurrence records and the modeling layers are available in the ntbox
package (if you haven’t installed it go to this link where you will find the installation notes for each of the operating systems).
0.2 The R packages
Let’s load the R
packages that we are going to use in our session
0.3 Loading the occurrence data
With the following commands, you can load into your R
session the geographic records
# Occurrence data for the giant hummingbird (Patagona gigas)
pg <- utils::read.csv(system.file("extdata/p_gigas.csv",
package = "ntbox"))
head(pg)
species | longitude | latitude | type |
---|---|---|---|
Patagona gigas | -65.21600 | -22.790906 | train |
Patagona gigas | -78.32320 | -0.459242 | test |
Patagona gigas | -68.06772 | -16.538205 | train |
Patagona gigas | -70.51691 | -33.385099 | train |
Patagona gigas | -70.31293 | -33.352039 | train |
Patagona gigas | -70.58427 | -34.274945 | test |
As you can see in the table there are 4 columns: “species”, “longitude” and “latitude” and “type” (with train or test as factors). The data points with the train labels are going to be used for model calibration and the test points will be used for selection.
0.4 Bioclimatic layers
Let’s load the environmental information and display it
# Bioclimatic layers path
wcpath <- list.files(system.file("extdata/bios",
package = "ntbox"),
pattern = ".tif$",full.names = TRUE)
wc <- raster::stack(wcpath)
plot(wc)
The environmental information is for South America and here we will project our niche model …
0.5 Split occurrence
Now we split the occurrence data into train and test using the type variable; the choice of the data is normally done using different partition methods, but in our case, the data previously was split using a random partition.
0.6 Extract the environmental information
The following code extracts the environmental information for both train and test data
pg_etrain <- raster::extract(wc,pg_train[,c("longitude",
"latitude")],
df=TRUE)
pg_etrain <- pg_etrain[,-1]
head(pg_etrain)
bio01 | bio02 | bio03 | bio04 | bio05 | bio06 | bio07 | bio08 | bio09 | bio10 | bio11 | bio12 | bio13 | bio14 | bio15 | bio16 | bio17 | bio18 | bio19 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
82.80 | 193.68 | 62.16 | 4088.88 | 219.60 | -90.00 | 309.60 | 123.96 | 25.80 | 123.96 | 23.56 | 249.52 | 65.28 | 0.00 | 108.68 | 172.56 | 2.96 | 172.56 | 3.92 |
50.20 | 145.80 | 69.60 | 1523.60 | 145.40 | -62.80 | 208.20 | 61.20 | 28.40 | 65.20 | 27.40 | 605.20 | 129.00 | 6.00 | 77.00 | 325.20 | 29.20 | 191.40 | 29.20 |
94.76 | 147.56 | 56.80 | 4035.48 | 238.52 | -17.76 | 256.28 | 51.08 | 146.36 | 146.36 | 43.08 | 571.20 | 128.80 | 3.64 | 91.88 | 337.48 | 16.64 | 16.64 | 327.20 |
54.08 | 139.56 | 56.84 | 3963.12 | 189.68 | -53.04 | 242.72 | 7.12 | 104.52 | 104.52 | 4.04 | 469.68 | 108.20 | 3.76 | 88.68 | 267.28 | 18.48 | 18.48 | 261.48 |
149.24 | 135.32 | 53.64 | 4122.48 | 290.08 | 39.68 | 250.40 | 101.72 | 200.44 | 202.60 | 96.08 | 424.84 | 109.32 | 1.36 | 105.16 | 278.00 | 9.80 | 9.80 | 269.96 |
111.20 | 140.00 | 54.00 | 4233.40 | 266.20 | 8.80 | 257.40 | 64.60 | 161.20 | 168.60 | 58.20 | 1340.80 | 279.80 | 18.20 | 82.80 | 766.80 | 68.20 | 71.80 | 701.60 |
pg_etest <- raster::extract(wc,pg_test[,c("longitude",
"latitude")],
df=TRUE)
pg_etest <- pg_etest[,-1]
head(pg_etest)
bio01 | bio02 | bio03 | bio04 | bio05 | bio06 | bio07 | bio08 | bio09 | bio10 | bio11 | bio12 | bio13 | bio14 | bio15 | bio16 | bio17 | bio18 | bio19 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
56.96 | 84.56 | 83.36 | 500.52 | 112.08 | 11.48 | 100.60 | 56.80 | 53.80 | 60.96 | 49.48 | 1223.80 | 128.04 | 70.00 | 16.56 | 364.88 | 243.44 | 307.56 | 286.68 |
108.80 | 140.20 | 53.20 | 4359.80 | 258.80 | -2.00 | 260.80 | 59.40 | 160.20 | 165.80 | 53.60 | 840.20 | 194.00 | 3.80 | 95.80 | 522.00 | 20.20 | 21.80 | 483.20 |
138.00 | 106.00 | 63.00 | 2240.00 | 231.00 | 63.00 | 168.00 | 109.00 | 164.00 | 168.00 | 109.00 | 349.00 | 95.00 | 0.00 | 111.00 | 239.00 | 3.00 | 3.00 | 239.00 |
135.36 | 128.16 | 59.24 | 3250.52 | 255.56 | 40.28 | 215.28 | 93.84 | 176.84 | 176.92 | 92.84 | 498.08 | 136.56 | 1.36 | 108.64 | 331.80 | 7.32 | 7.32 | 331.52 |
161.80 | 106.60 | 53.20 | 3360.20 | 273.60 | 73.80 | 199.80 | 123.80 | 204.80 | 204.80 | 118.80 | 504.40 | 130.60 | 2.40 | 103.40 | 331.20 | 9.60 | 9.60 | 312.80 |
136.88 | 119.04 | 53.20 | 3559.80 | 265.56 | 43.24 | 222.32 | 98.64 | 178.72 | 183.88 | 91.04 | 835.56 | 201.76 | 3.16 | 99.88 | 533.76 | 15.96 | 21.60 | 508.16 |
0.8 Specify the number of variables to fit the model
Now we specify the number of variables to fit the ellipsoid models; in the example, we will fit for 3,5, and 6 dimensions
0.9 The level parameter
This parameter is to specify the proportion of training points that will be used to fit the minimum volume ellipsoid (Van Aelst and Rousseeuw 2009).
0.10 Environmental background to compute the approximated AUC ratio
This background data is just to compute the partial ROC test
0.11 Set the omission rate criteria
For selecting the model we will use an arbitrary value of 6 percent of omission; it is not a rule but accepted omission rates are those bellow 10%. We will ask the function to return the partial ROC value (Peterson, Papes, and Soberon 2008)
0.12 Model calibration and selection protocol
Now we just need to use the function ellipsoid_selection
to run the model calibration and selection protocol
e_selct <- ntbox::ellipsoid_selection(env_train = pg_etrain,
env_test = pg_etest,
env_vars = env_vars,
level = level,
nvarstest = nvarstest,
env_bg = env_bg,
omr_criteria= omr_criteria,
proc = proc)
## -----------------------------------------------------------------------------------------
## **** Starting model selection process ****
## -----------------------------------------------------------------------------------------
##
## A total number of 56 models will be created for combinations of 8 variables taken by 3
##
## A total number of 56 models will be created for combinations of 8 variables taken by 5
##
## A total number of 28 models will be created for combinations of 8 variables taken by 6
##
## -----------------------------------------------------------------------------------------
## **A total number of 140 models will be tested **
##
## -----------------------------------------------------------------------------------------
## 6 models passed omr_criteria for train data
## 18 models passed omr_criteria for test data
## 2 models passed omr_criteria for train and test data
Let’s see the first 20 rows of the results
fitted_vars | nvars | om_rate_train | om_rate_test | bg_prevalence | pval_bin | pval_proc | env_bg_aucratio | mean_omr_train_test | rank_by_omr_train_test | rank_omr_aucratio |
---|---|---|---|---|---|---|---|---|---|---|
bio01,bio02,bio07 | 3 | 0.0420168 | 0.0392157 | 0.1708667 | 0 | 0 | 1.675877 | 0.0406162 | 1 | 1 |
bio01,bio02,bio12 | 3 | 0.0588235 | 0.0588235 | 0.2062457 | 0 | 0 | 1.657898 | 0.0588235 | 4 | 2 |
bio01,bio03,bio12 | 3 | 0.0756303 | 0.0392157 | 0.1730408 | 0 | 0 | 1.719728 | 0.0574230 | 2 | 3 |
bio07,bio14,bio18 | 3 | 0.0756303 | 0.0392157 | 0.5028165 | 0 | 0 | 1.372321 | 0.0574230 | 3 | 4 |
bio01,bio07,bio12 | 3 | 0.0672269 | 0.0588235 | 0.1707679 | 0 | 0 | 1.738229 | 0.0630252 | 5 | 5 |
bio01,bio02,bio03 | 3 | 0.0672269 | 0.0588235 | 0.2378694 | 0 | 0 | 1.625587 | 0.0630252 | 6 | 6 |
bio01,bio03,bio14 | 3 | 0.0924370 | 0.0392157 | 0.1452713 | 0 | 0 | 1.787052 | 0.0658263 | 7 | 7 |
bio01,bio07,bio14 | 3 | 0.0756303 | 0.0588235 | 0.1571302 | 0 | 0 | 1.730370 | 0.0672269 | 8 | 8 |
bio01,bio12,bio18 | 3 | 0.0756303 | 0.0588235 | 0.2091116 | 0 | 0 | 1.659776 | 0.0672269 | 9 | 9 |
bio01,bio02,bio18 | 3 | 0.0588235 | 0.0784314 | 0.2051586 | 0 | 0 | 1.666558 | 0.0686275 | 10 | 10 |
bio01,bio03,bio18 | 3 | 0.0840336 | 0.0588235 | 0.1554501 | 0 | 0 | 1.761996 | 0.0714286 | 11 | 11 |
bio02,bio14,bio18 | 3 | 0.0840336 | 0.0588235 | 0.4999506 | 0 | 0 | 1.334296 | 0.0714286 | 12 | 12 |
bio01,bio03,bio07 | 3 | 0.0672269 | 0.0784314 | 0.2144481 | 0 | 0 | 1.623151 | 0.0728291 | 13 | 13 |
bio01,bio12,bio14 | 3 | 0.1092437 | 0.0392157 | 0.1656290 | 0 | 0 | 1.784394 | 0.0742297 | 14 | 14 |
bio02,bio03,bio14 | 3 | 0.0924370 | 0.0588235 | 0.4443127 | 0 | 0 | 1.382246 | 0.0756303 | 15 | 15 |
bio01,bio02,bio07,bio12,bio14 | 5 | 0.1092437 | 0.0588235 | 0.1018875 | 0 | 0 | 1.801916 | 0.0840336 | 16 | 16 |
bio01,bio02,bio03,bio12,bio14 | 5 | 0.1092437 | 0.0588235 | 0.1163158 | 0 | 0 | 1.788790 | 0.0840336 | 17 | 17 |
bio07,bio12,bio14 | 3 | 0.1092437 | 0.0588235 | 0.2198834 | 0 | 0 | 1.640416 | 0.0840336 | 18 | 18 |
bio01,bio02,bio14 | 3 | 0.0924370 | 0.0784314 | 0.1987351 | 0 | 0 | 1.655143 | 0.0854342 | 19 | 19 |
bio02,bio12,bio14 | 3 | 0.0924370 | 0.0784314 | 0.2652436 | 0 | 0 | 1.564441 | 0.0854342 | 20 | 20 |
With the following lines of code, I am going to display the model in the first row of the table
# Best ellipsoid model for "omr_criteria"
bestvarcomb <- stringr::str_split(e_selct$fitted_vars,",")[[1]]
# Ellipsoid model (environmental space)
best_mod <- ntbox::cov_center(pg_etrain[,bestvarcomb],
mve = T,
level = 0.99,
vars = 1:length(bestvarcomb))
# Projection model in geographic space
mProj <- ntbox::ellipsoidfit(wc[[bestvarcomb]],
centroid = best_mod$centroid,
covar = best_mod$covariance,
level = 0.99,size = 3)
raster::plot(mProj$suitRaster)
points(pg[,c("longitude","latitude")],pch=20,cex=0.5)
References
Osorio-Olvera, Luis, C. Yañez-Arenas, E. Martínez-Meyer, and A. T. Peterson. 2020. “Relationships between population densities and niche?centroid distances in North American birds.” Ecol. Lett. 23: 555–64. https://doi.org/10.1111/ele.13453.
Peterson, A. Townsend, Monica Papes, and Jorge Soberon. 2008. “Rethinking receiver operating characteristic analysis applications in ecological niche modeling.” Ecol. Modell. 213 (1): 63–72.
Van Aelst, Stefan, and Peter Rousseeuw. 2009. “Minimum volume ellipsoid.” Wiley Interdiscip. Rev. Comput. Stat. 1 (1): 71–82. https://doi.org/10.1002/wics.19.