NicheToolBox: An example of the model selection protocol for ellipsoid models

0.1 The example

We demonstrate the use of the model selection protocol implemented in ntbox (an article with more information about the method is in (Osorio-Olvera et al. 2020)) by using occurrence data for the giant hummingbird (Patagona gigas). The occurrence records and the modeling layers are available in the ntbox package (if you haven’t installed it go to this link where you will find the installation notes for each of the operating systems).

$Fig. 1 Patagona gigas taken from wikimedia$

Fig. 1 Patagona gigas taken from wikimedia

0.2 The R packages

Let’s load the R packages that we are going to use in our session

library(ntbox)
library(raster)
library(rgl)
library(stringr)
set.seed(111)

0.3 Loading the occurrence data

With the following commands, you can load into your R session the geographic records

# Occurrence data for the giant hummingbird (Patagona gigas)
pg <- utils::read.csv(system.file("extdata/p_gigas.csv",
                                  package = "ntbox"))

head(pg)

species	longitude	latitude	type
Patagona gigas	-65.21600	-22.790906	train
Patagona gigas	-78.32320	-0.459242	test
Patagona gigas	-68.06772	-16.538205	train
Patagona gigas	-70.51691	-33.385099	train
Patagona gigas	-70.31293	-33.352039	train
Patagona gigas	-70.58427	-34.274945	test

As you can see in the table there are 4 columns: “species”, “longitude” and “latitude” and “type” (with train or test as factors). The data points with the train labels are going to be used for model calibration and the test points will be used for selection.

0.4 Bioclimatic layers

Let’s load the environmental information and display it

# Bioclimatic layers path
wcpath <- list.files(system.file("extdata/bios",
                                package = "ntbox"),
                    pattern = ".tif$",full.names = TRUE)

wc <- raster::stack(wcpath)

plot(wc)

The environmental information is for South America and here we will project our niche model …

0.5 Split occurrence

Now we split the occurrence data into train and test using the type variable; the choice of the data is normally done using different partition methods, but in our case, the data previously was split using a random partition.

# Split occs in train and test
pgL <- base::split(pg,pg$type)
pg_train <- pgL$train
pg_test <- pgL$test

0.6 Extract the environmental information

The following code extracts the environmental information for both train and test data

pg_etrain <- raster::extract(wc,pg_train[,c("longitude",
                                            "latitude")],
                             df=TRUE)
pg_etrain <- pg_etrain[,-1]

head(pg_etrain)

bio01	bio02	bio03	bio04	bio05	bio06	bio07	bio08	bio09	bio10	bio11	bio12	bio13	bio14	bio15	bio16	bio17	bio18	bio19
82.80	193.68	62.16	4088.88	219.60	-90.00	309.60	123.96	25.80	123.96	23.56	249.52	65.28	0.00	108.68	172.56	2.96	172.56	3.92
50.20	145.80	69.60	1523.60	145.40	-62.80	208.20	61.20	28.40	65.20	27.40	605.20	129.00	6.00	77.00	325.20	29.20	191.40	29.20
94.76	147.56	56.80	4035.48	238.52	-17.76	256.28	51.08	146.36	146.36	43.08	571.20	128.80	3.64	91.88	337.48	16.64	16.64	327.20
54.08	139.56	56.84	3963.12	189.68	-53.04	242.72	7.12	104.52	104.52	4.04	469.68	108.20	3.76	88.68	267.28	18.48	18.48	261.48
149.24	135.32	53.64	4122.48	290.08	39.68	250.40	101.72	200.44	202.60	96.08	424.84	109.32	1.36	105.16	278.00	9.80	9.80	269.96
111.20	140.00	54.00	4233.40	266.20	8.80	257.40	64.60	161.20	168.60	58.20	1340.80	279.80	18.20	82.80	766.80	68.20	71.80	701.60

pg_etest <- raster::extract(wc,pg_test[,c("longitude",
                                          "latitude")],
                            df=TRUE)
pg_etest <- pg_etest[,-1]

head(pg_etest)

bio01	bio02	bio03	bio04	bio05	bio06	bio07	bio08	bio09	bio10	bio11	bio12	bio13	bio14	bio15	bio16	bio17	bio18	bio19
56.96	84.56	83.36	500.52	112.08	11.48	100.60	56.80	53.80	60.96	49.48	1223.80	128.04	70.00	16.56	364.88	243.44	307.56	286.68
108.80	140.20	53.20	4359.80	258.80	-2.00	260.80	59.40	160.20	165.80	53.60	840.20	194.00	3.80	95.80	522.00	20.20	21.80	483.20
138.00	106.00	63.00	2240.00	231.00	63.00	168.00	109.00	164.00	168.00	109.00	349.00	95.00	0.00	111.00	239.00	3.00	3.00	239.00
135.36	128.16	59.24	3250.52	255.56	40.28	215.28	93.84	176.84	176.92	92.84	498.08	136.56	1.36	108.64	331.80	7.32	7.32	331.52
161.80	106.60	53.20	3360.20	273.60	73.80	199.80	123.80	204.80	204.80	118.80	504.40	130.60	2.40	103.40	331.20	9.60	9.60	312.80
136.88	119.04	53.20	3559.80	265.56	43.24	222.32	98.64	178.72	183.88	91.04	835.56	201.76	3.16	99.88	533.76	15.96	21.60	508.16

0.7 Non-correlated variables

Select the non-correlated variables using a threshold correlation value of 0.8

env_varsL <- ntbox::correlation_finder(cor(pg_etrain),
                                       threshold = 0.8,
                                       verbose = F)

env_vars <- env_varsL$descriptors
print(env_vars )

## [1] "bio01" "bio02" "bio03" "bio07" "bio12" "bio14" "bio18" "bio19"

0.8 Specify the number of variables to fit the model

Now we specify the number of variables to fit the ellipsoid models; in the example, we will fit for 3,5, and 6 dimensions

nvarstest <- c(3,5,6)

0.9 The level parameter

This parameter is to specify the proportion of training points that will be used to fit the minimum volume ellipsoid (Van Aelst and Rousseeuw 2009).

# Level
level <- 0.99

0.10 Environmental background to compute the approximated AUC ratio

This background data is just to compute the partial ROC test

env_bg <- ntbox::sample_envbg(wc,10000)

0.11 Set the omission rate criteria

For selecting the model we will use an arbitrary value of 6 percent of omission; it is not a rule but accepted omission rates are those bellow 10%. We will ask the function to return the partial ROC value (Peterson, Papes, and Soberon 2008)

omr_criteria <- 0.06
proc <- TRUE

0.12 Model calibration and selection protocol

Now we just need to use the function ellipsoid_selection to run the model calibration and selection protocol

e_selct <- ntbox::ellipsoid_selection(env_train = pg_etrain,
                                      env_test = pg_etest,
                                      env_vars = env_vars,
                                      level = level,
                                      nvarstest = nvarstest,
                                      env_bg = env_bg,
                                      omr_criteria= omr_criteria,
                                      proc = proc)

## -----------------------------------------------------------------------------------------
##      **** Starting model selection process ****
## -----------------------------------------------------------------------------------------
## 
## A total number of 56 models will be created for combinations of 8 variables taken by 3 
## 
## A total number of 56 models will be created for combinations of 8 variables taken by 5 
## 
## A total number of 28 models will be created for combinations of 8 variables taken by 6 
## 
## -----------------------------------------------------------------------------------------
##   **A total number of 140 models will be tested **
## 
## -----------------------------------------------------------------------------------------
##   6 models passed omr_criteria for train data
##   18 models passed omr_criteria for test data
##   2 models passed omr_criteria for train and test data

Let’s see the first 20 rows of the results

head(e_selct,20)

fitted_vars	nvars	om_rate_train	om_rate_test	bg_prevalence	env_bg_aucratio	mean_omr_train_test	rank_by_omr_train_test	rank_omr_aucratio
bio01,bio02,bio07	3	0.0420168	0.0392157	0.1708667	1.675877	0.0406162	1	1
bio01,bio02,bio12	3	0.0588235	0.0588235	0.2062457	1.657898	0.0588235	4	2
bio01,bio03,bio12	3	0.0756303	0.0392157	0.1730408	1.719728	0.0574230	2	3
bio07,bio14,bio18	3	0.0756303	0.0392157	0.5028165	1.372321	0.0574230	3	4
bio01,bio07,bio12	3	0.0672269	0.0588235	0.1707679	1.738229	0.0630252	5	5
bio01,bio02,bio03	3	0.0672269	0.0588235	0.2378694	1.625587	0.0630252	6	6
bio01,bio03,bio14	3	0.0924370	0.0392157	0.1452713	1.787052	0.0658263	7	7
bio01,bio07,bio14	3	0.0756303	0.0588235	0.1571302	1.730370	0.0672269	8	8
bio01,bio12,bio18	3	0.0756303	0.0588235	0.2091116	1.659776	0.0672269	9	9
bio01,bio02,bio18	3	0.0588235	0.0784314	0.2051586	1.666558	0.0686275	10	10
bio01,bio03,bio18	3	0.0840336	0.0588235	0.1554501	1.761996	0.0714286	11	11
bio02,bio14,bio18	3	0.0840336	0.0588235	0.4999506	1.334296	0.0714286	12	12
bio01,bio03,bio07	3	0.0672269	0.0784314	0.2144481	1.623151	0.0728291	13	13
bio01,bio12,bio14	3	0.1092437	0.0392157	0.1656290	1.784394	0.0742297	14	14
bio02,bio03,bio14	3	0.0924370	0.0588235	0.4443127	1.382246	0.0756303	15	15
bio01,bio02,bio07,bio12,bio14	5	0.1092437	0.0588235	0.1018875	1.801916	0.0840336	16	16
bio01,bio02,bio03,bio12,bio14	5	0.1092437	0.0588235	0.1163158	1.788790	0.0840336	17	17
bio07,bio12,bio14	3	0.1092437	0.0588235	0.2198834	1.640416	0.0840336	18	18
bio01,bio02,bio14	3	0.0924370	0.0784314	0.1987351	1.655143	0.0854342	19	19
bio02,bio12,bio14	3	0.0924370	0.0784314	0.2652436	1.564441	0.0854342	20	20

With the following lines of code, I am going to display the model in the first row of the table

# Best ellipsoid model for "omr_criteria" 
bestvarcomb <- stringr::str_split(e_selct$fitted_vars,",")[[1]]

# Ellipsoid model (environmental space)
best_mod <- ntbox::cov_center(pg_etrain[,bestvarcomb],
                              mve = T,
                              level = 0.99,
                              vars = 1:length(bestvarcomb))


# Projection model in geographic space

mProj <- ntbox::ellipsoidfit(wc[[bestvarcomb]],
                             centroid = best_mod$centroid,
                             covar = best_mod$covariance,
                             level = 0.99,size = 3)
raster::plot(mProj$suitRaster)
points(pg[,c("longitude","latitude")],pch=20,cex=0.5)

References

Osorio-Olvera, Luis, C. Yañez-Arenas, E. Martínez-Meyer, and A. T. Peterson. 2020. “Relationships between population densities and niche?centroid distances in North American birds.” Ecol. Lett. 23: 555–64. https://doi.org/10.1111/ele.13453.

Peterson, A. Townsend, Monica Papes, and Jorge Soberon. 2008. “Rethinking receiver operating characteristic analysis applications in ecological niche modeling.” Ecol. Modell. 213 (1): 63–72.

Van Aelst, Stefan, and Peter Rousseeuw. 2009. “Minimum volume ellipsoid.” Wiley Interdiscip. Rev. Comput. Stat. 1 (1): 71–82. https://doi.org/10.1002/wics.19.