ellipsoid_selection: Performs variable selection for ellipsoid models

Performs variable selection for ellipsoid models according to omission rates in the environmental space.

ellipsoid_selection(
  env_train,
  env_test = NULL,
  env_vars,
  nvarstest,
  level = 0.95,
  mve = TRUE,
  env_bg = NULL,
  omr_criteria,
  parallel = F,
  comp_each = 100,
  proc = FALSE,
  proc_iter = 100,
  rseed = TRUE
)

Arguments

env_train	A data frame with the environmental training data.
env_test	A data frame with the environmental testing data. The default is NULL if given the selection process will show the p-value of a binomial test.
env_vars	A vector with the names of environmental variables to be used in the selection process.
nvarstest	A vector indicating the number of variables to fit the ellipsoids during model selection. It is allowed to test models with a different number of variables (i.e. nvarstest=c(3,6)).
level	Proportion of points to be included in the ellipsoids. This parameter is equivalent to the error (E) proposed by Peterson et al. (2008).
mve	A logical value. If TRUE a minimum volume ellipsoid will be computed using the function `cov.rob` of the MASS package. If False the covariance matrix of the input data will be used.
env_bg	Environmental data to compute the approximated prevalence of the model. The data should be a sample of the environmental layers of the calibration area.
omr_criteria	Omission rate criteria. Value of the omission rate allowed for the selection process. Default NULL see details.
parallel	The computations will be run in parallel. Default FALSE
comp_each	Number of models to run in each job in the parallel computation. Default 100
proc	Logical if TRUE a partial roc test will be run.
proc_iter	Numeric. The total number of iterations for the partial ROC bootstrap.
rseed	Logical. Whether or not to set a random seed for partial roc bootstrap. Default TRUE.

Value

A data.frame with 5 columns: i) "fitted_vars" the names of variables that were fitted; ii) "om_rate" omission rates of the model; iii) "bg_prevalence" approximated prevalence of the model see details section; iv) The rank value of importance in model selection by omission rate; v) The rank value by prevalence after if the value of omr_criteria is passed.

Details

Model selection occurs in environmental space (E-space). For each variable combination the omission rate (omr) in E-space is computed using the function inEllipsoid. The results will be ordered by omr and if the user-specified the environmental background "env_bg" an estimated prevalence will be computed and the results will be ordered also by "bg_prevalence".

The number of variables to construct candidate models can be specified by the user in the parameter "nvarstest". Model selection will be run in parallel if the user-specified more than one set of combinations and the total number of models to be tested is greater than 500. If given"omr_criteria" and "bg_prevalence", the results will be shown pondering those models that met the "omr_criteria" by the value of "bg_prevalence". For more details and examples go to ellipsoid_omr help.

References

Peterson, A.T. et al. (2008) Rethinking receiver operating characteristic analysis applications in ecological niche modeling. Ecol. Modell., 213, 63–72.

Examples

if (FALSE) {
# Bioclimatic layers path
wcpath <- list.files(system.file("extdata/bios",
                                package = "ntbox"),
                    pattern = ".tif$",full.names = TRUE)
# Bioclimatic layers
wc <- raster::stack(wcpath)
# Occurrence data for the giant hummingbird (Patagona gigas)
pg <- utils::read.csv(system.file("extdata/p_gigas.csv",
                                  package = "ntbox"))
# Split occs in train and test
pgL <- base::split(pg,pg$type)
pg_train <- pgL$train
pg_test <- pgL$test
# Environmental data for training and testing
pg_etrain <- raster::extract(wc,pg_train[,c("longitude",
                                            "latitude")],
                             df=TRUE)
pg_etrain <- pg_etrain[,-1]
pg_etest <- raster::extract(wc,pg_test[,c("longitude",
                                          "latitude")],
                            df=TRUE)
pg_etest <- pg_etest[,-1]

# Non-correlated variables
env_varsL <- ntbox::correlation_finder(cor(pg_etrain),
                                       threshold = 0.8,
                                       verbose = F)
env_vars <- env_varsL$descriptors
# Number of variables to fit ellipsoids (3,5,6 )
nvarstest <- c(3,5,6)
# Level
level <- 0.95
# Environmental background to compute the appoximated
# prevalence in the prediction
env_bg <- raster::sampleRandom(wc,10000)

# Selection process

e_selct <- ntbox::ellipsoid_selection(env_train = pg_etrain,
                                      env_test = pg_etest,
                                      env_vars = env_vars,
                                      level = level,
                                      nvarstest = nvarstest,
                                      env_bg = env_bg,
                                      omr_criteria=0.07)

# Best ellipsoid model for "omr_criteria" and prevalence
bestvarcomb <- stringr::str_split(e_selct$fitted_vars,",")[[1]]

# Ellipsoid model projection

best_mod <- ntbox::cov_center(pg_etrain[,bestvarcomb],
                              mve = T,
                              level = 0.99,
                              vars = 1:length(bestvarcomb))


# Projection model in geographic space

mProj <- ntbox::ellipsoidfit(wc[[bestvarcomb]],
                             centroid = best_mod$centroid,
                             covar = best_mod$covariance,
                             level = 0.99,size = 3)

raster::plot(mProj$suitRaster)
points(pg[,c("longitude","latitude")],pch=20,cex=0.5)

pg_proc <- ntbox::pROC(continuous_mod = mProj$suitRaster,
                       test_data = pg_test[,c("longitude","latitude")],
                       n_iter = 1000,
                       E_percent = 5,
                       boost_percent = 50,parallel = F)
print(pg_proc$pROC_summary)
}