Conducts a full subsets analysis based on gam(m4) using the list generated by a call to generate_model_set
Usage
fit_model_set(
model.set.list,
max.models = 200,
save.model.fits = TRUE,
parallel = FALSE,
n.cores = 4,
r2.type = "r2.lm.est",
report.unique.r2 = FALSE,
VI.mods = "min.n"
)Arguments
- model.set.list
A list as returned by generate_model_set()
- max.models
The total number of models allowed to be fit and the model fits to be saved during fitting. If the candidate set is bigger than this value, a warning message will be returned.
- save.model.fits
Should all successfully fitted models be saved to list success.models. Defaults to TRUE. Value is overwritten if the number of models n the set is bigger than max.models.
- parallel
A logical value indicating if parallel processing should be used. The default is FALSE.
- n.cores
An integer indicating the number of cores to use if parallel is TRUE. Defaults to 4.
- r2.type
The value to extract from the gam model fit to use as the R squared value. Defaults to r2.lm.est which returns and estimated R squared value based on a linear regression between the observed and predicted values. r2 will return the adjusted R.sq as reported by gam, gamm or gamm4.dev will return the deviance explained as reported by gam or gamm. Note gamm4 does not currently return a deviance.
- report.unique.r2
The estimated null model R2 is subtracted from each model R2 to give an idea of the unique variance explained. This can be useful where null terms are included in the model set.
- VI.mods
The set of models used to calculate summed variable importance scores. Defaults to 'min.n', which uses only the best n models for each variable (n being the minimum number of models any one predictor is present in). Set to 'all' to use all models in the candidate set instead.
Value
A list of the following output files:
mod.data.out - A data.frame that contains the statistics associated with each model fit. This includes AICc and BIC, delta values (e.g. AICc-(min(AICc)), corresponding weight values (Burnham and Anderson 2003), an estimate of the model R2, and a column for each of the included predictor variables containing either 0 (variable not included in the model) or 1 (variable is present in the model). Use of BIC in information theoretic approaches has been heavily criticised because of the inherent assumption of BIC that there is a true model that is represented in the candidate set (Anderson & Burnham 2002). Rather than decide a-priori which model selection tool users should adopt, we supply both as part of the function outputs. To simplify output, only AICc and AICc based model weights, rather than AIC, are included as these are asymptotically equivalent at large sample sizes, and for small sample sizes AICc should be used in any case. Calculating R2 values is non-trivial for mixed models, especially non-gaussian cases (and some argue should not be done at all). We have supplied a range of methods for estimating R2 (r2.type), as in our experience a single method rarely performs adequately across all scenarios.
used.data - A data.frame which is identical to the data.frame initially supplied by the user, but with any hard coded interaction terms appended via cbind.
failed.models - A list of model formula that failed to fit. Ideally the list of failed models should be empty, but when this is not the case interrogating failed.models provides a useful means of troubleshooting. Users can examine which models are not fitting and explore the reasons for this by fitting the failed models outside the full_subsets_gam call based on the listed formula. When a large number of models fail to fit properly it usually indicates poor specification of the initial test.fit or other arguments in the call to full_subsets_gam (such as the inclusion of factor interactions when there are few data within each level of the factor), or that inappropriate variables are being included in the model set.
success.models - A complete list of all successfully fitted model formula. If models were saved, this can be used for multimodel inference and creating model averaged predictions. otherwise the formula can be used to refit the top model set via a call to update of the test fit: update(test.fit,formula=mod.formula[[l]])
variable.importance - A list containing importance scores for each included predictor. To determine the relative importance of each predictor across the whole model set we summed the wi values for all models containing each variable. The higher the combined weights for an explanatory parameter, the more important it is in the analysis (Burnham & Anderson, 2002). An assumption of the use of summed model weights to infer variable importance is that the number of models in which the different predictors are present is uniform. As our function removes models with correlated predictors, this is not always the case. To overcome this issue, the summed variable.importance scores are the summed weights for the best n models, where n is equal to the minimum number of models any one predictor is present in. If you would like the variable importance scores to be based on all models in the set instead, set VI.mods="all".
Details
The function constructs and fits a complete model set based on the supplied arguments. for more information see Fisher R, Wilson SK, Sin TM, Lee AC, Langlois TJ (2018) A simple function for full-subsets multiple regression in ecology with R. Ecology and Evolution https://onlinelibrary.wiley.com/doi/abs/10.1002/ece3.4134
Examples
library(mgcv)
data(case_study1)
use.dat <- case_study1
use.dat$site <- as.factor(use.dat$site)
test.fit <- gam(Herbivore.abundance ~ s(depth, k = 3, bs = "cr") + s(site, bs = "re"),
family = tw(), data = use.dat)
model.set <- generate_model_set(
use.dat = use.dat,
test.fit = test.fit,
pred.vars.cont = c("complexity", "depth"),
pred.vars.fact = "ZONE",
null.terms = "s(site,bs='re')",
max.predictors = 2,
k = 3
)
fit_model_set(model.set, parallel = FALSE)
#>
|
| | 0%
|
|========= | 12%
|
|================== | 25%
|
|========================== | 38%
|
|=================================== | 50%
|
|============================================ | 62%
|
|==================================================== | 75%
|
|============================================================= | 88%
|
|======================================================================| 100%
#> $mod.data.out
#> modname
#> null null
#> complexity complexity
#> depth depth
#> ZONE ZONE
#> ZONE+complexity ZONE+complexity
#> ZONE+depth ZONE+depth
#> ZONE+complexity.by.ZONE ZONE+complexity.by.ZONE
#> ZONE+depth.by.ZONE ZONE+depth.by.ZONE
#> formula
#> null s(site, bs = "re")
#> complexity s(complexity, k = 3, bs = "cr") + s(site, bs = "re")
#> depth s(depth, k = 3, bs = "cr") + s(site, bs = "re")
#> ZONE ZONE + s(site, bs = "re")
#> ZONE+complexity s(complexity, k = 3, bs = "cr") + ZONE + s(site, bs = "re")
#> ZONE+depth s(depth, k = 3, bs = "cr") + ZONE + s(site, bs = "re")
#> ZONE+complexity.by.ZONE s(complexity, by = ZONE, k = 3, bs = "cr") + ZONE + s(site, bs = "re")
#> ZONE+depth.by.ZONE s(depth, by = ZONE, k = 3, bs = "cr") + ZONE + s(site, bs = "re")
#> AICc BIC r2.vals r2.vals.unique edf
#> null 601.8165 613.9550 0.13574 NA 2.93
#> complexity 530.3521 546.5630 0.55313 NA 5.15
#> depth 604.2478 620.2653 0.17579 NA 4.98
#> ZONE 603.0187 616.7110 0.13529 NA 4.02
#> ZONE+complexity 531.1829 548.2208 0.55773 NA 5.94
#> ZONE+depth 604.7738 621.8040 0.17570 NA 5.94
#> ZONE+complexity.by.ZONE 531.6302 549.9272 0.53534 NA 6.83
#> ZONE+depth.by.ZONE 600.4762 619.5704 0.23486 NA 7.56
#> edf.less.1 delta.AICc delta.BIC wi.AICc wi.BIC
#> null 0 71.464 67.392 0.000 0.000
#> complexity 0 0.000 0.000 0.457 0.616
#> depth 0 73.896 73.702 0.000 0.000
#> ZONE 0 72.667 70.148 0.000 0.000
#> ZONE+complexity 0 0.831 1.658 0.302 0.269
#> ZONE+depth 0 74.422 75.241 0.000 0.000
#> ZONE+complexity.by.ZONE 0 1.278 3.364 0.241 0.115
#> ZONE+depth.by.ZONE 0 70.124 73.007 0.000 0.000
#> complexity depth ZONE
#> null 0 0 0
#> complexity 1 0 0
#> depth 0 1 0
#> ZONE 0 0 1
#> ZONE+complexity 1 0 1
#> ZONE+depth 0 1 1
#> ZONE+complexity.by.ZONE 1 0 1
#> ZONE+depth.by.ZONE 0 1 1
#>
#> $failed.models
#> named list()
#>
#> $success.models
#> $success.models$null
#>
#> Family: Tweedie(p=1.469)
#> Link function: log
#>
#> Formula:
#> Herbivore.abundance ~ s(site, bs = "re")
#>
#> Estimated degrees of freedom:
#> 1.93 total = 2.93
#>
#> REML score: 297.6204
#>
#> $success.models$complexity
#>
#> Family: Tweedie(p=1.265)
#> Link function: log
#>
#> Formula:
#> Herbivore.abundance ~ s(complexity, k = 3, bs = "cr") + s(site,
#> bs = "re")
#>
#> Estimated degrees of freedom:
#> 1.73 2.42 total = 5.15
#>
#> REML score: 261.5252
#>
#> $success.models$depth
#>
#> Family: Tweedie(p=1.465)
#> Link function: log
#>
#> Formula:
#> Herbivore.abundance ~ s(depth, k = 3, bs = "cr") + s(site, bs = "re")
#>
#> Estimated degrees of freedom:
#> 1.42 2.57 total = 4.98
#>
#> REML score: 297.5266
#>
#> $success.models$ZONE
#>
#> Family: Tweedie(p=1.469)
#> Link function: log
#>
#> Formula:
#> Herbivore.abundance ~ ZONE + s(site, bs = "re")
#>
#> Estimated degrees of freedom:
#> 2.02 total = 4.02
#>
#> REML score: 297.9798
#>
#> $success.models$`ZONE+complexity`
#>
#> Family: Tweedie(p=1.267)
#> Link function: log
#>
#> Formula:
#> Herbivore.abundance ~ s(complexity, k = 3, bs = "cr") + ZONE +
#> s(site, bs = "re")
#>
#> Estimated degrees of freedom:
#> 1.71 2.23 total = 5.94
#>
#> REML score: 262.176
#>
#> $success.models$`ZONE+depth`
#>
#> Family: Tweedie(p=1.465)
#> Link function: log
#>
#> Formula:
#> Herbivore.abundance ~ s(depth, k = 3, bs = "cr") + ZONE + s(site,
#> bs = "re")
#>
#> Estimated degrees of freedom:
#> 1.47 2.47 total = 5.94
#>
#> REML score: 297.7074
#>
#> $success.models$`ZONE+complexity.by.ZONE`
#>
#> Family: Tweedie(p=1.262)
#> Link function: log
#>
#> Formula:
#> Herbivore.abundance ~ s(complexity, by = ZONE, k = 3, bs = "cr") +
#> ZONE + s(site, bs = "re")
#>
#> Estimated degrees of freedom:
#> 1.78 1.00 2.05 total = 6.83
#>
#> REML score: 261.0315
#>
#> $success.models$`ZONE+depth.by.ZONE`
#>
#> Family: Tweedie(p=1.45)
#> Link function: log
#>
#> Formula:
#> Herbivore.abundance ~ s(depth, by = ZONE, k = 3, bs = "cr") +
#> ZONE + s(site, bs = "re")
#>
#> Estimated degrees of freedom:
#> 1.71 1.00 2.85 total = 7.56
#>
#> REML score: 294.4957
#>
#>
#> $variable.importance
#> $variable.importance$aic
#> $variable.importance$aic$variable.weights.raw
#> complexity depth ZONE
#> 1.000 0.000 0.543
#>
#>
#> $variable.importance$bic
#> $variable.importance$bic$variable.weights.raw
#> complexity depth ZONE
#> 1.000 0.000 0.384
#>
#>
#>