Stacked super-learner ensemble of genomic prediction models

Combines several base models into one predictor by stacking: each base model's out-of-fold cross-validated predictions are used to learn a set of non-negative weights (constrained to sum to one), and the final prediction is that weighted average of the base models refit on all the data. This is the Breiman / van der Laan stacked-regression (super-learner) idea applied to genomic selection; in practice it tends to match or beat the best single model without having to know in advance which that is.

Usage

gs_ensemble(y, geno, base_models = NULL, inner_k = 5, seed = NULL, ...)

Arguments

y: Numeric phenotype vector (length n), no missing values.
geno: Marker matrix (n x m, 0/1/2, no missing values).
base_models: Character vector of base model names. Defaults to every available model except the ensemble itself.
inner_k: Folds for the inner stacking cross-validation. Default 5.
seed: Optional seed for the inner folds (via withr::with_seed()).
...: Passed to gs_fit() for the base models.

Value

An object of class gs_ensemble (and gs_model): a list with base_names, weights (named, summing to 1), the refit base_fits, and the out-of-fold prediction matrix oof.

References

van der Laan, M. J., Polley, E. C. and Hubbard, A. E. (2007) "Super Learner." Statistical Applications in Genetics and Molecular Biology 6, Article 25. doi:10.2202/1544-6115.1309

Examples

sim <- simulate_population(n = 100, m = 300, seed = 1)
ens <- gs_ensemble(sim$pheno, sim$geno, base_models = "gblup", seed = 1)
ens$weights
#> gblup 
#>     1