--- title: "Bootstrap p-values and confidence intervals for regression models" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{boot_summary} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Background p-values can be computed by inverting the corresponding confidence intervals, as described in Section 14.2 of [Thulin (2024)](https://www.modernstatisticswithr.com/mathschap.html#confintequal) and Section 3.12 in [Hall (1992)](https://link.springer.com/book/10.1007/978-1-4612-4384-7). This package contains functions for computing bootstrap p-values in this way. The approach relies on the fact that: - The p-value of the two-sided test for the parameter theta is the smallest alpha such that theta is not contained in the corresponding 1-alpha confidence interval, - For a test of the parameter theta with significance level alpha, the set of values of theta that aren't rejected by the two-sided test (when used as the null hypothesis) is a 1-alpha confidence interval for theta. ## Summaries for regression models Summary tables with confidence intervals and p-values for the coefficients of regression models can be obtained using the `boot_summary` (most models) and `censboot_summary` (models with censored response variables) functions. Currently, the following models are supported: - Linear models fitted using `lm`, - Generalised linear models fitted using `glm` or `MASS::glm.nb`, - Nonlinear models fitted using `nls`, - Robust linear models fitted using `MASS::rlm`, - Ordered logistic or probit regression models fitted (without weights) using `MASS::polr`, - Linear mixed models fitted using `lme4::lmer` or `lmerTest::lmer`, - Generalised linear mixed models fitted using `lme4::glmer`. - Cox PH regression models fitted using `survival::coxph` (using `censboot_summary`). - Accelerated failure time models fitted using `survival::survreg` or `rms::psm` (using `censboot_summary`). - Any regression model such that: `residuals(object, type="pearson")` returns Pearson residuals; `fitted(object)` returns fitted values; `hatvalues(object)` returns the leverages, or perhaps the value 1 which will effectively ignore setting the hatvalues. In addition, the `data` argument should contain no missing values among the columns actually used in fitting the model. A number of examples are available in Chapters 8 and 9 of [Modern Statistics with R](https://www.modernstatisticswithr.com/). Here are some simple examples with a linear regression model for the `mtcars` data: ```{r message=FALSE} library(boot.pval) # Bootstrap summary of a linear model for mtcars: model <- lm(mpg ~ hp + vs, data = mtcars) boot_summary(model) # Use 9999 bootstrap replicates and adjust p-values for # multiplicity using Holm's method: boot_summary(model, R = 9999, adjust.method = "holm") # Use case resampling instead of residual resampling: boot_summary(model, method = "case") # Export results to a gt table: boot_summary(model, R = 9999) |> summary_to_gt() ``` See Davison and Hinkley (1997) for details about residual resampling (the default) and case resampling. ```{r eval = FALSE} # Export results to a Word document: library(flextable) boot_summary(model, R = 9999) |> summary_to_flextable() |> save_as_docx(path = "my_table.docx") ``` And a toy example for a generalised linear mixed model (using a small number of bootstrap repetitions): ```{r eval = FALSE} library(lme4) model <- glmer(TICKS ~ YEAR + (1|LOCATION), data = grouseticks, family = poisson) boot_summary(model, R = 99) ``` ## Speeding up computations For complex models, speed can be greatly improved by using parallelisation. For `lmer` and `glmer` models, this is set using the `parallel` (available options are `"multicore"` and `"snow"`). The number of CPUs to use is set using `ncpus`. ```{r eval = FALSE} model <- glmer(TICKS ~ YEAR + (1|LOCATION), data = grouseticks, family = poisson) boot_summary(model, R = 999, parallel = "multicore", ncpus = 10) ``` For other models, use `ncores`: ```{r eval = FALSE} model <- lm(mpg ~ hp + vs, data = mtcars) boot_summary(model, R = 9999, ncores = 10) ``` ## Survival models Survival regression models should be fitted using the argument `model = TRUE`. A summary table can then be obtained using `censboot_summary`. By default, the table contains exponentiated coefficients (i.e. hazard ratios, in the case of a Cox PH model). ```{r message = FALSE} library(survival) # Weibull AFT model: model <- survreg(formula = Surv(time, status) ~ age + sex, data = lung, dist = "weibull", model = TRUE) # Table with exponentiated coefficients: censboot_summary(model) # Cox PH model: model <- coxph(formula = Surv(time, status) ~ age + sex, data = lung, model = TRUE) # Table with hazard ratios: censboot_summary(model) # Table with original coefficients: censboot_summary(model, coef = "raw") ``` To speed up computations using parallelisation, use the `parallel` and `ncpus` arguments: ```{r eval = FALSE} censboot_summary(model, parallel = "multicore", ncpus = 10) ``` ## References * Davison, A.C. and Hinkley, D.V. (1997) _Bootstrap Methods and Their Application_. Cambridge University Press. * Hall P (1992). _The Bootstrap and Edgeworth Expansion_. Springer, New York. * Thulin, M. (2024) _Modern Statistics with R_. Second edition. Chapman & Hall/CRC Press.