--- title: "Creating custom bootstrap tests" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{custom_tests} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Background p-values can be computed by inverting the corresponding confidence intervals, as described in Section 14.2 of [Thulin (2024)](https://www.modernstatisticswithr.com/mathschap.html#confintequal) and Section 3.12 in [Hall (1992)](https://link.springer.com/book/10.1007/978-1-4612-4384-7). This package contains functions for computing bootstrap p-values in this way. The approach relies on the fact that: - The p-value of the two-sided test for the parameter theta is the smallest alpha such that theta is not contained in the corresponding 1-alpha confidence interval, - For a test of the parameter theta with significance level alpha, the set of values of theta that aren't rejected by the two-sided test (when used as the null hypothesis) is a 1-alpha confidence interval for theta. ## Custom hypothesis tests Bootstrap p-values for hypothesis tests based on `boot` objects can be obtained using the `boot.pval` function. The example below, based in part on code from [Section 7.4.2 of Modern Statistics with R](https://www.modernstatisticswithr.com/modchapter.html#a-general-approach), shows how to implement a bootstrap correlation test. ```{r, echo = FALSE} set.seed(142) ``` First, we create a function for computing the correlation given a bivariate dataset and a vector containing row numbers (the indices of the bootstrap sample). We also include a `method` option to allow the user to choose which type of correlation is used: ```{r} cor_boot <- function(data, row_numbers, method = "pearson") { # Obtain the bootstrap sample: sample <- data[row_numbers,] # Compute and return the statistic for the bootstrap sample: return(cor(sample[[1]], sample[[2]], method = method)) } ``` Let's say that we wish to test the correlation between the variables `hp` and `drat` in the `mtcars` data: ```{r} mtcars |> plot(hp ~ drat, data = _) ``` We subset the data (below I use the base R function `subset` for this; see my post on [base R replacements of `tidyverse` functions](https://mansthulin.se/posts/basetidyverseverbs/) for more on this) and then use `boot` to draw 999 bootstrap samples: ```{r} library(boot) boot_res <- mtcars |> subset(select = c(hp, drat)) |> boot(statistic = cor_boot, R = 999) ``` If we like, we can plot the bootstrap distribution: ```{r} plot(boot_res) ``` And compute confidence intervals: ```{r} boot.ci(boot_res) ``` To compute the p-value corresponding to the different intervals, we can now use `boot.pval`: ```{r} # Compute the bootstrap p-value based on the percentile interval: library(boot.pval) boot.pval(boot_res, type = "perc") ``` For studentized intervals, we also need to estimate the variance of the statistic, e.g. using an inner bootstrap. We create a new function for this, run a new resampling, and compute the studentized p-value: ```{r} cor_boot_student <- function(data, row_numbers, method = "pearson") { sample <- data[row_numbers,] correlation <- cor(sample[[1]], sample[[2]], method = method) inner_boot <- boot(sample, cor_boot, 100) variance <- var(inner_boot$t) return(c(correlation, variance)) } # Run the resampling: boot_res <- mtcars |> subset(select = c(hp, drat)) |> boot(cor_boot_student, 999) # Compute the bootstrap p-value based on the studentized interval: boot.pval(boot_res, type = "stud") ``` That's all there is to it! Note that using `boot.pval` is completely analogous to how we use `boot.ci` from `boot`. ## An example from `boot` The following example is an extensions of that given in the documentation for `boot::boot`: ```{r message = FALSE} # Hypothesis test for the city data # H0: ratio = 1.4 ratio <- function(d, w) sum(d$x * w)/sum(d$u * w) city.boot <- boot(city, ratio, R = 999, stype = "w", sim = "ordinary") boot.pval(city.boot, theta_null = 1.4) # Studentized test for the two sample difference of means problem # using the final two series of the gravity data. diff.means <- function(d, f) { n <- nrow(d) gp1 <- 1:table(as.numeric(d$series))[1] m1 <- sum(d[gp1,1] * f[gp1])/sum(f[gp1]) m2 <- sum(d[-gp1,1] * f[-gp1])/sum(f[-gp1]) ss1 <- sum(d[gp1,1]^2 * f[gp1]) - (m1 * m1 * sum(f[gp1])) ss2 <- sum(d[-gp1,1]^2 * f[-gp1]) - (m2 * m2 * sum(f[-gp1])) c(m1 - m2, (ss1 + ss2)/(sum(f) - 2)) } grav1 <- gravity[as.numeric(gravity[,2]) >= 7, ] grav1.boot <- boot(grav1, diff.means, R = 999, stype = "f", strata = grav1[ ,2]) boot.pval(grav1.boot, type = "stud", theta_null = 0) ``` ## References * Davison, A.C. and Hinkley, D.V. (1997) _Bootstrap Methods and Their Application_. Cambridge University Press. * Hall P (1992). _The Bootstrap and Edgeworth Expansion_. Springer, New York. * Thulin, M. (2024) _Modern Statistics with R_. Second edition. Chapman & Hall/CRC Press.