|
| 1 | +# Getting Started |
| 2 | + |
| 3 | +This guide shows how to use `bbstat` to perform Bayesian bootstrapping with both built-in and custom statistics. |
| 4 | + |
| 5 | +We'll start with a quick example on univariate data, move on to a bivariate case, and then show how to write your own weighted statistic. The goal is to help you see how Bayesian bootstrapping works in practice. |
| 6 | + |
| 7 | +## Installation |
| 8 | + |
| 9 | +You can install bbstat from PyPI: |
| 10 | + |
| 11 | +```bash |
| 12 | +pip install bbstat |
| 13 | +``` |
| 14 | + |
| 15 | +Then import what you need: |
| 16 | + |
| 17 | +```python |
| 18 | +import numpy as np |
| 19 | +from bbstat import bootstrap |
| 20 | +``` |
| 21 | + |
| 22 | +## Bootstrapping a simple statistic |
| 23 | + |
| 24 | +Let's start with something familiar: estimating the mean of a small dataset. We'll use the Bayesian bootstrap to quantify uncertainty in that mean. |
| 25 | + |
| 26 | +```python |
| 27 | +# Sample data: daily coffee consumption (in cups) from a small survey |
| 28 | +coffee = np.array([2.0, 3.0, 1.5, 2.5, 3.0, 2.0, 4.0]) |
| 29 | + |
| 30 | +# Run the Bayesian bootstrap with 2000 Dirichlet-weighted replicates |
| 31 | +distribution = bootstrap(data=coffee, statistic_fn="mean", n_boot=2000, seed=1) |
| 32 | + |
| 33 | +# Summarize the distribution as a posterior mean and 95% credible interval |
| 34 | +summary = distribution.summarize(level=0.95) |
| 35 | +print(summary) |
| 36 | +# BootstrapSummary(mean=2.583..., ci_low=2.057..., ci_high=3.159..., level=0.95) |
| 37 | +``` |
| 38 | + |
| 39 | +If you'd like cleaner, human-readable output, the `BootstrapSummary.round()` method can automatically round values to a sensible precision based on the width of the credible interval: |
| 40 | + |
| 41 | +```python |
| 42 | +print(summary.round()) |
| 43 | +# BootstrapSummary(mean=2.6, ci_low=2.1, ci_high=3.2, level=0.95) |
| 44 | +``` |
| 45 | + |
| 46 | +Here the mean estimate is about 2.6 cups per day, with a 95% credible interval of roughly [2.1, 3.2]. The uncertainty reflects variation in the weights each sample could have in the population, not in resampled data points. |
| 47 | + |
| 48 | +## Bootstrapping a quantile |
| 49 | + |
| 50 | +You can use any of the built-in weighted statistics the same way. For example, let's estimate the 90th percentile of the same dataset: |
| 51 | + |
| 52 | +```python |
| 53 | +distribution = bootstrap( |
| 54 | + data=coffee, |
| 55 | + statistic_fn="quantile", |
| 56 | + fn_kwargs={"quantile": 0.9}, |
| 57 | + seed=1, |
| 58 | +) |
| 59 | + |
| 60 | +summary = distribution.summarize(level=0.95) |
| 61 | +print(summary) |
| 62 | +# BootstrapSummary(mean=3.28, ci_low=2.85, ci_high=3.81, level=0.95) |
| 63 | +``` |
| 64 | + |
| 65 | +The bootstrapped 0.9 quantile is around 3.3 cups, meaning that about 90% of coffee drinkers in this sample consume 3.3 or fewer cups per day. |
| 66 | + |
| 67 | +## Bivariate example: dependence between variables |
| 68 | + |
| 69 | +For bivariate data, bbstat includes functions such as `"pearson_dependency"` (weighted correlation) and `"mutual_information"` (a nonlinear dependence measure). |
| 70 | + |
| 71 | +Let's look at the relationship between study time and exam score: |
| 72 | + |
| 73 | +```python |
| 74 | +# Simulated data: study hours vs exam scores |
| 75 | +study_hours = np.array([2, 3, 4, 5, 6, 8, 9]) |
| 76 | +exam_scores = np.array([60, 65, 70, 72, 78, 85, 90]) |
| 77 | + |
| 78 | +data = (study_hours, exam_scores) |
| 79 | + |
| 80 | +# Weighted Pearson correlation via Bayesian bootstrapping |
| 81 | +distribution = bootstrap(data=data, statistic_fn="pearson_dependency", n_boot=2000, seed=1) |
| 82 | +summary = distribution.summarize(level=0.95).round() |
| 83 | +print(summary) |
| 84 | +# BootstrapSummary(mean=0.9969, ci_low=0.9911, ci_high=0.9992, level=0.95) |
| 85 | +``` |
| 86 | + |
| 87 | +This shows a strong positive correlation, and the credible interval indicates high confidence that the true correlation is above 0.99. You could switch to "mutual_information" to estimate a nonlinear dependency instead. |
| 88 | + |
| 89 | +## Writing your own weighted statistic |
| 90 | + |
| 91 | +Defining a custom statistic is simple. All functions used with bootstrap() must follow this signature: |
| 92 | + |
| 93 | +```python |
| 94 | +def custom_statistic(data, weights, **kwargs) -> float: |
| 95 | + ... |
| 96 | +``` |
| 97 | + |
| 98 | +Here's an example that implements a **weighted geometric mean**, which is not (yet) included among the built-ins but demonstrates how to use the weights properly: |
| 99 | + |
| 100 | +For a set of positive numbers \(x_1, x_2, \dots, x_n > 0\) with associated weights |
| 101 | +\(w_1, w_2, \dots, w_n\) such that \(w_i \ge 0\) and \(\sum_{i=1}^n w_i = 1\), |
| 102 | +the **weighted geometric mean** is defined as: |
| 103 | + |
| 104 | +\[ |
| 105 | +\text{GM}_w = \prod_{i=1}^{n} x_i^{w_i} = \exp\Bigg( \sum_{i=1}^{n} w_i \ln x_i \Bigg) |
| 106 | +\] |
| 107 | + |
| 108 | +In the Bayesian bootstrap, the weights \(w_i\) are drawn from a Dirichlet distribution: |
| 109 | + |
| 110 | +\[ |
| 111 | +(w_1, \dots, w_n) \sim \text{Dirichlet}(\alpha_1=1, \dots, \alpha_n=1) |
| 112 | +\] |
| 113 | + |
| 114 | +Each bootstrap replicate computes: |
| 115 | + |
| 116 | +\[ |
| 117 | +\text{GM}_\text{replicate} = \exp\Bigg( \sum_{i=1}^{n} w_i \ln x_i \Bigg) |
| 118 | +\] |
| 119 | + |
| 120 | +Repeating this for many replicates produces a posterior-like distribution of the geometric mean. |
| 121 | + |
| 122 | +```python |
| 123 | +def weighted_geometric_mean(data, weights): |
| 124 | + """Compute the weighted geometric mean.""" |
| 125 | + data = np.asarray(data) |
| 126 | + weights = np.asarray(weights) |
| 127 | + # Avoid log(0): require positive data |
| 128 | + if np.any(data <= 0): |
| 129 | + raise ValueError("Geometric mean requires positive data.") |
| 130 | + log_mean = np.sum(weights * np.log(data)) |
| 131 | + return np.exp(log_mean) |
| 132 | + |
| 133 | + |
| 134 | +data = np.array([1.2, 1.5, 2.0, 2.8, 3.1]) |
| 135 | +distribution = bootstrap(data=data, statistic_fn=weighted_geometric_mean, n_boot=1500, seed=1) |
| 136 | +summary = distribution.summarize().round() |
| 137 | +print(summary) |
| 138 | +# BootstrapSummary(mean=2.01, ci_low=1.58, ci_high=2.48, level=0.87) |
| 139 | +``` |
| 140 | + |
| 141 | +The same pattern applies if your statistic takes multiple arrays (e.g., (x, y)). The function receives the data and weights, computes its result, and returns a single float. |
| 142 | + |
| 143 | +## Common questions and pitfalls |
| 144 | +- **Why are the credible intervals sometimes narrow?** |
| 145 | + Bayesian bootstrapping assumes that the observed data already represent the full population support. Uncertainty is only about how much weight each observation should get, not about unseen data. If the sample is small or has heavy tails, results can appear overconfident. |
| 146 | +- **Can I get negative weights or resampled data?** |
| 147 | + No. Weights are drawn from a uniform Dirichlet distribution, so they're always non-negative and sum to one. This approach replaces the random resampling in the classical bootstrap, rather than supplementing it. |
| 148 | +- **What if my statistic ignores the weights?** |
| 149 | + Then it is not a Bayesian bootstrap anymore and you are just re-evaluating the same statistic repeatedly. Always make sure your custom statistic uses the provided weights. |
| 150 | +- **What if my data contain zeros or negative values?** |
| 151 | + That's fine for most statistics, but not all (the geometric mean above is a case in point). Handle such cases carefully or filter the data before applying those statistics. |
| 152 | +- **Can I use bbstat for regression or multivariate models?** |
| 153 | + Yes, as long as your statistic can be written as a weighted function of the data. For example, a weighted regression slope or a loss function summary. The Bayesian bootstrap does not assume any specific model form. |
| 154 | +- **How does rounding work in summarize()?** |
| 155 | + When you call `BootstrapSummary.round()`, the method automatically picks a decimal precision suitable for the width of the credible interval so that the displayed digits reflect the level of uncertainty. You can also set a fixed precision manually if you prefer. |
0 commit comments