Skip to content

Commit 9da4573

Browse files
authored
Merge pull request #6 from cwehmeyer/fix/docs
docs: fix examples, switch theme, add getting-started page, and improve descriptions
2 parents 5b6ce31 + e999aa9 commit 9da4573

File tree

5 files changed

+251
-116
lines changed

5 files changed

+251
-116
lines changed

README.md

Lines changed: 49 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -2,21 +2,32 @@
22

33
[![PyPI version](https://img.shields.io/pypi/v/bbstat.svg)](https://pypi.org/project/bbstat/)
44
[![Python Versions](https://img.shields.io/pypi/pyversions/bbstat.svg)](https://pypi.org/project/bbstat/)
5+
[![CodeQL](https://github.com/cwehmeyer/bbstat/actions/workflows/github-code-scanning/codeql/badge.svg)](https://github.com/cwehmeyer/bbstat/actions/workflows/github-code-scanning/codeql)
56
[![CI](https://github.com/cwehmeyer/bbstat/actions/workflows/ci.yaml/badge.svg?branch=main)](https://github.com/cwehmeyer/bbstat/actions/workflows/ci.yaml)
67
[![codecov](https://codecov.io/gh/cwehmeyer/bbstat/branch/main/graph/badge.svg?token=V3QV2DFJ9W)](https://codecov.io/gh/cwehmeyer/bbstat)
78
[![Docs](https://img.shields.io/badge/docs-latest-blue.svg)](https://cwehmeyer.github.io/bbstat/)
89

9-
A lightweight library for Bayesian bootstrapping and statistical evaluation.
10+
A lightweight library for Bayesian bootstrapping and statistical evaluation designed for learning, experimentation, and exploring Bayesian nonparametric ideas.
11+
12+
The Bayesian bootstrap (Rubin, 1981) is a simple nonparametric Bayesian method for estimating uncertainty in statistics without assuming a likelihood model. It replaces resampling with random Dirichlet-distributed weights on the observed data, producing a posterior-like distribution for any statistic (mean, quantile, regression, etc.). Results reflect uncertainty in the weights (not in unobserved data) and are asymptotically similar to the classical bootstrap. Assumes i.i.d. data; results may be overconfident if the sample is small or unrepresentative.
13+
14+
This package implements the core logic of Bayesian bootstrapping in Python, along with a few weighted statistic functions, as a way to learn and experiment with Bayesian nonparametric ideas. It's meant as an educational and exploratory project rather than a production-ready library, but may be useful for understanding or demonstrating how Bayesian bootstrap inference works in practice.
15+
16+
## Why use this package?
17+
18+
Learn and experiment with Bayesian bootstrap inference in Python
19+
Quickly compute posterior-like uncertainty intervals for arbitrary statistics
20+
Extend easily with your own weighted statistic functions
1021

1122
## Installation
1223

13-
### From PyPI:
24+
- From PyPI:
1425

1526
```bash
1627
pip install bbstat
1728
```
1829

19-
### From GitHub source code:
30+
- From GitHub source code:
2031

2132
```bash
2233
git clone https://github.com/cwehmeyer/bbstat.git
@@ -56,53 +67,57 @@ print(summary.round()) # => BootstrapSummary(mean=52000.0, ci_low=47000.0, ci_h
5667

5768
### `bootstrap(data, statistic_fn, n_boot=1000, ...)`
5869

59-
Performs Bayesian bootstrapping on input `data` using the given statistic.
70+
Performs Bayesian bootstrapping on `data` using the given statistic.
6071

6172
- `data`: 1D NumPy array, or tuple/list thereof
6273
- `statistic_fn`: string or callable (e.g., `"mean"`, `"median"`, or custom function)
63-
- `level`: credible interval (default 0.87)
6474
- `n_boot`: number of bootstrap samples
6575
- `seed`: random seed (optional)
6676
- `blocksize`: number of resamples to allocate in one block
6777
- `fn_kwargs`: optional dictionary with parameters for `statistic_fn`
6878

69-
Returns a `BootstrapResult` with:
70-
- `.mean`: estimated statistic value
71-
- `.ci`: tuple representing lower and upper bounds of the credible interval
72-
- `.level`: credible level used
73-
- `.n_boot`: number of bootstraps performed
74-
- `.estimates`: array of statistic values computed across the bootstrapped posterior samples
79+
**Parameters**
80+
81+
- `data`: 1D NumPy array, or tuple/list of arrays
82+
- `statistic_fn`: string or callable (e.g. `"mean"`, `"median"`, or custom function)
83+
- `n_boot`: number of bootstrap samples
84+
- `seed`: random seed (optional)
85+
- `blocksize`: number of resamples processed per block
86+
- `fn_kwargs`: optional dict of extra parameters for `statistic_fn`
87+
88+
**Returns**
89+
90+
A `BootstrapDistribution` object with:
91+
92+
- `.estimates`: array of bootstrapped statistic values
93+
- `.summarize(level)`: returns a `BootstrapSummary` with `mean`, `ci_low`, `ci_high`, and `level`
7594

7695
### Weighted statistic functions included
7796

78-
The module bbstat.statistics provides a number univariate and bivariate weighted statistics:
79-
- `"entropy"`: `bbstat.statistics.compute_weighted_entropy(data, weights)`
80-
- `"eta_square_dependency"`: `bbstat.statistics.compute_weighted_eta_square_dependency(data, weights)`
81-
- `"log_odds"`: `bbstat.statistics.compute_weighted_log_odds(data, weights, state: int)`
82-
- `"mean"`: `bbstat.statistics.compute_weighted_mean(data, weights)`
83-
- `"median"`: `bbstat.statistics.compute_weighted_median(data, weights)`
84-
- `"mutual_information"`: `bbstat.statistics.compute_weighted_mutual_information(data, weights)`
85-
- `"pearson_dependence"`: `bbstat.statistics.compute_weighted_pearson_dependence(data, weights, ddof: int = 0)`
86-
- `"percentile"`: `bbstat.statistics.compute_weighted_percentile(data, weights, percentile: float)`
87-
- `"probability"`: `bbstat.statistics.compute_weighted_probability(data, weights, state: int)`
88-
- `"quantile"`: `bbstat.statistics.compute_weighted_quantile(data, weights, quantile: float)`
89-
- `"self_information"`: `bbstat.statistics.compute_weighted_self_information(data, weights, state: int)`
90-
- `"spearman_depedence"`: `bbstat.statistics.compute_weighted_spearman_depedence(data, weights, ddof: int = 0)`
91-
- `"std"`: `bbstat.statistics.compute_weighted_std(data, weights, ddof: int = 0)`
92-
- `"sum"`: `bbstat.statistics.compute_weighted_sum(data, weights)`
93-
- `"variance"`: `bbstat.statistics.compute_weighted_variance(data, weights, ddof: int = 0)`
94-
95-
If you want to use your own custom functions, please adhere to this pattern
97+
The module `bbstat.statistics` includes several univariate and bivariate weighted statistics, such as:
98+
99+
- `"mean"``compute_weighted_mean(data, weights)`
100+
- `"median"``compute_weighted_median(data, weights)`
101+
- `"quantile"` / `"percentile"`
102+
- `"variance"` / `"std"` / `"sum"`
103+
- `"entropy"` / `"log_odds"` / `"probability"` / `"self_information"`
104+
- `"pearson_dependence"` / `"spearman_dependence"`
105+
- `"eta_square_dependency"` / `"mutual_information"`
106+
107+
You can also supply your own functions following this pattern:
96108

97109
```python
98-
def custom_statistic(data, weights, *, **kwargs) -> float
110+
def custom_statistic(data, weights, **kwargs) -> float:
111+
...
99112
```
100113

101-
where `data` is
102-
- a 1D numpy array of length `n_data` or
103-
- a tuple/list of 1D numpy arrays, each of length `n_data`
114+
where:
115+
116+
- `data`: 1D NumPy array or tuple/list of 1D arrays
117+
- `weights`: 1D NumPy array of non-negative values summing to 1
118+
- `**kwargs`: optional keyword arguments passed by `fn_kwargs`
104119

105-
and `weights` is a 1D numpy array of length `n_data`, with non-negative elements that sum up to one. The function may also take additional parameters which can be supplied via `**kwargs`.
120+
If you want to use your own custom functions, please adhere to this pattern.
106121

107122
## License
108123

docs/getting_started.md

Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# Getting Started
2+
3+
This guide shows how to use `bbstat` to perform Bayesian bootstrapping with both built-in and custom statistics.
4+
5+
We'll start with a quick example on univariate data, move on to a bivariate case, and then show how to write your own weighted statistic. The goal is to help you see how Bayesian bootstrapping works in practice.
6+
7+
## Installation
8+
9+
You can install bbstat from PyPI:
10+
11+
```bash
12+
pip install bbstat
13+
```
14+
15+
Then import what you need:
16+
17+
```python
18+
import numpy as np
19+
from bbstat import bootstrap
20+
```
21+
22+
## Bootstrapping a simple statistic
23+
24+
Let's start with something familiar: estimating the mean of a small dataset. We'll use the Bayesian bootstrap to quantify uncertainty in that mean.
25+
26+
```python
27+
# Sample data: daily coffee consumption (in cups) from a small survey
28+
coffee = np.array([2.0, 3.0, 1.5, 2.5, 3.0, 2.0, 4.0])
29+
30+
# Run the Bayesian bootstrap with 2000 Dirichlet-weighted replicates
31+
distribution = bootstrap(data=coffee, statistic_fn="mean", n_boot=2000, seed=1)
32+
33+
# Summarize the distribution as a posterior mean and 95% credible interval
34+
summary = distribution.summarize(level=0.95)
35+
print(summary)
36+
# BootstrapSummary(mean=2.583..., ci_low=2.057..., ci_high=3.159..., level=0.95)
37+
```
38+
39+
If you'd like cleaner, human-readable output, the `BootstrapSummary.round()` method can automatically round values to a sensible precision based on the width of the credible interval:
40+
41+
```python
42+
print(summary.round())
43+
# BootstrapSummary(mean=2.6, ci_low=2.1, ci_high=3.2, level=0.95)
44+
```
45+
46+
Here the mean estimate is about 2.6 cups per day, with a 95% credible interval of roughly [2.1, 3.2]. The uncertainty reflects variation in the weights each sample could have in the population, not in resampled data points.
47+
48+
## Bootstrapping a quantile
49+
50+
You can use any of the built-in weighted statistics the same way. For example, let's estimate the 90th percentile of the same dataset:
51+
52+
```python
53+
distribution = bootstrap(
54+
data=coffee,
55+
statistic_fn="quantile",
56+
fn_kwargs={"quantile": 0.9},
57+
seed=1,
58+
)
59+
60+
summary = distribution.summarize(level=0.95)
61+
print(summary)
62+
# BootstrapSummary(mean=3.28, ci_low=2.85, ci_high=3.81, level=0.95)
63+
```
64+
65+
The bootstrapped 0.9 quantile is around 3.3 cups, meaning that about 90% of coffee drinkers in this sample consume 3.3 or fewer cups per day.
66+
67+
## Bivariate example: dependence between variables
68+
69+
For bivariate data, bbstat includes functions such as `"pearson_dependency"` (weighted correlation) and `"mutual_information"` (a nonlinear dependence measure).
70+
71+
Let's look at the relationship between study time and exam score:
72+
73+
```python
74+
# Simulated data: study hours vs exam scores
75+
study_hours = np.array([2, 3, 4, 5, 6, 8, 9])
76+
exam_scores = np.array([60, 65, 70, 72, 78, 85, 90])
77+
78+
data = (study_hours, exam_scores)
79+
80+
# Weighted Pearson correlation via Bayesian bootstrapping
81+
distribution = bootstrap(data=data, statistic_fn="pearson_dependency", n_boot=2000, seed=1)
82+
summary = distribution.summarize(level=0.95).round()
83+
print(summary)
84+
# BootstrapSummary(mean=0.9969, ci_low=0.9911, ci_high=0.9992, level=0.95)
85+
```
86+
87+
This shows a strong positive correlation, and the credible interval indicates high confidence that the true correlation is above 0.99. You could switch to "mutual_information" to estimate a nonlinear dependency instead.
88+
89+
## Writing your own weighted statistic
90+
91+
Defining a custom statistic is simple. All functions used with bootstrap() must follow this signature:
92+
93+
```python
94+
def custom_statistic(data, weights, **kwargs) -> float:
95+
...
96+
```
97+
98+
Here's an example that implements a **weighted geometric mean**, which is not (yet) included among the built-ins but demonstrates how to use the weights properly:
99+
100+
For a set of positive numbers \(x_1, x_2, \dots, x_n > 0\) with associated weights
101+
\(w_1, w_2, \dots, w_n\) such that \(w_i \ge 0\) and \(\sum_{i=1}^n w_i = 1\),
102+
the **weighted geometric mean** is defined as:
103+
104+
\[
105+
\text{GM}_w = \prod_{i=1}^{n} x_i^{w_i} = \exp\Bigg( \sum_{i=1}^{n} w_i \ln x_i \Bigg)
106+
\]
107+
108+
In the Bayesian bootstrap, the weights \(w_i\) are drawn from a Dirichlet distribution:
109+
110+
\[
111+
(w_1, \dots, w_n) \sim \text{Dirichlet}(\alpha_1=1, \dots, \alpha_n=1)
112+
\]
113+
114+
Each bootstrap replicate computes:
115+
116+
\[
117+
\text{GM}_\text{replicate} = \exp\Bigg( \sum_{i=1}^{n} w_i \ln x_i \Bigg)
118+
\]
119+
120+
Repeating this for many replicates produces a posterior-like distribution of the geometric mean.
121+
122+
```python
123+
def weighted_geometric_mean(data, weights):
124+
"""Compute the weighted geometric mean."""
125+
data = np.asarray(data)
126+
weights = np.asarray(weights)
127+
# Avoid log(0): require positive data
128+
if np.any(data <= 0):
129+
raise ValueError("Geometric mean requires positive data.")
130+
log_mean = np.sum(weights * np.log(data))
131+
return np.exp(log_mean)
132+
133+
134+
data = np.array([1.2, 1.5, 2.0, 2.8, 3.1])
135+
distribution = bootstrap(data=data, statistic_fn=weighted_geometric_mean, n_boot=1500, seed=1)
136+
summary = distribution.summarize().round()
137+
print(summary)
138+
# BootstrapSummary(mean=2.01, ci_low=1.58, ci_high=2.48, level=0.87)
139+
```
140+
141+
The same pattern applies if your statistic takes multiple arrays (e.g., (x, y)). The function receives the data and weights, computes its result, and returns a single float.
142+
143+
## Common questions and pitfalls
144+
- **Why are the credible intervals sometimes narrow?**
145+
Bayesian bootstrapping assumes that the observed data already represent the full population support. Uncertainty is only about how much weight each observation should get, not about unseen data. If the sample is small or has heavy tails, results can appear overconfident.
146+
- **Can I get negative weights or resampled data?**
147+
No. Weights are drawn from a uniform Dirichlet distribution, so they're always non-negative and sum to one. This approach replaces the random resampling in the classical bootstrap, rather than supplementing it.
148+
- **What if my statistic ignores the weights?**
149+
Then it is not a Bayesian bootstrap anymore and you are just re-evaluating the same statistic repeatedly. Always make sure your custom statistic uses the provided weights.
150+
- **What if my data contain zeros or negative values?**
151+
That's fine for most statistics, but not all (the geometric mean above is a case in point). Handle such cases carefully or filter the data before applying those statistics.
152+
- **Can I use bbstat for regression or multivariate models?**
153+
Yes, as long as your statistic can be written as a weighted function of the data. For example, a weighted regression slope or a loss function summary. The Bayesian bootstrap does not assume any specific model form.
154+
- **How does rounding work in summarize()?**
155+
When you call `BootstrapSummary.round()`, the method automatically picks a decimal precision suitable for the width of the credible interval so that the displayed digits reflect the level of uncertainty. You can also set a fixed precision manually if you prefer.

docs/index.md

Lines changed: 1 addition & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -1,76 +1 @@
1-
# bbstat
2-
3-
Welcome to **bbstat**, a lightweight library for Bayesian bootstrapping and statistical evaluation.
4-
5-
## Features
6-
7-
- Bayesian bootstrap resampling
8-
- Compute weighted statistics
9-
- Evaluate uncertainty via credible intervals
10-
- Easy-to-use and extensible
11-
12-
## Installation
13-
14-
Installation from PyPi:
15-
16-
```bash
17-
pip install bbstat
18-
```
19-
20-
Installation from GitHub source code:
21-
22-
```bash
23-
git clone https://github.com/cwehmeyer/bbstat.git
24-
cd bbstat
25-
pip install .
26-
```
27-
28-
### Optional Extras
29-
30-
This package includes optional dependencies for development, testing, and documentation. To install them from GitHub source:
31-
32-
- For development:
33-
34-
```bash
35-
pip install '.[dev]'
36-
```
37-
38-
- For testing:
39-
40-
```bash
41-
pip install '.[test]'
42-
```
43-
44-
- For documentation:
45-
46-
```bash
47-
pip install '.[docs]'
48-
```
49-
50-
## Getting started
51-
52-
```python
53-
import numpy as np
54-
from bbstat import bootstrap
55-
56-
# Data preparation: simulated income for a small population (e.g., a survey of 25 people)
57-
income = np.array([
58-
24_000, 26_000, 28_000, 30_000, 32_000,
59-
35_000, 36_000, 38_000, 40_000, 41_000,
60-
45_000, 48_000, 50_000, 52_000, 54_000,
61-
58_000, 60_000, 62_000, 65_000, 68_000,
62-
70_000, 75_000, 80_000, 90_000, 100_000,
63-
], dtype=np.float64)
64-
65-
# Direct estimate of mean income
66-
print(np.mean(income)) # => 52280.0
67-
68-
# Bootstrapped distribution of the mean income.
69-
distribution = bootstrap(data=income, statistic_fn="mean", seed=1)
70-
print(distribution) # => BootstrapDistribution(mean=52263.8..., size=1000)
71-
72-
# Summarize the bootstrapped distribution of the mean income.
73-
summary = distribution.summarize(level=0.87)
74-
print(summary) # => BootstrapSummary(mean=52263.8..., ci_low=46566.8..., ci_high=58453.6..., level=0.87)
75-
print(summary.round()) # => BootstrapSummary(mean=52000.0, ci_low=47000.0, ci_high=58000.0, level=0.87)
76-
```
1+
{% include-markdown "../README.md" %}

0 commit comments

Comments
 (0)