cwehmeyer
diff --git a/‎README.md‎
Lines changed: 49 additions & 34 deletions b/‎README.md‎
Lines changed: 49 additions & 34 deletions
diff --git a/‎docs/getting_started.md‎
Lines changed: 155 additions & 0 deletions b/‎docs/getting_started.md‎
Lines changed: 155 additions & 0 deletions
diff --git a/‎docs/index.md‎
Lines changed: 1 addition & 76 deletions b/‎docs/index.md‎
Lines changed: 1 addition & 76 deletions
@@ -2,21 +2,32 @@
 
 [![PyPI version](https://img.shields.io/pypi/v/bbstat.svg)](https://pypi.org/project/bbstat/)
 [![Python Versions](https://img.shields.io/pypi/pyversions/bbstat.svg)](https://pypi.org/project/bbstat/)
+[![CodeQL](https://github.com/cwehmeyer/bbstat/actions/workflows/github-code-scanning/codeql/badge.svg)](https://github.com/cwehmeyer/bbstat/actions/workflows/github-code-scanning/codeql)
 [![CI](https://github.com/cwehmeyer/bbstat/actions/workflows/ci.yaml/badge.svg?branch=main)](https://github.com/cwehmeyer/bbstat/actions/workflows/ci.yaml)
 [![codecov](https://codecov.io/gh/cwehmeyer/bbstat/branch/main/graph/badge.svg?token=V3QV2DFJ9W)](https://codecov.io/gh/cwehmeyer/bbstat)
 [![Docs](https://img.shields.io/badge/docs-latest-blue.svg)](https://cwehmeyer.github.io/bbstat/)
 
-A lightweight library for Bayesian bootstrapping and statistical evaluation.
+A lightweight library for Bayesian bootstrapping and statistical evaluation designed for learning, experimentation, and exploring Bayesian nonparametric ideas.
+
+The Bayesian bootstrap (Rubin, 1981) is a simple nonparametric Bayesian method for estimating uncertainty in statistics without assuming a likelihood model. It replaces resampling with random Dirichlet-distributed weights on the observed data, producing a posterior-like distribution for any statistic (mean, quantile, regression, etc.). Results reflect uncertainty in the weights (not in unobserved data) and are asymptotically similar to the classical bootstrap. Assumes i.i.d. data; results may be overconfident if the sample is small or unrepresentative.
+
+This package implements the core logic of Bayesian bootstrapping in Python, along with a few weighted statistic functions, as a way to learn and experiment with Bayesian nonparametric ideas. It's meant as an educational and exploratory project rather than a production-ready library, but may be useful for understanding or demonstrating how Bayesian bootstrap inference works in practice.
+
+## Why use this package?
+
+Learn and experiment with Bayesian bootstrap inference in Python
+Quickly compute posterior-like uncertainty intervals for arbitrary statistics
+Extend easily with your own weighted statistic functions
 
 ## Installation
 
-### From PyPI:
+- From PyPI:
 
 ```bash
 pip install bbstat
 ```
 
-### From GitHub source code:
+- From GitHub source code:
 
 ```bash
 git clone https://github.com/cwehmeyer/bbstat.git
@@ -56,53 +67,57 @@ print(summary.round())  # => BootstrapSummary(mean=52000.0, ci_low=47000.0, ci_h
 
 ### `bootstrap(data, statistic_fn, n_boot=1000, ...)`
 
-Performs Bayesian bootstrapping on input `data` using the given statistic.
+Performs Bayesian bootstrapping on `data` using the given statistic.
 
 - `data`: 1D NumPy array, or tuple/list thereof
 - `statistic_fn`: string or callable (e.g., `"mean"`, `"median"`, or custom function)
-- `level`: credible interval (default 0.87)
 - `n_boot`: number of bootstrap samples
 - `seed`: random seed (optional)
 - `blocksize`: number of resamples to allocate in one block
 - `fn_kwargs`: optional dictionary with parameters for `statistic_fn`
 
-Returns a `BootstrapResult` with:
-- `.mean`: estimated statistic value
-- `.ci`: tuple representing lower and upper bounds of the credible interval
-- `.level`: credible level used
-- `.n_boot`: number of bootstraps performed
-- `.estimates`: array of statistic values computed across the bootstrapped posterior samples
+**Parameters**
+
+- `data`: 1D NumPy array, or tuple/list of arrays
+- `statistic_fn`: string or callable (e.g. `"mean"`, `"median"`, or custom function)
+- `n_boot`: number of bootstrap samples
+- `seed`: random seed (optional)
+- `blocksize`: number of resamples processed per block
+- `fn_kwargs`: optional dict of extra parameters for `statistic_fn`
+
+**Returns**
+
+A `BootstrapDistribution` object with:
+
+- `.estimates`: array of bootstrapped statistic values
+- `.summarize(level)`: returns a `BootstrapSummary` with `mean`, `ci_low`, `ci_high`, and `level`
 
 ### Weighted statistic functions included
 
-The module bbstat.statistics provides a number univariate and bivariate weighted statistics:
-- `"entropy"`: `bbstat.statistics.compute_weighted_entropy(data, weights)`
-- `"eta_square_dependency"`: `bbstat.statistics.compute_weighted_eta_square_dependency(data, weights)`
-- `"log_odds"`: `bbstat.statistics.compute_weighted_log_odds(data, weights, state: int)`
-- `"mean"`: `bbstat.statistics.compute_weighted_mean(data, weights)`
-- `"median"`: `bbstat.statistics.compute_weighted_median(data, weights)`
-- `"mutual_information"`: `bbstat.statistics.compute_weighted_mutual_information(data, weights)`
-- `"pearson_dependence"`: `bbstat.statistics.compute_weighted_pearson_dependence(data, weights, ddof: int = 0)`
-- `"percentile"`: `bbstat.statistics.compute_weighted_percentile(data, weights, percentile: float)`
-- `"probability"`: `bbstat.statistics.compute_weighted_probability(data, weights, state: int)`
-- `"quantile"`: `bbstat.statistics.compute_weighted_quantile(data, weights, quantile: float)`
-- `"self_information"`: `bbstat.statistics.compute_weighted_self_information(data, weights, state: int)`
-- `"spearman_depedence"`: `bbstat.statistics.compute_weighted_spearman_depedence(data, weights, ddof: int = 0)`
-- `"std"`: `bbstat.statistics.compute_weighted_std(data, weights, ddof: int = 0)`
-- `"sum"`: `bbstat.statistics.compute_weighted_sum(data, weights)`
-- `"variance"`: `bbstat.statistics.compute_weighted_variance(data, weights, ddof: int = 0)`
-
-If you want to use your own custom functions, please adhere to this pattern
+The module `bbstat.statistics` includes several univariate and bivariate weighted statistics, such as:
+
+- `"mean"` – `compute_weighted_mean(data, weights)`
+- `"median"` – `compute_weighted_median(data, weights)`
+- `"quantile"` / `"percentile"`
+- `"variance"` / `"std"` / `"sum"`
+- `"entropy"` / `"log_odds"` / `"probability"` / `"self_information"`
+- `"pearson_dependence"` / `"spearman_dependence"`
+- `"eta_square_dependency"` / `"mutual_information"`
+
+You can also supply your own functions following this pattern:
 
 ```python
-def custom_statistic(data, weights, *, **kwargs) -> float
+def custom_statistic(data, weights, **kwargs) -> float:
+    ...
 ```
 
-where `data` is
-- a 1D numpy array of length `n_data` or
-- a tuple/list of 1D numpy arrays, each of length `n_data`
+where:
+
+- `data`: 1D NumPy array or tuple/list of 1D arrays
+- `weights`: 1D NumPy array of non-negative values summing to 1
+- `**kwargs`: optional keyword arguments passed by `fn_kwargs`
 
-and `weights` is a 1D numpy array of length `n_data`, with non-negative elements that sum up to one. The function may also take additional parameters which can be supplied via `**kwargs`.
+If you want to use your own custom functions, please adhere to this pattern.
 
 ## License
 
 
@@ -0,0 +1,155 @@
+# Getting Started
+
+This guide shows how to use `bbstat` to perform Bayesian bootstrapping with both built-in and custom statistics.
+
+We'll start with a quick example on univariate data, move on to a bivariate case, and then show how to write your own weighted statistic. The goal is to help you see how Bayesian bootstrapping works in practice.
+
+## Installation
+
+You can install bbstat from PyPI:
+
+```bash
+pip install bbstat
+```
+
+Then import what you need:
+
+```python
+import numpy as np
+from bbstat import bootstrap
+```
+
+## Bootstrapping a simple statistic
+
+Let's start with something familiar: estimating the mean of a small dataset. We'll use the Bayesian bootstrap to quantify uncertainty in that mean.
+
+```python
+# Sample data: daily coffee consumption (in cups) from a small survey
+coffee = np.array([2.0, 3.0, 1.5, 2.5, 3.0, 2.0, 4.0])
+
+# Run the Bayesian bootstrap with 2000 Dirichlet-weighted replicates
+distribution = bootstrap(data=coffee, statistic_fn="mean", n_boot=2000, seed=1)
+
+# Summarize the distribution as a posterior mean and 95% credible interval
+summary = distribution.summarize(level=0.95)
+print(summary)
+# BootstrapSummary(mean=2.583..., ci_low=2.057..., ci_high=3.159..., level=0.95)
+```
+
+If you'd like cleaner, human-readable output, the `BootstrapSummary.round()` method can automatically round values to a sensible precision based on the width of the credible interval:
+
+```python
+print(summary.round())
+# BootstrapSummary(mean=2.6, ci_low=2.1, ci_high=3.2, level=0.95)
+```
+
+Here the mean estimate is about 2.6 cups per day, with a 95% credible interval of roughly [2.1, 3.2]. The uncertainty reflects variation in the weights each sample could have in the population, not in resampled data points.
+
+## Bootstrapping a quantile
+
+You can use any of the built-in weighted statistics the same way. For example, let's estimate the 90th percentile of the same dataset:
+
+```python
+distribution = bootstrap(
+    data=coffee,
+    statistic_fn="quantile",
+    fn_kwargs={"quantile": 0.9},
+    seed=1,
+)
+
+summary = distribution.summarize(level=0.95)
+print(summary)
+# BootstrapSummary(mean=3.28, ci_low=2.85, ci_high=3.81, level=0.95)
+```
+
+The bootstrapped 0.9 quantile is around 3.3 cups, meaning that about 90% of coffee drinkers in this sample consume 3.3 or fewer cups per day.
+
+## Bivariate example: dependence between variables
+
+For bivariate data, bbstat includes functions such as `"pearson_dependency"` (weighted correlation) and `"mutual_information"` (a nonlinear dependence measure).
+
+Let's look at the relationship between study time and exam score:
+
+```python
+# Simulated data: study hours vs exam scores
+study_hours = np.array([2, 3, 4, 5, 6, 8, 9])
+exam_scores = np.array([60, 65, 70, 72, 78, 85, 90])
+
+data = (study_hours, exam_scores)
+
+# Weighted Pearson correlation via Bayesian bootstrapping
+distribution = bootstrap(data=data, statistic_fn="pearson_dependency", n_boot=2000, seed=1)
+summary = distribution.summarize(level=0.95).round()
+print(summary)
+# BootstrapSummary(mean=0.9969, ci_low=0.9911, ci_high=0.9992, level=0.95)
+```
+
+This shows a strong positive correlation, and the credible interval indicates high confidence that the true correlation is above 0.99. You could switch to "mutual_information" to estimate a nonlinear dependency instead.
+
+## Writing your own weighted statistic
+
+Defining a custom statistic is simple. All functions used with bootstrap() must follow this signature:
+
+```python
+def custom_statistic(data, weights, **kwargs) -> float:
+    ...
+```
+
+Here's an example that implements a **weighted geometric mean**, which is not (yet) included among the built-ins but demonstrates how to use the weights properly:
+
+For a set of positive numbers \(x_1, x_2, \dots, x_n > 0\) with associated weights
+\(w_1, w_2, \dots, w_n\) such that \(w_i \ge 0\) and \(\sum_{i=1}^n w_i = 1\),
+the **weighted geometric mean** is defined as:
+
+\[
+\text{GM}_w = \prod_{i=1}^{n} x_i^{w_i} = \exp\Bigg( \sum_{i=1}^{n} w_i \ln x_i \Bigg)
+\]
+
+In the Bayesian bootstrap, the weights \(w_i\) are drawn from a Dirichlet distribution:
+
+\[
+(w_1, \dots, w_n) \sim \text{Dirichlet}(\alpha_1=1, \dots, \alpha_n=1)
+\]
+
+Each bootstrap replicate computes:
+
+\[
+\text{GM}_\text{replicate} = \exp\Bigg( \sum_{i=1}^{n} w_i \ln x_i \Bigg)
+\]
+
+Repeating this for many replicates produces a posterior-like distribution of the geometric mean.
+
+```python
+def weighted_geometric_mean(data, weights):
+    """Compute the weighted geometric mean."""
+    data = np.asarray(data)
+    weights = np.asarray(weights)
+    # Avoid log(0): require positive data
+    if np.any(data <= 0):
+        raise ValueError("Geometric mean requires positive data.")
+    log_mean = np.sum(weights * np.log(data))
+    return np.exp(log_mean)
+
+
+data = np.array([1.2, 1.5, 2.0, 2.8, 3.1])
+distribution = bootstrap(data=data, statistic_fn=weighted_geometric_mean, n_boot=1500, seed=1)
+summary = distribution.summarize().round()
+print(summary)
+# BootstrapSummary(mean=2.01, ci_low=1.58, ci_high=2.48, level=0.87)
+```
+
+The same pattern applies if your statistic takes multiple arrays (e.g., (x, y)). The function receives the data and weights, computes its result, and returns a single float.
+
+## Common questions and pitfalls
+- **Why are the credible intervals sometimes narrow?**
+  Bayesian bootstrapping assumes that the observed data already represent the full population support. Uncertainty is only about how much weight each observation should get, not about unseen data. If the sample is small or has heavy tails, results can appear overconfident.
+- **Can I get negative weights or resampled data?**
+  No. Weights are drawn from a uniform Dirichlet distribution, so they're always non-negative and sum to one. This approach replaces the random resampling in the classical bootstrap, rather than supplementing it.
+- **What if my statistic ignores the weights?**
+  Then it is not a Bayesian bootstrap anymore and you are just re-evaluating the same statistic repeatedly. Always make sure your custom statistic uses the provided weights.
+- **What if my data contain zeros or negative values?**
+  That's fine for most statistics, but not all (the geometric mean above is a case in point). Handle such cases carefully or filter the data before applying those statistics.
+- **Can I use bbstat for regression or multivariate models?**
+  Yes, as long as your statistic can be written as a weighted function of the data. For example, a weighted regression slope or a loss function summary. The Bayesian bootstrap does not assume any specific model form.
+- **How does rounding work in summarize()?**
+  When you call `BootstrapSummary.round()`, the method automatically picks a decimal precision suitable for the width of the credible interval so that the displayed digits reflect the level of uncertainty. You can also set a fixed precision manually if you prefer.
@@ -1,76 +1 @@
-# bbstat
-
-Welcome to **bbstat**, a lightweight library for Bayesian bootstrapping and statistical evaluation.
-
-## Features
-
-- Bayesian bootstrap resampling
-- Compute weighted statistics
-- Evaluate uncertainty via credible intervals
-- Easy-to-use and extensible
-
-## Installation
-
-Installation from PyPi:
-
-```bash
-pip install bbstat
-```
-
-Installation from GitHub source code:
-
-```bash
-git clone https://github.com/cwehmeyer/bbstat.git
-cd bbstat
-pip install .
-```
-
-### Optional Extras
-
-This package includes optional dependencies for development, testing, and documentation. To install them from GitHub source:
-
-- For development:
-
-```bash
-pip install '.[dev]'
-```
-
-- For testing:
-
-```bash
-pip install '.[test]'
-```
-
-- For documentation:
-
-```bash
-pip install '.[docs]'
-```
-
-## Getting started
-
-```python
-import numpy as np
-from bbstat import bootstrap
-
-# Data preparation: simulated income for a small population (e.g., a survey of 25 people)
-income = np.array([
-    24_000, 26_000, 28_000, 30_000, 32_000,
-    35_000, 36_000, 38_000, 40_000, 41_000,
-    45_000, 48_000, 50_000, 52_000, 54_000,
-    58_000, 60_000, 62_000, 65_000, 68_000,
-    70_000, 75_000, 80_000, 90_000, 100_000,
-], dtype=np.float64)
-
-# Direct estimate of mean income
-print(np.mean(income))  # => 52280.0
-
-# Bootstrapped distribution of the mean income.
-distribution = bootstrap(data=income, statistic_fn="mean", seed=1)
-print(distribution)  # => BootstrapDistribution(mean=52263.8..., size=1000)
-
-# Summarize the bootstrapped distribution of the mean income.
-summary = distribution.summarize(level=0.87)
-print(summary)  # => BootstrapSummary(mean=52263.8..., ci_low=46566.8..., ci_high=58453.6..., level=0.87)
-print(summary.round())  # => BootstrapSummary(mean=52000.0, ci_low=47000.0, ci_high=58000.0, level=0.87)
-```
+{% include-markdown "../README.md" %}