Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .DS_Store
Binary file not shown.
44 changes: 18 additions & 26 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -1,44 +1,36 @@
name: Conda CI
name: CI

on: [push, pull_request]

jobs:
build:
test:
runs-on: ubuntu-latest

steps:
- name: Check out repository code
uses: actions/checkout@v2
- name: Checkout repository
uses: actions/checkout@v4

- name: Set up Conda
uses: conda-incubator/setup-miniconda@v2
- name: Set up Python
uses: actions/setup-python@v5
with:
auto-activate-base: false
python-version: "3.10"

- name: Create Conda environment
run: conda env create --file env.yml

- name: Activate Conda environment and install dependencies
- name: Upgrade pip
run: |
source $CONDA/bin/activate catomatic
pip install -e .
python -m pip install --upgrade pip

- name: Verify Conda environment
- name: Install package + dev dependencies
run: |
source $CONDA/bin/activate catomatic
conda info --all
conda list

- name: Set PYTHONPATH
run: echo "PYTHONPATH=$PYTHONPATH:$(pwd)/src" >> $GITHUB_ENV
pip install .[dev]

- name: Run Pytest and Coverage
- name: Run tests with coverage
run: |
source $CONDA/bin/activate catomatic
pytest --cov=catomatic src/tests/ --cov-report=xml
pytest src/tests/ \
--cov=catomatic \
--cov-report=xml

- name: Upload Coverage to Codecov
uses: codecov/codecov-action@v2
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v4
with:
files: ./coverage.xml
files: coverage.xml
token: ${{ secrets.CODECOV_TOKEN }}
24 changes: 24 additions & 0 deletions .github/workflows/mypy.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
name: mypy

on: [push, pull_request]

jobs:
type-check:
runs-on: ubuntu-latest

steps:
- name: Checkout repo
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.10"

- name: Install package + dev deps
run: |
pip install .[dev]

- name: Run MyPy
run: |
mypy src/catomatic/. --pretty
151 changes: 138 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
[![codecov](https://codecov.io/gh/fowler-lab/catomatic/branch/ecoff/graph/badge.svg?token=8fnOy6rHCd)](https://codecov.io/gh/fowler-lab/catomatic) [![DOI](https://zenodo.org/badge/801462003.svg)](https://doi.org/10.5281/zenodo.14917920)


# catomatic

Python code that algorithmically builds antimicrobial resistance catalogues of mutations.
catomatic is a Python toolkit for algorithmically constructing antimicrobial resistance (AMR) mutation catalogues directly from variant calls generated by read mapping. Rather than relying on alignment-level pattern matching or predefined resistance motifs, the tool infers resistance associations statistically from observed genotype–phenotype relationships, supporting both binary frequentist and regression-based modelling approaches.

This design is particularly well suited to Mycobacterium species, where resistance is primarily driven by chromosomal point mutations, indels, and complex multi-locus interactions, and where horizontal gene transfer is rare. By operating on mapped mutation data rather than alignment outputs, the framework enables transparent evidence tracking, flexible statistical testing, and reproducible catalogue construction tailored to the evolutionary and genomic characteristics of mycobacteria.

For aligment-relevant approaches, see AMRverse.

## Introduction

Expand All @@ -12,7 +15,7 @@ This repo contains 2 approaches to build resistance catalogues:
1. **Definite defectives (solo-based approach)**
2. **Interval regression**

The first is used in [https://doi.org/10.1101/2025.01.30.635633](https://doi.org/10.1101/2025.01.30.635633), and the second is a Python translation of the method used in [https://doi.org/10.1038/s41467-023-44325-5](https://doi.org/10.1038/s41467-023-44325-5), but is still under development.
The first is used in [https://doi.org/10.1101/2025.01.30.635633](https://doi.org/10.1101/2025.01.30.635633), and the second is a Python translation of the method used in [https://doi.org/10.1038/s41467-023-44325-5](https://doi.org/10.1038/s41467-023-44325-5).

---

Expand Down Expand Up @@ -52,20 +55,28 @@ Contingency tables, proportions, p-values, and Wilson confidence intervals are s

## Regression Builder

This method is under development and will be released soon with accompanying documentation.
The Regression Builder implements a mixed-effect interval regression-based approach for catalogue construction to generate predicted mean MICs. It is suitable when the phenotypes are censored or uncesnored MICs.

If whole genome SNPs are provided, agglomerative clustering can compute random effects to control for population structure. Any given number of fixed-effects (such as lineage and lab) can also be defined by supplying additional input columns.

Similarly to the BinaryBuilder, catalogues can be exported as json objects or piezo-compatible tables.

---

## Installation

### Using Conda
### Installation from source

We recommend using Conda for environment and dependency management.
Assuming in project directory (after git cloning)

```bash
conda env create -f env.yml
conda activate catomatic
pip install .
pip install -e .
```

### Pypy installation

```bash
pip install catomatic
```

## Running catomatic's Binary Builder
Expand All @@ -75,7 +86,7 @@ You need two input DataFrames:
- **Samples**: one row per sample, with 'R' or 'S' phenotypes (`UNIQUEID`, `PHENOTYPE`)
- **Mutations**: one row per mutation per sample (`UNIQUEID`, `MUTATION`)

If exporting to Piezo format:
If exporting to Piezo format (`--to_piezo`):

- The `MUTATION` column must follow GARC1 grammar (`gene@mutation`)
- A path to a `wildcards.json` file (containing mutation rules) must be provided
Expand Down Expand Up @@ -118,7 +129,7 @@ After installation, the simplest way to run the catomatic catalogue builder is v
#### Export to JSON

```bash
python -m catomatic binary \
catomatic binary \
--samples path/to/samples.csv \
--mutations path/to/mutations.csv \
--to_json \
Expand All @@ -128,7 +139,7 @@ python -m catomatic binary \
#### Export to Piezo format

```bash
python -m catomatic binary \
catomatic binary \
--samples path/to/samples.csv \
--mutations path/to/mutations.csv \
--to_piezo \
Expand Down Expand Up @@ -160,14 +171,128 @@ python -m catomatic binary \
| `--tails` | `str` | Tail type for statistical test. One of: `one`, `two`. Optional. Defaults to `two`. |
| `--strict_unlock` | `flag` | If set, disables classification of susceptible (`S`) mutations unless statistically confident. |

## Running catomatic's Regression Builder

You need two input DataFrames:

- **Samples**: one row per sample, with an MIC column (`UNIQUEID`, `MIC`)
- **Mutations**: one row per mutation per sample (`UNIQUEID`, `MUTATION`)

If exporting to Piezo format (`--to_piezo`):

- The `MUTATION` column must follow GARC1 grammar (`gene@mutation`)
- A path to a `wildcards.json` file (containing mutation rules) must be provided

### Python/Jupyter Example

```python
from catomatic.RegressionCatalogue import RegressionBuilder

# fit the model to generate mutation effects
model, effects = RegressionBuilder(samples=samples_df, mutations=mutations_df).predict_effects()

# classify effects and generate a catalogue (requires an ecoff)
catalogue = RegressionBuilder(samples=samples_df, mutations=mutations_df).build(ecoff=1.0)

# View dictionary version
cat_dict = catalogue.return_catalogue()

# Convert to Piezo-compatible format
catalogue_df = catalogue.build_piezo(
genbank_ref='...',
catalogue_name='...',
version='...',
drug='...',
wildcards='path/to/wildcards.json'
)

# Optionally export to CSV
catalogue.to_piezo(
genbank_ref='...',
catalogue_name='...',
version='...',
drug='...',
wildcards='path/to/wildcards.json',
outfile='path/to/output.csv'
)
```

### CLI

Similarly to BinaryBuilder, one can instantiate RegressionBuilder from the command line:

#### Export to JSON

```bash
catomatic regression \
--samples path/to/samples.csv \
--mutations path/to/mutations.csv \
--ecoff 1.0 \
--to_json \
--outfile path/to/output/catalogue.json
```

#### Export to Piezo format

```bash
catomatic regression \
--samples path/to/samples.csv \
--mutations path/to/mutations.csv \
--ecoff 1.0 \
--to_piezo \
--outfile path/to/output/catalogue.csv \
--genbank_ref '...' \
--catalogue_name '...' \
--version '...' \
--drug '...' \
--wildcards path/to/wildcards.json
```

### CLI Parameters

### CLI Parameters (Regression Builder)

| Parameter | Type | Description & default |
| -------------------- | ------------- | --------------------------------------------------------------------------------------------------------------------------------- |
| `--samples` | `str` | Path to the samples file (CSV). **Required**. |
| `--mutations` | `str` | Path to the mutations file (CSV). **Required**. |
| `--genes` | `str[]` | List of RAV genes. Required when non-RAV genes appear in the mutations table (e.g. when clustering SNP distances). Default: `[]`. |
| `--dilution_factor` | `int` | Dilution factor used in processing. Default: `2`. |
| `--censored` | `flag` | Treat phenotype data as censored. Default: `False`. |
| `--tail_dilutions` | `int` | Tail dilutions to use for uncensored data. Default: `1`. |
| `--frs` | `float` | Fraction Read Support threshold. Default: `None`. |
| `--ecoff` | `float` | Epidemiological cutoff value for classification. If `None`, it will be computed. Default: `None`. |
| `--b_bounds` | `float,float` | Bounds for beta (fixed-effect) coefficients. Two floats: `(min max)`. Default: `(None, None)`. |
| `--u_bounds` | `float,float` | Bounds for random-effect coefficients. Two floats: `(min max)`. Default: `(None, None)`. |
| `--s_bounds` | `float,float` | Bounds for sigma (residual variance). Two floats: `(min max)`. Default: `(None, None)`. |
| `--p` | `float` | Significance / confidence level. Default: `0.95`. |
| `--fixed_effects` | `str[]` | Column names to include as fixed effects. Default: `None`. |
| `--random_effects` | `flag` | Perform SNP clustering and include cluster as a random effect. Default: `False`. |
| `--cluster_distance` | `float` | Distance threshold for SNP clustering. Default: `1`. |
| `--outfile` | `str` | Path to save output JSON or Piezo file. Required with `--to_json` or `--to_piezo`. |
| `--options` | `dict` | Options passed to `scipy.optimize.minimize`. Default: `None`. |
| `--L2_penalties` | `dict` | Regularisation penalties for fixed and random effects. Default: `None`. |
| `--to_json` | `flag` | Export the resulting catalogue to JSON format. |
| `--to_piezo` | `flag` | Export the resulting catalogue to Piezo-compatible CSV format. |
| `--genbank_ref` | `str` | GenBank reference string for Piezo export. Required with `--to_piezo`. |
| `--catalogue_name` | `str` | Name of the catalogue. Required with `--to_piezo`. |
| `--version` | `str` | Catalogue version. Required with `--to_piezo`. |
| `--drug` | `str` | Drug associated with the mutations. Required with `--to_piezo`. |
| `--wildcards` | `str` | Path to JSON file containing wildcard mutation rules. Required with `--to_piezo`. |
| `--grammar` | `str` | Grammar used in the catalogue. Default: `GARC1`. |
| `--values` | `str` | Values used for predictions in the catalogue. Default: `RUS`. |
| `--for_piezo` | `flag` | If set, enables Piezo-specific placeholder rows. Omit if not exporting to Piezo. Default: `False`. |

### Notes

- When using post-hoc rule updates via .update(), you must provide wildcards and set replace=True if you intend to override existing entries.
- For Piezo export, placeholder entries are inserted automatically if needed to satisfy parser requirements (R, S, and U must be represented).
- The EVIDENCE column includes contingency tables, proportions, confidence intervals, and p-values, and may optionally include sample IDs if `record_ids=True`.
- To build a catalogue with the regression builder, as currently implemented, requires an ecoff as it will compare the predited effected against the background to supply an R/S/U label
- To only calculate predicted effects, this can be done in Python by calling RegressionBuilder.predict_effects()

## Citation

If you use catomatic in your research, please cite:

- https://doi.org/10.1101/2025.01.30.635633
- https://doi.org/10.1099/mgen.0.001429
Loading