MicroSplit

Overview

This repository implements MicroSplit for Cell Painting data, providing modular workflows optimized for large-scale image analysis and HPC environments. It extends the original MicroSplit-reproducibility with data handling, metadata integration, and scalable processing pipelines for cell painting.

What is MicroSplit

MicroSplit is a deep learning-based computational multiplexing technique that enables imaging of multiple cellular structures within a single fluorescent channel, allowing increased throughput, reduced acquisition time, and lower light exposure. The method uses a hierarchical variational auto-encoder (LVAE) with lateral context. MicroSplit is implemented in the CAREamics library. Using MicroSplit, cell painting assays can be revised to enable multiple cellular structures to be imaged in a single fluorescent channel and then computationally unmixed before image analysis steps.

Workflow Architecture

Phase 1: Dataset Creation

Location: src/microsplit_reproducibility/workflows/dataset_creation/ Fetches raw images from JUMP data sources, combines channels, and structures data for training:

JUMPDatasetBuilder: Main interface for dataset creation
Dataset-specific functions: create_orf_dataset(), create_crispr_dataset(), create_compound_dataset(), create_pilot_dataset()
Handles: AWS S3 integration via jump-portrait, metadata parsing, channel normalization, TIFF export

Phase 2: Noise Model Creation

Location: src/microsplit_reproducibility/workflows/noise_model/ Loads data for CAREamics Noise2Void training:

load_data_for_noise_model(): Single function to prepare data from Phase 1 output
Integrates directly with CAREamics N2V workflow

Phase 3: Training

Location: src/microsplit_reproducibility/datasets/JUMP.py

get_train_val_data(): Loads Phase 1 TIFFs for model training
Configuration via config factories
Training handled by CAREamics + PyTorch Lightning

Phase 4: Prediction

Location: src/microsplit_reproducibility/workflows/prediction/, examples/2D/JUMP/HPC/ Applies trained models to generate predictions:

Metadata mapping and result storage
Preparing the output for image analysis with cellprofiler

Installation

Important

A GPU is required for training. Pre-trained models can be used for inference without GPU access.

Installation takes 5-10 minutes with an existing Conda/Mamba setup.

Environment Setup

Create a Python environment (Python 3.10 recommended):

mamba create -n microsplit python=3.10
mamba activate microsplit

Tip

For Apple Silicon (M1/M2/M3), use:

CONDA_SUBDIR=osx-arm64 conda create -n microsplit python=3.9
conda activate microsplit
conda config --env --set subdir osx-arm64

Install PyTorch following the official instructions for your system.

Verify GPU access:

# NVIDIA CUDA
python -c "import torch; print([torch.cuda.get_device_properties(i) for i in range(torch.cuda.device_count())])"

# Apple Silicon
python -c "import torch; import platform; print(platform.processor() in ('arm', 'arm64') and torch.backends.mps.is_available())"

Install this repository:

git clone https://github.com/[your-username]/JUMP-MicroSplit.git
cd JUMP-MicroSplit
pip install .

Quick Start

Phase 1: Create Dataset

genes = ["TP53", "KRAS", "EGFR", "BRAF", "MYC"]
channels = ["DNA", "RNA", "ER", "AGP", "Mito"]

builder = JUMPDatasetBuilder(
    dataset_type="crispr",
    channels=channels,
    output_dir="./crispr_5gene_data"
)
profiles = load_crispr_profiles(source="source_4")
for gene in genes:
    crispr_ids = select_crispr_by_gene(profiles, gene)
    create_crispr_dataset(
        crispr_ids=crispr_ids,
        channels=[Channel[ch] for ch in channels],
        output_dir=Path(f"./crispr_5gene_data/{gene}"),
        images_per_perturbation=10
    )

Phase 2: Create Noise Models

channels = ["DNA", "RNA", "ER", "AGP", "Mito"]

for channel in channels:
    # Load single-channel data for N2V training
    noise_data = load_data_for_noise_model(
        dataset_dir="./crispr_5gene_data",
        channels=[channel],  
        max_images=None 
    )
    config = create_n2v_configuration(
        experiment_name=f"crispr_{channel.lower()}_noise_model",
        data_type="array",
        axes="YX",
        patch_size=[64, 64],
        batch_size=64,
        num_epochs=10
    )
    
    careamist = CAREamist(source=config)
    careamist.train(train_source=noise_data)
    careamist.save(f"./noise_models/{channel.lower()}_n2v_model")

Phase 3: Train MicroSplit Model

target_channels = ["DNA", "RNA", "ER", "AGP", "Mito"]
dataset_dir = "./crispr_5gene_data"
noise_models_dir = "./noise_models"

train_data_config, val_data_config, test_data_config = get_data_configs(
    channel_idx_list=target_channels
)

experiment_params = get_microsplit_parameters(
    nm_path=noise_models_dir,
    channel_idx_list=target_channels,
    batch_size=8
)

train_dset, val_dset, test_dset, data_stats = create_train_val_datasets(
    datapath=dataset_dir,
    train_config=train_data_config,
    val_config=val_data_config,
    test_config=test_data_config,
    load_data_func=get_train_val_data
)

train_dloader = DataLoader(train_dset, batch_size=8, shuffle=True)
val_dloader = DataLoader(val_dset, batch_size=8, shuffle=False)

train_microsplit_model(
    train_dloader=train_dloader,
    val_dloader=val_dloader,
    data_stats=data_stats,
    experiment_params=experiment_params,
    num_epochs=50,
    checkpoint_dir="./checkpoints/crispr_5gene_data"
)

Phase 4: Prediction and metadata mapping

predict_and_evaluate(
    dataset_dir="./crispr_5gene_data",
    checkpoint_dir="./checkpoints/crispr_5gene_data", 
    prediction_dir="./predictions/crispr_5gene_data",
    noise_models_dir="./noise_models",
    channels=["DNA", "RNA", "ER", "AGP", "Mito"],
    mmse_count=50
)

original_metadata = dataset_dir/"original_metadata.csv"
metadata_df = create_test_metadata_mapping(
    original_metadata_csv=original_metadata,
    prediction_dir=prediction_dir,
    channels=channels,
    output_csv="metadata_mapping.csv"
)

generate_cellprofiler_loaddata_csv(
    metadata_mapping_df=metadata_df,
    prediction_dir=prediction_dir,
    channels=channels,
    output_csv="cellprofiler_input.csv"
)

HPC Batch Processing (Optional)

We strongly recommend to run training and prediction via HPC for processing multiple channel combinations or large datasets. Remember to configure job arrays and resource allocation in SLURM scripts based on your HPC environment.

Training:

cd examples/2D/JUMP/HPC
sbatch train_all_combinations.sh
sbatch 5channels_predictions.sh

CellPaintMONO

After running MicroSplit on the cell painting data, we can evualate how the Microsplit-predicted cell painting data performs in comparison to the original data from pre-processing to profile creation. We use CellProfiler v.4.2.8 to run two analysis pipelines, adapted from the cpg0000-jump-pilot experiment pipelines and then perform some downstream analysis tasks on the profiles. The tools for analysis and comparison of cell painting data, as well as the entire image analysis workflow can be found in the CellPaintMONO repository.

Tested Systems

System 1

OS: Red Hat Enterprise Linux 8.10
GPU: NVIDIA A40-16Q, 16GB
CUDA: 12.4

System 2

OS: macOS 14.1
GPU: Apple M3, 16GB

System 3

OS: Windows 10 Enterprise
GPU: NVIDIA RTX A3000, 6GB
CUDA: 12.3

Troubleshooting

Problem: NVIDIA driver version error
Solution: Downgrade PyTorch:

pip3 install torch==2.2 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Problem: Mac Silicon GPU test returns False
Solution: Install PyTorch via pip (not conda) and ensure macOS-arm64 Anaconda/Mamba release.

Resources

License

BSD-3-Clause License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 360 Commits
examples		examples
scripts		scripts
src/microsplit_reproducibility		src/microsplit_reproducibility
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MicroSplit

Overview

What is MicroSplit

Workflow Architecture

Phase 1: Dataset Creation

Phase 2: Noise Model Creation

Phase 3: Training

Phase 4: Prediction

Installation

Environment Setup

Quick Start

Phase 1: Create Dataset

Phase 2: Create Noise Models

Phase 3: Train MicroSplit Model

Phase 4: Prediction and metadata mapping

HPC Batch Processing (Optional)

CellPaintMONO

Tested Systems

System 1

System 2

System 3

Troubleshooting

Resources

License

About

Uh oh!

Releases

Packages

Languages

License

diyasrivas/JUMP-MicroSplit

Folders and files

Latest commit

History

Repository files navigation

MicroSplit

Overview

What is MicroSplit

Workflow Architecture

Phase 1: Dataset Creation

Phase 2: Noise Model Creation

Phase 3: Training

Phase 4: Prediction

Installation

Environment Setup

Quick Start

Phase 1: Create Dataset

Phase 2: Create Noise Models

Phase 3: Train MicroSplit Model

Phase 4: Prediction and metadata mapping

HPC Batch Processing (Optional)

CellPaintMONO

Tested Systems

System 1

System 2

System 3

Troubleshooting

Resources

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages