This repository contains hands-on examples for processing large-scale scientific data in the cloud using:
- Dataplug: A lightweight, client-side Python framework for efficient partitioning of unstructured scientific data stored in object storage (like Amazon S3), enabling elastic cloud processing.
- Lithops: Serverless framework for scalable parallel processing.
🚀 Quick Start (Recommended): Use pyrun.cloud
This tutorial is designed to run seamlessly on pyrun.cloud, a cloud-based JupyterLab platform with:
✅ Pre-installed dependencies
✅ Auto-configured Lithops backend
✅ Direct support for Dataplug and serverless workflows
🟢 No setup required — just launch the notebooks and start experimenting!
Notebook: dataplug_example.ipynb
This notebook shows how to:
- Load a FASTA file from an S3 bucket using
CloudObject.from_s3 - Explore metadata (e.g., number of sequences)
- Preprocess and split the file into chunks
- Partition the data for analysis
Run it on pyrun or locally with:
jupyter notebook dataplug_example.ipynbNotebook: dataplug_lithops.ipynb
This notebook demonstrates how to scale the same processing logic to the cloud using Lithops:
- Partition the FASTA file with
co.partition(...) - Apply
process_fasta_partitionto each slice - Launch parallel processing with
lithops.FunctionExecutor
Run it on pyrun or locally with:
jupyter notebook dataplug_lithops.ipynb✅ The integration between Dataplug and Lithops is native — no code changes needed to go from local to serverless!
If you prefer to run the notebooks locally instead of pyrun, follow these steps:
pip install git+https://github.com/CLOUDLAB-URV/dataplug
pip install lithopsTo execute functions in the cloud (AWS, IBM Cloud, Azure, etc.), you’ll need to configure your Lithops backend manually.
You can follow the official guide here:
👉 https://github.com/lithops-cloud/lithops#configuration
Create a .lithops_config file with your credentials and backend options.
- Python 3.10 or higher
- Access to an S3-compatible storage (e.g., AWS S3, MinIO)
- Internet connection
- Cloud credentials (automatically set in pyrun, or configured manually for local runs)
This code is part of the PyRun-SciPy2025 tutorial series for scientific computing in the cloud.