This tutorial aims at showcasing how to launch and monitor Python experiments using experimaestro.
We will use experimaestro in a basic setup performing hyper-parameters search for a deep learning model (here we will train CNN trained on the MNIST dataset).
In this situation, people usually have to run a lot of different trainings with various hyper-parameters. This can rapidly become a pain to:
- Launch training jobs in parallel on your cluster, and monitor them in real time.
- Track the hyper-parameters already tested. And avoid launching a job with same parameters twice.
- Save results in proper unique and not conflicting directories for further loading and analysis.
By the end of this tutorial, you will understand the core structure of
experimaestro. And you will be able to launch and monitor your own experiments. In this tutorial, we rely on git and uv; both needs to be installed.
- Clone this repository.
git clone https://github.com/experimaestro/experimaestro-demo.git && cd experimaestro-demoExperimaestro automatically creates folders for all the run tasks. We still have to specify where those folders will be created !
There are two solutions :
Workspace settings are stored in $HOME/.config/experimaestro/settings.yaml (see documentation at https://experimaestro-python.readthedocs.io/en/latest/settings/ )
This repository contains a default settings.yaml for quick testing.
- π‘Tip: This command will write the default config above in your
settings.yamlif it doesn't already exist.
FILE="$HOME/.config/experimaestro/settings.yaml"; if [ ! -f $FILE ] ; then cat ./xpm_settings.yaml > $FILE ; else echo "$FILE already exists !"; fiIf there are no settings.yaml provided, you can still specify where you want your experiments to run with the following.
mkdir $HOME/experiments
uv run experimaestro run-experiment --workdir $HOME/experiments ...We will now have a closer look at the key files of this demo repository. In short, in the mnist_xp folder we have
learn.pycontains the core code, here a CNN configuration and the learning and evaluation tasksdata.pycontains the code that provides datasetexperiment.pyorchestrates the experimentparams.yamlcontains configuration parameters for our experiment
We will also point out the most important objects that allow us to run the experiment.
This file contains the code that defines a CNN model, and specifies how to learn and evaluate an image classification model.
The most important concept at this stage is that of a configuration object, that serves as a structured template to define parameters and settings for tasks and experiments.
Let's see a first configuration that defines a CNN model
class CNN(Config, nn.Module):
"""Defines a CNN model"""
n_layers: Param[int] = 2
"""Number of Hidden layers"""
hidden_dim: Param[int] = 64
"""Number of hidden units"""
kernel_size: Param[int] = 3
"""Kernel size of the CNN"""
...A configuration is characterized by:
- Deriving from the
experimaestro.Configclass - Defining experimental parameters (number of layer, hidden dimension, kernel size) that can change the outcome of an experiment.
- Use docstring to document the parameters β which can be used to generate a documentation when the number of experimental configuration becomes high (e.g. the documentation of a cross-scorer in the IR library)
This configuration can latter be used as a dataclass when configuring the
experiments, e.g. CNN.C(n_layers=3).For each instance of a configuration, we can
compute a unique identifier that changes only if one or more experimental
parameter changes. For instance, CNN.C() and CNN.C(n_layers=2) have the same
identifier, which is different from CNN.C(n_layers=1).
Another important type of object are Task objects. They correspond to the code that can be run, e.g. on a SLURM cluster or locally, to perform a part of the experiment, e.g. learning the CNN model from data. Let us take a closer look at some bits of the code:
class Learn(Task):
"""Learn to classify an image into a pre-defined set of classes"""
parameters_path: Meta[Path] = field(default_factory=PathGenerator("parameters.pth"))
"""Path to store the model parameters"""
data: Param[LabelledImages]
"""Train data are labelled images"""
model: Param[CNN]
"""The model we are training"""
...
def execute(self):
...You can notice that tasks are specific types of configuration. You can also notice that parameters can be other configurations, allowing to compose experimental components easily.
The main difference between Config and Task is the execute method. The
latter contain the code to be run when the task is run by the task scheduler.
Another important thing in the above example
Please have a look at the Task
documentation
for more details.
The first code in this file defines a labelled image dataset, which is a light abstraction of the basic datasets we manipulate in our experiment. We use the datamaestro library, which is tightly associated with experimaestro, for this purpose.
class LabelledImages(Base, ABC):
@abstractmethod
def torchvision_dataset(self, **kwargs) -> VisionDataset:
...The datamaestro.Base class is the central class that any dataset class should
derive from.
Leveraging torchvision, We then define the MNIST data with the following code:
class MNISTLabelledImages(LabelledImages):
root: Meta[Path]
train: Param[bool]
def torchvision_dataset(self, **kwargs) -> VisionDataset:
return MNIST(self.root, train=self.train, **kwargs)
def download_mnist(context: Context, root: Path, force=False):
logging.info("Downloading in %s", root)
for train in [False, True]:
MNIST(root, train=train, download=True)
@custom_download("root", download_mnist)
@dataset(id="com.lecun.mnist")
def mnist(root: Path) -> Supervised[LabelledImages, None, LabelledImages]:
"""This corresponds to a dataset with an ID `com.lecun.mnist`"""
return Supervised(
train=MNISTLabelledImages.C(root=root, train=True),
test=MNISTLabelledImages.C(root=root, train=False),
)π Going further datamaestro can help managing many datasets by providing a unified interface to many datasets, as well as providing many utilities to download files from various sources. One such example is datamaestro-image that contains e.g. the MNIST dataset and datamaestro-text that contains NLP and IR datasets. Feel free to contribute!
π See how MNIST is defined in datamaestro-image
from datamaestro_image.data import ImageClassification, LabelledImages, Base, IDXImage
from datamaestro.download.single import filedownloader
from datamaestro.definitions import dataset
from datamaestro.data.tensor import IDX
@filedownloader(
"train_images.idx",
"https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz",
)
@filedownloader(
"train_labels.idx",
"https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz",
)
@filedownloader(
"test_images.idx",
"https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz",
)
@filedownloader(
"test_labels.idx",
"https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz",
)
@dataset()
def mnist(train_images, train_labels, test_images, test_labels) -> ImageClassification:
"""The MNIST database
The MNIST database of handwritten digits, available from this page, has a
training set of 60,000 examples, and a test set of 10,000 examples. It is a
subset of a larger set available from NIST. The digits have been
size-normalized and centered in a fixed-size image.
"""
return ImageClassification(
train=LabelledImages(
images=IDXImage(path=train_images), labels=IDX(path=train_labels)
),
test=LabelledImages(
images=IDXImage(path=test_images), labels=IDX(path=test_labels)
),
)This file describe the experiment, an experiment may load and launch several Tasks ..
There are two key elements in experiment.py:
The most important concept in Experimaestro is that of a configuration. In Experimaestro, a configuration object is a fundamental concept used to specify parameters and settings for tasks and experiments. It acts as a structured way to input the necessary details to execute a task or a series of tasks.
π‘ Have you noticed ? Under the hood, the
Taskthat we defined above contains a Configuration ! This enables experimaestro to create a unique Task ID for each set ofParams.
Here, we first define a ConfigurationBase object that describes our experiment settings.
from experimaestro import ConfigurationBase
...
@configuration
class Configuration(ConfigurationBase):
# experimaestro Task parameters
## Model
n_layers: int = 3 # number of Hidden layers
...
# Misc
launcher: str = """duration=3h & cuda(mem=4G)*1 & cpu(cores=2)"""
...This class describes the configuration needed to run our experiment, The values for this configuration
Please have a look at the Config documentation for more details.
When operating with clusters using workload managers like Slurm, experimaestro can manage the sumbission of your Tasks. This can be done easily with a configuration file (that specificies how to launch a task given some specifications), and the find_launcher function:
Here, we specified in the configuration above what hardware contraints we need for our Task: training a CNN on MNIST. We can then find a launcher with:
...
gpulauncher = find_launcher(cfg.launcher, tags=["slurm"])You can have a look at the Launchers documentation for more details.
Now we are ready to launch our tasks ! we use the
for n_layer in cfg.n_layers:
for hidden_dim in cfg.hidden_dim:
for kernel_size in cfg.kernel_size:
task = TrainOnMNIST(...)
...task.submit(launcher=gpulauncher)... # Submit the Task to SlurmThis file contains the values of the parameters that will be used to run our experiment.
id: MNIST_train
title: "MNIST training"
description: "Training a simple CNN model on MNIST"
# what experiment file to run
file: experiment # will run experiment.py
# Launcher configuration: what resources we need to run a Task
launcher: "duration=1h & cuda(mem=4G)*1 & cpu(cores=4)"
# Experimental parameters
hidden_dim: # number of hidden units in the model
- 32
n_layers: # number of layers in the model
- 1
- 2
kernel_size: # kernel size of the convolutions
- 3
# Training parameters
epochs: 1 # number of epochs to train
lr: 1e-3 # learning rate
batch_size: 64 # batch size
n_val: 100 # number of steps between validation and loggingWe will launch one job for each possible combination of hidden_dim,n_layers and kernel_size.
Now that everything is set up, we can run our experiment
uv run experimaestro run-experiment mnist_xp/params.yamlNow experimaestro will:
- Lock the experiment
MNIST_trainso that you cannot relaunch it while this one is running. - Run
experiment.pywith the configuration values read inparams.yaml - For each Task submitted in the experiment:
- A unique hash ID is created depending on the parameters given to the task.
- This ensures that you don't run the Task with the same params twice.
- A folder is created in the
workspace/jobs/task-id, it will be the working directory for the Task- here you can find the logs and the outputs of the running Task
- A unique hash ID is created depending on the parameters given to the task.
Your jobs are now launched. You can display all launched jobs with the following command:
uv run experimaestro jobs list --tagsit yields:
RUNNING task.trainonmnist/12feeb6cb9f5f7aad5d0fdcaac5ee057673f7c82485126d8903ecc119b567451 MNIST_train n_layers=1 hidden_dim=32 kernel_size=3
RUNNING task.trainonmnist/c9420a1de91830ff467397bd3e1aa592535eac931c9dff7efbad7c0e759c0be3 MNIST_train n_layers=2 hidden_dim=32 kernel_size=3π‘ Bonus: For better readability, we used
tagswhen creating our Tasks, and used the--tagflag in the command above. This is why the parameters are diplayed above. See the docs for more details.
Installation:
- Add details for results processing.
- Detail how to install the launcher file as well.