Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,10 @@ CPPFLAGS := -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DPYXIS_VERSION=\"$(PYXIS_VER)\" $
CFLAGS := -std=gnu11 -O2 -g -Wall -Wunused-variable -fstack-protector-strong -fpic $(CFLAGS)
LDFLAGS := -Wl,-znoexecstack -Wl,-zrelro -Wl,-znow $(LDFLAGS)

ifneq ($(strip $(SLURM_ROOT)),)
CPPFLAGS += -I$(SLURM_ROOT)/include
endif

C_SRCS := common.c args.c pyxis_slurmstepd.c pyxis_slurmd.c pyxis_srun.c pyxis_alloc.c pyxis_dispatch.c config.c enroot.c importer.c
C_OBJS := $(C_SRCS:.c=.o)

Expand Down
168 changes: 168 additions & 0 deletions PLAN_CONTAINER_CACHE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
# Plan: `--container-cache` (Module 3A) — Persistent RootFS Reuse with Pyxis + Enroot

This file is a working plan/spec for implementing and validating **Module 3A** in Pyxis:

- New user-facing flag: `srun --container-cache`
- Goal: reuse the **unpacked** Enroot rootfs across jobs on the **same node** to achieve near-1s warm starts
- Constraints:
- Must work with the existing cluster cleanup behavior (Epilog deletes `pyxis_${SLURM_JOB_ID}*`)
- **Disallow** `--container-writable` and `--container-save` in cache mode (to avoid unsafe cross-job state)
- Must not require users to encode digest/version into `--container-name`

---

## Background / Why this exists

On our cluster:
- Cold start (`--container-name` first use) costs ~14–18s/node
- Warm start can be ~1s **only if** the unpacked rootfs persists
- Today rootfs does **not** persist across jobs because:
- job-scoped naming (`pyxis_${JOBID}_...`) and
- `/etc/slurm/epilog.d/70-enroot-container-cleanup.sh` removing `pyxis_${SLURM_JOB_ID}*`

So we need a **Pyxis-native** mechanism to:
1) generate stable cache identities and
2) ensure the resulting rootfs directories do not match job-scoped cleanup patterns.

---

## Design summary

### User interface

- **Flag:** `--container-cache`
- Also available via env: `PYXIS_CONTAINER_CACHE=1`
- When set:
- Requires `--container-image`
- Rejects `--container-writable`
- Rejects `--container-save`
- Forces read-only rootfs (`ENROOT_ROOTFS_WRITABLE=n`)

### How caching works

1) **Stable cache key** is derived from the image identity:
- For `.sqsh` paths: `abs_path + mtime + size` (fast; avoids hashing huge files)
- For non-path images: hash the image string (future improvement: OCI digest)

2) **Stable container name** is auto-generated from the cache key:
- Example prefix: `pyxis_cache_<uid>_<hash>`
- **Important:** In cache mode, naming must be **non-job-scoped** (no jobid prefix).

3) **Reuse behavior**
- If the container already exists: Pyxis reuses the filesystem and skips `enroot create`
- Otherwise: Pyxis creates it once (cold path), then it persists

### Cache directory layout

- **Required config for cache mode** (plugstack arg):
- `container_cache_data_path=/raid/containers/data`
- **Per-user cache directory**:
- Pyxis creates/uses `<base>/<uid>` with mode `0700` and ownership `<uid>:<gid>`
- For cache mode, Pyxis sets `ENROOT_DATA_PATH=<base>/<uid>` for all Enroot calls
- **Cached rootfs directory name**:
- `pyxis_cache_u<uid>_<hash>`
- Full path: `<base>/<uid>/pyxis_cache_u<uid>_<hash>`
- **Locking + "last used" signal**:
- Pyxis creates `<rootfs>/.pyxis_cache_lock`
- jobs hold a **shared** lock for the job lifetime
- GC tries an **exclusive non-blocking** lock; if it can’t lock, the entry is treated as in-use
- Each time a cached rootfs is used, Pyxis `touch`es the rootfs directory to update its `mtime`
- GC uses this `mtime` as the LRU timestamp (no reliance on filesystem `atime`)

### GC / LRU eviction (global across users)

Goal: prevent jobs failing due to the cache filesystem being full.

- **When GC runs**:
- Opportunistic, at job start in cache mode **only if** we’re about to create a new cached rootfs (cold path)
- Reuse path should not trigger GC
- **Watermarks** (admin-configurable):
- `container_cache_gc_high=85` (default): start evicting when used% is \(\ge\) high
- `container_cache_gc_low=80` (default): stop evicting when used% drops below low
- **Global serialization**:
- GC takes an exclusive lock on `<base>/pyxis-container-cache-gc.lock` so multiple jobs don’t evict concurrently
- **Candidate selection (LRU)**:
- Scan all user dirs: `<base>/*/pyxis_cache_*`
- Sort candidates by directory `mtime` (oldest first)
- **Eviction loop**:
- For each candidate, try to acquire an **exclusive non-blocking** lock on `<candidate>/.pyxis_cache_lock`
- if locked: recursively delete the candidate directory
- if not lockable: skip (in-use)
- Re-check used% and stop once below `container_cache_gc_low`
- **Cross-user behavior**:
- Because `<base>/<uid>` is `0700`, GC must run in a privileged SPANK hook so it can traverse/evict other users’ entries.

---

## Code changes (high-level)

### 1) Add new flag and env var plumbing
- `args.h`: add `int container_cache;`
- `args.c`:
- register `--container-cache`
- parse `PYXIS_CONTAINER_CACHE`

### 2) Implement cache mode in `pyxis_slurmstepd.c`
- During `slurm_spank_user_init`:
- validate incompatible flags (`--container-writable`, `--container-save`)
- compute stable name
- force `container_scope=global` for cache mode
- derive a per-user cache directory from `container_cache_data_path` (`<base>/<uid>`)

- During Enroot execution (`enroot_set_env`):
- set `ENROOT_DATA_PATH` for cache mode (`<base>/<uid>`)

- During container create:
- run GC if needed (global `/raid` thresholds)
- touch/lock cached entries to prevent eviction while in-use

### 3) Config knobs
Extend `config.[ch]` to parse:
- `container_cache_data_path=...`
- `container_cache_gc_high=...`
- `container_cache_gc_low=...`

---

## Testing plan (cluster)

### Automated (BATS)
- Ensure Slurm env is set (cluster-specific; adjust paths as needed):

```bash
export SLURM_ROOT=/cm/local/apps/slurm/24.11
export PATH="$SLURM_ROOT/bin:$SLURM_ROOT/sbin:$PATH"
export LD_LIBRARY_PATH="$SLURM_ROOT/lib64:${LD_LIBRARY_PATH:-}"
export SLURM_CONF=/etc/slurm/slurm.conf
```

- Run: `bats tests/container_cache.bats`
- Covers: policy enforcement, stable naming/layout under `<base>/<uid>`, read-only enforcement, env var enablement, and GC behavior (including cross-user eviction) when usage is above the configured high watermark.
- If needed: `PYXIS_TEST_SQSH_IMAGE=/path/to/image.sqsh bats tests/container_cache.bats`

### Functional correctness (manual)
1) Cold create:
- `srun --container-cache --container-image=<image> ...`
- expect: rootfs directory `<base>/<uid>/pyxis_cache_u<uid>_<hash>` is created
2) Warm reuse (separate job, same node):
- run the same command on the same node
- expect: the same cached rootfs is reused (near-1s startup)

### Cleanup compatibility
- Verify cached directories are **non-job-scoped** (e.g. `pyxis_cache_u<uid>_*`) and do **not** match Epilog `pyxis_${JOBID}*` patterns.

### GC/LRU (manual)
- Trigger GC by filling the cache filesystem above `container_cache_gc_high`, then create new cached rootfs entries.
- Confirm oldest caches are evicted first and locked/in-use caches are not evicted.

---

## Open items / future improvements

- Use image digest for OCI images (instead of hashing only the image string)
- More robust last-used tracking (avoid relying on directory `mtime`)
- Expand test coverage for concurrency/stress scenarios on a multi-node cluster

---


33 changes: 33 additions & 0 deletions args.c
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ static struct plugin_args pyxis_args = {
.workdir = NULL,
.container_name = NULL,
.container_name_flags = NULL,
.container_cache = -1,
.container_save = NULL,
.mount_home = -1,
.remap_root = -1,
Expand All @@ -31,6 +32,7 @@ static int spank_option_image(int val, const char *optarg, int remote);
static int spank_option_mount(int val, const char *optarg, int remote);
static int spank_option_workdir(int val, const char *optarg, int remote);
static int spank_option_container_name(int val, const char *optarg, int remote);
static int spank_option_container_cache(int val, const char *optarg, int remote);
static int spank_option_container_save(int val, const char *optarg, int remote);
static int spank_option_container_mount_home(int val, const char *optarg, int remote);
static int spank_option_container_remap_root(int val, const char *optarg, int remote);
Expand Down Expand Up @@ -69,6 +71,15 @@ struct spank_option spank_opts[] =
"If a container with this name already exists, the existing container is used and the import is skipped.",
1, 0, spank_option_container_name
},
{
"container-cache",
NULL,
"[pyxis] enable persistent container root filesystem caching. "
"When set, pyxis derives a stable container name from the image identity and attempts to reuse "
"the container filesystem across jobs on the same node. "
"Incompatible with --container-writable and --container-save.",
0, 1, spank_option_container_cache
},
{
"container-save",
"PATH",
Expand Down Expand Up @@ -185,6 +196,13 @@ void pyxis_args_check_environment_variables(spank_t sp)
if (env_val != NULL && pyxis_args.container_name == NULL)
spank_option_container_name(0, env_val, 0);

env_val = get_env_var(sp, "PYXIS_CONTAINER_CACHE", buf, sizeof(buf));
if (env_val != NULL && pyxis_args.container_cache == -1) {
ret = parse_bool(env_val);
if (ret >= 0)
spank_option_container_cache(ret, NULL, 0);
}

env_val = get_env_var(sp, "PYXIS_CONTAINER_SAVE", buf, sizeof(buf));
if (env_val != NULL && pyxis_args.container_save == NULL)
spank_option_container_save(0, env_val, 0);
Expand Down Expand Up @@ -490,6 +508,19 @@ static int spank_option_container_name(int val, const char *optarg, int remote)
return (rv);
}

static int spank_option_container_cache(int val, const char *optarg, int remote)
{
(void)optarg;
(void)remote;

/* Slurm may call us multiple times with the same value. */
if (pyxis_args.container_cache == val)
return (0);

pyxis_args.container_cache = val;
return (0);
}

static int spank_option_container_save(int val, const char *optarg, int remote)
{
if (optarg == NULL || *optarg == '\0') {
Expand Down Expand Up @@ -630,6 +661,8 @@ struct plugin_args *pyxis_args_register(spank_t sp)
bool pyxis_args_enabled(void)
{
if (pyxis_args.image == NULL && pyxis_args.container_name == NULL) {
if (pyxis_args.container_cache == 1)
return (true);
if (pyxis_args.mounts_len > 0)
slurm_error("pyxis: ignoring --container-mounts because neither --container-image nor --container-name is set");
if (pyxis_args.workdir != NULL)
Expand Down
1 change: 1 addition & 0 deletions args.h
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ struct plugin_args {
char *workdir;
char *container_name;
char *container_name_flags;
int container_cache;
char *container_save;
int mount_home;
int remap_root;
Expand Down
27 changes: 27 additions & 0 deletions config.c
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <slurm/spank.h>
Expand All @@ -24,6 +25,8 @@ int pyxis_config_parse(struct plugin_config *config, int ac, char **av)
{
int ret;
const char *optarg;
const char *cache_data_prefix = "container_cache_data_path=";
const size_t cache_data_prefix_len = sizeof("container_cache_data_path=") - 1;

memset(config, 0, sizeof(*config));

Expand All @@ -38,6 +41,9 @@ int pyxis_config_parse(struct plugin_config *config, int ac, char **av)
config->sbatch_support = true;
config->use_enroot_load = false;
config->importer_path[0] = '\0';
config->container_cache_data_path[0] = '\0';
config->container_cache_gc_high = 85;
config->container_cache_gc_low = 80;

for (int i = 0; i < ac; ++i) {
if (strncmp("runtime_path=", av[i], 13) == 0) {
Expand Down Expand Up @@ -88,11 +94,32 @@ int pyxis_config_parse(struct plugin_config *config, int ac, char **av)
slurm_error("pyxis: importer: path too long: %s", optarg);
return (-1);
}
} else if (strncmp(cache_data_prefix, av[i], cache_data_prefix_len) == 0) {
optarg = av[i] + cache_data_prefix_len;
ret = snprintf(config->container_cache_data_path, sizeof(config->container_cache_data_path), "%s", optarg);
if (ret < 0 || ret >= (int)sizeof(config->container_cache_data_path)) {
slurm_error("pyxis: container_cache_data_path: path too long: %s", optarg);
return (-1);
}
} else if (strncmp("container_cache_gc_high=", av[i], 24) == 0) {
optarg = av[i] + 24;
config->container_cache_gc_high = atoi(optarg);
} else if (strncmp("container_cache_gc_low=", av[i], 23) == 0) {
optarg = av[i] + 23;
config->container_cache_gc_low = atoi(optarg);
} else {
slurm_error("pyxis: unknown configuration option: %s", av[i]);
return (-1);
}
}

if (config->container_cache_gc_high < 1 || config->container_cache_gc_high > 99 ||
config->container_cache_gc_low < 1 || config->container_cache_gc_low > 99 ||
config->container_cache_gc_low >= config->container_cache_gc_high) {
slurm_error("pyxis: invalid container cache GC configuration: high=%d low=%d",
config->container_cache_gc_high, config->container_cache_gc_low);
return (-1);
}

return (0);
}
3 changes: 3 additions & 0 deletions config.h
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ struct plugin_config {
bool sbatch_support;
bool use_enroot_load;
char importer_path[PATH_MAX];
char container_cache_data_path[PATH_MAX];
int container_cache_gc_high;
int container_cache_gc_low;
};

int pyxis_config_parse(struct plugin_config *config, int ac, char **av);
Expand Down
Loading