Skip to content
/ logtap Public

Tail for GPU clouds. Survives disconnects, aggregates multi-node.

License

Notifications You must be signed in to change notification settings

cainky/logtap

logtap

PyPI version Tests License: GPL v3 Python 3.10+

tail -f for GPU clouds. Survives disconnects, aggregates multi-node.

Stop losing your training logs when SSH drops. Watch from anywhere, reconnect seamlessly.

The Problem

You're training a model on RunPod, Vast.ai, or Lambda. You SSH in, start training, and:

  • Your terminal disconnects after an hour
  • You lose visibility into what's happening
  • You resort to tmux hacks just to keep logs alive
  • Multi-node training means logs scattered across machines

The Solution

# On your GPU instance
pip install logtap
logtap collect &
python train.py 2>&1 | logtap ingest run1

# From your laptop (or phone)
logtap tail run1 --follow

# Connection drops... reconnects automatically
# "reconnected (missed 0 lines)"

Quickstart: RunPod / Vast.ai

On the GPU instance:

pip install logtap
export LOGTAP_API_KEY=secret
logtap collect --port 8000

Start training and stream logs:

python train.py 2>&1 | logtap ingest run1 --tag node=$(hostname)

From your laptop:

export LOGTAP_SERVER=http://<gpu-ip>:8000
export LOGTAP_API_KEY=secret
logtap tail run1 --follow

Disconnect, close your terminal, or switch networks. Re-run logtap tail anytime to resume where you left off.

Works the same on RunPod, Vast.ai, Lambda, and any ephemeral GPU cloud.

Features

  • Survives Disconnects - Resume from where you left off with cursor-based streaming
  • Pipe-Friendly - Works with any training script via stdin
  • Multi-Node Ready - Tag runs with node=gpu1 and filter/aggregate
  • Zero Infra - No database, no complex setup, just pip install
  • Lightweight - <50MB memory, append-only file storage

Why not tmux / mosh?

tmux and mosh help keep SSH sessions alive. logtap solves a different problem.

  • SSH can still drop (web terminals, proxies, idle timeouts)
  • tmux doesn't aggregate logs across machines
  • tmux can't be viewed from another device without SSH
  • tmux sessions die when ephemeral instances stop

logtap streams logs over HTTP:

  • survives disconnects
  • resumes without gaps
  • aggregates multi-node training via tags
  • works from anywhere (no SSH required)

You can still use tmux. You just don't have to rely on it.

Quick Start

1. Install

pip install logtap

2. Start Collector (on GPU instance)

logtap collect --api-key secret

3. Pipe Your Training Logs

python train.py 2>&1 | logtap ingest run1 --api-key secret

4. Tail From Anywhere

export LOGTAP_SERVER=http://your-gpu-ip:8000
export LOGTAP_API_KEY=secret

logtap tail run1 --follow

CLI Commands

Command Description
logtap collect Start collector server (accepts ingested runs)
logtap ingest <run> Pipe stdin to collector
logtap tail <run> Tail a run with --follow for streaming
logtap runs List active runs
logtap doctor Check server connectivity and diagnose issues

Ingest Options

# Auto-generate run name
python train.py | logtap ingest

# Add tags for multi-node
python train.py | logtap ingest run1 --tag node=gpu1 --tag rank=0

# Quiet mode (no status messages)
python train.py | logtap ingest run1 --quiet

Tail Options

# Follow mode (like tail -f)
logtap tail run1 --follow

# Resume from specific cursor (survives disconnects!)
logtap tail run1 --follow --since 5000

# Filter by tag
logtap tail run1 --tag node=gpu1

# Output formats
logtap tail run1 --output jsonl | jq '.line'

Collector Options

logtap collect \
  --port 8000 \
  --api-key secret \
  --data-dir ~/.logtap/runs \
  --max-disk-mb 5000 \
  --retention-hours 72

Multi-Node Training

Tag each node and aggregate:

# Node 1
python train.py | logtap ingest run1 --tag node=gpu1

# Node 2
python train.py | logtap ingest run1 --tag node=gpu2

# Watch all nodes
logtap tail run1 --follow

# Watch specific node
logtap tail run1 --follow --tag node=gpu1

Environment Variables

Variable Default Description
LOGTAP_SERVER http://localhost:8000 Collector URL
LOGTAP_API_KEY - API key for auth

Set these to avoid typing --server and --api-key every time.

How It Works

  1. Collector writes logs to append-only files with cursor tracking
  2. Ingest streams stdin over HTTP chunked POST
  3. Tail uses SSE (Server-Sent Events) with resume support
  4. Reconnect passes ?since=<cursor> to continue without gaps

No database. No message queue. Just files and HTTP.

API Endpoints

For scripting or custom integrations:

Endpoint Description
POST /runs/{id}/ingest Stream lines (chunked POST)
GET /runs/{id}/stream SSE stream with ?since=&follow=
GET /runs/{id}/query Query with ?from=&to=&search=
GET /runs List runs
GET /health Health check with capabilities

Legacy: Static File Mode

logtap also works as a simple remote log viewer (the original use case):

# On server with log files
logtap serve --log-dir /var/log

# From client
logtap tail syslog --server http://myserver:8000 --follow
logtap query auth.log --regex "Failed password"

Security

  • API Key Auth - Optional but recommended for production
  • Path Traversal Protection - Comprehensive defense with symlink-safe containment checks (see SECURITY.md)
  • ReDoS Protection - Uses google-re2 for guaranteed linear-time regex matching
  • Read-Only by Default - Collector only writes to its data directory
  • Input Validation - Rejects control characters, NUL bytes, and malicious path patterns

Development

git clone https://github.com/cainky/logtap.git
cd logtap

# Install with uv
uv sync --extra dev

# Run tests
uv run pytest

# Run collector in dev mode
uv run logtap collect --reload

License

GPL v3 - see LICENSE

Author

Kyle Cain - @cainky

About

Tail for GPU clouds. Survives disconnects, aggregates multi-node.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

 

Packages

No packages published

Contributors 2

  •  
  •