High-throughput Event-Level Infrastructure for Optimized Scheduling
A distributed event processing framework for scientific computing workloads.
HELIOS is a containerized batch processing system designed for parallel analysis of event-based datasets. Inspired by computing workflows at large research facilities, the system provides a control plane for job orchestration and an execution plane for stateless worker processes.
The framework emphasizes reproducibility, fault isolation, and infrastructure-driven scalability rather than domain-specific analytics or real-time processing.
High-energy physics experiments and large-scale scientific facilities generate massive volumes of event data that require parallel processing for efficient analysis. HELIOS models the essential components of such workflows:
- Independent batch job submission and tracking
- Parallel execution across multiple compute units
- Deterministic result aggregation
- Container-based reproducibility
- Clear separation of control and compute logic
API Service - FastAPI-based REST interface for job management
- Job submission and lifecycle tracking
- Progress monitoring and result aggregation
- System health and statistics endpoints
Job Manager - In-memory orchestrator managing job state transitions
submitted → running → completed | failed- Chunk assignment and progress tracking
- Resource allocation coordination
Worker Processes - Stateless compute units processing event subsets
- Independent chunk processing
- Partial statistics computation
- Deterministic output generation
Input Dataset → Job Submission → Chunk Distribution → Parallel Workers → Result Aggregation
| Concept | HELIOS Component |
|---|---|
| Detector event data | CSV event records |
| Batch job | REST API job submission |
| Computing node | Worker process |
| Job scheduler | Job Manager |
| Partial results | Per-chunk statistics |
| Final reconstruction | Aggregated metrics |
- Batch-oriented job execution model
- Data-parallel event processing with configurable parallelism
- Two execution backends:
- Multiprocessing: Shared-memory parallelism on single node
- Subprocess: Process-isolated execution with intermediate artifacts
- Asynchronous job execution with real-time progress tracking
- RESTful API for programmatic job control
- Docker-based deployment for reproducibility
Event records contain:
event_id- Unique event identifiertimestamp- Event occurrence timesensor_id- Detector/sensor identifierenergy- Measured energy valuestatus- Event classification
Computed statistics:
- Total event count
- Status distribution
- Per-sensor event counts
- Average energy
- High-energy event frequency
| Method | Endpoint | Description |
|---|---|---|
| GET | / |
System health check |
| GET | /health |
Detailed service status |
| POST | /jobs/submit |
Submit processing job |
| GET | /jobs/{job_id}/status |
Query job state |
| GET | /jobs/{job_id}/result |
Retrieve results |
| GET | /jobs |
List all jobs |
| DELETE | /jobs/{job_id} |
Cancel running job |
| GET | /stats |
System statistics |
- Python 3.11+
- Docker and Docker Compose
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --reload --port 8000# Build containers
docker-compose build
# Start API service
docker-compose up api
# Run in background
docker-compose up -d api
# View logs
docker-compose logs -f apipython generate_data.pyCreates synthetic event data at data/raw/events.csv.
curl -X POST http://localhost:8000/jobs/submit \
-H "Content-Type: application/json" \
-d '{
"input_file": "data/raw/events.csv",
"num_workers": 4,
"method": "multiprocessing"
}'Response:
{
"job_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"status": "submitted",
"message": "Job submitted successfully"
}curl http://localhost:8000/jobs/{job_id}/statuscurl http://localhost:8000/jobs/{job_id}/resultExample output:
{
"job_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"status": "completed",
"results": {
"total": 1000000,
"status_counts": {
"active": 750000,
"inactive": 250000
},
"sensor_counts": {
"S001": 333333,
"S002": 333334,
"S003": 333333
},
"avg_energy": 75.42,
"high_energy_events": 234100,
"chunks_processed": 4,
"method": "multiprocessing",
"num_workers": 4
}
}Utilizes Python's multiprocessing.Pool for shared-memory parallelism.
Characteristics:
- Low overhead
- Optimal for CPU-bound tasks
- Single-node execution
Configuration:
{
"method": "multiprocessing",
"num_workers": 4
}Launches independent worker processes with stronger isolation.
Characteristics:
- Process-level fault isolation
- Intermediate result artifacts
- Models distributed execution semantics
Configuration:
{
"method": "subprocess",
"num_workers": 4
}helios/
├── scripts/
│ ├── data/
│ │ ├── raw/ # Input datasets
│ │ └── processed/ # Processing artifacts
│ ├── main.py # FastAPI application
│ ├── job_manager.py # Job lifecycle management
│ ├── compute.py # Parallel processing engine
│ ├── worker.py # Worker process implementation
│ ├── generate_data.py # Dataset generator
│ ├── test_api.py # Integration tests
│ ├── Dockerfile # API container image
│ ├── Dockerfile.worker # Worker container image
│ ├── docker-compose.yml # Local orchestration
│ └── requirements.txt # Python dependencies
└── README.md
- Worker count should match available CPU cores for optimal throughput
- Data chunking limits memory footprint per worker
- Subprocess execution trades I/O overhead for stronger isolation
- Throughput scales linearly with worker count for CPU-bound workloads
- Network-bound workloads may exhibit sub-linear scaling
python test_api.pydocker-compose --profile worker-test up workerpython worker.py data/raw/events.csv 0 1000 0 data/processed/chunks/test.jsonJob submission parameters:
input_file- Path to input CSV datasetnum_workers- Parallel worker count (default: 4)method- Execution backend:multiprocessingorsubprocess
Environment variables:
HELIOS_DATA_DIR- Data directory pathHELIOS_LOG_LEVEL- Logging verbosityHELIOS_MAX_WORKERS- Maximum concurrent workers
- Single-node execution with multiprocessing and subprocess backends
- In-memory job state management
- Local filesystem for data and intermediate results
- RESTful API for job control and monitoring
- Docker-based containerization for reproducibility
This project draws inspiration from batch processing systems used at:
- CERN Computing Infrastructure
- Large Hadron Collider experiments
- Distributed computing workflows in high-energy physics
MIT