From 447b01988110d11fcd5c9d8c8fafd6c388ab8f8d Mon Sep 17 00:00:00 2001 From: Nick Rhodes Date: Fri, 9 May 2025 14:21:01 +0100 Subject: [PATCH 1/5] Main structure, WIP more content needed --- book/scheduling-submission.md | 244 +++++++++++++++++++++++++++++++--- 1 file changed, 225 insertions(+), 19 deletions(-) diff --git a/book/scheduling-submission.md b/book/scheduling-submission.md index 2abb4aa..ae26000 100644 --- a/book/scheduling-submission.md +++ b/book/scheduling-submission.md @@ -1,29 +1,235 @@ -# Session 5: Job Scheduling & Submission +# Session 5: Introduction to Job Scheduling and Batch Jobs -## Overview of HPC job scheduling systems -### General background -### Slurm -## Job Scripts -### Structure of job scripts -### resource requests -### submission command +In this session, you'll explore: +* What job schedulers are and their importance +* How to write and submit batch job scripts +* Key SLURM commands for managing your tasks +* Working with output and error files +* How to use modules to manage environments -## exercise +--- -- What is a scheduler -- Fair use +## What Is a Job Scheduler? πŸ€– -- Again, divided up into mini tabset "presentations" +A **job scheduler** is a software that manages and prioritizes compute jobs. -- https://cautious-happiness-oz2284n.pages.github.io/#/basic-slurm-configuration -- https://cautious-happiness-oz2284n.pages.github.io/#/basic-slurm-job-scripts -- https://arcdocs.leeds.ac.uk/aire/usage/job_type.html -- https://arcdocs.leeds.ac.uk/aire/usage/job_example.html +### Key Points to Explore: +* What do you think a job scheduler does? Write down your thoughts. +* How does a scheduler help in maximizing cluster resources? -## Practical - write and submit a simple job script, monitor its progress, and learn to interpret feedback +**Task:** Review the following diagram on job scheduling: -- Create/share example job and job script -https://arctraining.github.io/rc-slides/hpc1.html#/submit-a-serial-python-job +* **User β†’ Scheduler β†’ Compute Nodes** +πŸ”— [More on Aire's job scheduler](https://arcdocs.leeds.ac.uk/aire/usage/job_scheduler.html) + +--- + +## What Are Batch Jobs? πŸ“ + +**Batch jobs** are non-interactive tasks that are scheduled and run on the cluster. + +### Key Concepts to Understand: + +* How do batch jobs differ from interactive sessions? +* What does the **fair-share policy** mean? + +**Self-Reflection:** Think about how batch jobs are queued and prioritized on the system. Why is the fair-share policy important? + +--- + +## Writing Job Scripts ✍️ + +To create a job script, you’ll need to: + +* Edit with `nano`, `vim`, or `emacs` (Beginner? Try `nano`). +* Always include `#SBATCH` directives at the top. + +### Key Steps to Try: + +1. Write a simple script with these directives: `--time`, `--mem`, `--cpus-per-task`. +2. What happens if you forget a necessary directive, like `--time`? + +**Tip:** If you wrote your script on Windows, run `dos2unix` to avoid submission errors. + +πŸ”— [Writing scripts on Aire](https://arcdocs.leeds.ac.uk/aire/usage/job_type.html#writing-job-scripts) + +--- + +## Simple Serial Job πŸš€ + +### Example Job Script: + +```bash +#!/bin/bash +#SBATCH --job-name=basic_job +#SBATCH --time=01:00:00 +#SBATCH --mem=1G +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=1 +``` + +**Activity:** Create and submit a simple job using the script above. What do you expect the output to look like? + +--- + +## Viewing & Cancelling Jobs πŸ”βœ‚οΈ + +**Learn how to monitor jobs:** + +* Use `squeue` to list running jobs. +* Example output: + +```bash +JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) +12345 general basic_job user R 00:02 1 node01 +``` + +### Challenge: + +* Run `squeue` and identify the status of your current job. +* Try canceling a job using `scancel `. What happens? + +πŸ”— [Monitoring jobs](https://arcdocs.leeds.ac.uk/aire/usage/job_type.html#using-the-queue) + +--- + +## Task Arrays Simplified πŸ” + +**Task arrays** allow you to submit multiple jobs with similar scripts, each identified by a unique task ID. + +### Try It: + +1. Write a script with the following: + +```bash +#SBATCH --array=1-3 +#SBATCH --output=array_%A_%a.out +``` + +2. Run `sbatch` with your task array script. +3. Check the output files for each task. + +--- + +## Interactive Sessions πŸ’» + +**Interactive sessions** are useful for real-time testing and debugging. + +### Self-Test: + +1. Start an interactive session: `srun --pty -t01:00 bash` +2. Inside the session, load a module, e.g., `module load python`. +3. Exit the session and reflect on the limitations. + +**Reflection:** Why are interactive sessions not recommended for long-term tasks? + +--- + +## Working with Output and Error Files πŸ–₯️ + +Every job generates two files: + +* **Standard Output** (`slurm-.out`) +* **Standard Error** (`slurm-.err`) + +### Activity: + +1. Submit a job script. +2. Check the output and error files. +3. Try redirecting them to specific locations using `#SBATCH --output=/path/to/output_file`. + +--- + +## High Memory & GPU Nodes πŸ§ πŸ’» + +If your job requires **high memory** or **GPU nodes**, request them using `--mem` or `--gres=gpu`. + +### Key Challenge: + +1. Request high memory for your job by adding `#SBATCH --mem=256G`. +2. Try requesting a GPU with `#SBATCH --gres=gpu:1`. + +**Reflection:** When would you need high memory or GPU resources? + +πŸ”— [High Memory and GPU Nodes](https://arcdocs.leeds.ac.uk/aire/usage/high_memory_gpu.html) + +--- + +## SLURM Options at a Glance πŸ“Š + +| Option | Purpose | +| ----------------- | ------------------------ | +| `--time` | Runtime (required) | +| `--cpus-per-task` | Cores per task | +| `--mem` | Memory per node | +| `--partition` | Select queue (e.g., gpu) | + +### Challenge: + +* Review the table and write down which options you think are most important for your work. + +--- + +## Recap & Quiz 🎯 + +**Test Your Knowledge:** + +1. What does a job scheduler do? +2. Which flag sets the number of CPU cores per task? +3. How do you cancel a job with ID 42? +4. What’s the default memory allocation per job? + +--- + +## Exercise πŸ”¬ + +**Create a job script** that: + +1. Requests 2 cores and 4 GB RAM +2. Runs for 30 minutes +3. Is part of a 5-task array +4. Loads the Python module + +**Check Your Solution:** Compare your script with the example provided. + +```bash +#!/bin/bash +#SBATCH --job-name=capstone +#SBATCH --time=00:30:00 +#SBATCH --cpus-per-task=2 +#SBATCH --mem=4G +#SBATCH --array=1-5 +#SBATCH --output=capstone_%A_%a.out +#SBATCH --error=capstone_%A_%a.err + +module load python + +echo "Job ID: $SLURM_JOB_ID" +echo "Task ID: $SLURM_ARRAY_TASK_ID" +hostname +module list +``` + +--- + +## Closing Thoughts πŸ” + +You’ve now learned how to: + +* Write and submit batch jobs +* Monitor job status and cancel jobs +* Utilize high memory and GPU nodes + +For further exploration, revisit the documentation linked throughout this session, and practice writing and submitting scripts. + +πŸ“š For full documentation, visit: [arcdocs.leeds.ac.uk/aire](https://arcdocs.leeds.ac.uk/aire) + +--- + +### **Next Steps:** + +1. Keep practicing by submitting your own job scripts and experimenting with SLURM options. +2. Try using advanced SLURM features as you become more familiar with the system. \ No newline at end of file From 5e4393b0b5dd08aec2ae523e222380cfa5495c4f Mon Sep 17 00:00:00 2001 From: Nick Rhodes Date: Mon, 2 Jun 2025 22:29:24 +0100 Subject: [PATCH 2/5] more detail on background info, added a quiz, add exercise answers --- book/scheduling-submission.md | 514 ++++++++++++++++++++++++++-------- 1 file changed, 394 insertions(+), 120 deletions(-) diff --git a/book/scheduling-submission.md b/book/scheduling-submission.md index ae26000..018c432 100644 --- a/book/scheduling-submission.md +++ b/book/scheduling-submission.md @@ -1,235 +1,509 @@ # Session 5: Introduction to Job Scheduling and Batch Jobs -In this session, you'll explore: +In this session, you will learn: -* What job schedulers are and their importance +* What a job scheduler is and why it is used * How to write and submit batch job scripts -* Key SLURM commands for managing your tasks -* Working with output and error files -* How to use modules to manage environments +* How to monitor, manage, and cancel jobs +* How to use modules to set up software environments +* How to request high memory and GPU resources +* (Optional) How to submit task arrays --- -## What Is a Job Scheduler? πŸ€– +## Background: What is a Job Scheduler? -A **job scheduler** is a software that manages and prioritizes compute jobs. +High Performance Computing (HPC) systems are shared by many users, each submitting their own jobs β€” code they want to run using the cluster's compute power. -### Key Points to Explore: +A **job scheduler** is the system that: +- Organizes when and where jobs run +- Allocates the requested resources (CPU cores, memory, GPUs) +- Ensures fair access to shared resources for all users -* What do you think a job scheduler does? Write down your thoughts. -* How does a scheduler help in maximizing cluster resources? +Schedulers make decisions based on: +- What resources your job requests (e.g., how many cores, how much memory) +- How long your job will run (your time limit) +- Current system load +- Fair-share policies (giving all users fair access over time) -**Task:** Review the following diagram on job scheduling: +**Without a scheduler**, users would have to manually coordinate access to thousands of CPUs β€” impractical and chaotic. -* **User β†’ Scheduler β†’ Compute Nodes** +--- + +## SLURM: The Scheduler on Aire + +At Leeds, the **SLURM** scheduler (Simple Linux Utility for Resource Management) manages all jobs on the Aire cluster. + +When you submit a job: +1. You describe what you need (e.g., CPUs, memory, time) in a **job script**. +2. You submit the job to SLURM with `sbatch`. +3. SLURM places your job in a **queue**. +4. When enough resources are available and your job’s priority is high enough, SLURM starts the job on suitable **compute nodes**. + +--- + +### How Jobs Flow Through the System + +```plaintext +Your Job Script + β”‚ + β–Ό + SLURM Scheduler + β”‚ + β”œβ”€β”€ Queues Jobs + β”œβ”€β”€ Prioritizes Jobs + β”œβ”€β”€ Allocates Resources + β–Ό + Compute Nodes (Run the job) + β”‚ + β–Ό + Output Files +```` + +--- -πŸ”— [More on Aire's job scheduler](https://arcdocs.leeds.ac.uk/aire/usage/job_scheduler.html) +### Common Job States + +| **State** | **Meaning** | +| ----------- | ------------------------------------------ | +| `PENDING` | Job is waiting for resources | +| `RUNNING` | Job is actively running on compute nodes | +| `COMPLETED` | Job finished successfully | +| `FAILED` | Job failed (e.g., errors, exceeded limits) | +| `CANCELLED` | Job was manually stopped (e.g., by user) | + +You can monitor job states with the `squeue` command. --- -## What Are Batch Jobs? πŸ“ +### Why Do Jobs Wait? + +Not every job runs immediately. Reasons include: -**Batch jobs** are non-interactive tasks that are scheduled and run on the cluster. +* Not enough CPUs/memory free +* Higher priority jobs ahead of yours +* Fair-share adjustment (users who have used less recently get higher priority) +* Your job is requesting rare resources (e.g., GPU nodes) + +--- -### Key Concepts to Understand: +### How You Interact with the Scheduler -* How do batch jobs differ from interactive sessions? -* What does the **fair-share policy** mean? +| **Action** | **Command** | +| -------------- | ---------------------- | +| Submit a job | `sbatch my_job.sh` | +| View your jobs | `squeue -u ` | +| Cancel a job | `scancel ` | -**Self-Reflection:** Think about how batch jobs are queued and prioritized on the system. Why is the fair-share policy important? +You will learn these commands and write your first job script in the next sections. --- -## Writing Job Scripts ✍️ +## Why Use Batch Jobs? -To create a job script, you’ll need to: +Batch jobs allow you to: -* Edit with `nano`, `vim`, or `emacs` (Beginner? Try `nano`). -* Always include `#SBATCH` directives at the top. +* Set up your work once +* Submit it to the scheduler +* Log out and let it run unattended +* Automatically capture outputs and errors into files -### Key Steps to Try: +This is essential for longer jobs that run for hours or days β€” you don't need to stay logged in. -1. Write a simple script with these directives: `--time`, `--mem`, `--cpus-per-task`. -2. What happens if you forget a necessary directive, like `--time`? +--- + +## Summary -**Tip:** If you wrote your script on Windows, run `dos2unix` to avoid submission errors. +* A job scheduler manages who runs jobs and when on an HPC cluster. +* SLURM is the scheduler used on Aire. +* You write a job script to describe what you need. +* SLURM queues, prioritizes, and runs your job on available compute nodes. +* You interact with SLURM via simple commands like `sbatch`, `squeue`, and `scancel`. + +--- -πŸ”— [Writing scripts on Aire](https://arcdocs.leeds.ac.uk/aire/usage/job_type.html#writing-job-scripts) +# Hands-On Practical --- -## Simple Serial Job πŸš€ +## Hands-On: Write and Submit a Simple Job -### Example Job Script: +**Exercise:** +Write a batch script requesting: + +* 2 CPUs +* 4 GB memory +* 30 minutes runtime +* Load the Python module +* Run a simple command like `hostname` + +Submit the script. + +**Answer:** ```bash #!/bin/bash -#SBATCH --job-name=basic_job -#SBATCH --time=01:00:00 -#SBATCH --mem=1G -#SBATCH --ntasks=1 -#SBATCH --cpus-per-task=1 +#SBATCH --job-name=simple_job +#SBATCH --time=00:30:00 +#SBATCH --mem=4G +#SBATCH --cpus-per-task=2 +#SBATCH --output=simple_output_%j.out +#SBATCH --error=simple_error_%j.err + +module load python + +hostname +``` + +Submit using: + +```bash +sbatch simple_job.sh ``` -**Activity:** Create and submit a simple job using the script above. What do you expect the output to look like? +**Expected Output:** + +* You will receive a job submission message in the terminal like: + + ``` + Submitted batch job 123456 + ``` + +* After the job completes, check the output file `simple_output_123456.out` in your current working directory. + +* **Contents of Output File:** + + ``` + nodeXYZ.arc.leeds.ac.uk + ``` + + (Your job ran `hostname`, so you get the compute node name.) --- -## Viewing & Cancelling Jobs πŸ”βœ‚οΈ +## Monitoring Jobs + +Check job status: -**Learn how to monitor jobs:** +```bash +squeue -u +``` -* Use `squeue` to list running jobs. -* Example output: +Cancel a job: ```bash -JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) -12345 general basic_job user R 00:02 1 node01 +scancel ``` -### Challenge: +**Exercise:** +Submit a job and use `squeue` to monitor it. Cancel the job once it starts running. + +**Answer:** + +1. Submit job: `sbatch simple_job.sh` +2. Monitor job: `squeue -u ` +3. Cancel job: `scancel ` + +**Expected Output:** + +* `squeue` shows your job in the queue: -* Run `squeue` and identify the status of your current job. -* Try canceling a job using `scancel `. What happens? + ``` + JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) + 123456 general simple_job user1 PD 0:00 1 (Priority) + ``` -πŸ”— [Monitoring jobs](https://arcdocs.leeds.ac.uk/aire/usage/job_type.html#using-the-queue) + (Status `PD` means Pending; `R` means Running.) + +* After `scancel`, job disappears from `squeue` list. --- -## Task Arrays Simplified πŸ” +## Interactive Jobs + +Interactive jobs are useful for quick testing or debugging: + +```bash +srun --pty --time=01:00:00 bash +``` -**Task arrays** allow you to submit multiple jobs with similar scripts, each identified by a unique task ID. +**Exercise:** -### Try It: +1. Start an interactive session. +2. Load the Python module. +3. Run `hostname`. +4. Exit the session. -1. Write a script with the following: +**Answer:** ```bash -#SBATCH --array=1-3 -#SBATCH --output=array_%A_%a.out +srun --pty --time=01:00:00 bash +module load python +hostname +exit ``` -2. Run `sbatch` with your task array script. -3. Check the output files for each task. +**Expected Output:** + +* Terminal will change β€” you’ll have a shell prompt on a compute node. + +* After running `hostname`, you’ll see the node name: + + ``` + nodeXYZ.arc.leeds.ac.uk + ``` + +* `exit` returns you to the login node. --- -## Interactive Sessions πŸ’» +## Output and Error Files -**Interactive sessions** are useful for real-time testing and debugging. +SLURM creates: -### Self-Test: +* `slurm-.out` β€” standard output +* `slurm-.err` β€” standard error (if specified separately) -1. Start an interactive session: `srun --pty -t01:00 bash` -2. Inside the session, load a module, e.g., `module load python`. -3. Exit the session and reflect on the limitations. +Control output filenames: -**Reflection:** Why are interactive sessions not recommended for long-term tasks? +```bash +#SBATCH --output=/path/to/output_file.out +#SBATCH --error=/path/to/error_file.err +``` ---- +**Exercise:** +Modify your script to redirect output and error to specific files. -## Working with Output and Error Files πŸ–₯️ +**Answer:** -Every job generates two files: +```bash +#!/bin/bash +#SBATCH --job-name=simple_job +#SBATCH --time=00:30:00 +#SBATCH --mem=4G +#SBATCH --cpus-per-task=2 +#SBATCH --output=/home//my_output.out +#SBATCH --error=/home//my_error.err + +module load python + +hostname +``` -* **Standard Output** (`slurm-.out`) -* **Standard Error** (`slurm-.err`) +**Expected Output:** -### Activity: +* After job completion, you will find: -1. Submit a job script. -2. Check the output and error files. -3. Try redirecting them to specific locations using `#SBATCH --output=/path/to/output_file`. + * `/home//my_output.out` β€” contains node name from `hostname`. + * `/home//my_error.err` β€” should be empty if no errors. --- -## High Memory & GPU Nodes πŸ§ πŸ’» +## High Memory and GPU Requests -If your job requires **high memory** or **GPU nodes**, request them using `--mem` or `--gres=gpu`. +For high-memory jobs: -### Key Challenge: +```bash +#SBATCH --mem=256G +``` -1. Request high memory for your job by adding `#SBATCH --mem=256G`. -2. Try requesting a GPU with `#SBATCH --gres=gpu:1`. +For GPU jobs: -**Reflection:** When would you need high memory or GPU resources? +```bash +#SBATCH --gres=gpu:1 +``` -πŸ”— [High Memory and GPU Nodes](https://arcdocs.leeds.ac.uk/aire/usage/high_memory_gpu.html) +**Exercise:** +Update your batch script to request: ---- +* 256 GB memory +* 1 GPU + +**Answer:** + +```bash +#!/bin/bash +#SBATCH --job-name=highmem_gpu_job +#SBATCH --time=00:30:00 +#SBATCH --mem=256G +#SBATCH --cpus-per-task=2 +#SBATCH --gres=gpu:1 +#SBATCH --output=highmem_output_%j.out +#SBATCH --error=highmem_error_%j.err -## SLURM Options at a Glance πŸ“Š +module load python -| Option | Purpose | -| ----------------- | ------------------------ | -| `--time` | Runtime (required) | -| `--cpus-per-task` | Cores per task | -| `--mem` | Memory per node | -| `--partition` | Select queue (e.g., gpu) | +hostname +``` -### Challenge: +Submit using: -* Review the table and write down which options you think are most important for your work. +```bash +sbatch highmem_gpu_job.sh +``` ---- +**Expected Output:** + +* Terminal submission message: -## Recap & Quiz 🎯 + ``` + Submitted batch job 123457 + ``` +* Output file `highmem_output_123457.out` will contain: -**Test Your Knowledge:** + ``` + nodeXYZ.arc.leeds.ac.uk + ``` -1. What does a job scheduler do? -2. Which flag sets the number of CPU cores per task? -3. How do you cancel a job with ID 42? -4. What’s the default memory allocation per job? + (Node assigned may be a GPU node.) --- -## Exercise πŸ”¬ +## (Optional) Task Arrays -**Create a job script** that: +Arrays allow submitting multiple similar jobs efficiently. -1. Requests 2 cores and 4 GB RAM -2. Runs for 30 minutes -3. Is part of a 5-task array -4. Loads the Python module +Example script: -**Check Your Solution:** Compare your script with the example provided. +```bash +#!/bin/bash +#SBATCH --job-name=array_job +#SBATCH --array=1-3 +#SBATCH --output=array_output_%A_%a.out + +module load python + +python my_script.py $SLURM_ARRAY_TASK_ID +``` + +Submit: + +```bash +sbatch array_script.sh +``` + +**Exercise (Optional):** +Write a batch script to submit a task array with 3 tasks, each printing its task ID. + +**Answer:** ```bash #!/bin/bash -#SBATCH --job-name=capstone -#SBATCH --time=00:30:00 -#SBATCH --cpus-per-task=2 -#SBATCH --mem=4G -#SBATCH --array=1-5 -#SBATCH --output=capstone_%A_%a.out -#SBATCH --error=capstone_%A_%a.err +#SBATCH --job-name=array_example +#SBATCH --array=1-3 +#SBATCH --output=array_%A_%a.out module load python -echo "Job ID: $SLURM_JOB_ID" echo "Task ID: $SLURM_ARRAY_TASK_ID" hostname -module list ``` +Submit using: + +```bash +sbatch array_example.sh +``` + +**Expected Output:** + +* Three output files: + + ``` + array__1.out + array__2.out + array__3.out + ``` +* Each file contains: + + ``` + Task ID: 1 + nodeXYZ.arc.leeds.ac.uk + ``` + + (or `2`, `3` β€” depending on the task.) + --- -## Closing Thoughts πŸ” +## Further Reading -You’ve now learned how to: +* [Aire Job Scheduling Guide](https://arcdocs.leeds.ac.uk/aire/usage/job_scheduler.html) +* [Writing Job Scripts](https://arcdocs.leeds.ac.uk/aire/usage/job_type.html) +* [Interactive Jobs](https://arcdocs.leeds.ac.uk/aire/usage/job_type.html#interactive) -* Write and submit batch jobs -* Monitor job status and cancel jobs -* Utilize high memory and GPU nodes +--- -For further exploration, revisit the documentation linked throughout this session, and practice writing and submitting scripts. +**Next Steps:** -πŸ“š For full documentation, visit: [arcdocs.leeds.ac.uk/aire](https://arcdocs.leeds.ac.uk/aire) +* Practice writing and submitting simple job scripts. +* Experiment with resource requests. +* Explore more advanced SLURM features as needed. --- -### **Next Steps:** +## **Recap Quiz** + +**Q1.** +What is the purpose of a job scheduler on an HPC system? + +* A) To speed up internet connections +* B) To manually assign jobs to users +* C) To allocate compute resources and manage job queues +* D) To monitor user emails + +> **Answer:** +> **C) To allocate compute resources and manage job queues** + +--- + +**Q2.** +Which scheduler is used on the Aire HPC system? + +* A) PBS +* B) SLURM +* C) LSF +* D) Grid Engine -1. Keep practicing by submitting your own job scripts and experimenting with SLURM options. -2. Try using advanced SLURM features as you become more familiar with the system. \ No newline at end of file +> **Answer:** +> **B) SLURM** + +--- + +**Q3.** +What does the job state `PENDING` mean? + +* A) The job is actively running on a node +* B) The job is waiting for available resources +* C) The job has completed successfully +* D) The job was cancelled by the user + +> **Answer:** +> **B) The job is waiting for available resources** + +--- + +**Q4.** +Which command would you use to submit a job script to SLURM? + +* A) `srun` +* B) `squeue` +* C) `sbatch` +* D) `scancel` + +> **Answer:** +> **C) sbatch** + +--- + +**Q5.** +Why is it important to set a time limit (`--time`) in your job script? + +* A) It makes the job run faster +* B) It helps SLURM schedule jobs more efficiently +* C) It avoids the job being cancelled for overrun +* D) Both B and C + +> **Answer:** +> **D) Both B and C** + +--- From 4a2bc766152b62d068cdb0568ac4edfad9d5441e Mon Sep 17 00:00:00 2001 From: Nick Rhodes Date: Wed, 4 Jun 2025 07:24:34 +0100 Subject: [PATCH 3/5] Refinements --- book/best-practices-troubleshooting.md | 344 ++++++++++++++++++++++++- book/scheduling-submission.md | 4 +- book/wrap-up.md | 184 ++++++++++++- 3 files changed, 521 insertions(+), 11 deletions(-) diff --git a/book/best-practices-troubleshooting.md b/book/best-practices-troubleshooting.md index a48f591..2dd3463 100644 --- a/book/best-practices-troubleshooting.md +++ b/book/best-practices-troubleshooting.md @@ -1,10 +1,342 @@ -# Session 6: HPC Best Practices & Troubleshooting +# **Session 6: HPC Best Practices & Troubleshooting** +## **Learning Outcomes** -## Common issues with job submission and system usage +By the end of this session, you will be able to: -## Strategies for error diagnosis and resource optimization +* Recognize common job submission and system usage issues. +* Apply systematic troubleshooting techniques. +* Optimize resource usage and avoid common pitfalls. +* Navigate documentation effectively (arcdocs, Google). +* Submit an effective support ticket when needed. -## Guidance and Support -### arcdocs (and google) -### submit a ticket \ No newline at end of file +--- + +## **Background / Introduction** + +High Performance Computing (HPC) systems are powerful but complex environments. Errors and failures are common, especially for new users. Learning how to systematically troubleshoot saves time, reduces frustration, and improves your productivity. + +You are encouraged to resolve problems independently before seeking help. This session will provide practical tools and workflows to diagnose issues and know when and how to escalate. + +--- + +## **Common Issues with Job Submission and System Usage** + +* **Job stuck in PENDING**: + + * Insufficient resources available. + * Requested rare resources (e.g., GPUs, high-memory). + * User priority is low (fair-share scheduling). + +* **Job fails immediately**: + + * Syntax error in script. + * Incorrect module loaded or missing environment setup. + * File paths incorrect or inaccessible. + +* **Job exceeds time/memory limits**: + + * Underestimated resource requirements. + * Infinite loops or runaway jobs. + +* **Storage problems**: + + * Disk quota exceeded. + * Writing to non-scratch areas with insufficient space. + +* **Permission errors**: + + * Incorrect file/directory permissions. + * Attempt to write to read-only directories. + +--- + +## **Strategies for Error Diagnosis and Resource Optimization** + +A methodical approach helps: + +1. **Read the Error Message**: + + * First line of defense. + * Look for keywords like `Segmentation fault`, `Permission denied`, `Out of memory`. + +2. **Check Output and Error Logs**: + + * Look at `.out` and `.err` files created by SLURM. + * Search for unusual termination messages. + +3. **Validate the Job Script**: + + * Are directives correct (`#SBATCH`)? + * Paths to modules, data files correct? + +4. **Use Monitoring Tools**: + + * `squeue` β€” check job status. + * `scontrol show job ` β€” detailed job info. + * `sacct` β€” view accounting data after job completes. + * `seff ` β€” summarize efficiency (CPU and memory). + +5. **Resource Requests**: + + * Adjust memory (`--mem`), CPUs (`--cpus-per-task`), or time (`--time`). + * Test with smaller jobs first. + +6. **Minimal Reproducible Example**: + + * Simplify the problem. + * Remove unnecessary steps to isolate the cause. + +7. **Restart Strategy**: + + * If a job crashes, ensure it can restart from checkpoints. + +--- + +## **Bonus Tips** + +\::::{admonition} πŸ” Troubleshooting Tips + +* **Always check the `.err` file** first β€” many runtime errors are logged there. +* **Use `seff `** after job completion to quickly check if memory and CPUs were used efficiently. +* **Google Smartly**: Put error messages in **quotes** to search exact phrases. +* **Save working job scripts** β€” version control isn't just for code. + \:::: + +--- + +## **Sample Error Log Snippet** + +Typical `.err` file: + +``` +Traceback (most recent call last): + File "big_simulation.py", line 42, in + import numpy +ModuleNotFoundError: No module named 'numpy' + +srun: error: node1234: task 0: Exited with exit code 1 +``` + +**Interpretation**: + +* Top error: Python cannot find the `numpy` module β€” indicates missing environment/module. +* Bottom error: SLURM shows that task 0 exited abnormally. + +--- + +## **Guidance and Support** + +### **Using arcdocs and Google Effectively** + +* **arcdocs**: + + * [Aire HPC Documentation](https://arcdocs.leeds.ac.uk/) + * Search with clear keywords (e.g., "job submission error", "SLURM memory limit"). + +* **Google**: + + * Copy error messages *verbatim* into search. + * Use quotes for exact matches. + +**Example**: + +> Error: `srun: error: Unable to allocate resources: Requested node configuration is not available` +> Google: `"srun: error: Unable to allocate resources: Requested node configuration is not available" HPC SLURM` + +--- + +### **Submitting a Support Ticket** + +Only escalate if: + +* You have attempted basic troubleshooting. +* The issue persists and blocks your work. + +**How to Write a Good Ticket**: + +1. **Clear Description**. +2. **Job Details** β€” job script, output/error logs. +3. **Environment Info** β€” loaded modules, software versions. +4. **What You’ve Tried**. + +**Example**: + +> **Subject**: Job Failing with Out of Memory β€” Aire HPC +> +> **Description**: +> I'm submitting a job with 32 cores and 128GB memory. It fails with "oom-killer" message after \~1 hour. +> +> **Script**: +> +> ```bash +> #SBATCH --cpus-per-task=32 +> #SBATCH --mem=128G +> #SBATCH --time=2:00:00 +> ``` +> +> **Modules Loaded**: +> +> * python/3.8 +> * mpi/openmpi-4.1 +> +> **Error Logs**: +> `Out of memory: Kill process 12345 (python) score 1234 or sacrifice child` +> +> **Steps Tried**: +> +> * Reduced number of cores. +> * Increased memory request to 160GB (still fails). + +--- + +## **Exercises** + +### Exercise 1: Diagnose a Failed Job + +You submitted: + +```bash +#!/bin/bash +#SBATCH --job-name=test_fail +#SBATCH --time=00:30:00 +#SBATCH --mem=2G +#SBATCH --cpus-per-task=4 +#SBATCH --output=test_output.out +#SBATCH --error=test_error.err + +module load python +python big_simulation.py +``` + +Error file contains: + +``` +ModuleNotFoundError: No module named 'numpy' +``` + +**Task**: Identify the issue and suggest a fix. + +--- + +### Exercise 2: Find the Right Documentation + +Error: + +``` +srun: error: Unable to create job step +``` + +**Task**: Use arcdocs or Google to find possible causes and solutions. + +--- + +### Exercise 3: Draft a Support Ticket + +Error: +`slurmstepd: error: Exceeded job memory limit` +Job requested 8GB memory. Simulation needs \~20GB. + +**Task**: Draft a support ticket reporting the issue. + +--- + +## **Answers / Expected Outputs** + +### Exercise 1 Answer + +**Issue**: + +* The Python environment lacks the `numpy` module. + +**Fix**: + +* Load a Python module with `numpy` or install it. + +Example: + +```bash +module load python/3.10 +pip install --user numpy +``` + +--- + +### Exercise 2 Answer + +**Search Terms**: + +> `SLURM srun Unable to create job step` + +**Solution**: + +* Insufficient resources or a mismatch between `srun` and job allocation. + +--- + +### Exercise 3 Answer + +**Support Ticket Draft**: + +> **Subject**: Exceeded Job Memory Limit β€” Job ID 987654 +> +> **Description**: +> Simulation failed with memory limit error. +> +> **Job Script**: +> +> ```bash +> #SBATCH --mem=8G +> #SBATCH --time=2:00:00 +> ``` +> +> **Error Log**: +> `slurmstepd: error: Exceeded job memory limit` +> +> **Modules Loaded**: +> +> * python/3.8 +> +> **Steps Tried**: +> +> * Reviewed simulation memory needs (\~20GB). + +--- + +## **Recap Quiz** + +**Q1.** +What is the first step you should take when your HPC job fails? + +> **Answer:** C) Read the error message and check logs. + +**Q2.** +Which of the following is *NOT* a good practice for troubleshooting HPC jobs? + +> **Answer:** B) Ignore error logs and focus on the job script. + +**Q3.** +Where can you find official documentation for the Aire HPC system? + +> **Answer:** B) arcdocs + +**Q4.** +When is it appropriate to submit a support ticket? + +> **Answer:** B) After attempting troubleshooting and collecting relevant information. + +**Q5.** +What should a good support ticket *always* include? + +> **Answer:** B) Full job script, error logs, and description of troubleshooting steps. + +--- + +## **Next Steps** + +* Practice troubleshooting job failures. +* Explore arcdocs for documentation. +* Practice drafting clear support tickets. +* Experiment with optimizing resource requests. + +> **Pro Tip**: Systematic troubleshooting and good communication can drastically reduce time to resolve HPC issues. \ No newline at end of file diff --git a/book/scheduling-submission.md b/book/scheduling-submission.md index 018c432..0229b99 100644 --- a/book/scheduling-submission.md +++ b/book/scheduling-submission.md @@ -427,9 +427,9 @@ sbatch array_example.sh ## Further Reading -* [Aire Job Scheduling Guide](https://arcdocs.leeds.ac.uk/aire/usage/job_scheduler.html) +* [Aire Job Scheduling Guide](https://arcdocs.leeds.ac.uk/aire/system/job_scheduler.html) * [Writing Job Scripts](https://arcdocs.leeds.ac.uk/aire/usage/job_type.html) -* [Interactive Jobs](https://arcdocs.leeds.ac.uk/aire/usage/job_type.html#interactive) +* [Job Examples](https://arcdocs.leeds.ac.uk/aire/usage/job_example.html) --- diff --git a/book/wrap-up.md b/book/wrap-up.md index 9a4cc5e..448bc9a 100644 --- a/book/wrap-up.md +++ b/book/wrap-up.md @@ -1,5 +1,183 @@ + # Session 7: Wrap Up -## Recap -## Further Guidance -## Q&A/Discussion \ No newline at end of file +In this final session, we’ll consolidate what you’ve learned throughout the course and point you towards further resources to continue developing your skills with Aire and HPC more generally. + +--- + +## Recap of Key Concepts + +Let’s quickly revisit the main topics covered in this course: + +* **What is HPC?** + + * High Performance Computing allows multiple processors (cores) to work together to solve large problems faster. + * HPC clusters like **Aire** are built from many nodes, each with powerful CPUs (and sometimes GPUs). + * Parallelism is key to exploiting HPC systems effectively. + +* **Logging on and Linux Basics** + + * Access Aire through SSH, either directly from the campus network or via VPN/jumphost if off-campus. + * Linux command-line skills are critical for interacting with the system. + * Key commands: `ls`, `cd`, `pwd`, `cp`, `wget`, `rm`. + +* **Storage on Aire** + + * Understand storage types: **home**, **scratch**, and **shared storage**. + * Efficiently move data and check quotas. + +* **Modules and Software** + + * Software is accessed through **modules**. + * Learned to load, list, swap modules. + * Other strategies: Spack, EasyBuild, self-build, containers. + +* **Job Scheduling and Batch Jobs** + + * **SLURM** scheduler manages all jobs on Aire. + * Wrote batch scripts, submitted jobs, monitored queues. + * Job states: `PENDING`, `RUNNING`, `COMPLETED`. + * Hands-on practice with interactive sessions and task arrays. + +* **Best Practices and Troubleshooting** + + * Diagnosing job errors, optimizing resource requests. + * Start small (test jobs), scale up once validated. + * Using arcdocs, Google, and support tickets for help. + +--- + +## Further Guidance and Next Steps + +### Explore Advanced Topics + +* **Parallel Programming:** MPI, OpenMP, GPU computing. +* **Workflow Management:** Snakemake, Nextflow, job dependencies. +* **Performance Optimization:** Profiling, benchmarking. + +### Recommended Resources + +* [ARC Documentation Portal](https://arcdocs.leeds.ac.uk/aire/welcome.html) +* [HPC Carpentry Lessons](https://carpentries-incubator.github.io/hpc-intro/) +* [High Performance Python Course](https://arc.leeds.ac.uk/courses/swd6-high-performance-python/) + +### Stay Connected + +* [Research Computing Training Calendar](https://arc.leeds.ac.uk/courses/) +* [Submit a Support Ticket](https://it.leeds.ac.uk) if you need assistance. + +--- + +## Final Q\&A / Discussion + +Use this time to: + +* Ask any outstanding questions. +* Share challenges you encountered and solutions you found. +* Request demos or deeper dives into specific topics. +* Discuss how to apply HPC in your research area. + +### Q\&A Prompts + +* *What was the most challenging part ?* +* *What would you like to use Aire for in your workflows?* +* *Is there a particular software package you'd like to or using?* +* *Any lingering questions?* + +--- + +## Bonus Tips for Continuing Success + +* **Start small**: Always begin with a minimal test case to confirm your setup before scaling up. +* **Version control your scripts**: Use Git or another version control system to track your job scripts and code. +* **Resource efficiency**: Request only what you need β€” excessive resource requests can delay your job. +* **Learn by doing**: Schedule time to practice; the more batch scripts you write, the more confident you'll become. +* **Stay informed**: Subscribe to mailing lists or notifications from the HPC service for updates and downtime notices. + +\::: + +--- + +## **Recap Quiz** + +**Q1.** +What is the main advantage of using a HPC system like Aire? + +* A) Larger storage capacity +* B) Faster problem solving through parallel processing +* C) Access to free software licenses +* D) Automatic data backups + +> **Answer:** +> **B) Faster problem solving through parallel processing** + +--- + +**Q2.** +Which command would you use to submit a batch job on Aire? + +* A) `srun` +* B) `ssh` +* C) `sbatch` +* D) `scp` + +> **Answer:** +> **C) sbatch** + +--- + +**Q3.** +Where should large, temporary files for active computations be stored? + +* A) Home directory +* B) Scratch storage +* C) Admin node +* D) Login node + +> **Answer:** +> **B) Scratch storage** + +--- + +**Q4.** +If your job is stuck in `PENDING` state, what might be the cause? + +* A) Syntax error in your script +* B) Not enough available resources +* C) The login node is overloaded +* D) You forgot to save the output + +> **Answer:** +> **B) Not enough available resources** + +--- + +**Q5.** +What’s a good first step if your job fails with an unknown error? + +* A) Immediately resubmit it +* B) Open a support ticket +* C) Check the error log and output files +* D) Assume the cluster is broken + +> **Answer:** +> **C) Check the error log and output files** + +--- + +## Where to Go Next: Mini-Roadmap + +Here’s how you can continue your HPC journey: + +| **Step** | **What to Do** | +| ------------------------- | --------------------------------------------------------------------- | +| **1. Practice** | Regularly submit small jobs, explore module loading, refine scripts. | +| **2. Advanced Courses** | Take courses on parallel programming (MPI/OpenMP) or HPC Python. | +| **3. Optimize** | Learn profiling and benchmarking tools to make your code faster. | +| **4. Real Projects** | Apply your skills to real research problems, scale up carefully. | +| **5. Community** | Join research computing forums, HPC mailing lists, and attend events. | +| **6. Mentorship/Support** | Reach out to HPC support teams early if stuck β€” avoid wasting cycles. | + +\:::{admonition} Final Tip +The best way to learn HPC is by *doing*. Start small, break things, fix them, and gradually scale up your work. Every issue you encounter is a learning opportunity. +\::: From 50d746f7dbb5346a99b7f1cbbcddc29aaef82b76 Mon Sep 17 00:00:00 2001 From: Nick Rhodes Date: Wed, 4 Jun 2025 07:35:20 +0100 Subject: [PATCH 4/5] Fixes to Wrap Up --- book/wrap-up.md | 23 +++++++---------------- 1 file changed, 7 insertions(+), 16 deletions(-) diff --git a/book/wrap-up.md b/book/wrap-up.md index 448bc9a..ea9eeb5 100644 --- a/book/wrap-up.md +++ b/book/wrap-up.md @@ -55,16 +55,11 @@ Let’s quickly revisit the main topics covered in this course: * **Workflow Management:** Snakemake, Nextflow, job dependencies. * **Performance Optimization:** Profiling, benchmarking. -### Recommended Resources +### Useful links * [ARC Documentation Portal](https://arcdocs.leeds.ac.uk/aire/welcome.html) -* [HPC Carpentry Lessons](https://carpentries-incubator.github.io/hpc-intro/) -* [High Performance Python Course](https://arc.leeds.ac.uk/courses/swd6-high-performance-python/) - -### Stay Connected - * [Research Computing Training Calendar](https://arc.leeds.ac.uk/courses/) -* [Submit a Support Ticket](https://it.leeds.ac.uk) if you need assistance. +* [Submit a Support Ticket](https://it.leeds.ac.uk/it?id=sc_cat_item&sys_id=7587b2530f675f00a82247ece1050eda) if you need assistance. --- @@ -86,7 +81,7 @@ Use this time to: --- -## Bonus Tips for Continuing Success +## Tips * **Start small**: Always begin with a minimal test case to confirm your setup before scaling up. * **Version control your scripts**: Use Git or another version control system to track your job scripts and code. @@ -94,8 +89,6 @@ Use this time to: * **Learn by doing**: Schedule time to practice; the more batch scripts you write, the more confident you'll become. * **Stay informed**: Subscribe to mailing lists or notifications from the HPC service for updates and downtime notices. -\::: - --- ## **Recap Quiz** @@ -172,12 +165,10 @@ Here’s how you can continue your HPC journey: | **Step** | **What to Do** | | ------------------------- | --------------------------------------------------------------------- | | **1. Practice** | Regularly submit small jobs, explore module loading, refine scripts. | -| **2. Advanced Courses** | Take courses on parallel programming (MPI/OpenMP) or HPC Python. | -| **3. Optimize** | Learn profiling and benchmarking tools to make your code faster. | +| **2. More Courses** | Take courses on Git, R, Python, HPC2. | | **4. Real Projects** | Apply your skills to real research problems, scale up carefully. | -| **5. Community** | Join research computing forums, HPC mailing lists, and attend events. | -| **6. Mentorship/Support** | Reach out to HPC support teams early if stuck β€” avoid wasting cycles. | +| **5. Community** | Join research computing on MS Teams, attend events. | +| **6. Mentorship/Support** | Reach out to HPC support teams EARLY if stuck β€” avoid wasting cycles. | -\:::{admonition} Final Tip +## Final Tip The best way to learn HPC is by *doing*. Start small, break things, fix them, and gradually scale up your work. Every issue you encounter is a learning opportunity. -\::: From fef27fcdc5ad6044ef8f99f9d76060913a6fdcad Mon Sep 17 00:00:00 2001 From: Nick Rhodes Date: Wed, 4 Jun 2025 08:01:10 +0100 Subject: [PATCH 5/5] Fixes to Troubleshooting --- book/best-practices-troubleshooting.md | 135 +------------------------ 1 file changed, 5 insertions(+), 130 deletions(-) diff --git a/book/best-practices-troubleshooting.md b/book/best-practices-troubleshooting.md index 2dd3463..451ef5b 100644 --- a/book/best-practices-troubleshooting.md +++ b/book/best-practices-troubleshooting.md @@ -75,7 +75,6 @@ A methodical approach helps: * `squeue` β€” check job status. * `scontrol show job ` β€” detailed job info. * `sacct` β€” view accounting data after job completes. - * `seff ` β€” summarize efficiency (CPU and memory). 5. **Resource Requests**: @@ -95,13 +94,11 @@ A methodical approach helps: ## **Bonus Tips** -\::::{admonition} πŸ” Troubleshooting Tips +### πŸ” Troubleshooting Tips * **Always check the `.err` file** first β€” many runtime errors are logged there. -* **Use `seff `** after job completion to quickly check if memory and CPUs were used efficiently. * **Google Smartly**: Put error messages in **quotes** to search exact phrases. * **Save working job scripts** β€” version control isn't just for code. - \:::: --- @@ -131,7 +128,7 @@ srun: error: node1234: task 0: Exited with exit code 1 * **arcdocs**: - * [Aire HPC Documentation](https://arcdocs.leeds.ac.uk/) + * [Aire HPC Documentation](https://arcdocs.leeds.ac.uk/aire) * Search with clear keywords (e.g., "job submission error", "SLURM memory limit"). * **Google**: @@ -148,10 +145,7 @@ srun: error: node1234: task 0: Exited with exit code 1 ### **Submitting a Support Ticket** -Only escalate if: - -* You have attempted basic troubleshooting. -* The issue persists and blocks your work. +* Only escalate if you have attempted basic troubleshooting. **How to Write a Good Ticket**: @@ -188,120 +182,6 @@ Only escalate if: > * Reduced number of cores. > * Increased memory request to 160GB (still fails). ---- - -## **Exercises** - -### Exercise 1: Diagnose a Failed Job - -You submitted: - -```bash -#!/bin/bash -#SBATCH --job-name=test_fail -#SBATCH --time=00:30:00 -#SBATCH --mem=2G -#SBATCH --cpus-per-task=4 -#SBATCH --output=test_output.out -#SBATCH --error=test_error.err - -module load python -python big_simulation.py -``` - -Error file contains: - -``` -ModuleNotFoundError: No module named 'numpy' -``` - -**Task**: Identify the issue and suggest a fix. - ---- - -### Exercise 2: Find the Right Documentation - -Error: - -``` -srun: error: Unable to create job step -``` - -**Task**: Use arcdocs or Google to find possible causes and solutions. - ---- - -### Exercise 3: Draft a Support Ticket - -Error: -`slurmstepd: error: Exceeded job memory limit` -Job requested 8GB memory. Simulation needs \~20GB. - -**Task**: Draft a support ticket reporting the issue. - ---- - -## **Answers / Expected Outputs** - -### Exercise 1 Answer - -**Issue**: - -* The Python environment lacks the `numpy` module. - -**Fix**: - -* Load a Python module with `numpy` or install it. - -Example: - -```bash -module load python/3.10 -pip install --user numpy -``` - ---- - -### Exercise 2 Answer - -**Search Terms**: - -> `SLURM srun Unable to create job step` - -**Solution**: - -* Insufficient resources or a mismatch between `srun` and job allocation. - ---- - -### Exercise 3 Answer - -**Support Ticket Draft**: - -> **Subject**: Exceeded Job Memory Limit β€” Job ID 987654 -> -> **Description**: -> Simulation failed with memory limit error. -> -> **Job Script**: -> -> ```bash -> #SBATCH --mem=8G -> #SBATCH --time=2:00:00 -> ``` -> -> **Error Log**: -> `slurmstepd: error: Exceeded job memory limit` -> -> **Modules Loaded**: -> -> * python/3.8 -> -> **Steps Tried**: -> -> * Reviewed simulation memory needs (\~20GB). - ---- ## **Recap Quiz** @@ -311,21 +191,16 @@ What is the first step you should take when your HPC job fails? > **Answer:** C) Read the error message and check logs. **Q2.** -Which of the following is *NOT* a good practice for troubleshooting HPC jobs? - -> **Answer:** B) Ignore error logs and focus on the job script. - -**Q3.** Where can you find official documentation for the Aire HPC system? > **Answer:** B) arcdocs -**Q4.** +**Q3.** When is it appropriate to submit a support ticket? > **Answer:** B) After attempting troubleshooting and collecting relevant information. -**Q5.** +**Q4.** What should a good support ticket *always* include? > **Answer:** B) Full job script, error logs, and description of troubleshooting steps.