diff --git a/book/best-practices-troubleshooting.md b/book/best-practices-troubleshooting.md index a48f591..451ef5b 100644 --- a/book/best-practices-troubleshooting.md +++ b/book/best-practices-troubleshooting.md @@ -1,10 +1,217 @@ -# Session 6: HPC Best Practices & Troubleshooting +# **Session 6: HPC Best Practices & Troubleshooting** +## **Learning Outcomes** -## Common issues with job submission and system usage +By the end of this session, you will be able to: -## Strategies for error diagnosis and resource optimization +* Recognize common job submission and system usage issues. +* Apply systematic troubleshooting techniques. +* Optimize resource usage and avoid common pitfalls. +* Navigate documentation effectively (arcdocs, Google). +* Submit an effective support ticket when needed. -## Guidance and Support -### arcdocs (and google) -### submit a ticket \ No newline at end of file +--- + +## **Background / Introduction** + +High Performance Computing (HPC) systems are powerful but complex environments. Errors and failures are common, especially for new users. Learning how to systematically troubleshoot saves time, reduces frustration, and improves your productivity. + +You are encouraged to resolve problems independently before seeking help. This session will provide practical tools and workflows to diagnose issues and know when and how to escalate. + +--- + +## **Common Issues with Job Submission and System Usage** + +* **Job stuck in PENDING**: + + * Insufficient resources available. + * Requested rare resources (e.g., GPUs, high-memory). + * User priority is low (fair-share scheduling). + +* **Job fails immediately**: + + * Syntax error in script. + * Incorrect module loaded or missing environment setup. + * File paths incorrect or inaccessible. + +* **Job exceeds time/memory limits**: + + * Underestimated resource requirements. + * Infinite loops or runaway jobs. + +* **Storage problems**: + + * Disk quota exceeded. + * Writing to non-scratch areas with insufficient space. + +* **Permission errors**: + + * Incorrect file/directory permissions. + * Attempt to write to read-only directories. + +--- + +## **Strategies for Error Diagnosis and Resource Optimization** + +A methodical approach helps: + +1. **Read the Error Message**: + + * First line of defense. + * Look for keywords like `Segmentation fault`, `Permission denied`, `Out of memory`. + +2. **Check Output and Error Logs**: + + * Look at `.out` and `.err` files created by SLURM. + * Search for unusual termination messages. + +3. **Validate the Job Script**: + + * Are directives correct (`#SBATCH`)? + * Paths to modules, data files correct? + +4. **Use Monitoring Tools**: + + * `squeue` — check job status. + * `scontrol show job ` — detailed job info. + * `sacct` — view accounting data after job completes. + +5. **Resource Requests**: + + * Adjust memory (`--mem`), CPUs (`--cpus-per-task`), or time (`--time`). + * Test with smaller jobs first. + +6. **Minimal Reproducible Example**: + + * Simplify the problem. + * Remove unnecessary steps to isolate the cause. + +7. **Restart Strategy**: + + * If a job crashes, ensure it can restart from checkpoints. + +--- + +## **Bonus Tips** + +### 🔍 Troubleshooting Tips + +* **Always check the `.err` file** first — many runtime errors are logged there. +* **Google Smartly**: Put error messages in **quotes** to search exact phrases. +* **Save working job scripts** — version control isn't just for code. + +--- + +## **Sample Error Log Snippet** + +Typical `.err` file: + +``` +Traceback (most recent call last): + File "big_simulation.py", line 42, in + import numpy +ModuleNotFoundError: No module named 'numpy' + +srun: error: node1234: task 0: Exited with exit code 1 +``` + +**Interpretation**: + +* Top error: Python cannot find the `numpy` module — indicates missing environment/module. +* Bottom error: SLURM shows that task 0 exited abnormally. + +--- + +## **Guidance and Support** + +### **Using arcdocs and Google Effectively** + +* **arcdocs**: + + * [Aire HPC Documentation](https://arcdocs.leeds.ac.uk/aire) + * Search with clear keywords (e.g., "job submission error", "SLURM memory limit"). + +* **Google**: + + * Copy error messages *verbatim* into search. + * Use quotes for exact matches. + +**Example**: + +> Error: `srun: error: Unable to allocate resources: Requested node configuration is not available` +> Google: `"srun: error: Unable to allocate resources: Requested node configuration is not available" HPC SLURM` + +--- + +### **Submitting a Support Ticket** + +* Only escalate if you have attempted basic troubleshooting. + +**How to Write a Good Ticket**: + +1. **Clear Description**. +2. **Job Details** — job script, output/error logs. +3. **Environment Info** — loaded modules, software versions. +4. **What You’ve Tried**. + +**Example**: + +> **Subject**: Job Failing with Out of Memory — Aire HPC +> +> **Description**: +> I'm submitting a job with 32 cores and 128GB memory. It fails with "oom-killer" message after \~1 hour. +> +> **Script**: +> +> ```bash +> #SBATCH --cpus-per-task=32 +> #SBATCH --mem=128G +> #SBATCH --time=2:00:00 +> ``` +> +> **Modules Loaded**: +> +> * python/3.8 +> * mpi/openmpi-4.1 +> +> **Error Logs**: +> `Out of memory: Kill process 12345 (python) score 1234 or sacrifice child` +> +> **Steps Tried**: +> +> * Reduced number of cores. +> * Increased memory request to 160GB (still fails). + + +## **Recap Quiz** + +**Q1.** +What is the first step you should take when your HPC job fails? + +> **Answer:** C) Read the error message and check logs. + +**Q2.** +Where can you find official documentation for the Aire HPC system? + +> **Answer:** B) arcdocs + +**Q3.** +When is it appropriate to submit a support ticket? + +> **Answer:** B) After attempting troubleshooting and collecting relevant information. + +**Q4.** +What should a good support ticket *always* include? + +> **Answer:** B) Full job script, error logs, and description of troubleshooting steps. + +--- + +## **Next Steps** + +* Practice troubleshooting job failures. +* Explore arcdocs for documentation. +* Practice drafting clear support tickets. +* Experiment with optimizing resource requests. + +> **Pro Tip**: Systematic troubleshooting and good communication can drastically reduce time to resolve HPC issues. \ No newline at end of file diff --git a/book/scheduling-submission.md b/book/scheduling-submission.md index 2abb4aa..0229b99 100644 --- a/book/scheduling-submission.md +++ b/book/scheduling-submission.md @@ -1,29 +1,509 @@ -# Session 5: Job Scheduling & Submission +# Session 5: Introduction to Job Scheduling and Batch Jobs -## Overview of HPC job scheduling systems -### General background -### Slurm -## Job Scripts -### Structure of job scripts -### resource requests -### submission command +In this session, you will learn: +* What a job scheduler is and why it is used +* How to write and submit batch job scripts +* How to monitor, manage, and cancel jobs +* How to use modules to set up software environments +* How to request high memory and GPU resources +* (Optional) How to submit task arrays -## exercise +--- -- What is a scheduler -- Fair use +## Background: What is a Job Scheduler? -- Again, divided up into mini tabset "presentations" +High Performance Computing (HPC) systems are shared by many users, each submitting their own jobs — code they want to run using the cluster's compute power. -- https://cautious-happiness-oz2284n.pages.github.io/#/basic-slurm-configuration -- https://cautious-happiness-oz2284n.pages.github.io/#/basic-slurm-job-scripts -- https://arcdocs.leeds.ac.uk/aire/usage/job_type.html -- https://arcdocs.leeds.ac.uk/aire/usage/job_example.html +A **job scheduler** is the system that: +- Organizes when and where jobs run +- Allocates the requested resources (CPU cores, memory, GPUs) +- Ensures fair access to shared resources for all users +Schedulers make decisions based on: +- What resources your job requests (e.g., how many cores, how much memory) +- How long your job will run (your time limit) +- Current system load +- Fair-share policies (giving all users fair access over time) -## Practical - write and submit a simple job script, monitor its progress, and learn to interpret feedback +**Without a scheduler**, users would have to manually coordinate access to thousands of CPUs — impractical and chaotic. -- Create/share example job and job script -https://arctraining.github.io/rc-slides/hpc1.html#/submit-a-serial-python-job +--- +## SLURM: The Scheduler on Aire + +At Leeds, the **SLURM** scheduler (Simple Linux Utility for Resource Management) manages all jobs on the Aire cluster. + +When you submit a job: +1. You describe what you need (e.g., CPUs, memory, time) in a **job script**. +2. You submit the job to SLURM with `sbatch`. +3. SLURM places your job in a **queue**. +4. When enough resources are available and your job’s priority is high enough, SLURM starts the job on suitable **compute nodes**. + +--- + +### How Jobs Flow Through the System + +```plaintext +Your Job Script + │ + ▼ + SLURM Scheduler + │ + ├── Queues Jobs + ├── Prioritizes Jobs + ├── Allocates Resources + ▼ + Compute Nodes (Run the job) + │ + ▼ + Output Files +```` + +--- + +### Common Job States + +| **State** | **Meaning** | +| ----------- | ------------------------------------------ | +| `PENDING` | Job is waiting for resources | +| `RUNNING` | Job is actively running on compute nodes | +| `COMPLETED` | Job finished successfully | +| `FAILED` | Job failed (e.g., errors, exceeded limits) | +| `CANCELLED` | Job was manually stopped (e.g., by user) | + +You can monitor job states with the `squeue` command. + +--- + +### Why Do Jobs Wait? + +Not every job runs immediately. Reasons include: + +* Not enough CPUs/memory free +* Higher priority jobs ahead of yours +* Fair-share adjustment (users who have used less recently get higher priority) +* Your job is requesting rare resources (e.g., GPU nodes) + +--- + +### How You Interact with the Scheduler + +| **Action** | **Command** | +| -------------- | ---------------------- | +| Submit a job | `sbatch my_job.sh` | +| View your jobs | `squeue -u ` | +| Cancel a job | `scancel ` | + +You will learn these commands and write your first job script in the next sections. + +--- + +## Why Use Batch Jobs? + +Batch jobs allow you to: + +* Set up your work once +* Submit it to the scheduler +* Log out and let it run unattended +* Automatically capture outputs and errors into files + +This is essential for longer jobs that run for hours or days — you don't need to stay logged in. + +--- + +## Summary + +* A job scheduler manages who runs jobs and when on an HPC cluster. +* SLURM is the scheduler used on Aire. +* You write a job script to describe what you need. +* SLURM queues, prioritizes, and runs your job on available compute nodes. +* You interact with SLURM via simple commands like `sbatch`, `squeue`, and `scancel`. + +--- + +# Hands-On Practical + +--- + +## Hands-On: Write and Submit a Simple Job + +**Exercise:** +Write a batch script requesting: + +* 2 CPUs +* 4 GB memory +* 30 minutes runtime +* Load the Python module +* Run a simple command like `hostname` + +Submit the script. + +**Answer:** + +```bash +#!/bin/bash +#SBATCH --job-name=simple_job +#SBATCH --time=00:30:00 +#SBATCH --mem=4G +#SBATCH --cpus-per-task=2 +#SBATCH --output=simple_output_%j.out +#SBATCH --error=simple_error_%j.err + +module load python + +hostname +``` + +Submit using: + +```bash +sbatch simple_job.sh +``` + +**Expected Output:** + +* You will receive a job submission message in the terminal like: + + ``` + Submitted batch job 123456 + ``` + +* After the job completes, check the output file `simple_output_123456.out` in your current working directory. + +* **Contents of Output File:** + + ``` + nodeXYZ.arc.leeds.ac.uk + ``` + + (Your job ran `hostname`, so you get the compute node name.) + +--- + +## Monitoring Jobs + +Check job status: + +```bash +squeue -u +``` + +Cancel a job: + +```bash +scancel +``` + +**Exercise:** +Submit a job and use `squeue` to monitor it. Cancel the job once it starts running. + +**Answer:** + +1. Submit job: `sbatch simple_job.sh` +2. Monitor job: `squeue -u ` +3. Cancel job: `scancel ` + +**Expected Output:** + +* `squeue` shows your job in the queue: + + ``` + JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) + 123456 general simple_job user1 PD 0:00 1 (Priority) + ``` + + (Status `PD` means Pending; `R` means Running.) + +* After `scancel`, job disappears from `squeue` list. + +--- + +## Interactive Jobs + +Interactive jobs are useful for quick testing or debugging: + +```bash +srun --pty --time=01:00:00 bash +``` + +**Exercise:** + +1. Start an interactive session. +2. Load the Python module. +3. Run `hostname`. +4. Exit the session. + +**Answer:** + +```bash +srun --pty --time=01:00:00 bash +module load python +hostname +exit +``` + +**Expected Output:** + +* Terminal will change — you’ll have a shell prompt on a compute node. + +* After running `hostname`, you’ll see the node name: + + ``` + nodeXYZ.arc.leeds.ac.uk + ``` + +* `exit` returns you to the login node. + +--- + +## Output and Error Files + +SLURM creates: + +* `slurm-.out` — standard output +* `slurm-.err` — standard error (if specified separately) + +Control output filenames: + +```bash +#SBATCH --output=/path/to/output_file.out +#SBATCH --error=/path/to/error_file.err +``` + +**Exercise:** +Modify your script to redirect output and error to specific files. + +**Answer:** + +```bash +#!/bin/bash +#SBATCH --job-name=simple_job +#SBATCH --time=00:30:00 +#SBATCH --mem=4G +#SBATCH --cpus-per-task=2 +#SBATCH --output=/home//my_output.out +#SBATCH --error=/home//my_error.err + +module load python + +hostname +``` + +**Expected Output:** + +* After job completion, you will find: + + * `/home//my_output.out` — contains node name from `hostname`. + * `/home//my_error.err` — should be empty if no errors. + +--- + +## High Memory and GPU Requests + +For high-memory jobs: + +```bash +#SBATCH --mem=256G +``` + +For GPU jobs: + +```bash +#SBATCH --gres=gpu:1 +``` + +**Exercise:** +Update your batch script to request: + +* 256 GB memory +* 1 GPU + +**Answer:** + +```bash +#!/bin/bash +#SBATCH --job-name=highmem_gpu_job +#SBATCH --time=00:30:00 +#SBATCH --mem=256G +#SBATCH --cpus-per-task=2 +#SBATCH --gres=gpu:1 +#SBATCH --output=highmem_output_%j.out +#SBATCH --error=highmem_error_%j.err + +module load python + +hostname +``` + +Submit using: + +```bash +sbatch highmem_gpu_job.sh +``` + +**Expected Output:** + +* Terminal submission message: + + ``` + Submitted batch job 123457 + ``` +* Output file `highmem_output_123457.out` will contain: + + ``` + nodeXYZ.arc.leeds.ac.uk + ``` + + (Node assigned may be a GPU node.) + +--- + +## (Optional) Task Arrays + +Arrays allow submitting multiple similar jobs efficiently. + +Example script: + +```bash +#!/bin/bash +#SBATCH --job-name=array_job +#SBATCH --array=1-3 +#SBATCH --output=array_output_%A_%a.out + +module load python + +python my_script.py $SLURM_ARRAY_TASK_ID +``` + +Submit: + +```bash +sbatch array_script.sh +``` + +**Exercise (Optional):** +Write a batch script to submit a task array with 3 tasks, each printing its task ID. + +**Answer:** + +```bash +#!/bin/bash +#SBATCH --job-name=array_example +#SBATCH --array=1-3 +#SBATCH --output=array_%A_%a.out + +module load python + +echo "Task ID: $SLURM_ARRAY_TASK_ID" +hostname +``` + +Submit using: + +```bash +sbatch array_example.sh +``` + +**Expected Output:** + +* Three output files: + + ``` + array__1.out + array__2.out + array__3.out + ``` +* Each file contains: + + ``` + Task ID: 1 + nodeXYZ.arc.leeds.ac.uk + ``` + + (or `2`, `3` — depending on the task.) + +--- + +## Further Reading + +* [Aire Job Scheduling Guide](https://arcdocs.leeds.ac.uk/aire/system/job_scheduler.html) +* [Writing Job Scripts](https://arcdocs.leeds.ac.uk/aire/usage/job_type.html) +* [Job Examples](https://arcdocs.leeds.ac.uk/aire/usage/job_example.html) + +--- + +**Next Steps:** + +* Practice writing and submitting simple job scripts. +* Experiment with resource requests. +* Explore more advanced SLURM features as needed. + +--- + +## **Recap Quiz** + +**Q1.** +What is the purpose of a job scheduler on an HPC system? + +* A) To speed up internet connections +* B) To manually assign jobs to users +* C) To allocate compute resources and manage job queues +* D) To monitor user emails + +> **Answer:** +> **C) To allocate compute resources and manage job queues** + +--- + +**Q2.** +Which scheduler is used on the Aire HPC system? + +* A) PBS +* B) SLURM +* C) LSF +* D) Grid Engine + +> **Answer:** +> **B) SLURM** + +--- + +**Q3.** +What does the job state `PENDING` mean? + +* A) The job is actively running on a node +* B) The job is waiting for available resources +* C) The job has completed successfully +* D) The job was cancelled by the user + +> **Answer:** +> **B) The job is waiting for available resources** + +--- + +**Q4.** +Which command would you use to submit a job script to SLURM? + +* A) `srun` +* B) `squeue` +* C) `sbatch` +* D) `scancel` + +> **Answer:** +> **C) sbatch** + +--- + +**Q5.** +Why is it important to set a time limit (`--time`) in your job script? + +* A) It makes the job run faster +* B) It helps SLURM schedule jobs more efficiently +* C) It avoids the job being cancelled for overrun +* D) Both B and C + +> **Answer:** +> **D) Both B and C** + +--- diff --git a/book/wrap-up.md b/book/wrap-up.md index 9a4cc5e..ea9eeb5 100644 --- a/book/wrap-up.md +++ b/book/wrap-up.md @@ -1,5 +1,174 @@ + # Session 7: Wrap Up -## Recap -## Further Guidance -## Q&A/Discussion \ No newline at end of file +In this final session, we’ll consolidate what you’ve learned throughout the course and point you towards further resources to continue developing your skills with Aire and HPC more generally. + +--- + +## Recap of Key Concepts + +Let’s quickly revisit the main topics covered in this course: + +* **What is HPC?** + + * High Performance Computing allows multiple processors (cores) to work together to solve large problems faster. + * HPC clusters like **Aire** are built from many nodes, each with powerful CPUs (and sometimes GPUs). + * Parallelism is key to exploiting HPC systems effectively. + +* **Logging on and Linux Basics** + + * Access Aire through SSH, either directly from the campus network or via VPN/jumphost if off-campus. + * Linux command-line skills are critical for interacting with the system. + * Key commands: `ls`, `cd`, `pwd`, `cp`, `wget`, `rm`. + +* **Storage on Aire** + + * Understand storage types: **home**, **scratch**, and **shared storage**. + * Efficiently move data and check quotas. + +* **Modules and Software** + + * Software is accessed through **modules**. + * Learned to load, list, swap modules. + * Other strategies: Spack, EasyBuild, self-build, containers. + +* **Job Scheduling and Batch Jobs** + + * **SLURM** scheduler manages all jobs on Aire. + * Wrote batch scripts, submitted jobs, monitored queues. + * Job states: `PENDING`, `RUNNING`, `COMPLETED`. + * Hands-on practice with interactive sessions and task arrays. + +* **Best Practices and Troubleshooting** + + * Diagnosing job errors, optimizing resource requests. + * Start small (test jobs), scale up once validated. + * Using arcdocs, Google, and support tickets for help. + +--- + +## Further Guidance and Next Steps + +### Explore Advanced Topics + +* **Parallel Programming:** MPI, OpenMP, GPU computing. +* **Workflow Management:** Snakemake, Nextflow, job dependencies. +* **Performance Optimization:** Profiling, benchmarking. + +### Useful links + +* [ARC Documentation Portal](https://arcdocs.leeds.ac.uk/aire/welcome.html) +* [Research Computing Training Calendar](https://arc.leeds.ac.uk/courses/) +* [Submit a Support Ticket](https://it.leeds.ac.uk/it?id=sc_cat_item&sys_id=7587b2530f675f00a82247ece1050eda) if you need assistance. + +--- + +## Final Q\&A / Discussion + +Use this time to: + +* Ask any outstanding questions. +* Share challenges you encountered and solutions you found. +* Request demos or deeper dives into specific topics. +* Discuss how to apply HPC in your research area. + +### Q\&A Prompts + +* *What was the most challenging part ?* +* *What would you like to use Aire for in your workflows?* +* *Is there a particular software package you'd like to or using?* +* *Any lingering questions?* + +--- + +## Tips + +* **Start small**: Always begin with a minimal test case to confirm your setup before scaling up. +* **Version control your scripts**: Use Git or another version control system to track your job scripts and code. +* **Resource efficiency**: Request only what you need — excessive resource requests can delay your job. +* **Learn by doing**: Schedule time to practice; the more batch scripts you write, the more confident you'll become. +* **Stay informed**: Subscribe to mailing lists or notifications from the HPC service for updates and downtime notices. + +--- + +## **Recap Quiz** + +**Q1.** +What is the main advantage of using a HPC system like Aire? + +* A) Larger storage capacity +* B) Faster problem solving through parallel processing +* C) Access to free software licenses +* D) Automatic data backups + +> **Answer:** +> **B) Faster problem solving through parallel processing** + +--- + +**Q2.** +Which command would you use to submit a batch job on Aire? + +* A) `srun` +* B) `ssh` +* C) `sbatch` +* D) `scp` + +> **Answer:** +> **C) sbatch** + +--- + +**Q3.** +Where should large, temporary files for active computations be stored? + +* A) Home directory +* B) Scratch storage +* C) Admin node +* D) Login node + +> **Answer:** +> **B) Scratch storage** + +--- + +**Q4.** +If your job is stuck in `PENDING` state, what might be the cause? + +* A) Syntax error in your script +* B) Not enough available resources +* C) The login node is overloaded +* D) You forgot to save the output + +> **Answer:** +> **B) Not enough available resources** + +--- + +**Q5.** +What’s a good first step if your job fails with an unknown error? + +* A) Immediately resubmit it +* B) Open a support ticket +* C) Check the error log and output files +* D) Assume the cluster is broken + +> **Answer:** +> **C) Check the error log and output files** + +--- + +## Where to Go Next: Mini-Roadmap + +Here’s how you can continue your HPC journey: + +| **Step** | **What to Do** | +| ------------------------- | --------------------------------------------------------------------- | +| **1. Practice** | Regularly submit small jobs, explore module loading, refine scripts. | +| **2. More Courses** | Take courses on Git, R, Python, HPC2. | +| **4. Real Projects** | Apply your skills to real research problems, scale up carefully. | +| **5. Community** | Join research computing on MS Teams, attend events. | +| **6. Mentorship/Support** | Reach out to HPC support teams EARLY if stuck — avoid wasting cycles. | + +## Final Tip +The best way to learn HPC is by *doing*. Start small, break things, fix them, and gradually scale up your work. Every issue you encounter is a learning opportunity.