'
-```
-
-Example:
-
-```sh
-python hello_world.py --rp_server_api --test_input '{"input": {"name": "Runpod"}}'
-```
-
-You can combine these arguments to create a highly customized local testing environment. Here's an example that uses multiple options:
-
-```sh
-python hello_world.py --rp_server_api --rp_log_level DEBUG --rp_debugger --rp_api_port 8080 --rp_api_concurrency 2 --test_input '{"input": {"name": "Advanced Tester"}}'
-```
-
-This command:
-
-1. Starts the local server
-2. Sets the log level to DEBUG for maximum information
-3. Enables the debugger
-4. Uses port 8080 for the API server
-5. Sets up 2 concurrent workers
-6. Provides a test input directly in the command
-
-## Conclusion
-
-These advanced options for local testing with the Runpod Python SDK give you fine-grained control over your development environment. By mastering these tools, you can ensure your serverless functions are robust and ready for deployment to the Runpod cloud.
-
-In the next lesson, we'll explore how to structure more complex handlers to tackle advanced use cases in your serverless applications.
diff --git a/serverless/development/dual-mode-worker.mdx b/serverless/development/dual-mode-worker.mdx
index a3c201a3..ddb54a95 100644
--- a/serverless/development/dual-mode-worker.mdx
+++ b/serverless/development/dual-mode-worker.mdx
@@ -1,53 +1,35 @@
---
-title: "Build a dual-mode Serverless worker"
-sidebarTitle: "Build a dual-mode worker"
-description: "Create a flexible Serverless worker that supports a Pod-first development workflow."
+title: "Pod-first development"
+description: "Develop on a Pod before deploying to Serverless for faster iteration."
---
-Developing machine learning and AI applications often requires powerful GPUs, making local development of API endpoints challenging. A typical development workflow for [Serverless](/serverless/overview) would be to write your handler code, deploy it directly to a Serverless endpoint, send endpoint requests to test, debug using worker logs, and repeat.
+Developing machine learning applications often requires powerful GPUs, making local development challenging. Instead of repeatedly deploying to Serverless for testing, you can develop on a Pod first and then deploy the same Docker image to Serverless when ready.
-This can have signifcant drawbacks, such as:
-
-* **Slow iteration**: Each deployment requires a new build and test cycle, which can be time-consuming.
-* **Limited visibility**: Logs and errors are not always easy to debug, especially when running in a remote environment.
-* **Resource constraints**: Your local machine may not have the necessary resources to test your application.
-
-This tutorial shows how to build a "Pod-first" development environment: creating a flexible, dual-mode Docker image that can be deployed as either a Pod or a Serverless worker.
-
-Using this method, you'll leverage a [Pod](/pods/overview)—a GPU instance ideal for interactive development, with tools like Jupyter Notebooks and direct IDE integration—as your cloud-based development machine. The Pod will be deployed with a flexible Docker base, allowing the same container image to be seamlessly deployed to a Serverless endpoint.
-
-This workflow lets you develop and thoroughly test your application using a containerized Pod environment, ensuring it works correctly. Then, when you're ready to deploy to production, you can deploy it instantly to Serverless.
-
-Follow the steps below to create a worker image that leverages this flexibility, allowing for faster iteration and more robust deployments.
+This "Pod-first" workflow lets you develop and test interactively in a GPU environment, then seamlessly transition to Serverless for production. You'll use a Pod as your cloud-based development machine with tools like Jupyter Notebooks and SSH, catching issues early before deploying to Serverless.
-
-To get a basic dual-mode worker up and running immediately, you can [clone this repository](https://github.com/justinwlin/Runpod-GPU-And-Serverless-Base) and use it as a base.
-
+To get started quickly, you can [clone this repository](https://github.com/justinwlin/Runpod-GPU-And-Serverless-Base) for a ready-to-use dual-mode worker base.
## What you'll learn
-In this tutorial you'll learn how to:
+In this guide you'll learn how to:
-* Set up a project for a dual-mode Serverless worker.
-* Create a handler file (`handler.py`) that adapts its behavior based on a user-specified environment variable.
-* Write a startup script (`start.sh`) to manage different operational modes.
-* Build a Docker image designed for flexibility.
-* Understand and utilize the "Pod-first" development workflow.
-* Deploy and test your worker in both Pod and Serverless environments.
+- Set up a project for a dual-mode Serverless worker.
+- Create a handler that adapts based on an environment variable.
+- Write a startup script to manage different operational modes.
+- Build a Docker image that works in both Pod and Serverless environments.
+- Deploy and test your worker in both environments.
## Requirements
-* You've [created a Runpod account](/get-started/manage-accounts).
-* You've installed [Python 3.x](https://www.python.org/downloads/) and [Docker](https://docs.docker.com/get-started/get-docker/) on your local machine and configured them for your command line.
-* Basic understanding of Docker concepts and shell scripting.
+- You've [created a Runpod account](/get-started/manage-accounts).
+- You've installed [Python 3.x](https://www.python.org/downloads/) and [Docker](https://docs.docker.com/get-started/get-docker/) and configured them for your command line.
+- Basic understanding of Docker concepts and shell scripting.
## Step 1: Set up your project structure
-First, create a directory for your project and the necessary files.
-
- Open your terminal and run the following commands:
+Create a directory for your project and the necessary files:
```sh
mkdir dual-mode-worker
@@ -62,13 +44,13 @@ This creates:
- `Dockerfile`: Instructions to build your Docker image.
- `requirements.txt`: A file to list Python dependencies.
-## Step 2: Create the `handler.py` file
+## Step 2: Create the handler
-This Python script will contain your core logic. It will check for a user-specified environment variable `MODE_TO_RUN` to determine whether to run in Pod or Serverless mode.
+This Python script will check for a `MODE_TO_RUN` environment variable to determine whether to run in Pod or Serverless mode.
Add the following code to `handler.py`:
-```python
+```python handler.py
import os
import asyncio
import runpod
@@ -111,7 +93,7 @@ Key features:
## Step 3: Create the `start.sh` script
-This script will be the entrypoint for your Docker container. It reads the `MODE_TO_RUN` environment variable and configures the container accordingly.
+The `start.sh` script serves as the entrypoint for your Docker container and manages different operational modes. It reads the `MODE_TO_RUN` environment variable and configures the container accordingly.
Add the following code to `start.sh`:
@@ -210,18 +192,17 @@ export_env_vars
echo "Start script(s) finished"
sleep infinity
-
```
-Key features:
+
+Here are some key features of this script:
+
* `case $MODE_TO_RUN in ... esac`: This structure directs the startup based on the mode.
* `serverless` mode: Executes `handler.py`, which then starts the Runpod Serverless worker. `exec` replaces the shell process with the Python process.
* `pod` mode: Starts up the JupyterLab server for Pod development, then runs `sleep infinity` to keep the container alive so you can connect to it (e.g., via SSH or `docker exec`). You would then manually run `python /app/handler.py` inside the Pod to test your handler logic.
## Step 4: Create the `Dockerfile`
-This file defines how to build your Docker image.
-
-Add the following content to `Dockerfile`:
+Create a `Dockerfile` that includes your handler and startup script:
```dockerfile
# Use an official Runpod base image
@@ -283,7 +264,9 @@ RUN ls -la $WORKSPACE_DIR/start.sh
# depot build -t justinrunpod/pod-server-base:1.0 . --push --platform linux/amd64
CMD $WORKSPACE_DIR/start.sh
```
-Key features:
+
+Key features of this `Dockerfile`:
+
* `FROM runpod/pytorch:2.0.1-py3.10-cuda11.8.0-devel-ubuntu22.04`: Starts with a Runpod base image that comes with nginx, runpodctl, and other helpful base packages.
* `ARG WORKSPACE_DIR=/workspace` and `ENV WORKSPACE_DIR=${WORKSPACE_DIR}`: Allows the workspace directory to be set at build time.
* `WORKDIR $WORKSPACE_DIR`: Sets the working directory to the value of `WORKSPACE_DIR`.
@@ -300,21 +283,21 @@ Instead of building and pushing your image via Docker Hub, you can also [deploy
-Now, build your Docker image and push it to a container registry like Docker Hub.
+Now you're ready to build your Docker image and push it to Docker Hub:
- Build your Docker image, replacing `[YOUR_USERNAME]` with your Docker Hub username and choosing a suitable image name:
+ Build your Docker image, replacing `YOUR_USERNAME` with your Docker Hub username and choosing a suitable image name:
```sh
- docker build --platform linux/amd64 --tag [YOUR_USERNAME]/dual-mode-worker .
+ docker build --platform linux/amd64 --tag YOUR_USERNAME/dual-mode-worker .
```
The `--platform linux/amd64` flag is important for compatibility with Runpod's infrastructure.
```sh
- docker push [YOUR_USERNAME]/dual-mode-worker:latest
+ docker push YOUR_USERNAME/dual-mode-worker:latest
```
@@ -330,38 +313,40 @@ Now that you've finished building our Docker image, let's explore how you would
Deploy the image to a Pod by following these steps:
-1. Go to the [Pods page](https://www.runpod.io/console/pods) in the Runpod console and click **Create Pod**.
-2. Select an appropriate GPU for your workload (see [Choose a Pod](/pods/choose-a-pod) for guidance).
-3. Under **Pod Template**, select **Edit Template**.
-4. Under **Container Image**, enter `[YOUR_USERNAME]/dual-mode-worker:latest`.
-5. Under **Public Environment Variables**, select **Add environment variable**. Set variable key to **`MODE_TO_RUN`** and the value to **`pod`**.
-6. Click **Set Overrides**, then deploy your Pod.
-
-After [connecting to the Pod](/pods/connect-to-a-pod), navigate to `/app` and run your handler directly:
-
-```sh
-python handler.py
-```
-
-This will execute the Pod-specific test harness in your `handler.py`, giving you immediate feedback. You can edit `handler.py` within the Pod and re-run it for rapid iteration.
+1. Navigate to the [Pods page](https://www.runpod.io/console/pods) in the Runpod console.
+2. Click **Deploy**.
+3. Select your preferred GPU.
+4. Under **Container Image**, enter `YOUR_USERNAME/dual-mode-worker:latest`.
+5. Under **Public Environment Variables**, select **Add environment variable** and add:
+ - Key: `MODE_TO_RUN`
+ - Value: `pod`
+6. Click **Deploy**.
+
+Once your Pod is running, you can:
+- [Connect via the web terminal, JupyterLab, or SSH](/pods/connect-to-a-pod) to test your handler interactively.
+- Debug and iterate on your code.
+- Test GPU-specific operations.
+- Edit `handler.py` within the Pod and re-run it for rapid iteration.
## Step 7: Deploy to a Serverless endpoint
Once you're confident with your `handler.py` logic tested in Pod mode, you're ready to deploy your dual-mode worker to a Serverless endpoint.
-1. Go to the [Serverless section](https://www.runpod.io/console/serverless) of the Runpod console.
+1. Navigate to the [Serverless page](https://www.runpod.io/console/serverless) in the Runpod console.
2. Click **New Endpoint**.
3. Click **Import from Docker Registry**.
-4. In the **Container Image** field, enter your Docker image URL: `docker.io/[YOUR_USERNAME]/dual-mode-worker:latest`, then click *Next****.
-5. Under **Environment Variables**, set `MODE_TO_RUN` to `serverless`.
-6. Configure GPU, workers, and other settings as needed.
-7. Select **Create Endpoint**.
+4. In the **Container Image** field, enter your Docker image URL: `docker.io/YOUR_USERNAME/dual-mode-worker:latest`, then click *Next****.
+5. Under **Environment Variables**, add:
+ - Key: `MODE_TO_RUN`
+ - Value: `serverless`
+6. Configure your endpoint settings (GPU type, workers, etc.).
+7. Click **Create Endpoint**.
-The *same* image is used, but `start.sh` will now direct it to run in Serverless mode, starting the `runpod.serverless.start` worker.
+The *same* image will be used for your workers, but `start.sh` will now direct them to run in Serverless mode, using the `runpod.serverless.start` function to process requests.
## Step 8: Test your endpoint
-After deploying your endpoint in to Serverless mode, you can test it with the following steps:
+After deploying your endpoint in to Serverless mode, you can test it by sending API requests to your endpoint.
1. Navigate to your endpoint's detail page in the Runpod console.
2. Click the **Requests** tab.
@@ -398,14 +383,14 @@ Congratulations! You've successfully built, deployed, and tested a dual-mode Ser
1. Deploy your initial Docker image to a Runpod Pod, ensuring `MODE_TO_RUN` is set to `pod` (or rely on the Dockerfile default).
2. [Connect to your Pod](/pods/connect-to-a-pod) (via SSH or web terminal).
3. Navigate to the `/app` directory.
- 4. As you develop, install any necessary Python packages (`pip install [PACKAGE_NAME]`) or system dependencies (`apt-get install [PACKAGE_NAME]`).
+ 4. As you develop, install any necessary Python packages (`pip install PACKAGE_NAME`) or system dependencies (`apt-get install PACKAGE_NAME`).
5. Iterate on your `handler.py` script. Test your changes frequently by running `python handler.py` directly in the Pod's terminal. This will execute the test harness you defined in the `elif MODE_TO_RUN == "pod":` block, giving you immediate feedback.
Once you're satisfied with a set of changes and have new dependencies:
1. Add new Python packages to your `requirements.txt` file.
- 2. Add system installation commands (e.g., `RUN apt-get update && apt-get install -y [PACKAGE_NAME]`) to your `Dockerfile`.
+ 2. Add system installation commands (e.g., `RUN apt-get update && apt-get install -y PACKAGE_NAME`) to your `Dockerfile`.
3. Ensure your updated `handler.py` is saved.
@@ -425,13 +410,4 @@ Congratulations! You've successfully built, deployed, and tested a dual-mode Ser
-This iterative loop—write your handler, update the Docker image, test in Pod mode, then deploy to Serverless—allows for rapid development and debugging of your Serverless workers.
-
-## Next steps
-
-Now that you've mastered the dual-mode development workflow, you can:
-
-* [Explore advanced handler functions.](/serverless/workers/handler-functions)
-* [Learn about sending requests programmatically via API or SDKs.](/serverless/endpoints/send-requests)
-* [Understand endpoint configurations for performance and cost optimization.](/serverless/endpoints/endpoint-configurations)
-* [Deep dive into local testing and development.](/serverless/development/local-testing)
+This iterative loop (write your handler, update the Docker image, test in Pod mode, then deploy to Serverless) enables you to rapidly develop and debug your Serverless workers.
\ No newline at end of file
diff --git a/serverless/development/environment-variables.mdx b/serverless/development/environment-variables.mdx
index 74d0dae4..6a40cb33 100644
--- a/serverless/development/environment-variables.mdx
+++ b/serverless/development/environment-variables.mdx
@@ -1,91 +1,248 @@
---
-title: "Use environment variables"
+title: "Environment variables"
+description: "Configure your Serverless workers with environment variables."
---
-Incorporating environment variables into your Handler Functions is a key aspect of managing external resources like S3 buckets.
+Environment variables let you configure your workers without hardcoding credentials or settings in your code. They're ideal for managing API keys, service URLs, feature flags, and other configuration that changes between development and production.
-This section focuses on how to use environment variables to facilitate the uploading of images to an S3 bucket using Runpod Handler Functions.
+## How environment variables work
-You will go through the process of writing Python code for the uploading and setting the necessary environment variables in the Web interface.
+Environment variables are set in the Runpod console and are available to your handler at runtime through `os.environ`. Your handler can read these variables to configure its behavior.
-## Prerequisites
+### Access environment variables in your handler
-* Ensure the Runpod Python library is installed: `pip install runpod`.
-* Have an image file named `image.png` in the Docker container's working directory.
+```python
+import os
+import runpod
-## Python Code for S3 Uploads
+def handler(job):
+ # Read an environment variable
+ api_key = os.environ.get("API_KEY")
+ service_url = os.environ.get("SERVICE_URL", "https://default-url.com")
+
+ # Use the configuration
+ result = call_external_service(service_url, api_key)
+ return {"output": result}
-Let's break down the steps to upload an image to an S3 bucket using Python:
+runpod.serverless.start({"handler": handler})
+```
-1. **Handler Function for S3 Upload**: Here's an example of a handler function that uploads `image.png` to an S3 bucket and returns the image URL:
+## Set environment variables
- ```python
- from runpod.serverless.utils import rp_upload
- import runpod
+Set environment variables in the Runpod console when creating or editing your endpoint:
+1. Navigate to your endpoint in the [Runpod console](https://www.runpod.io/console/serverless).
+2. Click on the **Settings** tab.
+3. Scroll to the **Environment Variables** section.
+4. Add your variables as key-value pairs.
+5. Click **Save** to apply the changes.
- def handler(job):
- image_url = rp_upload.upload_image(job["id"], "./image.png")
- return [image_url]
+## Build-time vs runtime variables
+There are two types of environment variables:
- runpod.serverless.start({"handler": handler})
- ```
+### Build-time variables
-2. **Packaging Your Code**: Follow the guidelines in [Worker Image Creation](/serverless/workers/deploy) for packaging and deployment.
+Build-time variables are set in your Dockerfile using the `ENV` instruction. These are baked into your Docker image during the build:
-### Setting Environment Variables for S3
+```dockerfile
+FROM runpod/base:0.4.0-cuda11.8.0
-Using environment variables securely passes the necessary credentials and configurations to your serverless function:
+# Build-time environment variables
+ENV MODEL_NAME="llama-2-7b"
+ENV DEFAULT_TEMPERATURE="0.7"
-1. **Accessing Environment Variables Setting**: In the template creation/editing interface of your pod, navigate to the bottom section where you can set environment variables.
+COPY handler.py /handler.py
+CMD ["python", "-u", "/handler.py"]
+```
-2. **Configuring S3 Variables**: Set the following key variables for your S3 bucket:
+Build-time variables are useful for:
+- Default configuration values.
+- Values that rarely change.
+- Non-sensitive information.
- * `BUCKET_ENDPOINT_URL`
- * `BUCKET_ACCESS_KEY_ID`
- * `BUCKET_SECRET_ACCESS_KEY`
+### Runtime variables
-Ensure that your `BUCKET_ENDPOINT_URL` includes the bucket name. For example: `https://your-bucket-name.nyc3.digitaloceanspaces.com` | `https://your-bucket-name.nyc3.digitaloceanspaces.com`
+Runtime variables are set in the Runpod console and can be changed without rebuilding your image. These override build-time variables with the same name:
-## Testing your API
+Runtime variables are useful for:
+- API keys and secrets.
+- Environment-specific configuration (dev, staging, prod).
+- Values that change frequently.
+- Sensitive information that shouldn't be in your image.
-Finally, test the serverless function to confirm that it successfully uploads images to your S3 bucket:
+## Common use cases
-1. **Making a Request**: Make a POST request to your API endpoint with the necessary headers and input data. Remember, the input must be a JSON item:
+### API keys and secrets
- ```java
- import requests
+Store sensitive credentials as runtime environment variables:
- endpoint = "https://api.runpod.ai/v2/xxxxxxxxx/run"
- headers = {"Content-Type": "application/json", "Authorization": "Bearer XXXXXXXXXXXXX"}
- input_data = {"input": {"inp": "this is an example input"}}
+```python
+import os
+import runpod
+import requests
+
+def handler(event):
+ # Read API keys from environment
+ openai_key = os.environ.get("OPENAI_API_KEY")
+ anthropic_key = os.environ.get("ANTHROPIC_API_KEY")
+
+ # Use them in your code
+ if not openai_key:
+ return {"error": "OPENAI_API_KEY not configured"}
+
+ # Your API call here
+ result = call_openai(openai_key, event["input"]["prompt"])
+ return {"output": result}
- response = requests.post(endpoint, json=input_data, headers=headers)
- ```
+runpod.serverless.start({"handler": handler})
+```
-2. **Checking the Output**: Make a GET request to retrieve the job status and output. Here’s an example of how to do it:
+
+Never hardcode API keys or secrets in your code. Always use environment variables to keep credentials secure.
+
- ```csharp
- response = requests.get(
- "https://api.runpod.ai/v2/xxxxxxxxx/status/" + response.json()["id"],
- headers=headers,
- )
- response.json()
- ```
+### S3 bucket configuration
- The response should include the URL of the uploaded image on completion:
+Configure S3 or S3-compatible storage for uploading results:
- ```json
- {
- "delayTime": 86588,
- "executionTime": 1563,
- "id": "e3d2e250-ea81-4074-9838-1c52d006ddcf",
- "output": [
- "https://your-bucket.s3.us-west-004.backblazeb2.com/your-image.png"
- ],
- "status": "COMPLETED"
- }
- ```
+```python
+import os
+import runpod
+from runpod.serverless.utils import rp_upload
+
+def handler(event):
+ # S3 credentials are read from environment variables:
+ # - BUCKET_ENDPOINT_URL
+ # - BUCKET_ACCESS_KEY_ID
+ # - BUCKET_SECRET_ACCESS_KEY
+
+ # Process your input
+ result_image_path = generate_image(event["input"]["prompt"])
+
+ # Upload to S3
+ image_url = rp_upload.upload_image(event["id"], result_image_path)
+
+ return {"output": {"image_url": image_url}}
+
+runpod.serverless.start({"handler": handler})
+```
+
+Set these variables in the Runpod console:
+- `BUCKET_ENDPOINT_URL`: Your bucket endpoint (e.g., `https://your-bucket.s3.us-west-2.amazonaws.com`)
+- `BUCKET_ACCESS_KEY_ID`: Your access key ID
+- `BUCKET_SECRET_ACCESS_KEY`: Your secret access key
+
+
+The `BUCKET_ENDPOINT_URL` should include your bucket name in the URL.
+
+
+### Feature flags
+
+Use environment variables to enable or disable features:
+
+```python
+import os
+import runpod
+
+def handler(event):
+ # Read feature flags
+ enable_caching = os.environ.get("ENABLE_CACHING", "false").lower() == "true"
+ enable_logging = os.environ.get("ENABLE_LOGGING", "true").lower() == "true"
+
+ if enable_logging:
+ print(f"Processing request: {event['id']}")
+
+ # Your processing logic
+ result = process_input(event["input"], use_cache=enable_caching)
+
+ return {"output": result}
+
+runpod.serverless.start({"handler": handler})
+```
+
+### Model configuration
+
+Configure model parameters without changing code:
+
+```python
+import os
+import runpod
+
+def handler(event):
+ # Read model configuration from environment
+ model_name = os.environ.get("MODEL_NAME", "default-model")
+ max_tokens = int(os.environ.get("MAX_TOKENS", "1024"))
+ temperature = float(os.environ.get("TEMPERATURE", "0.7"))
+
+ # Use configuration in your model
+ result = generate_text(
+ model=model_name,
+ prompt=event["input"]["prompt"],
+ max_tokens=max_tokens,
+ temperature=temperature
+ )
+
+ return {"output": result}
+
+runpod.serverless.start({"handler": handler})
+```
+
+## Best practices
+
+### Use defaults
+
+Always provide default values for non-critical environment variables:
+
+```python
+# Good: Provides a default
+service_url = os.environ.get("SERVICE_URL", "https://api.example.com")
+
+# Good: Fails explicitly if missing
+api_key = os.environ.get("API_KEY")
+if not api_key:
+ raise ValueError("API_KEY environment variable is required")
+```
+
+### Validate on startup
+
+Validate critical environment variables when your handler starts:
+
+```python
+import os
+import runpod
+
+# Validate environment variables on startup
+required_vars = ["API_KEY", "SERVICE_URL"]
+missing_vars = [var for var in required_vars if not os.environ.get(var)]
+
+if missing_vars:
+ raise ValueError(f"Missing required environment variables: {', '.join(missing_vars)}")
+
+def handler(event):
+ # Your handler logic here
+ pass
+
+runpod.serverless.start({"handler": handler})
+```
+
+### Document your variables
+
+Document the environment variables your handler expects in your README:
+
+```markdown
+## Environment Variables
+
+| Variable | Required | Default | Description |
+|----------|----------|---------|-------------|
+| `API_KEY` | Yes | N/A | Your API key for the external service |
+| `SERVICE_URL` | No | `https://api.example.com` | The service endpoint URL |
+| `MAX_WORKERS` | No | `4` | Maximum number of concurrent workers |
+```
+
+### Separate secrets from config
+
+Use different approaches for secrets vs configuration:
+- **Secrets**: Only set as runtime variables in the Runpod console.
+- **Configuration**: Can use build-time defaults with runtime overrides.
-By following these steps, you can effectively use environment variables to manage S3 bucket credentials and operations within your Runpod Handler Functions. This approach ensures secure, scalable, and efficient handling of external resources in your serverless applications.
diff --git a/serverless/development/error-handling.mdx b/serverless/development/error-handling.mdx
new file mode 100644
index 00000000..836c2924
--- /dev/null
+++ b/serverless/development/error-handling.mdx
@@ -0,0 +1,112 @@
+---
+title: "Error handling"
+sidebarTitle: "Error handling"
+description: "Implement robust error handling for your Serverless workers."
+---
+
+Robust error handling is essential for production Serverless workers. It prevents your worker from crashing silently and ensures that useful error messages are returned to the user, making debugging significantly easier.
+
+## Basic error handling
+
+The simplest way to handle errors is to wrap your handler logic in a `try...except` block. This ensures that even if your logic fails, the worker remains stable and returns a readable error message.
+
+```python
+import runpod
+
+def handler(job):
+ try:
+ input = job["input"]
+
+ # Replace process_input() with your own handler logic
+ result = process_input(input)
+
+ return {"output": result}
+ except KeyError as e:
+ return {"error": f"Missing required input: {str(e)}"}
+ except Exception as e:
+ return {"error": f"An error occurred: {str(e)}"}
+
+runpod.serverless.start({"handler": handler})
+```
+
+## Structured error responses
+
+For more complex applications, you should return consistent error objects. This allows the client consuming your API to programmatically handle different types of errors, such as [validation failures](/serverless/development/validation) versus unexpected server errors.
+
+```python
+import runpod
+import traceback
+
+def handler(job):
+ try:
+ # Validate input
+ if "prompt" not in job.get("input", {}):
+ return {
+ "error": {
+ "type": "ValidationError",
+ "message": "Missing required field: prompt",
+ "details": "The 'prompt' field is required in the input object"
+ }
+ }
+
+ prompt = job["input"]["prompt"]
+ result = process_prompt(prompt)
+ return {"output": result}
+
+ except ValueError as e:
+ return {
+ "error": {
+ "type": "ValueError",
+ "message": str(e),
+ "details": "Invalid input value provided"
+ }
+ }
+ except Exception as e:
+ # Log the full traceback for debugging
+ print(f"Unexpected error: {traceback.format_exc()}")
+ return {
+ "error": {
+ "type": "UnexpectedError",
+ "message": "An unexpected error occurred",
+ "details": str(e)
+ }
+ }
+
+runpod.serverless.start({"handler": handler})
+```
+
+## Timeout handling
+
+For long-running operations, it is best practice to implement timeout logic within your handler. This prevents a job from hanging indefinitely and consuming credits without producing a result.
+
+```python
+import runpod
+import signal
+
+class TimeoutError(Exception):
+ pass
+
+def timeout_handler(signum, frame):
+ raise TimeoutError("Operation timed out")
+
+def handler(job):
+ try:
+ # Set a timeout (e.g., 60 seconds)
+ signal.signal(signal.SIGALRM, timeout_handler)
+ signal.alarm(60)
+
+ # Your processing code here
+ result = long_running_operation(job["input"])
+
+ # Cancel the timeout
+ signal.alarm(0)
+
+ return {"output": result}
+
+ except TimeoutError:
+ return {"error": "Request timed out after 60 seconds"}
+ except Exception as e:
+ return {"error": str(e)}
+
+runpod.serverless.start({"handler": handler})
+```
\ No newline at end of file
diff --git a/serverless/development/local-testing.mdx b/serverless/development/local-testing.mdx
index ed8dd1c8..c792cb5c 100644
--- a/serverless/development/local-testing.mdx
+++ b/serverless/development/local-testing.mdx
@@ -1,177 +1,200 @@
---
-title: "Test locally"
+title: "Local testing"
+description: "Test your Serverless handlers locally before deploying to production."
---
-When developing your Handler Function for Runpod serverless, it's crucial to test it thoroughly in a local environment before deployment. The Runpod SDK provides multiple ways to facilitate this local testing, allowing you to simulate various scenarios and inputs without consuming cloud resources.
+Testing your handler locally before deploying saves time and helps you catch issues early. The Runpod SDK provides multiple ways to test your handler function without consuming cloud resources.
-## Custom Inputs
+## Basic testing
-The simplest way to test your Handler Function is by passing a custom input directly when running your Python file.
-
-This method is ideal for quick checks and iterative development.
+The simplest way to test your handler is by running it directly with test input.
### Inline JSON
-You can pass inline json to your function to test its response.
-
-Assuming your handler function is in a file named `your_handler.py`, you can test it like this:
+Pass test input directly via the command line:
-
-
```sh
-python your_handler.py \
- --test_input '{"input": {"prompt": "The quick brown fox jumps"}}'
+python handler.py --test_input '{"input": {"prompt": "Hello, world!"}}'
```
-
-
-
-Add the following file to your project and run the command.
-
-```py
-import runpod
-
+This runs your handler with the specified input and displays the output in your terminal.
-def handler(event):
- """
- This is a sample handler function that echoes the input
- and adds a greeting.
- """
- try:
- # Extract the prompt from the input
- prompt = event["input"]["prompt"]
+### Test file
- result = f"Hello! You said: {prompt}"
+For more complex or reusable test inputs, create a `test_input.json` file in the same directory as your handler:
- # Return the result
- return {"output": result}
- except Exception as e:
- # If there's an error, return it
- return {"error": str(e)}
+```json test_input.json
+{
+ "input": {
+ "prompt": "This is a test input from a JSON file"
+ }
+}
+```
+Run your handler without any arguments:
-# Start the serverless function
-runpod.serverless.start({"handler": handler})
+```sh
+python main.py
```
-
+The SDK automatically detects and uses the `test_input.json` file.
-
+
+If you provide both a `test_input.json` file and the `--test_input` flag, the command-line input takes precedence.
+
-This command runs your handler with the specified input, allowing you to verify the output and behavior quickly.
+## Local API server
-### JSON file
+For more comprehensive testing, start a local API server that simulates your Serverless endpoint. This lets you send HTTP requests to test your handler as if it were deployed.
-For more complex or reusable test inputs, you can use a `test_input.json` file.
+Start the local server:
-This approach allows you to easily manage and version control your test cases.
+```sh
+python handler.py --rp_serve_api
+```
-Create a file named `test_input.json` in the same directory as your `your_handler.py` file. For example:
+This starts a FastAPI server on `http://localhost:8000`.
-```json
-{
- "input": {
- "prompt": "This is a test input from JSON file"
- }
-}
-```
+### Send requests to the server
-2. Run your handler using the following command:
+Once your local server is running, send HTTP `POST` requests from another terminal to test your function:
```sh
-python your_handler.py
+curl -X POST http://localhost:8000/runsync \
+ -H "Content-Type: application/json" \
+ -d '{"input": {"prompt": "Hello, world!"}}'
```
-When you run this command, the script will automatically detect and use the `test_input.json` file if it exists.
+
+The `/run` endpoint only returns a fake request ID without executing your code, since async mode requires communication with Runpod's system. For local testing, use `/runsync` to execute your handler and get results immediately.
+
-3. The output will indicate that it's using the input from the JSON file:
+## Testing concurrency
+
+To test how your handler performs under parallel execution, use the `--rp_api_concurrency` flag to set the number of concurrent workers.
+
+This command starts your local server with 4 concurrent workers:
```sh
---- Starting Serverless Worker | Version 1.6.2 ---
-INFO | Using test_input.json as job input.
-DEBUG | Retrieved local job: {'input': {'prompt': 'This is a test from JSON file'}, 'id': 'local_test'}
-INFO | local_test | Started.
-DEBUG | local_test | Handler output: {'output': 'Hello! You said: This is a test from JSON file'}
-DEBUG | local_test | run_job return: {'output': {'output': 'Hello! You said: This is a test from JSON file'}}
-INFO | Job local_test completed successfully.
-INFO | Job result: {'output': {'output': 'Hello! You said: This is a test from JSON file'}}
-INFO | Local testing complete, exiting.
+python main.py --rp_serve_api --rp_api_concurrency 4
```
-Using `test_input.json` is particularly helpful when:
+
+When using `--rp_api_concurrency` with a value greater than 1, your main file must be named `main.py` for proper FastAPI integration. If your file has a different name, rename it to `main.py` before running with multiple workers.
+
+
+### Testing concurrent requests
-* You have complex input structures that are cumbersome to type in the command line.
-* You want to maintain a set of test cases that you can easily switch between.
-* You're collaborating with a team and want to share standardized test inputs.
+Send multiple requests simultaneously to test concurrency:
-
+```bash
+for i in {1..10}; do
+ curl -X POST http://localhost:8000/runsync \
+ -H "Content-Type: application/json" \
+ -d '{"input": {}}' &
+done
+```
-If you provide a test input via the command line (`--test_input` argument), it will override the `test_input.json` file. This allows for flexibility in your testing process.
+### Handling concurrency in your code
-
+If your handler uses shared state (like global variables), use proper synchronization to avoid race conditions:
-## Local Test Server
+```python
+import runpod
+from threading import Lock
-For more comprehensive testing, especially when you want to simulate HTTP requests to your serverless function, you can launch a local test server. This server provides an endpoint that you can send requests to, mimicking the behavior of a deployed serverless function.
+counter = 0
+counter_lock = Lock()
-To start the local test server, use the `--rp_serve_api` flag:
-```sh
-python your_handler.py --rp_serve_api
+def handler(event):
+ global counter
+ with counter_lock:
+ counter += 1
+ return {"counter": counter}
+
+
+runpod.serverless.start({"handler": handler})
```
-This command starts a FastAPI server on your local machine, accessible at `http://localhost:8000`.
+## Debugging
+
+### Log levels
-### Customizing the Local Server
+Control the verbosity of console output with the `--rp_log_level` flag:
-You can further customize the local server using additional flags:
+```sh
+python handler.py --rp_serve_api --rp_log_level DEBUG
+```
+
+Available log levels:
+- `ERROR`: Only show error messages.
+- `WARN`: Show warnings and errors.
+- `INFO`: Show general information, warnings, and errors.
+- `DEBUG`: Show all messages, including detailed debug information.
-* `--rp_api_port`: Specify a custom port (default is 8000)
-* `--rp_api_host`: Set the host address (default is "localhost")
-* `--rp_api_concurrency`: Set the number of worker processes (default is 1)
+### Enable the debugger
-Example:
+Use the `--rp_debugger` flag for detailed troubleshooting:
```sh
-python main.py \
- --rp_serve_api \
- --rp_api_port 8080 \
- --rp_api_concurrency 4
+python handler.py --rp_serve_api --rp_debugger
```
-This starts the server on port `8080` with 4 worker processes.
+This enables the Runpod debugger, which provides additional diagnostic information to help you troubleshoot issues.
-### Sending Requests to the Local Server
+## Server configuration
-Once your local server is running, you can send HTTP POST requests to test your function. Use tools like `curl` or Postman, or write scripts to automate your tests.
+Customize the local API server with these flags:
-Example using `curl`:
+### Port
+
+Set a custom port (default is 8000):
```sh
-curl -X POST http://localhost:8000/runsync \
- -H "Content-Type: application/json" \
- -d '{"input": {"prompt": "The quick brown fox jumps"}}'
+python handler.py --rp_serve_api --rp_api_port 8080
```
-
+### Host
-When testing locally, the /run endpoint only returns a fake requestId without executing your code, as async mode requires communication with our system. This is why you can’t check job status using /status. For local testing, use /runsync. To test async functionality, you’ll need to deploy your app on our platform.
+Set the hostname (default is "localhost"):
-
+```sh
+python handler.py --rp_serve_api --rp_api_host 0.0.0.0
+```
+
+
+Setting `--rp_api_host` to `0.0.0.0` allows connections from other devices on the network. This can be useful for testing but may have security implications.
+
-## Advanced testing options
+## Flag reference
-The Runpod SDK offers additional flags for more advanced testing scenarios:
+Here's a complete reference of all available flags for local testing:
-* `--rp_log_level`: Control log verbosity (options: ERROR, WARN, INFO, DEBUG)
-* `--rp_debugger`: Enable the Runpod debugger for detailed troubleshooting
+| Flag | Description | Default | Example |
+|------|-------------|---------|---------|
+| `--rp_serve_api` | Starts the local API server | N/A | `--rp_serve_api` |
+| `--rp_api_port` | Sets the server port | 8000 | `--rp_api_port 8080` |
+| `--rp_api_host` | Sets the server hostname | "localhost" | `--rp_api_host 0.0.0.0` |
+| `--rp_api_concurrency` | Sets concurrent workers | 1 | `--rp_api_concurrency 4` |
+| `--rp_log_level` | Controls log verbosity | INFO | `--rp_log_level DEBUG` |
+| `--rp_debugger` | Enables the debugger | Disabled | `--rp_debugger` |
+| `--test_input` | Provides test input as JSON | N/A | `--test_input '{"input": {}}'` |
-Example:
+## Combined example
+
+You can combine multiple flags to create a customized local testing environment:
```sh
-python your_handler.py --rp_serve_api --rp_log_level DEBUG --rp_debugger
+python handler.py --rp_serve_api \
+ --rp_api_port 8080 \
+ --rp_api_concurrency 4 \
+ --rp_log_level DEBUG \
+ --rp_debugger
```
-Local testing is a crucial step in developing robust and reliable serverless functions for Runpod. By utilizing these local testing options, you can catch and fix issues early, optimize your function's performance, and ensure a smoother deployment process.
-
-For more detailed information on local testing and advanced usage scenarios, refer to our [blog post](https://blog.runpod.io/workers-local-api-server-introduced-with-runpod-python-0-9-13/) and the other tutorials in this documentation.
+This command:
+- Starts the local API server on port 8080.
+- Uses 4 concurrent workers.
+- Sets the log level to `DEBUG` for maximum information.
+- Enables the debugger for troubleshooting.
\ No newline at end of file
diff --git a/serverless/development/logs.mdx b/serverless/development/logs.mdx
index aa578e41..a3471950 100644
--- a/serverless/development/logs.mdx
+++ b/serverless/development/logs.mdx
@@ -1,6 +1,6 @@
---
-title: "Logs"
-sidebarTitle: "Logs"
+title: "Logs and monitoring"
+sidebarTitle: "Logs and monitoring"
description: "Access and manage logs for Serverless endpoints and workers."
---
@@ -8,7 +8,7 @@ description: "Access and manage logs for Serverless endpoints and workers."
-Runpod provides comprehensive logging capabilities for Serverless endpoints and workers to help you monitor, debug, and troubleshoot your applications. Understanding the different types of logs and their persistence characteristics is crucial for effective application management.
+Runpod provides comprehensive logging capabilities for Serverless endpoints and workers to help you monitor, debug, and troubleshoot your applications.
## Endpoint logs
@@ -256,6 +256,78 @@ if __name__ == "__main__":
runpod.serverless.start({"handler": handler})
```
+## Structured logging
+
+Outputting structured logs in a machine-readable format (typically JSON) makes it easier to parse, search, and analyze logs programmatically. This is especially useful when exporting logs to external services or analyzing large volumes of logs.
+
+### JSON logging example
+
+```python
+import logging
+import json
+import runpod
+
+def setup_structured_logger():
+ """
+ Configure a logger that outputs JSON-formatted logs.
+ """
+ logger = logging.getLogger("runpod_worker")
+ logger.setLevel(logging.DEBUG)
+
+ # Create a handler that outputs to stdout
+ handler = logging.StreamHandler()
+
+ # Don't use a formatter—we'll format manually as JSON
+ logger.addHandler(handler)
+
+ return logger
+
+logger = setup_structured_logger()
+
+def log_json(level, message, **kwargs):
+ """
+ Log a structured JSON message.
+ """
+ log_entry = {
+ "level": level,
+ "message": message,
+ **kwargs
+ }
+ print(json.dumps(log_entry))
+
+def handler(event):
+ request_id = event.get("id", "unknown")
+
+ try:
+ log_json("INFO", "Processing request", request_id=request_id, input_keys=list(event.get("input", {}).keys()))
+
+ # Replace process_input() with your own processing logic
+ result = process_input(event["input"])
+
+ log_json("INFO", "Request completed", request_id=request_id, execution_time_ms=123)
+
+ return {"output": result}
+ except Exception as e:
+ log_json("ERROR", "Request failed", request_id=request_id, error=str(e), error_type=type(e).__name__)
+ return {"error": str(e)}
+
+runpod.serverless.start({"handler": handler})
+```
+
+This produces logs like:
+
+```json
+{"level": "INFO", "message": "Processing request", "request_id": "abc123", "input_keys": ["prompt", "max_length"]}
+{"level": "INFO", "message": "Request completed", "request_id": "abc123", "execution_time_ms": 123}
+```
+
+### Benefits of structured logging
+
+- **Easier parsing**: JSON logs can be easily parsed by log aggregation tools.
+- **Better search**: Search for specific fields like `request_id` or `error_type`.
+- **Analytics**: Analyze trends, patterns, and metrics from log data.
+- **Integration**: Export to external services like Datadog, Splunk, or Elasticsearch.
+
### Accessing stored logs
To access logs stored in network volumes:
diff --git a/serverless/development/optimization.mdx b/serverless/development/optimization.mdx
new file mode 100644
index 00000000..9490b955
--- /dev/null
+++ b/serverless/development/optimization.mdx
@@ -0,0 +1,63 @@
+---
+title: "Optimize your workers"
+sidebarTitle: "Optimization"
+description: "Implement strategies to reduce latency and cost for your Serverless workers."
+---
+
+Optimizing your Serverless workers involves a cycle of measuring performance with [benchmarking](/serverless/development/benchmarking), identifying bottlenecks, and tuning your [endpoint configurations](/serverless/endpoints/endpoint-configurations). This guide covers specific strategies to reduce startup times and improve throughput.
+
+## Optimization overview
+
+Effective optimization requires making conscious tradeoffs between cost, speed, and model size.
+
+To ensure high availability during peak traffic, you should select multiple GPU types in your configuration rather than relying on a single hardware specification. When choosing hardware, a single high-end GPU is generally preferable to multiple lower-tier cards, as the superior memory bandwidth and newer architecture often yield better inference performance than parallelization across weaker cards. When choosing multiple [GPU types](/references/gpu-types), you should select the [GPU categories](/serverless/endpoints/endpoint-configurations#gpu-configuration) that are most likely to be available in your desired data centers.
+
+For latency-sensitive applications, utilizing active workers is the most effective way to eliminate cold starts. You should also configure your [max workers](/serverless/endpoints/endpoint-configurations#max-workers) setting with approximately 20% headroom above your expected concurrency. This buffer ensures that your endpoint can handle sudden load spikes without throttling requests or hitting capacity limits.
+
+Your architectural choices also significantly impact performance. Whenever possible, bake your models directly into the Docker image to leverage the high-speed local NVMe storage of the host machine. If you utilize [network volumes](/storage/network-volumes) for larger datasets, remember that this restricts your endpoint to specific data centers, which effectively shrinks your pool of available compute resources.
+
+
+## Reducing worker startup times
+
+
+There are two key metrics to consider when optimizing your workers:
+
+ - **Delay time**: The time spent waiting for a worker to become available. This includes the cold start time if a new worker needs to be spun up.
+ - **Execution time**: The time the GPU takes to actually process the request once the worker has received the job.
+
+
+Try [benchmarking your workers](/serverless/development/benchmarking) to measure these metrics.
+
+
+**Delay time** is comprised of:
+
+ - **Initialization time**: The time spent downloading the Docker image.
+ - **Cold start time**: The time spent loading the model into memory.
+
+If your delay time is high, use these strategies to reduce it.
+
+
+If your worker's cold start time exceeds the default 7-minute limit, the system may mark it as unhealthy. You can extend this limit by setting the `RUNPOD_INIT_TIMEOUT` environment variable (e.g. `RUNPOD_INIT_TIMEOUT=800` for 800 seconds).
+
+
+### Embed models in Docker images
+
+For production environments, package your ML models directly within your worker container image instead of downloading them in your handler function. This strategy places models on the worker's high-speed local storage (SSD/NVMe), dramatically reducing the time needed to load models into GPU memory. Note that extremely large models (500GB+) may still require network volume storage.
+
+### Use network volumes during development
+
+For flexibility during development, save large models to a [network volume](/storage/network-volumes) using a Pod or one-time handler, then mount this volume to your Serverless workers. While network volumes offer slower model loading compared to embedding models directly, they can speed up your workflow by enabling rapid iteration and seamless switching between different models and configurations.
+
+### Maintain active workers
+
+Set [active worker counts](/serverless/endpoints/endpoint-configurations#active-workers) above zero to completely eliminate cold starts. These workers remain ready to process requests instantly and cost up to 30% less when idle compared to standard (flex) workers.
+
+You can estimate the optimal number of active workers using the formula: `(Requests per Minute × Request Duration) / 60`. For example, with 6 requests per minute taking 30 seconds each, you would need 3 active workers to handle the load without queuing.
+
+### Optimize scaling parameters
+
+Fine-tune your [auto-scaling configuration](/serverless/endpoints/endpoint-configurations#auto-scaling-type) for more responsive worker provisioning. Lowering the queue delay threshold to 2-3 seconds (default 4) or decreasing the request count threshold allows the system to respond more swiftly to traffic fluctuations.
+
+### Increase maximum worker limits
+
+Set a higher [max worker](/serverless/endpoints/endpoint-configurations#max-workers) limit to ensure your Docker images are pre-cached across multiple compute nodes and data centers. This proactive approach eliminates image download delays during scaling events, significantly reducing startup times.
\ No newline at end of file
diff --git a/serverless/development/overview.mdx b/serverless/development/overview.mdx
index 66bbeae4..3aa61544 100644
--- a/serverless/development/overview.mdx
+++ b/serverless/development/overview.mdx
@@ -1,137 +1,116 @@
---
-title: "Local server flags"
+title: "Serverless development"
+sidebarTitle: "Overview"
+description: "Test, debug, and optimize your Serverless applications."
---
-When developing Runpod Serverless functions, it's crucial to test them thoroughly before deployment. The Runpod SDK provides a powerful local testing environment that allows you to simulate your Serverless endpoints right on your development machine. This local server eliminates the need for constant Docker container rebuilds, uploads, and endpoint updates during the development and testing phase.
-
-To facilitate this local testing environment, the Runpod SDK offers a variety of flags that allow you to customize your setup. These flags enable you to:
-
-* Configure the server settings (port, host, concurrency)
-* Control logging verbosity
-* Enable debugging features
-* Provide test inputs
-
-By using these flags, you can create a local environment that closely mimics the behavior of your functions in the Runpod cloud, allowing for more accurate testing and smoother deployments.
-
-This guide provides a comprehensive overview of all available flags, their purposes, and how to use them effectively in your local testing workflow.
-
-## Basic usage
-
-To start your local server with additional flags, use the following format:
-
-```sh
-python your_function.py [flags]
-```
-
-Replace `your_function.py` with the name of your Python file containing the Runpod handler.
-
-## Available flags
-
-### --rp\_serve\_api
-
-Starts the API server for local testing.
-
-**Usage**:
-
-```sh
-python your_function.py --rp_serve_api
-```
-
-### --rp\_api\_port
-
-Sets the port number for the FastAPI server.
-
-**Default**: 8000
-
-**Usage**:
-
-```sh
-python your_function.py --rp_serve_api --rp_api_port 8080
+When developing for Runpod Serverless, you'll typically start by writing handler functions, test them locally, and then deploy to production. This guide introduces the development workflow and tools that help you test, debug, and optimize your Serverless applications effectively.
+
+## Development lifecycle
+
+The typical workflow starts with writing your handler function. Your handler receives an event object with input data and returns a response. Once you have a handler function, test it locally using the Runpod SDK's testing environment. You can test with inline JSON inputs, use a local API server, or simulate concurrency, all without actually deploying your code and incurring charges.
+
+When your handler is working correctly, package it into a Docker image and deploy it to a Serverless endpoint. Your worker will auto-scale based on demand. Once deployed, use logs, metrics, and SSH access to troubleshoot issues and optimize performance in production.
+
+
+```mermaid
+%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#5D29F0','primaryTextColor':'#fff','primaryBorderColor':'#874BFF','lineColor':'#AE6DFF','secondaryColor':'#AE6DFF','tertiaryColor':'#FCB1FF','edgeLabelBackground':'#AE6DFF', 'fontSize':'15px','fontFamily':'font-inter'}}}%%
+
+flowchart TD
+ Start([Write handler function]) --> Test[Test handler locally with the Runpod SDK]
+
+ Test --> Check{Tests pass?}
+
+ Check -->|" No "| Fix[Fix code & debug]
+
+ Fix --> Test
+
+ Check -->|" Yes "| Package[Package worker as a Docker image]
+
+ Package --> Deploy[Deploy worker image to Runpod Serverless]
+
+ subgraph Production [Production]
+ Deploy --> Running[Workers auto-scale based on demand]
+ Running --> Monitor[Monitor logs and metrics, SSH into workers for live debugging]
+ end
+
+ Monitor -.-> Start
+
+ style Start fill:#5D29F0,stroke:#874BFF,color:#FFFFFF,stroke-width:2px
+ style Test fill:#874BFF,stroke:#AE6DFF,color:#FFFFFF,stroke-width:2px
+ style Check fill:#1B0656,stroke:#874BFF,color:#FFFFFF,stroke-width:2px
+ style Fix fill:#FFC01F,stroke:#FF6214,color:#000000,stroke-width:2px
+ style Package fill:#5D29F0,stroke:#874BFF,color:#FFFFFF,stroke-width:2px
+ style Deploy fill:#5D29F0,stroke:#874BFF,color:#FFFFFF,stroke-width:2px
+ style Running fill:#5D29F0,stroke:#874BFF,color:#FFFFFF,stroke-width:2px
+ style Monitor fill:#FCB1FF,stroke:#AE6DFF,color:#000000,stroke-width:2px
+ style Production fill:#100433,stroke:#874BFF,color:#FFFFFF,stroke-width:2px
+
+ linkStyle default stroke-width:2px
```
+
-Setting `--rp_api_host` to `0.0.0.0` allows connections from other devices on the network, which can be useful for testing but may have security implications.
+
+For faster iteration and debugging of GPU-intensive applications, you can develop on a Pod first before deploying to Serverless. This "Pod-first" workflow gives you direct access to the GPU environment with tools like Jupyter Notebooks and SSH, letting you iterate faster than deploying repeatedly to Serverless. Learn more in [Pod-first development](/serverless/development/dual-mode-worker).
+
-### --rp\_api\_concurrency
+## Local testing
-Sets the number of concurrent workers for the FastAPI server.
+The Runpod SDK provides a comprehensive local testing environment:
-**Default**: 1
+- **Basic testing**: Run your handler with inline JSON or test files.
+- **Local API server**: Simulate HTTP requests to your Serverless endpoint.
+- **Concurrency testing**: Test how your handler performs under parallel execution.
+- **Debug mode**: Enable detailed logging and troubleshooting output.
-**Usage**:
+Learn more in [Local testing](/serverless/development/local-testing).
-```sh
-python your_function.py --rp_serve_api --rp_api_concurrency 4
-```
+## Error handling
-
+Implement robust error handling to ensure your workers remain stable and return useful error messages.
-When using `--rp_api_concurrency` with a value greater than 1, ensure your main file is named `main.py` for proper FastAPI integration.
+Learn more in [Error handling](/serverless/development/error-handling).
-
+## SDK utilities
-### --rp\_api\_host
+The Runpod SDK includes helper functions to make your handlers more robust:
-Sets the hostname for the FastAPI server.
+- **Input validation**: Validate request data against a schema.
+- **Cleanup utilities**: Automatically remove temporary files after processing.
-**Default**: "localhost"
+Learn more in [Validate inputs](/serverless/development/validation) and [Clean up files](/serverless/development/cleanup).
-**Usage**:
+## Benchmarking and optimization
-```sh
-python your_function.py --rp_serve_api --rp_api_host 0.0.0.0
-```
+Optimize your workers for performance and cost:
-### --rp\_log\_level
+- **Benchmark response times**: Measure cold start and execution time.
+- **Optimize your workers**: Reduce startup and execution times.
-Controls the verbosity of console output.
+Learn more in the [Benchmarking](/serverless/development/benchmarking) and [Optimization](/serverless/development/optimization) guides.
-**Options**: `ERROR` | `WARN` | `INFO` | `DEBUG`
+## Pod-first development
-**Usage**:
+For faster iteration and debugging of GPU-intensive applications, develop on a Pod first, then deploy the same Docker image to Serverless. This workflow provides:
-```sh
-python your_function.py --rp_serve_api --rp_log_level DEBUG
-```
-
-### --rp\_debugger
+- Interactive development with Jupyter Notebooks.
+- Direct SSH access to the GPU environment.
+- Faster iteration compared to deploying repeatedly to Serverless.
-Enables the Runpod debugger for troubleshooting. The `--rp_debugger` flag is particularly useful when you need to step through your code for troubleshooting.
+Learn more in [Pod-first development](/serverless/development/dual-mode-worker).
-**Usage**:
-
-```sh
-python your_function.py --rp_serve_api --rp_debugger
-```
+## Debugging and observability
-### --test\_input
+Runpod provides several tools for debugging and monitoring:
-Provides test input data for your function, formatted as JSON.
+- **Logs**: View real-time and historical logs from your workers.
+- **Metrics**: Monitor execution time, delay time, and resource usage.
+- **SSH access**: Connect directly to running workers for live debugging.
-**Usage**:
-
-```sh
-python your_function.py --rp_serve_api \
- --test_input '{"input": {"key": "value"}}'
-```
-
-The `--test_input` flag is an alternative to using a `test_input.json` file. If both are present, the command-line input takes precedence.
-
-## Combined flags
-
-You can combine multiple flags to customize your local testing environment.
-
-For example:
-
-```sh
-python main.py --rp_serve_api \
- --rp_api_port 8080 \
- --rp_api_concurrency 4 \
- --rp_log_level DEBUG \
- --test_input '{"input": {"key": "value"}}'
-```
+Learn more in [Logs and monitoring](/serverless/development/logs) and [Connect to workers with SSH](/serverless/development/ssh-into-workers).
-This command starts the local server on port `8080` with 4 concurrent workers, sets the log level to `DEBUG`, and provides test input data.
+## Environment variables
-These flags provide powerful tools for customizing your local testing environment. By using them effectively, you can simulate various scenarios, debug issues, and ensure your Serverless functions are robust and ready for deployment to the Runpod cloud.
+Use environment variables to configure your workers without hardcoding credentials or settings in your code. Environment variables are set in the Runpod console and are available to your handler at runtime.
-For more detailed information on each flag and advanced usage scenarios, refer to the individual tutorials in this documentation.
+Learn more in [Environment variables](/serverless/development/environment-variables).
diff --git a/serverless/development/ssh-into-workers.mdx b/serverless/development/ssh-into-workers.mdx
index 3ec7e303..719bc56f 100644
--- a/serverless/development/ssh-into-workers.mdx
+++ b/serverless/development/ssh-into-workers.mdx
@@ -1,14 +1,14 @@
---
-title: "SSH into running workers"
+title: "Connect to workers with SSH"
sidebarTitle: "SSH into workers"
-description: "Connect to your Serverless workers via SSH for debugging and troubleshooting."
+description: "SSH into running workers for debugging and troubleshooting."
---
-SSH into running workers to debug endpoints in development and production. By connecting to a worker, you can inspect logs, file systems, and environment variables in real-time.
+You can connect directly to running workers via SSH for debugging and troubleshooting. By connecting to a worker, you can inspect logs, file systems, and environment variables in real-time.
## Generate an SSH key and add it to your Runpod account
-Before you can SSH into a worker, you'll need to generate an SSH key and add it to your Runpod account.
+Before you can connect to a worker, you'll need to generate an SSH key and add it to your Runpod account.
@@ -60,7 +60,7 @@ Before you can SSH into a worker, you'll need to generate an SSH key and add it
4. Under **Worker configuration**, set **Active workers** to 1 or more.
5. Click **Save** to apply the changes.
- This ensures at least one worker remains running at all times, eliminating cold start delays and allowing you to SSH in.
+ This ensures at least one worker remains running at all times, and allowing you to SSH in without your worker being automatically scaled down.
diff --git a/serverless/development/test-response-times.mdx b/serverless/development/test-response-times.mdx
deleted file mode 100644
index d4baf490..00000000
--- a/serverless/development/test-response-times.mdx
+++ /dev/null
@@ -1,45 +0,0 @@
----
-title: "Test response time"
----
-
-When setting up an API, you have several options available at different price points and resource allocations. You can select a single option if you would prefer to only use one price point, or select a preference order between the pools that will allocate your requests accordingly.
-
-
-
-
-
-The option that will be most cost effective for you will be based on your use case and your tolerance for task run time. Each situation will be different, so when deciding which API to use, it's worth it to do some testing to not only find out how long your tasks will take to run, but how much you might expect to pay for each task.
-
-To find out how long a task will take to run, select a single pool type as shown in the image above. Then, you can send a request to the API through your preferred method. If you're unfamiliar with how to do so or don't have your own method, then you can use a free option like [reqbin.com](https://reqbin.com/) to send an API request to the Runpod severs.
-
-The URLs to use in the API will be shown in the My APIs screen:
-
-
-
-
-
-On reqbin.com, enter the Run URL of your API, select POST under the dropdown, and enter your API key that was given when you created the key under [Settings](https://www.console.runpod.io/serverless/user/settings)(if you do not have it saved, you will need to return to Settings and create a new key). Under Content, you will also need to give it a basic command (in this example, we've used a Stable Diffusion prompt).
-
-
-
-
-
-
-
-
-
-Send the request, and it will give you an ID for the request and notify you that it is processing. You can then swap the URL in the request field with the Status address and add the ID to the end of it, and click Send.
-
-
-
-
-
-It will return a Delay Time and an Execution Time, denoted in milliseconds. The Delay Time should be extremely minimal, unless the API process was spun up from a cold start, then a sizable delay is expected for the first request sent. The Execution Time is how long the GPU took to actually process the request once it was received. It may be a good idea to send a number of tests so you can get a min, max, and average run time -- five tests should be an adequate sample size.
-
-
-
-
-
-You can then switch the GPU pool above to a different pool and repeat the process.
-
-What will ultimately be right for your use case will be determined by how long you can afford to let the process run. For heavier jobs, a task on a slower GPU will be likely be more cost-effective with a tradeoff of speed. For simpler tasks, there may also be diminishing returns on how fast the task that can be run that may not be significantly improved by selecting higher-end GPUs. Experiment to find the best balance for your scenario.
diff --git a/serverless/development/validation.mdx b/serverless/development/validation.mdx
new file mode 100644
index 00000000..f419d31c
--- /dev/null
+++ b/serverless/development/validation.mdx
@@ -0,0 +1,103 @@
+---
+title: "Validate inputs"
+sidebarTitle: "Validate inputs"
+description: "Validate handler inputs using the Runpod SDK schema validator."
+---
+
+The Runpod SDK includes a built-in validation utility that ensures your handler receives data in the correct format before processing begins. Validating inputs early helps catch errors immediately and prevents your worker from crashing due to unexpected or malformed data types.
+
+## Import the validator
+
+To use the validation features, import the `validate` function from the utils module:
+
+```python
+from runpod.serverless.utils.rp_validator import validate
+```
+
+## Define a schema
+
+You define your validation rules using a dictionary where each key represents an expected input field. This schema dictates the data types, necessity, and constraints for the incoming data.
+
+```python
+schema = {
+ "text": {
+ "type": str,
+ "required": True,
+ },
+ "max_length": {
+ "type": int,
+ "required": False,
+ "default": 100,
+ "constraints": lambda x: x > 0,
+ },
+}
+```
+
+The schema supports several configuration keys:
+- `type` (required): Expected input type (e.g., `str`, `int`, `float`, `bool`).
+- `required` (default: `False`): Whether the field is required.
+- `default` (default: `None`): Default value if input is not provided.
+- `constraints` (optional): A lambda function that returns `True` or `False` to validate the value.
+
+## Validate input in your handler
+
+When implementing validation in your handler, pass the input object and your schema to the `validate` function. The function returns a dictionary containing either an `errors` key or a `validated_input` key.
+
+```python
+import runpod
+from runpod.serverless.utils.rp_validator import validate
+
+schema = {
+ "text": {
+ "type": str,
+ "required": True,
+ },
+ "max_length": {
+ "type": int,
+ "required": False,
+ "default": 100,
+ "constraints": lambda x: x > 0,
+ },
+}
+
+
+def handler(event):
+ try:
+ # Validate the input against the schema
+ validated_input = validate(event["input"], schema)
+
+ # Check for validation errors
+ if "errors" in validated_input:
+ return {"error": validated_input["errors"]}
+
+ # Access the sanitized inputs
+ text = validated_input["validated_input"]["text"]
+ max_length = validated_input["validated_input"]["max_length"]
+
+ result = text[:max_length]
+ return {"output": result}
+ except Exception as e:
+ return {"error": str(e)}
+
+
+runpod.serverless.start({"handler": handler})
+```
+
+## Test the validator
+
+You can test your validation logic locally without deploying. Save your handler code and run it via the command line with the `--test_input` flag.
+
+```sh
+python your_handler.py --test_input '{"input": {"text": "Hello, world!", "max_length": 5}}'
+```
+
+Alternatively, you can define your test case in a JSON file and pass it to the handler to simulate a real request.
+
+```json test_input.json
+{
+ "input": {
+ "text": "The quick brown fox jumps over the lazy dog",
+ "max_length": 50
+ }
+}
+```
\ No newline at end of file
diff --git a/serverless/development/validator.mdx b/serverless/development/validator.mdx
deleted file mode 100644
index b7d26eb0..00000000
--- a/serverless/development/validator.mdx
+++ /dev/null
@@ -1,99 +0,0 @@
----
-title: "Input validation"
----
-
-Runpod's validator utility ensures robust execution of serverless workers by validating input data against a defined schema.
-
-To use it, import the following to your Python file:
-
-```py
-from runpod.serverless.utils.rp_validator import validate
-```
-
-The `validate` function takes two arguments:
-
-* the input data
-* the schema to validate against
-
-## Schema Definition
-
-Define your schema as a nested dictionary with these possible rules for each input:
-
-* `required` (default: `False`): Marks the type as required.
-* `default` (default: `None`): Default value if input is not provided.
-* `type` (required): Expected input type.
-* `constraints` (optional): for example, a lambda function returning `true` or `false`.
-
-## Example Usage
-
-```python
-import runpod
-from runpod.serverless.utils.rp_validator import validate
-
-schema = {
- "text": {
- "type": str,
- "required": True,
- },
- "max_length": {
- "type": int,
- "required": False,
- "default": 100,
- "constraints": lambda x: x > 0,
- },
-}
-
-
-def handler(event):
- try:
- validated_input = validate(event["input"], schema)
- if "errors" in validated_input:
- return {"error": validated_input["errors"]}
-
- text = validated_input["validated_input"]["text"]
- max_length = validated_input["validated_input"]["max_length"]
-
- result = text[:max_length]
- return {"output": result}
- except Exception as e:
- return {"error": str(e)}
-
-
-runpod.serverless.start({"handler": handler})
-```
-
-## Testing
-
-Save as `your_handler.py` and test using:
-
-
-
-```sh
-python your_handler.py
-```
-
-Or with inline input:
-
-```sh
-python your_handler.py --test_input '{"input": {"text": "Hello, world!", "max_length": 5}}'
-```
-
-
-
-
-Create `test_input.json`:
-
-```json
-{
- "input": {
- "text": "The quick brown fox jumps over the lazy dog",
- "max_length": 50
- }
-}
-```
-
-
-
-
-
-This approach allows early detection of input errors, preventing issues from unexpected or malformed inputs.
diff --git a/serverless/endpoints/endpoint-configurations.mdx b/serverless/endpoints/endpoint-configurations.mdx
index 121851af..963bdf11 100644
--- a/serverless/endpoints/endpoint-configurations.mdx
+++ b/serverless/endpoints/endpoint-configurations.mdx
@@ -1,267 +1,93 @@
---
-title: "Endpoint settings and optimization guide"
+title: "Endpoint settings"
sidebarTitle: "Endpoint settings"
-description: "Configure your endpoints to optimize for performance, cost, and reliability."
+description: "Reference guide for all Serverless endpoint settings and parameters."
---
import GPUTable from '/snippets/serverless-gpu-pricing-table.mdx';
-This guide explains all available settings and best practices for configuring your Serverless endpoints.
+This guide details the configuration options available for Runpod Serverless endpoints. These settings control how your endpoint scales, how it utilizes hardware, and how it manages request lifecycles.
-
-
-
+## General configuration
-## Endpoint name
+### Endpoint name
-The name you assign to your endpoint for easy identification in your dashboard. This name is only visible to you and doesn't affect the endpoint ID used for API calls.
+The name assigned to your endpoint helps you identify it within the Runpod console. This is a local display name and does not impact the endpoint ID used for API requests.
-## Endpoint type
+### Endpoint type
-Choose between two endpoint types based on your workload requirements:
+Select the architecture that best fits your application's traffic pattern:
-**Queue based endpoints** are well-suited for long-running requests, batch processing, or asynchronous tasks. They process requests through a queueing system that guarantees execution and provides built-in retry mechanisms. These endpoints are easy to implement using [handler functions](/serverless/workers/handler-functions), and are ideal for workloads that can be processed asynchronously.
+**Queue based endpoints** utilize a built-in queueing system to manage requests. They are ideal for asynchronous tasks, batch processing, and long-running jobs where immediate synchronous responses are not required. These endpoints provide guaranteed execution and automatic retries for failed requests. Queue based endpoints are implemented using [handler functions](/serverless/workers/handler-functions).
-**Load balancing endpoints** are best for high-throughput or low-latency workloads, or non-standard request/response patterns. They route requests directly to worker HTTP servers, bypassing the queue for faster response times. These endpoints support custom REST API paths and are ideal for real-time applications requiring immediate processing.
+**Load balancing endpoints** route traffic directly to available workers, bypassing the internal queue. They are designed for high-throughput, low-latency applications that require synchronous request/response cycles, such as real-time inference or custom REST APIs. For implementation details, see [Load balancing endpoints](/serverless/load-balancing/overview).
-For detailed information about load balancing endpoints, see [Load balancing endpoints](/serverless/load-balancing/overview).
+### GPU configuration
-## GPU configuration
+This setting determines the hardware tier your workers will utilize. You can select multiple GPU categories to create a prioritized list. Runpod attempts to allocate the first category in your list. If that hardware is unavailable, it automatically falls back to the subsequent options. Selecting multiple GPU types significantly improves endpoint availability during periods of high demand.
-Choose one or more GPU categories (organized by memory) for your endpoint in order of preference. Runpod prioritizes allocating the first category in your list and falls back to subsequent GPUs if your first choice is unavailable.
+
-The following GPU categories are available:
-
-
-
-
-
-Selecting multiple GPU types improves availability, especially for high-demand GPUs.
-
-
-
-## Worker configuration
+## Worker scaling
### Active workers
-Sets the minimum number of workers that remain running at all times. Setting this at one or higher eliminates cold start delays for faster response times. Active workers incur charges immediately, but receive up to 30% discount from regular pricing.
-
-Default: 0
-
-
-
-For workloads with long cold start times, consider using active workers to eliminate startup delays. You can estimate the optimal number by:
-
-1. Measuring your requests per minute during typical usage.
-2. Calculating average request duration in seconds.
-3. Using the formula: Active Workers = (Requests per Minute × Request Duration) / 60
-
-For example, with 6 requests per minute taking 30 seconds each: 6 × 30 / 60 = 3 active workers.
-
-Even a small number of active workers can significantly improve performance for steady traffic patterns while maintaining cost efficiency.
-
-
+This setting defines the minimum number of workers that remain warm and ready to process requests at all times. Setting this to 1 or higher eliminates cold starts for the initial wave of requests. Active workers incur charges even when idle, but they receive a 20-30% discount compared to on-demand workers.
### Max workers
-The maximum number of concurrent workers your endpoint can scale to.
-
-Default: 3
-
-
-
-Setting max workers to 1 restricts your deployment to a single machine, creating potential bottlenecks if that machine becomes unavailable.
-
-We recommend setting your max worker count approximately 20% higher than your expected maximum concurrency. This headroom allows for smoother scaling during traffic spikes and helps prevent request throttling.
-
-
+This setting controls the maximum number of concurrent instances your endpoint can scale to. This acts as a safety limit for costs and a cap on concurrency. We recommend setting your max worker count approximately 20% higher than your expected maximum concurrency. This buffer allows for smoother scaling during traffic spikes.
### GPUs per worker
-The number of GPUs assigned to each worker instance.
-
-Default: 1
+This defines how many GPUs are assigned to a single worker instance. The default is 1. When choosing between multiple lower-tier GPUs or fewer high-end GPUs, you should generally prioritize high-end GPUs with lower GPU count per worker when possible.
-
+### Auto-scaling type
-When choosing between multiple lower-tier GPUs or fewer high-end GPUs, you should generally prioritize high-end GPUs with lower GPU count per worker when possible.
+This setting determines the logic used to scale workers up and down.
-- High-end GPUs typically offer faster memory speeds and newer architectures, improving model loading and inference times.
-- Multi-GPU configurations introduce parallel processing overhead that can offset performance gains.
-- Higher GPU-per-worker requirements can reduce availability, as finding machines with multiple free GPUs is more challenging than locating single available GPUs.
+**Queue delay** scaling adds workers based on wait times. If requests sit in the queue for longer than a defined threshold (default 4 seconds), the system provisions new workers. This is best for workloads where slight delays are acceptable in exchange for higher utilization.
-
+**Request count** scaling is more aggressive. It adjusts worker numbers based on the total volume of pending and active work. The formula used is `Math.ceil((requestsInQueue + requestsInProgress) / scalerValue)`. Use a scaler value of 1 for maximum responsiveness, or increase it to scale more conservatively. This strategy is recommended for LLM workloads or applications with frequent, short requests.
-## Timeout settings
+## Lifecycle and timeouts
### Idle timeout
-The amount of time that a worker continues running after completing a request. You're still charged for this time, even if the worker isn't actively processing any requests.
-
-By default, the idle timeout is set to 5 seconds to help avoid frequent start/stop cycles and reduce the likelihood of cold starts. Setting a longer idle timeout can help minimize cold starts for intermittent traffic, but it may also increase your costs.
-
-When configuring idle timeout, start by matching it to your average cold start time to reduce startup delays. For workloads with extended cold starts, consider longer idle timeouts to minimize repeated initialization costs.
-
-
-
-That idle timeout is only effective when using [queue delay scaling](#queue-delay). Be cautious with high timeout values, as workers with constant traffic may never reach the idle state necessary to scale down properly.
-
-
+The idle timeout determines how long a worker remains active after completing a request before shutting down. While a worker is idle, you are billed for the time, but the worker remains "warm," allowing it to process subsequent requests immediately. The default is 5 seconds.
### Execution timeout
-The maximum time a job can run before automatic termination. This prevents runaway jobs from consuming excessive resources. You can turn off this setting, but we highly recommend keeping it on.
-
-Default: 600 seconds (10 minutes)
-Maximum: 24 hours (can be extended using job TTL)
-
-
-
-We strongly recommend enabling execution timeout for all endpoints. Set the timeout value to your typical request duration plus a 10-20% buffer. This safeguard prevents unexpected or faulty requests from running indefinitely and consuming unnecessary resources.
-
-
+The execution timeout acts as a failsafe to prevent runaway jobs from consuming infinite resources. It specifies the maximum duration a single job is allowed to run before being forcibly terminated. We strongly recommend keeping this enabled. The default is 600 seconds (10 minutes), and it can be extended up to 24 hours.
### Job TTL (time-to-live)
-The maximum time a job remains in the queue before automatic termination.
-
-Default: 86,400,000 milliseconds (24 hours)
-Minimum: 10,000 milliseconds (10 seconds)
-
-See [Execution policies](/serverless/endpoints/send-requests#execution-policies) for more information.
+This setting defines how long a job request remains valid in the queue before expiring. If a worker does not pick up the job within this window, the system discards it. The default is 24 hours.
-
+## Performance features
-You can use the `/status` operation to configure the time-to-live (TTL) for an individual job by appending a TTL parameter when checking the status of a job. For example, `https://api.runpod.ai/v2/{endpoint_id}/status/{job_id}?ttl=6000` sets the TTL for the job to 6 seconds. Use this when you want to tell the system to remove a job result sooner than the default retention time.
+### FlashBoot
-
+FlashBoot reduces cold start times by retaining the state of worker resources shortly after they spin down. This allows the system to "revive" a worker much faster than a standard fresh boot. FlashBoot is most effective on endpoints with consistent traffic, where workers frequently cycle between active and idle states.
-## FlashBoot
+### Model (optional)
-FlashBoot is Runpod's solution for reducing the average cold-start times on your endpoint. It works by retaining worker resources for some time after they're no longer in use, so they can be rebooted quickly. When your endpoint has consistent traffic, your workers have a higher chance of benefiting from FlashBoot for faster spin-ups. However, if your endpoint isn't receiving frequent requests, FlashBoot has fewer opportunities to optimize performance. There is no additional cost associated with FlashBoot.
-
-
-
-The effectiveness of FlashBoot increases exponentially with higher request volumes and worker counts, making it ideal for busy production endpoints. For endpoints with fewer than 3 workers, FlashBoot's overhead may exceed its benefits.
-
-
-
-## Model (optional)
-
-You can select from a list of [cached models](/serverless/endpoints/model-caching) using the **Model (optional)** field. Selecting a model signals the system to place your workers on host machines that contain the selected model, resulting in faster cold starts and significant cost savings.
+The Model field allows you to select from a list of [cached models](/serverless/endpoints/model-caching). When selected, Runpod schedules your workers on host machines that already have these large model files pre-loaded. This significantly reduces the network time required to download models during initialization.
## Advanced settings
-When configuring advanced settings, remember that each constraint (data center, storage, CUDA version, GPU type) may limit resource availability. For maximum availability and reliability, select all data centers and CUDA versions, and avoid network volumes unless your workload specifically requires them.
-
### Data centers
-Control which data centers can deploy and cache your workers. Allowing multiple data centers improves availability, while using a network volume restricts your endpoint to a single data center.
-
-Default: All data centers
-
-
-
-For the highest availability, allow all data centers (i.e., keep the default setting in place) and avoid using network volumes unless necessary.
-
-
+You can restrict your endpoint to specific geographical regions. For maximum reliability and availability, we recommend allowing all data centers. Restricting this list decreases the pool of available GPUs your endpoint can draw from.
### Network volumes
-Attach persistent storage to your workers. [Network volumes](/storage/network-volumes) have higher latency than local storage, and restrict workers to the data center containing your volume. However, they can be very useful for sharing large models or data between workers on an endpoint.
-
-### Auto-scaling type
-
-#### Queue delay
-
-Adds workers based on request wait times.
-
-The queue delay scaling strategy adjusts worker numbers based on request wait times. Workers are added if requests spend more than X seconds in the queue, where X is a threshold you define. By default, this threshold is set at 4 seconds.
-
-#### Request count
-
-The request count scaling strategy adjusts worker numbers according to total requests in the queue and in progress. It automatically adds workers as the number of requests increases, ensuring tasks are handled efficiently.
-
-Total workers formula: `Math.ceil((requestsInQueue + requestsInProgress) / 4)`
-
-
-
-**Optimizing your auto-scaling strategy:**
-
-- For maximum responsiveness, use "request count" with a scaler value of 1 to provision workers immediately for each incoming request.
-- LLM workloads with frequent, short requests typically perform better with "request count" scaling.
-- For gradual scaling, increase the request count scaler value to provision workers more conservatively.
-- Use queue delay when you want workers to remain available briefly after request completion to handle follow-up requests.
-- With long cold start times, favor conservative scaling to minimize the performance and cost impacts of frequent worker initialization.
-
-
-
-### Expose HTTP/TCP ports
-
-Enables direct communication with your worker via its public IP and port. This can be useful for real-time applications requiring minimal latency, such as [WebSocket applications](https://github.com/runpod-workers/worker-websocket).
-
-### Enabled GPU types
-
-Here you can specify which [GPU types](/references/gpu-types) to use within your selected GPU size categories. By default, all GPU types are enabled.
+[Network volumes](/storage/network-volumes) provide persistent storage that survives worker restarts. While they enable data sharing between workers, they introduce network latency and restrict your endpoint to the specific data center where the volume resides. Use network volumes only if your workload specifically requires shared persistence or datasets larger than the container limit.
### CUDA version selection
-Specify which CUDA versions can be used with your workload to ensures your code runs on compatible GPU hardware. Runpod will match your workload to GPU instances with the selected CUDA versions.
-
-
+This filter ensures your workers are scheduled on host machines with compatible drivers. While you should select the version your code requires, we recommend also selecting all newer versions. CUDA is generally backward compatible, and selecting a wider range of versions increases the pool of available hardware.
-CUDA versions are generally backward compatible, so we recommend that you check for the version you need and any higher versions. For example, if your code requires CUDA 12.4, you should also try running it on 12.5, 12.6, and so on.
-
-Limiting your endpoint to just one or two CUDA versions can significantly reduce GPU availability. Runpod continuously updates GPU drivers to support the latest CUDA versions, so keeping more CUDA versions selected gives you access to more resources.
-
-
-
-## Reducing worker startup times
-
-There are two primary factors that impact worker start times:
-
-1. **Worker initialization time:** Worker initialization occurs when a Docker image is downloaded to a new worker. This takes place after you create a new endpoint, adjust worker counts, or deploy a new worker image. Requests that arrive during initialization face delays, as a worker must be fully initialized before it can start processing.
-
-2. **Cold start:** A cold start occurs when a worker is revived from an idle state. Cold starts can get very long if your handler code loads large ML models (several gigabytes to hundreds of gigabytes) into GPU memory.
-
-
-
-If your worker's cold start time exceeds the default 7-minute limit (which can occur when loading large models), the system may mark it as unhealthy. To prevent this, you can extend the cold start timeout by setting the `RUNPOD_INIT_TIMEOUT` environment variable. For example, setting `RUNPOD_INIT_TIMEOUT=800` allows up to 800 seconds (13.3 minutes) for revival.
-
-
-
-Use these strategies to reduce worker startup times:
-
-1. **Embed models in Docker images:** Package your ML models directly within your worker container image instead of downloading them in your handler function. This strategy places models on the worker's high-speed local storage (SSD/NVMe), dramatically reducing the time needed to load models into GPU memory. This approach is optimal for production environments, though extremely large models (500GB+) may require network volume storage.
-
-2. **Store large models on network volumes:** For flexibility during development, save large models to a [network volume](/storage/network-volumes) using a Pod or one-time handler, then mount this volume to your Serverless workers. While network volumes offer slower model loading compared to embedding models directly, they can speed up your workflow by enabling rapid iteration and seamless switching between different models and configurations.
-
-3. **Maintain active workers:** Set active worker counts above zero to completely eliminate cold starts. These workers remain ready to process requests instantly and cost up to 30% less when idle compared to standard (flex) workers.
-
-4. **Extend idle timeouts:** Configure longer idle periods to preserve worker availability between requests. This strategy prevents premature worker shutdown during temporary traffic lulls, ensuring no cold starts for subsequent requests.
-
-5. **Optimize scaling parameters:** Fine-tune your auto-scaling configuration for more responsive worker provisioning:
- - Lower queue delay thresholds to 2-3 seconds (default 4).
- - Decrease request count thresholds to 2-3 (default 4).
-
- These refinements create a more agile scaling system that responds swiftly to traffic fluctuations.
-
-6. **Increase maximum worker limits:** Set higher maximum worker capacities to ensure your Docker images are pre-cached across multiple compute nodes and data centers. This proactive approach eliminates image download delays during scaling events, significantly reducing startup times.
-
-## Best practices summary
+### Expose HTTP/TCP ports
-- **Understand optimization tradeoffs** and make conscious tradeoffs between cost, speed, and model size.
-- **Start conservative** with max workers and scale up as needed.
-- **Monitor throttling** and adjust max workers accordingly.
-- **Use active workers** for latency-sensitive applications.
-- **Select multiple GPU types** to improve availability.
-- **Choose appropriate timeouts** based on your workload characteristics.
-- **Consider data locality** when using network volumes.
-- **Avoid setting max workers to 1** to prevent bottlenecks.
-- **Plan for 20% headroom** in max workers to handle load spikes.
-- **Prefer high-end GPUs with lower GPU count** for better performance.
-- **Set execution timeout** to prevent runaway processes.
-- **Match auto-scaling strategy** to your workload patterns.
-- **Embed models in Docker images** when possible for faster loading.
-- **Extend idle timeouts** to prevent frequent cold starts.
-- **Consider disabling FlashBoot** for endpoints with few workers or infrequent traffic.
+Enabling this option exposes the public IP and port of the worker, allowing for direct external communication. This is required for applications that need persistent connections, such as WebSockets.
diff --git a/serverless/overview.mdx b/serverless/overview.mdx
index 087ad5c7..4fd1fe9a 100644
--- a/serverless/overview.mdx
+++ b/serverless/overview.mdx
@@ -180,6 +180,10 @@ When deploying models on Serverless endpoints, follow this order of preference:
3. [Use network volumes](/serverless/storage/network-volumes): You can use network volumes to store models and other files that need to persist between workers. Models loaded from network storage are slower than cached or baked models, so you should only use this option when the preceeding approaches don't fit your needs.
+## Development lifecycle
+
+When developing for Serverless applications, you'll typically start by writing a handler function, testing it locally, and then deploying it to production. To learn more about testing, error handling, monitoring, and optimizing your Serverless applications, see [Serverless development](/serverless/development/overview).
+
## Next steps
Ready to get started with Runpod Serverless?