diff --git a/README.md b/README.md index f5add47..ec67a0c 100644 --- a/README.md +++ b/README.md @@ -3,6 +3,7 @@ Finetune Controller is a robust and flexible system designed to manage and streamline the fine-tuning of machine learning models on Kubernetes, particularly within OpenShift clusters. This project leverages modern tools and workflows, enabling efficient development and deployment processes for AI-driven applications. ### Features + - Local Development: Get started quickly with a streamlined setup process using uv, a high-performance Python package and project manager. - OpenShift Integration: Simplify deployment and scaling with OpenShift-specific configurations and GPU support for intensive workloads. - MongoDB Backend: Seamlessly connect to a local or cluster-based MongoDB database. @@ -11,12 +12,15 @@ Finetune Controller is a robust and flexible system designed to manage and strea ## Getting Started If the cluster is already set up continue else follow the cluster setup instructions [here](#setup-openshift-cluster) + ### Prereqs -1. Recommend using [uv](https://github.com/astral-sh/uv), *an extremely fast Python package and project manager* - ```shell - pip install uv - ``` +1. Recommend using [uv](https://github.com/astral-sh/uv), _an extremely fast Python package and project manager_ + + ```shell + pip install uv + ``` + 2. A container engine such as Docker or Podman - ### Install + 1. Create virtual environment and install dependencies - ```shell - uv sync - ``` -2. Start a local developement mongo database *(or connect to one on cluster with port-forward)* + ```shell + uv sync + ``` - Local - ```shell - docker run -d --rm --name mongodb \ - -e MONGODB_INITDB_ROOT_USERNAME="default-user" \ - -e MONGODB_INITDB_ROOT_PASSWORD="admin123456789" \ - -e MONGODB_INITDB_DATABASE="finetune" \ - -p 27017:27017 \ - mongodb/mongodb-community-server:latest - ``` +2. Start a local developement mongo database _(or connect to one on cluster with port-forward)_ - you can port-forward this connection to your local machine - ```shell - oc port-forward service/mongodb-community-server 27017:27017 -n - ``` + Local + + ```shell + docker run -d --rm --name mongodb \ + -e MONGODB_INITDB_ROOT_USERNAME="default-user" \ + -e MONGODB_INITDB_ROOT_PASSWORD="admin123456789" \ + -e MONGODB_INITDB_DATABASE="finetune" \ + -p 27017:27017 \ + mongodb/mongodb-community-server:latest + ``` + + you can port-forward this connection to your local machine + + ```shell + oc port-forward service/mongodb-community-server 27017:27017 -n + ``` 3. Connect to the Openshift cluster with the cli login command `oc login`. If cluster not already set up follow [these](#setup-openshift-cluster) steps 4. Create a project level `.env` file (see `.env.example`) and update the variables. - ```shell - cp .env.example .env - ``` + + ```shell + cp .env.example .env + ``` 5. Make sure the virtual environment is activated and start the local finetuning controller application. - ```shell - source .venv/bin/activate - uvicorn app.main:app --reload - ``` + ```shell + source .venv/bin/activate + + uvicorn app.main:app --reload + ``` This will: + - Start MongoDB with the required configuration - Build and start the FastAPI server - Make the application available at http://localhost:8000 ### Development and Contributing + Setup [pre-commit](https://pre-commit.com/#install) to keep linting and code styling up to standard. + ```shell uv sync pre-commit install @@ -94,57 +106,69 @@ pre-commit install ## Setup OpenShift Cluster Resources ### Create default project + Name can be descriptive for these examples we will use `finetune-controller` + ```shell oc new-project finetune-controller ``` ### Create Kubeflow project + ```shell oc new-project kubeflow ``` ### Install Kubeflow training operator + + ```shell kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.8.1" ``` ### Install Kueue + > Requires Kubernetes 1.29 or newer Follow the latest [docs](https://kueue.sigs.k8s.io/docs/installation/) Install a released version + ```shell kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.10.1/manifests.yaml ``` To wait for Kueue to be fully available, run: + ```shell kubectl wait deploy/kueue-controller-manager -nkueue-system --for=condition=available --timeout=5m ``` Restart pods + ```shell kubectl delete pods -lcontrol-plane=controller-manager -nkueue-system ``` First update the namepspace for the crd LocalQueue object in [default-user-queue.yaml](/crds/kueue/default-user-queue.yaml). default namepsace: "default" + ```shell yq e '.metadata.namespace = "finetune-controller"' -i crds/kueue/default-user-queue.yaml ``` Apply the default CRD config for Kueue or update by following their docs + ```shell kubectl apply -f crds/kueue/ ``` ### Install mongodb server -Example configuration. *do properly configure for production* +Example configuration. _do properly configure for production_ + ```shell oc new-app -e MONGODB_INITDB_ROOT_USERNAME="default-user" -e MONGODB_INITDB_ROOT_PASSWORD="admin123456789" -e MONGODB_INITDB_DATABASE="finetune" mongodb/mongodb-community-server:latest --namespace finetune-controller ``` @@ -165,9 +189,11 @@ oc set env deployment/ml-pipeline-ui DISABLE_GKE_METADATA=true ``` --> ### Add GPU nodes to ROSA cluster + Go to your cluster on redhat console [admin dashboard](https://console.redhat.com/openshift/cluster-list). Add a machine pool of your choosing with the following configuration: Taints + ``` key: nvidia.com/gpu value: @@ -175,6 +201,7 @@ effect: NoSchedule ``` Node Labels + ``` Key: cluster-api/accelerator Value: @@ -183,6 +210,7 @@ Value: ### Setup AWS Secret Example aws config + ```yaml # aws_credentials.yaml apiVersion: v1 @@ -198,6 +226,7 @@ type: Opaque ``` Example for base 64 command in terminal + ```bash echo -n "VALUE" | base64 ``` @@ -205,6 +234,7 @@ echo -n "VALUE" | base64 ### Setup Pull secrets Example docker pull secret config + ```yaml # pull_secret.yaml apiVersion: v1 @@ -214,10 +244,10 @@ kind: Secret metadata: name: cr-pull-secret type: kubernetes.io/dockerconfigjson - ``` Apply these secrets + ```shell oc apply -f aws-credentials.yaml -n finetune-controller ``` @@ -225,35 +255,46 @@ oc apply -f aws-credentials.yaml -n finetune-controller ## Install Finetune Controller On OpenShift 1. Create a `.env.production` file and update the defaults. For this example set `MONGODB_URL=mongodb://mongodb-community-server.finetune-controller.svc.cluster.local:27017` - ```shell - cp .env.example .env.production - ``` + + ```shell + cp .env.example .env.production + ``` 2. create the application - ```shell - oc new-app --strategy=docker --binary --name finetune-controller --env-file=".env.production" --namespace finetune-controller - ``` + + ```shell + oc new-app --strategy=docker --binary --name finetune-controller --env-file=".env.production" --namespace finetune-controller + ``` 3. expose services and patch tls config - ```shell - oc expose deployment/finetune-controller --port=8000 - oc expose svc/finetune-controller --port=8000 - oc patch route finetune-controller --type=merge -p '{"spec":{"tls":{"termination":"edge"}}}' - ``` + + ```shell + oc expose deployment/finetune-controller --port=8000 + oc expose svc/finetune-controller --port=8000 + oc patch route finetune-controller --type=merge -p '{"spec":{"tls":{"termination":"edge"}}}' + ``` 4. add cluster role binding permissions to the application 5. start a build - ```shell - oc start-build finetune-controller --from-dir=. --namespace=finetune-controller - ``` + ```shell + oc start-build finetune-controller --from-dir=. --namespace=finetune-controller + ``` ## Manually Publish Updates To Finetune Controller + Publish From current project + ```shell ./scripts/publish.sh ``` + Publish From git ~HEAD + ```shell ./scripts/publish_git.sh ``` + +# Adding Finetuning Models + +look at [setting up a finetune model](/docs/setup_models.md) diff --git a/docs/setup_models.md b/docs/setup_models.md new file mode 100644 index 0000000..1215cc1 --- /dev/null +++ b/docs/setup_models.md @@ -0,0 +1,457 @@ +# Setting Up Finetuning Models + +This guide explains how to add and configure custom finetuning models in the finetune-controller project. + +## Table of Contents + +1. [Model Architecture Overview](#model-architecture-overview) +2. [Creating a Custom Model](#creating-a-custom-model) +3. [Model Registration](#model-registration) +4. [Directory Structure](#directory-structure) +5. [Configuration Options](#configuration-options) +6. [Best Practices](#best-practices) +7. [Example Walkthrough](#example-walkthrough) + +## Model Architecture Overview + +The finetune-controller uses a modular architecture where each finetuning model is defined as a Python class that inherits from `BaseFineTuneModel`. The system automatically discovers and registers models from the `app/models/custom/` directory. + +### Key Components + +- **BaseFineTuneModel**: Base class that all finetuning models must inherit from +- **TrainingArguments**: Configuration class for model-specific training parameters +- **Model Registration**: Automatic discovery and registration system +- **Dynamic Loading**: Runtime loading of custom models + +## Creating a Custom Model + +To create a custom finetuning model, follow these steps: + +### Step 1: Create Training Configuration Class + +First, define a configuration class that inherits from `TrainingArguments`: + +```python +from app.models.base.finetuning import TrainingArguments, Field + +class MyModelConfig(TrainingArguments): + """Configuration parameters for your custom model""" + + # Define model-specific parameters with defaults + batch_size: int = Field( + default=32, + description="Size of each batch during training" + ) + learning_rate: float = Field( + default=0.001, + description="Learning rate for the optimizer" + ) + epochs: int = Field( + default=10, + description="Number of training epochs" + ) + # Add more parameters as needed +``` + +### Step 2: Create Model Class + +Define your model class inheriting from `BaseFineTuneModel`: + +```python +from app.models.base.finetuning import ( + BaseFineTuneModel, + TrainingFramework, + TrainingTask, + Field, + TrainingResources, + TrainingDataset, +) + +class MyCustomModel(BaseFineTuneModel): + """Custom finetuning model specification""" + + # Required fields + name: str = "MyCustomModel" # Display name in frontend + inference_name: str | None = "MyCustomModel" # Must match inference service name + description: str = "Description of your custom model" + project_url: str = "https://github.com/your-org/your-model" + image: str = "your-registry/your-model:latest" # Container image + command: list[str] = [ + "/bin/bash", + "-c", + "python train.py", # Your training script + ] + + # Model metadata + framework: TrainingFramework = TrainingFramework.PYTORCH # or TENSORFLOW + task: TrainingTask = TrainingTask.CLASSIFICATION # or other task types + + # Dataset configuration + dataset_info: TrainingDataset = TrainingDataset( + description="Description of expected dataset format", + dataset_required=True # Set to False if no dataset needed + ) + + # Resource requirements + resources: TrainingResources = TrainingResources( + requests={"cpu": 2, "memory": "4Gi"}, + limits={"cpu": 4, "memory": "8Gi"} + ) + + # GPU configuration + accelerator_count: int = Field( + default=1, + ge=0, + description="Number of GPU devices per worker" + ) + + # Model promotion path (S3 storage path) + promotion_path: str = Field( + default="domain/algorithm_name/application", + description="S3 path format: domain/algorithm_name/algorithm_application" + ) + + # Training configuration + training_arguments: MyModelConfig = MyModelConfig() + + def run_cmd(self) -> list[str]: + """Convert model properties to command arguments""" + cmd = self.command.copy() + args = [] + + # Add training arguments + args.append(f"--batch-size={self.training_arguments.batch_size}") + args.append(f"--learning-rate={self.training_arguments.learning_rate}") + args.append(f"--epochs={self.training_arguments.epochs}") + + # Required mount paths (always include these) + args.append(f"--dataset_path={self.dataset_mount}") + args.append(f"--checkpoint_path={self.checkpoint_mount}") + + # Combine command with arguments + cmd[-1] += " " + " ".join(args) + return cmd +``` + +### Step 3: Place in Custom Directory + +Save your model file in the `app/models/custom/` directory: + +``` +app/models/custom/my_custom_model.py +``` + +## Model Registration + +The system automatically discovers and registers models through the following process: + +### Automatic Discovery + +1. **Loading Process**: The `load_model_modules()` function in `app/jobs/registered_models.py` scans the `app/models/custom/` directory +2. **Dynamic Import**: Uses `load_models_from_directory()` from `app/models/model_loader.py` to import model classes +3. **Registration**: Adds discovered models to the `JOB_MANIFESTS` dictionary using the model's `name` field as the key + +### Registration Code Flow + +```python +# In registered_models.py +def load_model_modules(): + # Load custom models from directory + custom_models_dir = Path(__file__).parent.parent / "models" / "custom" + custom_models = load_models_from_directory(str(custom_models_dir)) + + # Register each model + for model_name, model_class in custom_models.items(): + name = model_class.model_fields.get("name").get_default() + if name not in JOB_MANIFESTS: + JOB_MANIFESTS[name] = model_class +``` + +### Manual Registration (Alternative) + +You can also manually register models by adding them to the `JOB_MANIFESTS` dictionary: + +```python +from app.models.custom.my_custom_model import MyCustomModel + +JOB_MANIFESTS = { + "MyCustomModel": MyCustomModel, + # ... other models +} +``` + +## Directory Structure + +``` +app/ +├── models/ +│ ├── base/ +│ │ └── finetuning.py # Base classes +│ ├── examples/ +│ │ └── mnist.py # Example model +│ └── custom/ # Your custom models go here +│ ├── __init__.py +│ ├── .gitkeep +│ └── your_model.py # Your custom model files +├── jobs/ +│ └── registered_models.py # Model registration +└── ... +``` + +## Configuration Options + +### Required Fields + +| Field | Type | Description | +| ----------- | ------------------- | --------------------------------------------------------------------------- | +| `name` | `str` | Display name in frontend (must be unique, min 4 chars, alphanumeric + .\_@) | +| `image` | `str` | Container image with your training code | +| `command` | `list[str]` | Command to execute training | +| `framework` | `TrainingFramework` | ML framework (PYTORCH, TENSORFLOW) | +| `task` | `TrainingTask` | Task type (CLASSIFICATION, REGRESSION, MULTITASK_CLASSIFICATION) | + +### Optional Fields + +| Field | Type | Default | Description | +| ---------------------- | ------------------- | ------------------------------------------------- | ------------------------------------------------------------------------------------ | +| `description` | `str` | `""` | Model description | +| `project_url` | `str` | `""` | Project repository URL | +| `inference_name` | `str \| None` | `None` | Name for inference service integration. Allows for many to one inference service | +| `image_pull_secret` | `str \| None` | `None` | Secret name for pulling private container images | +| `checkpoint_mount` | `str` | `"/data/artifacts"` | Mount point for storing results. Best not to change this. | +| `dataset_mount` | `str` | `"/data/dataset"` | Mount point for storing dataset. Best not to change this. | +| `dataset_info` | `TrainingDataset` | `TrainingDataset()` | Dataset configuration (see TrainingDataset fields below) | +| `device_types` | `list[str]` | `["cpu"]` | Node type to run on based on taint toleration. Default 'cpu' (normal) worker node | +| `resources` | `TrainingResources` | `{"requests": {"cpu": 2, "memory": "1Gi"}}` | CPU/Memory requirements | +| `accelerator_count` | `int` | `1` | Number of GPU devices per worker (minimum 1) | +| `cluster_nodes` | `int` | `1` | Total number of workers for training (minimum 1) | +| `store_asset_patterns` | `list[str]` | `["*.json", "*.yaml", "*.csv", "*.pt", "*.ckpt"]` | Pattern match a list of files to store | +| `promotion_path` | `str` | `""` | S3 prefix to upload artifacts. Format: `domain/algorithm_name/algorithm_application` | + +### TrainingDataset Configuration + +The `TrainingDataset` class defines dataset requirements and metadata: + +| Field | Type | Default | Description | +| ------------------ | ------ | ------- | ---------------------------------------------- | +| `description` | `str` | `""` | Description of the expected dataset format | +| `dataset_required` | `bool` | `False` | Whether a dataset is required for training | +| `dataset_name` | `str` | `""` | Name of dataset file from API (set at runtime) | + +Example: + +```python +dataset_info: TrainingDataset = TrainingDataset( + description="Expects CSV files with columns: image_path, label", + dataset_required=True +) +``` + +### TrainingResources Configuration + +The `TrainingResources` class defines CPU and memory requirements: + +| Field | Type | Required | Description | +| ---------- | ----------------------- | -------- | ---------------------------------- | +| `requests` | `dict[str, str \| int]` | Yes | Minimum resource requirements | +| `limits` | `dict[str, str \| int]` | No | Maximum resource limits (optional) | + +Example: + +```python +resources: TrainingResources = TrainingResources( + requests={"cpu": 4, "memory": "2Gi"}, + limits={"cpu": 8, "memory": "4Gi"} +) +``` + +Common resource specifications: + +- **CPU**: Integer or string (e.g., `2`, `"2"`, `"2000m"` for 2 cores) +- **Memory**: String with unit (e.g., `"1Gi"`, `"2048Mi"`, `"2G"`) + +### Training Arguments + +Define custom training parameters by creating a class that inherits from `TrainingArguments`: + +```python +class CustomConfig(TrainingArguments): + # Each parameter must have a default value and type annotation + param_name: type = Field( + default=default_value, + description="Parameter description", + # Optional Field constraints: + ge=0, # Greater than or equal + le=100, # Less than or equal + # ... other pydantic Field options + ) +``` + +## Best Practices + +### 1. Model Naming + +- Use descriptive, unique names +- Follow PascalCase for class names +- Use consistent naming between `name` and `inference_name` + +### 2. Container Images + +- Include all dependencies in your container image +- Use specific version tags, avoid `:latest` in production +- Test your container image independently before integration + +### 3. Resource Management + +- Set appropriate resource requests and limits +- Consider GPU requirements realistically +- Test resource usage with representative workloads + +### 4. Parameter Configuration + +- Provide sensible defaults for all parameters +- Include helpful descriptions +- Use appropriate Field constraints + +### 5. Command Generation + +- Always include dataset and checkpoint mount paths +- Handle boolean flags appropriately +- Ensure argument parsing matches your training script + +### 6. Error Handling + +- Test model loading with the `__main__` block pattern +- Validate configurations before deployment +- Include logging for debugging + +## Example Walkthrough + +Let's examine the MNIST example to understand the complete implementation: + +### MNIST Configuration Class + +```python +class MNISTConfig(TrainingArguments): + """Model Params for MNIST Finetune Job""" + + batch_size: int = Field(default=64, description="Size of each batch during training") + test_batch_size: int = Field(default=1000, description="Size of each batch during testing") + epochs: int = Field(default=1, description="Number of epochs for training") + lr: float = Field(default=1.0, description="Learning rate for the optimizer") + gamma: float = Field(default=0.7, description="Learning rate step gamma for scheduler") + no_cuda: bool = Field(default=False, description="Disable CUDA (use CPU instead of GPU)") + seed: int = Field(default=1, description="Random seed for reproducibility") + log_interval: int = Field(default=10, description="How many batches to wait before logging training status") + save_model: bool = Field(default=False, description="Whether to save the trained model") +``` + +### MNIST Model Class + +```python +class MNIST(BaseFineTuneModel): + """Finetune Job Spec for MNIST""" + + name: str = "MNIST" + inference_name: str | None = "MNIST" + description: str = "Example MNIST model for fine-tuning" + project_url: str = "https://github.com/acceleratedscience/model-foobar" + image: str = "quay.io/brian_duenas/mnist:latest" + command: list[str] = ["/bin/bash", "-c", "python mnist_training_script.py"] + + framework: TrainingFramework = TrainingFramework.PYTORCH + task: TrainingTask = TrainingTask.CLASSIFICATION + + dataset_info: TrainingDataset = TrainingDataset( + description="MNIST model does not expect a dataset", + dataset_required=False + ) + + resources: TrainingResources = TrainingResources( + requests={"cpu": 4, "memory": "1Gi"}, + limits={"cpu": 8, "memory": "2Gi"} + ) + + accelerator_count: int = Field(default=0, ge=1, description="Number of gpu devices to use for training per worker") + + promotion_path: str = Field( + default="molecules/mnist/mnist_test", + description="s3 path to upload artifacts. Based on Inference path `domain/algorithm_name/algorithm_application`" + ) + + training_arguments: MNISTConfig = MNISTConfig() +``` + +### Command Generation + +```python +def run_cmd(self) -> list[str]: + """Converts model properties to command arguments""" + cmd = self.command.copy() + args = [] + + # Handle boolean flags + if self.training_arguments.no_cuda: + args.append("--no-cuda") + if self.training_arguments.save_model: + args.append("--save-model") + + # Add value parameters + args.append(f"--batch-size={self.training_arguments.batch_size}") + args.append(f"--test-batch-size={self.training_arguments.test_batch_size}") + args.append(f"--epochs={self.training_arguments.epochs}") + args.append(f"--lr={self.training_arguments.lr}") + args.append(f"--seed={self.training_arguments.seed}") + args.append(f"--log-interval={self.training_arguments.log_interval}") + args.append(f"--gamma={self.training_arguments.gamma}") + + # Required mount paths + args.append(f"--dataset_path={self.dataset_mount}") + args.append(f"--checkpoint_path={self.checkpoint_mount}") + + # Combine with base command + cmd[-1] += " " + " ".join(args) + return cmd +``` + +### Testing Your Model + +Include a test block to validate your model definition: + +```python +if __name__ == "__main__": + # Test that the model definition correctly loads + from pprint import pprint + + model = MNIST(training_arguments={"epochs": 2}, description="test model load") + pprint(model.model_dump()) +``` + +## Troubleshooting + +### Common Issues + +1. **Model Not Appearing**: Check that your model file is in `app/models/custom/` and properly inherits from `BaseFineTuneModel` + +2. **Import Errors**: Ensure all required imports are available and your model class is properly defined + +3. **Name Conflicts**: Each model must have a unique `name` field value + +4. **Container Issues**: Verify your container image exists and contains the necessary training code + +5. **Resource Problems**: Check that your resource requests are reasonable for your cluster + +### Debugging + +Enable debug logging to see model registration details: + +```python +import logging +logging.getLogger("app.jobs.registered_models").setLevel(logging.DEBUG) +``` + +Run the model test block to validate your configuration: + +```bash +python -m app.models.custom.your_model +```