RFC: ComfyUI Training Modules #24

yoland68 · 2025-03-02T00:16:21Z

RFC: ComfyUI Training Modules

Start Date: 2025-03-01
Target Major Version: TBD

Summary

This RFC proposes the addition of training capabilities to ComfyUI, enabling users to create and fine-tune LoRA (Low-Rank Adaptation) models directly through the ComfyUI interface. The proposal includes a set of node implementations for loading image datasets, training LoRAs, visualizing training progress, and saving trained models.

Basic example

The basic workflow would allow users to:

Load an image dataset:
Train a LoRA on these images:
Save the resulting LoRA:
Visualize training loss:

Motivation

Currently, users who want to create custom LoRA models need to:

Use external tools and scripts for training, which often requires command-line expertise
Set up specialized environments for training
Manually move the trained models between systems

Adding training capabilities directly to ComfyUI would:

Simplify the training workflow: Users can train models in the same interface where they use them
Increase accessibility: Users without programming experience can customize models
Enable rapid iteration: The ability to train and immediately test models in the same interface
Provide visual feedback: Real-time visualization of the training process
Maintain workflow continuity: The entire model creation, training, and inference pipeline can be represented as a unified workflow

Detailed design

The implementation consists of four main components:

1. Image Dataset Loading

Two nodes are proposed for loading image datasets:

LoadImageSetNode: Loads individual images selected by the user
LoadImageSetFromFolderNode: Loads all images from a specified folder

These nodes offer options for handling images of different sizes (stretch, crop, pad) and prepare the images for training.

class LoadImageSetFromFolderNode:
    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": {
                "folder": (folder_paths.get_input_subfolders(), {"tooltip": "The folder to load images from."})
            },
            "optional": {
                "resize_method": (
                    ["None", "Stretch", "Crop", "Pad"],
                    {"default": "None"},
                ),
            }
        }

    RETURN_TYPES = ("IMAGE",)
    FUNCTION = "load_images"
    CATEGORY = "loaders"
    EXPERIMENTAL = True
    DESCRIPTION = "Loads a batch of images from a directory for training."

2. LoRA Training Node

The TrainLoraNode is the core component that handles the training process:

class TrainLoraNode:
    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": {
                "model": (IO.MODEL, {"tooltip": "The model to train the LoRA on."}),
                "vae": (IO.VAE, {"tooltip": "The VAE model to use for encoding images for training."}),
                "positive": (IO.CONDITIONING, {"tooltip": "The positive conditioning to use for training."}),
                "image": (IO.IMAGE, {"tooltip": "The image or image batch to train the LoRA on."}),
                "batch_size": (IO.INT, {"default": 1, "min": 1, "max": 10000, "step": 1}),
                "steps": (IO.INT, {"default": 50, "min": 1, "max": 1000}),
                "learning_rate": (IO.FLOAT, {"default": 0.0003, "min": 0.0000001, "max": 1.0, "step": 0.00001}),
                "rank": (IO.INT, {"default": 8, "min": 1, "max": 128}),
                "optimizer": (["Adam", "AdamW", "SGD", "RMSprop"], {"default": "Adam"}),
                "loss_function": (["MSE", "L1", "Huber", "SmoothL1"], {"default": "MSE"}),
                "seed": (IO.INT, {"default": 0, "min": 0, "max": 0xFFFFFFFFFFFFFFFF}),
                "training_dtype": (["bf16", "fp32"], {"default": "bf16"}),
                "existing_lora": (folder_paths.get_filename_list("loras") + ["[None]"], {"default": "[None]"}),
            },
        }

    RETURN_TYPES = (IO.MODEL, IO.LORA_MODEL, IO.LOSS_MAP, IO.INT)
    RETURN_NAMES = ("model_with_lora", "lora", "loss", "steps")
    FUNCTION = "train"
    CATEGORY = "training"
    EXPERIMENTAL = True

The training process:

Takes a batch of images and encodes them using a VAE
Sets up LoRA layers for all eligible weights in the model
Configures an optimizer and loss function based on user selections
Performs gradient-based training for the specified number of steps
Returns the model with LoRA applied, the LoRA weights, a map of training losses, and the total training steps

3. Model Saving Node

The SaveLoRA node enables users to save their trained LoRA models:

class SaveLoRA:
    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": {
                "lora": (IO.LORA_MODEL, {"tooltip": "The LoRA model to save."}),
                "prefix": (IO.STRING, {"default": "trained_lora"}),
            },
            "optional": {
                "steps": (IO.INT, {"forceInput": True}),
            },
        }

    RETURN_TYPES = ()
    FUNCTION = "save"
    CATEGORY = "loaders"
    EXPERIMENTAL = True
    OUTPUT_NODE = True

The node saves the LoRA weights in SafeTensors format, with a filename that includes the number of training steps and a timestamp.

4. Training Visualization Node

The LossGraphNode visualizes the training progress:

class LossGraphNode:
    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": {
                "loss": (IO.LOSS_MAP, {"default": {}}),
                "filename_prefix": (IO.STRING, {"default": "loss_graph"}),
            },
        }

    RETURN_TYPES = ()
    FUNCTION = "plot_loss"
    OUTPUT_NODE = True
    CATEGORY = "training"
    EXPERIMENTAL = True
    DESCRIPTION = "Plots the loss graph and saves it to the output directory."

This node generates a graph showing the training loss over time, providing visual feedback on the training process.

Supporting Components

The implementation also includes several support classes:

TrainSampler: A custom sampler that performs gradient updates during the sampling process
LoraDiff and BiasDiff: Weight wrapper classes that apply LoRA adaptations to model weights

Drawbacks

Resource Consumption: Training is computationally intensive and may strain systems with limited resources
UI Responsiveness: Long training processes could make the ComfyUI interface less responsive
Complexity: Adding training capabilities increases the complexity of the ComfyUI codebase
Learning Curve: Users may need to understand more ML concepts to effectively use the training features

Adoption strategy

Experimental Flag: Initially release nodes with the EXPERIMENTAL = True flag to indicate the developing nature of the feature
Documentation: Provide comprehensive documentation and tutorial workflows
Gradual Feature Addition: Start with basic LoRA training and expand to other training types based on user feedback
Default Parameters: Set sensible defaults to help users get started without deep ML knowledge

Unresolved questions

Memory Management: How will the system handle memory during training, especially for large models and datasets?
Checkpoint Frequency: Should the system automatically save checkpoints during training to prevent loss of progress?
Training Interruption: How should the system handle interrupted training sessions?
Hyperparameter Optimization: Should the system provide tools for automatically finding optimal hyperparameters?
Multi-GPU Support: How will training utilize multiple GPUs if available?
Integration with Existing Workflows: How can trained models be seamlessly integrated into existing inference workflows?
Performance Metrics: Should additional metrics beyond loss be tracked and visualized?
Dataset Preparation: Should the system provide more tools for dataset curation and augmentation?

Implementation Plan

Phase 1: Basic LoRA Training

Initial implementation of the nodes described in this RFC.

Phase 2: Enhanced Features

Checkpoint saving during training
More advanced training visualizations
Support for additional training techniques (e.g., DreamBooth, Control model training like Control LoRA and IPA)

Phase 3: Workflow Integration

Templates for common training scenarios
Integration with model merging and inference workflows
Advanced dataset management tools

Phase 4: Model Format

New model format to improve model memory management and metadata of models in ComfyUI

Links

Important: Do NOT comment on this PR. Please use the discussion thread linked above to provide feedback, as it provides branched discussions that are easier to follow. This also makes the edit history of the PR clearer.

Create 0000-trainer.md RFC

2176c57

github-actions bot added pending When the RFC is still in comments phase rfc labels Mar 2, 2025

github-actions bot locked as resolved and limited conversation to collaborators Mar 2, 2025

Add details 0000-trainer.md

a49d920

yoland68 closed this Mar 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: ComfyUI Training Modules #24

RFC: ComfyUI Training Modules #24

Uh oh!

yoland68 commented Mar 2, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RFC: ComfyUI Training Modules #24

RFC: ComfyUI Training Modules #24

Uh oh!

Conversation

yoland68 commented Mar 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

RFC: ComfyUI Training Modules

Summary

Basic example

Motivation

Detailed design

1. Image Dataset Loading

2. LoRA Training Node

3. Model Saving Node

4. Training Visualization Node

Supporting Components

Drawbacks

Adoption strategy

Unresolved questions

Implementation Plan

Phase 1: Basic LoRA Training

Phase 2: Enhanced Features

Phase 3: Workflow Integration

Phase 4: Model Format

Links

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yoland68 commented Mar 2, 2025 •

edited

Loading