Skip to content

knightyxp/VideoCoF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

50 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Unified Video Editing with Temporal Reasoner

๐Ÿ‘๏ธ See โ†’ ๐Ÿง  Reason โ†’ โœ๏ธ Edit

๐Ÿš€ A Chain of Frames video editing method enbale temporal reasoning and 4x video length extrapolation with just 50k training pairs!

ArXiv Project Page Hugging Face Model Hugging Face Demo Hugging Face Dataset

Xiangpeng Yang1, Ji Xie2, Yiyuan Yang1, Yan Huang1, Min Xu1, Qiang Wu1
1University of Technology Sydney, 2Zhejiang University

๐Ÿ’ฟ Introduction

VideoCoF-720p_compressed.mp4

๐Ÿ”ฅ News

  • 2026.01.02: ๐Ÿ“š We released the full VideoCoF-50k training dataset! Check it out at Hugging Face Datasets.
  • 2025.12.13: ๐Ÿš€ We released a 4-step fast inference script (~20-30s per video) and launched the Hugging Face demo! Please try it at Hugging Face Spaces.
  • 2025.12.09: Paper available on arXiv.
  • 2025.12.08: Release the inference code and videocof-50k weight.
  • 2025.12.06: ๐Ÿ”ฅ Project Page and README updated!

๐Ÿ“‘ Table of Contents

๐Ÿ”ง Quick Start

Online Demo: Try out our enhanced VideoCoF demo on Hugging Face Spaces here!

  1. Clone the repository:

    git clone https://github.com/videocof/VideoCoF.git
    cd VideoCoF
  2. Install dependencies:

    # 1. Create and activate a conda environment
    conda create -n videocof python=3.10
    conda activate videocof
    
    # 2. Install PyTorch (Choose version compatible with your CUDA)
    # For standard GPUs (CUDA 12.1):
    pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
    
    # For Hopper GPUs (e.g., H100/H800) requiring fast inference:
    # pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
    
    # 3. Install other dependencies
    pip install -r requirements.txt

    Note on Flash Attention: We recommend using FlashAttention-3 (currently beta) for optimal performance, especially on NVIDIA H100/H800 GPUs. If you are using these GPUs, please follow the official FlashAttention-3 installation guide after installing the compatible PyTorch version (e.g., PyTorch 2.8 + CUDA 12.8).

  3. Download Models:

    Wan-2.1-T2V-14B Pretrained Weights:

    ```bash
    git lfs install
    git clone https://huggingface.co/Wan-AI/Wan2.1-T2V-14B
    
    # Or using huggingface-cli:
    # hf download Wan-AI/Wan2.1-T2V-14B --local-dir Wan2.1-T2V-14B
    ```
    

    VideoCoF Checkpoint & Acceleration LoRA:

    ```bash
    git lfs install
    git clone https://huggingface.co/XiangpengYang/VideoCoF videocof_weight
    
    # Or using huggingface-cli:
    # hf download XiangpengYang/VideoCoF --local-dir videocof_weight
    
    # Download Acceleration LoRA (FusionX)
    wget -P videocof_weight https://huggingface.co/MonsterMMORPG/Wan_GGUF/resolve/main/Wan2.1_Text_to_Video_14B_FusionX_LoRA.safetensors
    ```
    
  4. Inference:

    ๐Ÿš€We provide Fast 4-step inference (Default, ~30s/video on H100) using acceleration LoRA.

    For single inference tasks:

    # Object Removal
    sh scripts/obj_rem.sh
    
    # Object Addition
    sh scripts/obj_add.sh
    
    # Object Swap
    sh scripts/obj_swap.sh
    
    # Local Style Transfer
    sh scripts/local_style.sh

    For parallel inference:

    sh scripts/parallel_infer.sh
  5. Gradio Demo:

    Launch the Gradio interface for interactive testing:

    # Ensure Wan2.1-T2V-14B (model_name), videocof_weight and dmd lora are in the current directory or properly referenced
    python examples/app.py

    The demo supports fast inference (~30s per video) online.

๐Ÿ† Model Zoo

Our models are available on Hugging Face:

Model Name Description Link
VideoCoF-Base Base model trained on 50k video pairs Hugging Face

๐Ÿญ Results

Why We Need Reasoning Before Editing?

Current video editing methods typically follow two paths:

  1. Expert models: Rely on external masks for precision but sacrifice unification.
  2. Unified in-context learning models: Mask-free but often struggle with spatial accuracy due to the lack of explicit cues.

VideoCoF bridges this gap by predicting reasoning tokens before generating the target video tokens.

Key Capabilities

  1. Seeing, Reasoning, Editing: VideoCoF adopts a "seeing, reasoning, editing" approach, ensuring edits are applied accurately to the intended targets.
  2. Length Extrapolation: Trained on only 50k data (33 frames), VideoCoF demonstrates robust multi-shot editing and length generalization (e.g., 4ร— length extrapolation).
  3. Diverse Editing Tasks: Supports fine-grained (instance and part level, spatial aware) Object Removal, Object Addition, Object Swap, and Local Style Transfer.

Gallery Highlights

Please refer to our Project Page for the full gallery.

  • Object Removal: Remove people or objects based on text prompts.
  • Object Addition: Add elements like animals, objects, or people.
  • Object Swap: Change specific attributes or objects.
  • Local Style Transfer: Modify texture, materials or colors.

๐Ÿšง TODO

  • Release paper.
  • Release inference code and weights.
  • Release Hugging Face demo (~30s infer a video online), try it at Hugging Face Spaces.
  • Release videocof-50k training data.
  • Release training code.

๐Ÿ™ Acknowledgments

We thank the authors of related works and the open-source community VideoX-Fun and Wan for their contributions.

๐Ÿ“œ License

This project is licensed under the Apache License 2.0.

๐Ÿ“ฎ Contact

For any questions, please feel free to reach out to the author Xiangpeng Yang @knightyxp, email: knightyxp@gmail.com/Xiangpeng.Yang@student.uts.edu.au

๐Ÿ“„ Citation

If you find this work useful for your research, please consider citing:

@article{yang2025videocof,
  title={Unified Video Editing with Temporal Reasoner},
  author={Yang, Xiangpeng and Xie, Ji and Yang, Yiyuan and Huang, Yan and Xu, Min and Wu, Qiang},
  journal={arXiv preprint arXiv:2512.07469},
  year={2025}
}
โญ **If you find this project helpful, please consider giving it a star!** โญ

โญ๏ธ Star History

Star History Chart