1University of Technology Sydney, 2Zhejiang University
VideoCoF-720p_compressed.mp4
- 2026.01.02: ๐ We released the full VideoCoF-50k training dataset! Check it out at Hugging Face Datasets.
- 2025.12.13: ๐ We released a 4-step fast inference script (~20-30s per video) and launched the Hugging Face demo! Please try it at Hugging Face Spaces.
- 2025.12.09: Paper available on arXiv.
- 2025.12.08: Release the inference code and videocof-50k weight.
- 2025.12.06: ๐ฅ Project Page and README updated!
- ๐ง Quick Start
- ๐ Model Zoo
- ๐ญ Results
- ๐ง TODO
- ๐ Acknowledgments
- ๐ License
- ๐ฎ Contact
- ๐ Citation
Online Demo: Try out our enhanced VideoCoF demo on Hugging Face Spaces here!
-
Clone the repository:
git clone https://github.com/videocof/VideoCoF.git cd VideoCoF -
Install dependencies:
# 1. Create and activate a conda environment conda create -n videocof python=3.10 conda activate videocof # 2. Install PyTorch (Choose version compatible with your CUDA) # For standard GPUs (CUDA 12.1): pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121 # For Hopper GPUs (e.g., H100/H800) requiring fast inference: # pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128 # 3. Install other dependencies pip install -r requirements.txt
Note on Flash Attention: We recommend using FlashAttention-3 (currently beta) for optimal performance, especially on NVIDIA H100/H800 GPUs. If you are using these GPUs, please follow the official FlashAttention-3 installation guide after installing the compatible PyTorch version (e.g., PyTorch 2.8 + CUDA 12.8).
-
Download Models:
Wan-2.1-T2V-14B Pretrained Weights:
```bash git lfs install git clone https://huggingface.co/Wan-AI/Wan2.1-T2V-14B # Or using huggingface-cli: # hf download Wan-AI/Wan2.1-T2V-14B --local-dir Wan2.1-T2V-14B ```VideoCoF Checkpoint & Acceleration LoRA:
```bash git lfs install git clone https://huggingface.co/XiangpengYang/VideoCoF videocof_weight # Or using huggingface-cli: # hf download XiangpengYang/VideoCoF --local-dir videocof_weight # Download Acceleration LoRA (FusionX) wget -P videocof_weight https://huggingface.co/MonsterMMORPG/Wan_GGUF/resolve/main/Wan2.1_Text_to_Video_14B_FusionX_LoRA.safetensors ``` -
Inference:
๐We provide Fast 4-step inference (Default, ~30s/video on H100) using acceleration LoRA.
For single inference tasks:
# Object Removal sh scripts/obj_rem.sh # Object Addition sh scripts/obj_add.sh # Object Swap sh scripts/obj_swap.sh # Local Style Transfer sh scripts/local_style.sh
For parallel inference:
sh scripts/parallel_infer.sh
-
Gradio Demo:
Launch the Gradio interface for interactive testing:
# Ensure Wan2.1-T2V-14B (model_name), videocof_weight and dmd lora are in the current directory or properly referenced python examples/app.pyThe demo supports fast inference (~30s per video) online.
Our models are available on Hugging Face:
| Model Name | Description | Link |
|---|---|---|
| VideoCoF-Base | Base model trained on 50k video pairs | Hugging Face |
Current video editing methods typically follow two paths:
- Expert models: Rely on external masks for precision but sacrifice unification.
- Unified in-context learning models: Mask-free but often struggle with spatial accuracy due to the lack of explicit cues.
VideoCoF bridges this gap by predicting reasoning tokens before generating the target video tokens.
- Seeing, Reasoning, Editing: VideoCoF adopts a "seeing, reasoning, editing" approach, ensuring edits are applied accurately to the intended targets.
- Length Extrapolation: Trained on only 50k data (33 frames), VideoCoF demonstrates robust multi-shot editing and length generalization (e.g., 4ร length extrapolation).
- Diverse Editing Tasks: Supports fine-grained (instance and part level, spatial aware) Object Removal, Object Addition, Object Swap, and Local Style Transfer.
Please refer to our Project Page for the full gallery.
- Object Removal: Remove people or objects based on text prompts.
- Object Addition: Add elements like animals, objects, or people.
- Object Swap: Change specific attributes or objects.
- Local Style Transfer: Modify texture, materials or colors.
- Release paper.
- Release inference code and weights.
- Release Hugging Face demo (~30s infer a video online), try it at Hugging Face Spaces.
- Release videocof-50k training data.
- Release training code.
We thank the authors of related works and the open-source community VideoX-Fun and Wan for their contributions.
This project is licensed under the Apache License 2.0.
For any questions, please feel free to reach out to the author Xiangpeng Yang @knightyxp, email: knightyxp@gmail.com/Xiangpeng.Yang@student.uts.edu.au
If you find this work useful for your research, please consider citing:
@article{yang2025videocof,
title={Unified Video Editing with Temporal Reasoner},
author={Yang, Xiangpeng and Xie, Ji and Yang, Yiyuan and Huang, Yan and Xu, Min and Wu, Qiang},
journal={arXiv preprint arXiv:2512.07469},
year={2025}
}