Skip to content

UVA-Computer-Vision-Lab/FrameINO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

26 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Frame In-N-Out: Unbounded Controllable Image-to-Video Generation (NeurIPS 2025)

Paper Website

We propose Frame In-N-Out, a controllable Image-to-Video generation Diffusion Transformer model where objects can enter or exit the scene along user-specified motion trajectories and ID reference. Our method introduces a new dataset curation pattern recognition, evaluation protocol, and a motion-controllable, identity-preserving, unbounded canvas Video Diffusion Transformer, to achieve Frame In and Frame Out in the cinematic domain.

πŸ”₯ Update | πŸ‘€ Visualization | πŸ”§ Installation | ⚑ Test | πŸ€— Model Zoo | 🧩 Dataset Curation | πŸ”₯Train | πŸ’» evaluation

Update πŸ”₯πŸ”₯πŸ”₯

  • Release the paper
  • Release the paper weights (CogVideoX-5B Stage1 Motion + Stage2 Motion with In-N-Out capbility)
  • Release the improved model weights (Wan2.2-5B on higher resolution, more datasets, and improved curation)
  • Gradio App demo
  • Release the Evaluation Code and Metrics
  • Release the Training Code with a short sample dataset
  • Release Arbitrary Resolution-trained Wan2.2 weights, denoted as V1.6
  • HF Space Demo
  • Release the Pre-Processing Code

⭐ If you like Frame In-N-Out, please help ⭐⭐star⭐⭐ this repo. Thanks! πŸ€—

Brief Intro Video πŸ‘€


overview.mp4

Wan2.2-5B on Frame In-N-Out:

Visual Canvas ID Reference Wan2.2-based Frame In-N-Out (v1.5)
Image 0 Image 0 Image 0
Image 0 Image 0 Image 0

Installation πŸ”§

conda create -n FINO python=3.10
conda activate FINO
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
conda install ffmpeg
pip install -r requirements.txt

Fast Inference ⚑⚑⚑

Gradio Interactive demo is available by

  python app.py

The Gradio Demo online is available here. This will cost 21GB of memory on average (peak: 26GB) with the enable_model_cpu_offload (but slightly slower). We will create a folder that starts with tmp_app_example_ in your local folder. There, you could find all conditions (including segmented ID and the padded generation result in the full canvas).

NOTE: This will automatically download pretrained weights to the HF cache and use our released V1.6 Wan2.2-5B weight by default, which supports arbitrary resolution better compared to V1.5 version.

NOTE: We recommend to open Running on public URL choice, which is more stable compared to the local URL option.

Model Zoo πŸ€—

We provide two forms of weight for Frame In-N-Out. The first one is what we used in our paper, CogVideoX-I2V-5B. Meanwhile, this summer, we see the Wan2.2-TI2V-5B, which is also applicable for full finetune in A100. Thus, we do some modifications to their architecture and training process and then introduce it here. Training code is released.

For the v1.5 version, we curate the dataset again, by optimizing the scene cut selection mechanism (leave more cases instead of directly filtering out), replacing CUT3R with SpatialTrackerV2 (SOTA 3D camera estimation model), a strong motion filtering mechanism, a clear start detection, and more small changes. Meanwhile, we discard the Webvid dataset in the training (which has a watermark) and introduce partial OpenS2V in the training.

Model Description Huggingface
CogVideoX-I2V-5B (Stage 1 - Motion Control) Paper Weight v1.0 Download
CogVideoX-I2V-5B (Stage 2 - Motion + In-N-Out Control) Paper Weight v1.0 Download
Wan2.2-TI2V-5B (Stage 1 - Motion Control) New Weight v1.5 on 704P Download
Wan2.2-TI2V-5B (Stage 2 - Motion + In-N-Out Control) New Weight v1.5 on 704P Download
Wan2.2-TI2V-5B (Stage 2 - Motion + In-N-Out Control) New Weight v1.6 on Arbitrary Resolution Download

Dataset Curation 🧩

The preprocessing code is located in the preprocess/ subdirectory.

For a small quick mini-dataset (demo training dataset), you can download by:

  # Recommend to set --local-dir as FrameINO_data, which is the default fixed dir in most files
  hf download uva-cv-lab/FrameINO_data --repo-type dataset --local-dir FrameINO_data

This dataset includes 300 train videos and the corresponding csv label files (text prompt, motion traj, filtering criteria) for data loading (as well as 20 videos for validation in training). The evaluation dataset for both Frame In and Frame Out benchmark can be found inside here.

NOTE: Please log in to our huggingface page and agree to the Gated Access.

Train πŸ”₯

Though we provide a short sample training dataset (~300 videos), the full dataset needs to be prepared by yourself. This is just for illustration and as an example.

The training is slightly different on the dataloader part from what is stated in the paper. We also modified and trained a Wan2.2-5B version. This is because we found that WAN2.2-5B might be an interesting model and we spent quite a lot of time after the submission to optimize the training and, more importantly, curation stage. We prefer the version presented below and will be based on this.

Stage1 Motion Training

For Wan2.2-5B:

# 1 GPU
python train_code/train_wan_motion.py

# 4GPU (Our experiment Setting). Change the XXXXX to your port (like 32214)
accelerate launch --config_file config/accelerate_config_4GPU.json --main_process_port XXXXX train_code/train_wan_motion.py

For CogVideoX:

# 1 GPU
python train_code/train_cogvideox_motion.py

# 4GPU (Our experiment Setting).  Change the XXXXX to your port (like 32214)
accelerate launch --config_file config/accelerate_config_4GPU.json --main_process_port XXXXX train_code/train_cogvideox_motion.py

Use --use_8BitAdam True for 8Bit Adam (based on your hardware support)

Stage2 Frame In-N-Out Training (Motion + Unbounded Canvas + ID reference)

For Wan2.2-5B:

# 1 GPU
python train_code/train_wan_motion_FrameINO.py    

# 4GPU (Our experiment Setting).  Change the XXXXX to your port (like 32214)
accelerate launch --config_file config/accelerate_config_4GPU.json --main_process_port XXXXX train_code/train_wan_motion_FrameINO.py

For CogVideoX:

# 1 GPU
python train_code/train_cogvideox_motion_FrameINO.py    

# 4GPU (Our Experiment Setting).  Change the XXXXX to your port (like 32214)
accelerate launch --config_file config/accelerate_config_4GPU.json --main_process_port XXXXX train_code/train_cogvideox_motion_FrameINO.py

Use --use_8BitAdam True for 8Bit Adam (based on your hardware support)

Evaluation πŸ’»

The evaluation dataloader is slightly different from the training version before. The dataloader we use in this stage is based on our paper setting (using the v1.0 paper weight at the same time). The evaluation dataset can be downlaoded based on the instruciton from Dataset Curation.

For Frame In:

  python test_code/run_cogvideox_FrameIn_mass_evaluation.py

Please check Frequently Changed Setting inside the code to double-check if your setting is aligned (like the pre-trained model path and the evaluation dataset).

For Frame Out:

  python test_code/run_cogvideox_FrameOut_mass_evaluation.py

Please check Frequently Changed Setting inside the code to double-check if your setting is aligned (like the pre-trained model path and the evaluation dataset).

For the evaluation metrics, we provide our modified version of Traj Error, Video Segmentation on Mean Absolute Error, Relative DINO matching, and VLM-judged In-N-Out success rate. Check evaluation/mass_evalution.py and then modify the setting there (like the number of frames, path, metrics for Frame In/Out) based on the needs.

Disclaimer

This project is released for academic use only. We disclaim responsibility for the distribution of the model weight and sample data. Users are solely liable for their actions. The project contributors are not legally affiliated with, nor accountable for, users' behaviors.

πŸ“š Citation

@article{wang2025frame,
  title={Frame In-N-Out: Unbounded Controllable Image-to-Video Generation},
  author={Wang, Boyang and Chen, Xuweiyi and Gadelha, Matheus and Cheng, Zezhou},
  journal={arXiv preprint arXiv:2505.21491},
  year={2025}
}

πŸ€— Acknowledgment

The current version of Frame In-N-Out is built on diffusers. We appreciate the authors for sharing their awesome codebase.

About

Frame In-N-Out: Unbounded Controllable Image-to-Video Generation (NeurIPS 2025)

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •