Frame In-N-Out: Unbounded Controllable Image-to-Video Generation (NeurIPS 2025)

We propose Frame In-N-Out, a controllable Image-to-Video generation Diffusion Transformer model where objects can enter or exit the scene along user-specified motion trajectories and ID reference. Our method introduces a new dataset curation pattern recognition, evaluation protocol, and a motion-controllable, identity-preserving, unbounded canvas Video Diffusion Transformer, to achieve Frame In and Frame Out in the cinematic domain.

Update 🔥🔥🔥

Release the paper
Release the paper weights (CogVideoX-5B Stage1 Motion + Stage2 Motion with In-N-Out capbility)
Release the improved model weights (Wan2.2-5B on higher resolution, more datasets, and improved curation)
Gradio App demo
Release the Evaluation Code and Metrics
Release the Training Code with a short sample dataset
Release Arbitrary Resolution-trained Wan2.2 weights, denoted as V1.6
HF Space Demo
Release the Pre-Processing Code

⭐ If you like Frame In-N-Out, please help ⭐⭐star⭐⭐ this repo. Thanks! 🤗

Brief Intro Video 👀

overview.mp4

Wan2.2-5B on Frame In-N-Out:

Visual Canvas	ID Reference	Wan2.2-based Frame In-N-Out (v1.5)

Installation 🔧

conda create -n FINO python=3.10
conda activate FINO
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
conda install ffmpeg
pip install -r requirements.txt

Fast Inference ⚡⚡⚡

Gradio Interactive demo is available by

  python app.py

The Gradio Demo online is available here. This will cost 21GB of memory on average (peak: 26GB) with the enable_model_cpu_offload (but slightly slower). We will create a folder that starts with tmp_app_example_ in your local folder. There, you could find all conditions (including segmented ID and the padded generation result in the full canvas).

NOTE: This will automatically download pretrained weights to the HF cache and use our released V1.6 Wan2.2-5B weight by default, which supports arbitrary resolution better compared to V1.5 version.

NOTE: We recommend to open Running on public URL choice, which is more stable compared to the local URL option.

Model Zoo 🤗

We provide two forms of weight for Frame In-N-Out. The first one is what we used in our paper, CogVideoX-I2V-5B. Meanwhile, this summer, we see the Wan2.2-TI2V-5B, which is also applicable for full finetune in A100. Thus, we do some modifications to their architecture and training process and then introduce it here. Training code is released.

For the v1.5 version, we curate the dataset again, by optimizing the scene cut selection mechanism (leave more cases instead of directly filtering out), replacing CUT3R with SpatialTrackerV2 (SOTA 3D camera estimation model), a strong motion filtering mechanism, a clear start detection, and more small changes. Meanwhile, we discard the Webvid dataset in the training (which has a watermark) and introduce partial OpenS2V in the training.

Model	Description	Huggingface
CogVideoX-I2V-5B (Stage 1 - Motion Control)	Paper Weight v1.0	Download
CogVideoX-I2V-5B (Stage 2 - Motion + In-N-Out Control)	Paper Weight v1.0	Download
Wan2.2-TI2V-5B (Stage 1 - Motion Control)	New Weight v1.5 on 704P	Download
Wan2.2-TI2V-5B (Stage 2 - Motion + In-N-Out Control)	New Weight v1.5 on 704P	Download
Wan2.2-TI2V-5B (Stage 2 - Motion + In-N-Out Control)	New Weight v1.6 on Arbitrary Resolution	Download

Dataset Curation 🧩

The preprocessing code is located in the preprocess/ subdirectory.

For a small quick mini-dataset (demo training dataset), you can download by:

  # Recommend to set --local-dir as FrameINO_data, which is the default fixed dir in most files
  hf download uva-cv-lab/FrameINO_data --repo-type dataset --local-dir FrameINO_data

This dataset includes 300 train videos and the corresponding csv label files (text prompt, motion traj, filtering criteria) for data loading (as well as 20 videos for validation in training). The evaluation dataset for both Frame In and Frame Out benchmark can be found inside here.

NOTE: Please log in to our huggingface page and agree to the Gated Access.

Train 🔥

Though we provide a short sample training dataset (~300 videos), the full dataset needs to be prepared by yourself. This is just for illustration and as an example.

The training is slightly different on the dataloader part from what is stated in the paper. We also modified and trained a Wan2.2-5B version. This is because we found that WAN2.2-5B might be an interesting model and we spent quite a lot of time after the submission to optimize the training and, more importantly, curation stage. We prefer the version presented below and will be based on this.

Stage1 Motion Training

For Wan2.2-5B:

# 1 GPU
python train_code/train_wan_motion.py

# 4GPU (Our experiment Setting). Change the XXXXX to your port (like 32214)
accelerate launch --config_file config/accelerate_config_4GPU.json --main_process_port XXXXX train_code/train_wan_motion.py

For CogVideoX:

# 1 GPU
python train_code/train_cogvideox_motion.py

# 4GPU (Our experiment Setting).  Change the XXXXX to your port (like 32214)
accelerate launch --config_file config/accelerate_config_4GPU.json --main_process_port XXXXX train_code/train_cogvideox_motion.py

Use --use_8BitAdam True for 8Bit Adam (based on your hardware support)

Stage2 Frame In-N-Out Training (Motion + Unbounded Canvas + ID reference)

For Wan2.2-5B:

# 1 GPU
python train_code/train_wan_motion_FrameINO.py    

# 4GPU (Our experiment Setting).  Change the XXXXX to your port (like 32214)
accelerate launch --config_file config/accelerate_config_4GPU.json --main_process_port XXXXX train_code/train_wan_motion_FrameINO.py

For CogVideoX:

# 1 GPU
python train_code/train_cogvideox_motion_FrameINO.py    

# 4GPU (Our Experiment Setting).  Change the XXXXX to your port (like 32214)
accelerate launch --config_file config/accelerate_config_4GPU.json --main_process_port XXXXX train_code/train_cogvideox_motion_FrameINO.py

Use --use_8BitAdam True for 8Bit Adam (based on your hardware support)

Evaluation 💻

The evaluation dataloader is slightly different from the training version before. The dataloader we use in this stage is based on our paper setting (using the v1.0 paper weight at the same time). The evaluation dataset can be downlaoded based on the instruciton from Dataset Curation.

For Frame In:

  python test_code/run_cogvideox_FrameIn_mass_evaluation.py

Please check Frequently Changed Setting inside the code to double-check if your setting is aligned (like the pre-trained model path and the evaluation dataset).

For Frame Out:

  python test_code/run_cogvideox_FrameOut_mass_evaluation.py

Please check Frequently Changed Setting inside the code to double-check if your setting is aligned (like the pre-trained model path and the evaluation dataset).

For the evaluation metrics, we provide our modified version of Traj Error, Video Segmentation on Mean Absolute Error, Relative DINO matching, and VLM-judged In-N-Out success rate. Check evaluation/mass_evalution.py and then modify the setting there (like the number of frames, path, metrics for Frame In/Out) based on the needs.

Disclaimer

This project is released for academic use only. We disclaim responsibility for the distribution of the model weight and sample data. Users are solely liable for their actions. The project contributors are not legally affiliated with, nor accountable for, users' behaviors.

📚 Citation

@article{wang2025frame,
  title={Frame In-N-Out: Unbounded Controllable Image-to-Video Generation},
  author={Wang, Boyang and Chen, Xuweiyi and Gadelha, Matheus and Cheng, Zezhou},
  journal={arXiv preprint arXiv:2505.21491},
  year={2025}
}

🤗 Acknowledgment

The current version of Frame In-N-Out is built on diffusers. We appreciate the authors for sharing their awesome codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
__assets__		__assets__
architecture		architecture
config		config
data_loader		data_loader
evaluation		evaluation
pipelines		pipelines
preprocess		preprocess
test_code		test_code
train_code		train_code
utils		utils
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Frame In-N-Out: Unbounded Controllable Image-to-Video Generation (NeurIPS 2025)

Update 🔥🔥🔥

Brief Intro Video 👀

Installation 🔧

Fast Inference ⚡⚡⚡

Model Zoo 🤗

Dataset Curation 🧩

Train 🔥

Stage1 Motion Training

Stage2 Frame In-N-Out Training (Motion + Unbounded Canvas + ID reference)

Evaluation 💻

Disclaimer

📚 Citation

🤗 Acknowledgment

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

UVA-Computer-Vision-Lab/FrameINO

Folders and files

Latest commit

History

Repository files navigation

Frame In-N-Out: Unbounded Controllable Image-to-Video Generation (NeurIPS 2025)

Update 🔥🔥🔥

Brief Intro Video 👀

Installation 🔧

Fast Inference ⚡⚡⚡

Model Zoo 🤗

Dataset Curation 🧩

Train 🔥

Stage1 Motion Training

Stage2 Frame In-N-Out Training (Motion + Unbounded Canvas + ID reference)

Evaluation 💻

Disclaimer

📚 Citation

🤗 Acknowledgment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages