LiveTalk

Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

LiveTalk enables real-time multimodal interactive avatar video generation through an improved on-policy distillation approach. By distilling bidirectional diffusion models into causal, few-step autoregressive models, LiveTalk achieves over 20× speedup, enabling seamless real-time interactive experience.

🎬 Demo

video_1.MP4

video_2.MP4

video_3.MP4

⭐ Highlights

Real-Time Generation: Achieves 24.82 FPS throughput with 0.33s first-frame latency
Multimodal Conditioning: Supports text, image, and audio inputs for flexible avatar control
Efficient Inference: Reduces inference time from ~83s to real-time through 4-step diffusion distillation
Multi-Turn Coherence: Demonstrates competitive performance against models like Veo3 and Sora2 on multi-round interaction benchmarks
End-to-End System: Provides integration with audio language models for conversational AI applications

🚀 Get started

Requirements

We tested this repo on the following setup:

NVIDIA GPU with at least 24 GB memory (RTX 4090, A800, and H800 are tested)
Linux operating system
64 GB RAM

Installation

Clone the LiveTalk respository:

git clone https://github.com/GAIR-NLP/livetalk.git
cd livetalk

Clone the OmniAvatar repository from within the current LiveTalk repository:

git clone https://github.com/Omni-Avatar/OmniAvatar

Apply patches on OmniAvatar:

bash scripts/add_patch.sh

Create a conda environment and install dependencies:

conda create -n livetalk python=3.10 -y
conda activate livetalk
pip install -r requirements.txt
conda install -c conda-forge ffmpeg
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
python setup.py develop

Download Checkpoints

Download the required model checkpoints:

# Download Wan2.1 base model
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir-use-symlinks False --local-dir pretrained_checkpoints/Wan2.1-T2V-1.3B

# Download LiveTalk model checkpoint
huggingface-cli download GAIR/LiveTalk-1.3B-V0.1 --local-dir-use-symlinks False --local-dir pretrained_checkpoints/LiveTalk-1.3B-V0.1

# Download Wav2Vec2 model for audio processing
huggingface-cli download facebook/wav2vec2-base-960h --local-dir-use-symlinks False --local-dir pretrained_checkpoints/wav2vec2

The file structure will be:

.livetalk
|-- ...
|-- OmniAvatar
|-- pretrained_checkpoints
|   |-- LiveTalk-1.3B-V0.1
|   |-- Wan2.1-T2V-1.3B
|   |-- wav2vec2

Inference

We currently offer a simple script for running video inference, but it does not support streaming input and output. The inference requires approximately 20GB of GPU memory usage.

Execute the inference script with your configuration:

bash ./scripts/inference.sh

Input Requirements:

Image: Reference image of the person (JPG/PNG format)
Audio: Speech audio file (WAV format, 16kHz sample rate recommended)
Text Prompt: Description of the desired video characteristics

Output:

High-quality video at 16 FPS
Audio synchronized with lip movements
Duration specified by video_duration parameter

🔍 Method Overview

Improved On-Policy Distillation Recipe

Our approach addresses challenges in distilling multimodal video diffusion models:

Challenge: Self Forcing encounters visual artifacts (flickering, black frames, quality degradation) with multimodal conditioning.

Our Solution:

Curated Multimodal Conditions: High-quality reference images (super-resolution, semantic consistency) and motion-focused text prompts
Converged ODE Initialization: Extended training (20k steps) for robust starting point
Aggressive Optimization: 2× learning rate and CFG=6 to maximize learning within limited DMD window

These improvements eliminate training instability and deliver high-quality results across perceptual metrics, audio-visual synchronization, and aesthetic quality.

Real-Time Multimodal Avatar System

Built on the distilled few-step multimodal diffusion model, our system turns the recipe into an end-to-end, real-time talking avatar pipeline:

Thinker / Talker (Audio LM): A streaming audio language model takes user text or audio as input and produces speech responses in real time
Performer (Few-Step Diffusion): Our 4-step causal video diffusion model generates video in block-wise AR fashion (3 latent frames per block), conditioned on (1) streaming audio, (2) a reference avatar image, and (3) motion-focused text prompts
KV Cache & Streaming: Clean KV cache from previous blocks is prefilled to maintain temporal coherence while enabling low-latency, block-by-block video streaming synchronized to the audio
Long-Form Identity Preservation: We adopt Anchor-Heavy Identity Sinks (AHIS), reserving part of the KV window as fixed “identity anchors” while using a smaller rolling window for context, which stabilizes appearance over minutes-long interactions
Parallel Denoising & Decoding: Diffusion denoising and VAE decoding run in a pipelined manner so that generation stays ahead of playback, avoiding stalls and achieving real-time rendering

Together with the improved on-policy distillation recipe, this system delivers high-fidelity, lip-synced avatar videos with sub-second first-frame latency, supporting natural multi-turn multimodal interaction.

🙏 Acknowledgements

This codebase builds upon:

Self Forcing for on-policy distillation framework
CausVid for autoregressive video diffusion
Wan2.1 and OmniAvatar for multimodal video diffusion

📄 License

This project is released under the Apache 2.0 license.

📚 Citation

If you find this codebase useful for your research, please cite our paper:

@article{livetalk2025,
  title={LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation},
  author={Chern, Ethan and Hu, Zhulin and Tang, Bohao and Su, Jiadi and Chern, Steffi and Deng, Zhijie and Liu, Pengfei},
  journal={arXiv preprint},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
configs		configs
demo_utils		demo_utils
examples/inference		examples/inference
model		model
pipeline		pipeline
scripts		scripts
utils		utils
wan		wan
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
inference.py		inference.py
livetalk.pdf		livetalk.pdf
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LiveTalk

Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

🎬 Demo

⭐ Highlights

🚀 Get started

Requirements

Installation

Download Checkpoints

Inference

🔍 Method Overview

Improved On-Policy Distillation Recipe

Real-Time Multimodal Avatar System

🙏 Acknowledgements

📄 License

📚 Citation

About

Uh oh!

Releases

Packages

Languages

License

GAIR-NLP/LiveTalk

Folders and files

Latest commit

History

Repository files navigation

LiveTalk

Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

🎬 Demo

⭐ Highlights

🚀 Get started

Requirements

Installation

Download Checkpoints

Inference

🔍 Method Overview

Improved On-Policy Distillation Recipe

Real-Time Multimodal Avatar System

🙏 Acknowledgements

📄 License

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages