Skip to content

Audit: World Models - Lorin Achey and Carson Kohlbrenner#48

Open
cKohl10 wants to merge 40 commits intomainfrom
audit/cKohl10-lorinachey-world-models
Open

Audit: World Models - Lorin Achey and Carson Kohlbrenner#48
cKohl10 wants to merge 40 commits intomainfrom
audit/cKohl10-lorinachey-world-models

Conversation

@cKohl10
Copy link
Collaborator

@cKohl10 cKohl10 commented Feb 4, 2026

This technical audit explores the current state of the art for world models: Their architecture, robotics use cases, and limitations. The audit focuses on GAIA-1, Genie, TesserAct, and Cosmos as case studies for the audit.

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

🚀 Preview Deployed

Your preview is ready for review!

🔗 Preview URL: https://arpg.github.io/vla-foundations/staging/pulls/48/textbook/audits/staging/Architecture-Fig2-TesserAct.png/

Review Checklist

  • LaTeX equations render correctly
  • All sections are complete per the template
  • References are formatted properly
  • Figures/diagrams display correctly

Next Steps

  1. Review your rendered content using the preview link above
  2. Tag @crheckman when ready for instructor review
  3. Push updates to auto-refresh the preview

This preview will be removed when the PR is closed.

cKohl10 and others added 20 commits February 3, 2026 23:56
@cKohl10
Copy link
Collaborator Author

cKohl10 commented Feb 10, 2026

@crheckman I believe all the final changes have been made and are ready for your review

Copy link
Collaborator

@crheckman crheckman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first half of reading period.

Classical simulators such as Isaac Sim \[1\] and Mujoco \[2\] capture the physical dynamics necessary for training embodied agents, however the hardcoded dynamics used in such simulators is not practical for large-scale data generation of nuanced physical phenomena and realistic rendering.
World models (also referred to as World Foundation Models or WFMs) offer an alternative, data-driven approach to simulation and future state prediction that can capture more nuanced physical phenomena and render realistic video/image outputs.
World models are trained to capture the underlying spatial and temporal dynamics in images and video to predict future states of the environment.
In this document, we will look at four prevalent world models: GAIA-1 \[3\], Genie \[4\], TesserAct \[5\], and Cosmos \[6\].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Architecture

Each world model analyzed in this document fundamentally learns to predict the spatio-temporal dynamics of static frames.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of them consume video as context?

## Architecture

Each world model analyzed in this document fundamentally learns to predict the spatio-temporal dynamics of static frames.
Each model follows the encoder-decoder formulation where an encoder $\mathcal{E}$ ingests input frames $x$ from time $t=0:T$ and encodes them into latent tokens $z_{0:T}$, a dynamics model $\text{DYN}$ predicts the next latent tokens $z_{T+1:T+K}$, and a decoder $\mathcal{D}$ reconstructs the frames at time $t>T$.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are none of them fully autoregressive? (ingest context, create latent vector, and autoregressively decode subsequent frames)? If not, and assuming such an architecture has been tried by anyone, seems like there may be an explanation (computational savings, stability, ...).


### Features

<table>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a way to render tables in Markdown with less ... HTML. Consider revising for reviewability's sake.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to hit this repo (not just the mdx file) with your favorite AI code assistant and help out in cleaning some of this up.

<td><strong>Cosmos</strong></td>
<td><strong>14 Billion</strong> (Cosmos-Predict1-14B variant)</td>
<td><strong>~20 Million hours</strong> of raw video ($10^8$ video clips)</td>
<td>10,000 H100 GPUs (for 3 months)</td>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱

were they really training for this long

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Llamas 405B model took 2 months to train on 16000 H100s too

</tr>
<tr>
<td><strong>TesserAct</strong></td>
<td><em>Not specified in sources</em></td>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model is built on CogVideoX-5B. https://github.com/UMass-Embodied-AGI/TesserAct/blob/main/doc/usage.md, 30GB of weights.


Tokenization is a critical component for world models as it compresses high-dimensional image data into a lower-dimensional latent space that the world model can efficiently reason over.
The naive approach of sectioning images into patches and flattening them into vectors is often insufficient for capturing the complex spatial and temporal relationships in image data at a scale sufficiently efficient enough for practical use of a world model.
State of the art world models instead use a variety of **discrete** and **continuous** tokenization approaches as follows:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define a "continuous token."

<td><strong>GAIA-1</strong></td>
<td><strong>Multimodal understanding</strong> and disentanglement of static and dynamic driving elements like pedestrians and road layouts.</td>
<td>Potential for <strong>sampling errors</strong> (loops or OOD artifacts) if autoregressive sampling strategies are not carefully tuned.</td>
<td>It uses a <strong>unified representation</strong> for video, text, and actions, but relies on a diffusion decoder to correct temporal inconsistencies in its latent predictions.</td>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

briefly expand on why this might be an issue (i.e. sampling errors)

<td><strong>Cosmos</strong></td>
<td><strong>14 Billion</strong> (Cosmos-Predict1-14B variant)</td>
<td><strong>~20 Million hours</strong> of raw video ($10^8$ video clips)</td>
<td>10,000 H100 GPUs (for 3 months)</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful to mention how much compute is required to actually run these, if that's mentioned anywhere in the paper.

**Information Decay.** The tokenizer compresses 3.5M bits to 7,488 bits (470×).
Sub-pixel depth gradients, high-frequency textures, precise object boundaries, and small/distant objects may fall below tokenization resolution.

**The Semantic-Motor Gap.** GAIA-1 outputs video frames, not control commands.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was their target in the paper to use this world model for training purposes? Or was it to actually control vehicles in real time? If they are addressing it as a limitation, I'm wondering what their original intentions were..


**If this paper were a technical proposal at Zoox/Tesla, would I sign off?**

**For Production: CONDITIONAL NO**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this be used it production? To generate possible future sequences in real time (maybe for MPC)? Or would it be used for RL offline for finetuning a policy? The feasibility might be different depending on the use case.

---

# Technical Paper Audits: World Models

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would appreciate a brief history of world models dating back from the 80s/90s. This paper gives a good introduction: World Models

This dispels the idea that world models are a new thing.

<tr>
<td><strong>Genie</strong></td>
<td><strong>Unsupervised learning</strong> of interactive environments from massive, action-free Internet video corpora.</td>
<td>Limited to <strong>16 frames of memory</strong> and an inference speed of approximately <strong>1 FPS</strong>.</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like most of these models have much lower inference speeds compared to a standard simulator. Is there another way to get around this lower speed, like using larger batches in parallel, or is the seemingly better data just worth the speed trade-off?

Copy link
Collaborator

@crheckman crheckman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second half of reading period

<tr>
<td><strong>Genie</strong></td>
<td><strong>Unsupervised learning</strong> of interactive environments from massive, action-free Internet video corpora.</td>
<td>Limited to <strong>16 frames of memory</strong> and an inference speed of approximately <strong>1 FPS</strong>.</td>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16 frames of memory is a disaster. is this not somewhere they can make use of mRoPE and long-context training? is the only requirement bottleneck realtime inference?

<tr>
<td><strong>Cosmos</strong></td>
<td>Providing a <strong>highly scalable platform</strong> for Physical AI with state-of-the-art reconstruction quality.</td>
<td>Models still struggle with perfect <strong>physics adherence</strong> and object permanence in certain edge cases.</td>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

physics adherence -> no models can actually adhere to physics, so if it's stated here, are the hallucinations/violations obvious?


* **Data and observability limits**: embodied, contact-rich interactions are underrepresented in large-scale datasets, and video-only observations cannot capture hidden state (e.g., forces, friction), limiting physics-faithful rollouts \[11\].
* **Physical consistency failures**: long-horizon generations can violate object permanence and contact dynamics, making some models unreliable as safety-critical simulators \[6\].
* **Weak closed-loop evidence**: GAIA-1 is a driving-focused generator rather than a deployable controller and is not evaluated in closed-loop autonomy \[3\].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to mention something about computational efficiency too.

We need to address the computational feasibility of the WFM in the control loop. If we run the WFM and VLA in parallel for predictive control, the inference latency of current generative architectures (diffusion/autoregressive) makes real-time operation impossible.

Furthermore, if we quantize or reduce sampling steps to force real-time performance, we risk washing out the variance in the simulation. This creates a 'mean-seeking' world model that fails to represent the dangerous edge cases our VLA actually needs to plan against. furthermore we'll end up with compounding simulation drift!

They can also serve as a "pre-trained" initialization to address **data scarcity** in real-world robotics.
* **Safe Policy Training:** By pairing a WFM with a reward model, agents can gain proficiency through **reinforcement learning** in a simulated environment that faithfully adheres to physical laws.
* **Planning and Model-Predictive Control (MPC):** Robots can use world models to simulate multiple potential future states based on different action sequences, executing only the path that maximizes the predicted reward.
* **Synthetic Data Generation for Sim2Real:** WFMs can generate massive amounts of synthetic video data, including metadata like **depth or semantic maps**, to bridge the gap between simulation and real-world deployment.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you must also mention something about the practical impossibility of modeling phenomena like non-specular reflection, radiative diffusion, granular media, and other physical phenomena that these models can pretty faithfully reconstruct at scale. This means we can "observe" edge case phenomena at a much higher frequency than we would casually encounter them in the world, and build models that understand them using these newly generated datasets.

Copy link
Contributor

@krusnim krusnim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice audit. I focused on the section on Genie for my comments.


## **3. Data & Scaling**
Genie follows the scaling laws typical of Large Language Models (LLMs).
* **Dataset (Platformers)**: Constructed by filtering 55M clips down to a high-quality "Curated" set of **6.8M clips (30,000 hours)**. Filtering distractor items like menu screens or streamer faces was found to be more beneficial than raw data quantity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why Genie used this platformer data. (Are platformers specifically what Genie is "for?") They boast that this "generalizes beyond gaming to robotic manipulation," but that seems very suspect to me, unless they threw out the platformer data entirely and just used RT-1's dataset for that experiment. In which case, why lead with the platformer data?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the paper I see now that the RT-1 version is a separate model. So the generality they're boasting is of the approach, not of a singular model - might want to make that slightly clearer.

## **5. Critical Synthesis & Sign-Off**

### **5.1 Load-Bearing Assumptions**
* **Semantic Consistency**: The model assumes that a learned latent action (e.g., "Latent Action 1") will consistently map to the same semantic behavior (e.g., "Jump") across vastly different visual textures—an assumption that holds empirically across game genres and robotic scenes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I struggle to understand how the paper gets around this limitation. Action [jump] for a platformer and for a robot are vastly different.

The model treats interactive environment generation as a **next-token prediction task**, where future states are conditioned on inferred latent actions.

### **2.2 Video Tokenization**
The **ST-ViViT tokenizer** (200M parameters) utilizes a **VQ-VAE** \[5\] with ST-transformer blocks in both the encoder and decoder.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth clarifying that (from my understanding) they actually use two VQ-VAEs: one for video tokenization and one for action tokenization.


## **3. Data & Scaling**
Genie follows the scaling laws typical of Large Language Models (LLMs).
* **Dataset (Platformers)**: Constructed by filtering 55M clips down to a high-quality "Curated" set of **6.8M clips (30,000 hours)**. Filtering distractor items like menu screens or streamer faces was found to be more beneficial than raw data quantity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, would like to know how they performed dataset filtering, if they mentioned it. Platformer video seems pretty recognizable in comparison to other content so it seems there are some tricks they could use.


### **4.3 The Video-Only Assumption**
The fundamental technical thesis of Genie is that **ground-truth action labels are unnecessary for learning world models**.
By discarding the LAM encoder at inference and allowing a user to index the learned VQ codebook, Genie proves that internet-scale video provides enough causal structure to ground an agent's "understanding" of a world.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without action data, how does Genie differentiate between the agent and the (rest of the) environment?


### 2.2 Image Tokenizer (0.3B parameters)

**Architecture**: Fully convolutional 2D U-Net encoder-decoder with vector quantization
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did they opt to use a U-net instead of a newer architecture (e.g. Transformer)?

## **5. Critical Synthesis & Sign-Off**

### **5.1 Load-Bearing Assumptions**
* **Semantic Consistency**: The model assumes that a learned latent action (e.g., "Latent Action 1") will consistently map to the same semantic behavior (e.g., "Jump") across vastly different visual textures—an assumption that holds empirically across game genres and robotic scenes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your presentation, you mentioned that the model only had 8 latent actions. I'm curious if there's a constraint here where the number of actions that work for 2D-gaming is inherently not enough to transition into 3D robotics; (even though this is a deliberate design choice to enable fully unsupervised!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants