Audit: World Models - Lorin Achey and Carson Kohlbrenner#48

Open

cKohl10 wants to merge 40 commits intomainfrom

audit/cKohl10-lorinachey-world-models

Collaborator

cKohl10 commented Feb 4, 2026

This technical audit explores the current state of the art for world models: Their architecture, robotics use cases, and limitations. The audit focuses on GAIA-1, Genie, TesserAct, and Cosmos as case studies for the audit.

lorinachey and others added 9 commits

February 3, 2026 08:50


          audit: add World Models deep-dive by Lorin and Carson

4517c43


          Fix spacing issues on the paper metadata

9a31ee3


          Replace paper authors with paper audit authors


          Attempt to properly space paper metadata again

e5aa4ca


          Update inline citations for papers and techniques


          Fix a missing citation in TesserAct

e78c929


          Remove second thesis block; make thesis section placeholder for the W…

88af303

…orld Models category


          Replace [cite] placeholders with the cited papers

63c1f80


          Added overall review of all papers and Cosmos technical audit

8fdc7ae

github-actions bot commented Feb 4, 2026 •

edited

Loading

🚀 Preview Deployed

Your preview is ready for review!

🔗 Preview URL: https://arpg.github.io/vla-foundations/staging/pulls/48/textbook/audits/staging/Architecture-Fig2-TesserAct.png/

Review Checklist

LaTeX equations render correctly
All sections are complete per the template
References are formatted properly
Figures/diagrams display correctly

Next Steps

Review your rendered content using the preview link above
Tag @crheckman when ready for instructor review
Push updates to auto-refresh the preview

This preview will be removed when the PR is closed.

cKohl10 and others added 20 commits

February 3, 2026 23:56


          Addressed linter errors preventing successful pull request

0ab0453


          Addressed linter errors preventing successful pull request, attempt 2

a1b7218


          Addressed linter errors preventing successful pull request, attempt 3

4e6a76e


          Addressed linter errors preventing successful pull request, attempt 4

669972a


          Convert the Markdown table into an HTML table to see if it passes the…

3c47e5b

… linter


          Convert all Markdown tables to HTML tables

aa71eea


          Break up long sentences onto separate lines for linter happiness

aa4c160


          Fix the hours of video data for TesserAct; update table citations for…

d8ae033

… correct rendering


          Add training details for GAIA-1 and parameter estimates for TesserAct


          Change papers to paper in the file metadata to pass linter checks

2548a6b


          Add variable keys for the equations in GAIA-1

f3f3feb


          Add equation keys to the TesserAct section

c800f1d


          Fix an inline rendering problem with GAIA-1 image token specifications

40ffa2f


          Replace left and right syntax with brace syntax to fix curly bracket …

7d7b8b2

…rendering issues


          Fix Latex rendering issues with the approximately symbol: replace ~ w…

6a08274

…ith \approx


          Fix another latex rendering error with brackets

b829682


          MDX doesn't support the 2952904 math syntax, only inline math $ is su…

256a7b7

…pported, so replace all 2952904 with $


          Fix more latex math issues with braces and brackets

ab5d744


          Fix more latex math issues with braces and brackets

196e361


          Fix more latex math issues with braces and brackets

2bc4aad

cKohl10 added 9 commits

February 9, 2026 20:46


          Finished up the portion of the report on Genie and updated citations …

f4a0c35

…for Cosmos


          Attempting to debug broken figures

48b92ea


          Attempting to debug broken figures

d6ecc2d


          Attempting to debug broken figures

24a0633


          Attempting to debug broken figures

ce12631


          Attempting to debug broken figures

d96d389


          Attempting to debug broken figures

e6b7eb7


          Attempting to debug broken figures

bd44499


          Figured out what was preventing the figures from loading in!

6f0b2f4

Collaborator Author

cKohl10 commented Feb 10, 2026

@crheckman I believe all the final changes have been made and are ready for your review

lorinachey added 2 commits

February 10, 2026 08:41


          Try rendering the training objective with a markdown block in TesserA…

adf1d70

…ct audit


          Try rendering the training objective with a markdown block in TesserA…

ca5897a

…ct audit

crheckman requested changes

View reviewed changes

Collaborator

crheckman left a comment

first half of reading period.

content/textbook/audits/staging/cKohl10-lorinachey.mdx

+              Classical simulators such as Isaac Sim \[1\] and Mujoco \[2\] capture the physical dynamics necessary for training embodied agents, however the hardcoded dynamics used in such simulators is not practical for large-scale data generation of nuanced physical phenomena and realistic rendering.
+              World models (also referred to as World Foundation Models or WFMs) offer an alternative, data-driven approach to simulation and future state prediction that can capture more nuanced physical phenomena and render realistic video/image outputs.
+              World models are trained to capture the underlying spatial and temporal dynamics in images and video to predict future states of the environment.
+              In this document, we will look at four prevalent world models: GAIA-1 \[3\], Genie \[4\], TesserAct \[5\], and Cosmos \[6\].

Collaborator

crheckman Feb 10, 2026

nit: Oops, you missed one! https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simulation

content/textbook/audits/staging/cKohl10-lorinachey.mdx


		## Architecture

		Each world model analyzed in this document fundamentally learns to predict the spatio-temporal dynamics of static frames.

Collaborator

crheckman Feb 10, 2026

None of them consume video as context?

content/textbook/audits/staging/cKohl10-lorinachey.mdx

+              ## Architecture
+              Each world model analyzed in this document fundamentally learns to predict the spatio-temporal dynamics of static frames.
+              Each model follows the encoder-decoder formulation where an encoder $\mathcal{E}$ ingests input frames $x$ from time $t=0:T$ and encodes them into latent tokens $z_{0:T}$, a dynamics model $\text{DYN}$ predicts the next latent tokens $z_{T+1:T+K}$, and a decoder $\mathcal{D}$ reconstructs the frames at time $t>T$.

Collaborator

crheckman Feb 10, 2026

Are none of them fully autoregressive? (ingest context, create latent vector, and autoregressively decode subsequent frames)? If not, and assuming such an architecture has been tried by anyone, seems like there may be an explanation (computational savings, stability, ...).

content/textbook/audits/staging/cKohl10-lorinachey.mdx


		### Features

		<table>

Collaborator

crheckman Feb 10, 2026

I think there's a way to render tables in Markdown with less ... HTML. Consider revising for reviewability's sake.

Collaborator

crheckman Feb 10, 2026

You should be able to hit this repo (not just the mdx file) with your favorite AI code assistant and help out in cleaning some of this up.

content/textbook/audits/staging/cKohl10-lorinachey.mdx

+                    <td><strong>Cosmos</strong></td>
+                    <td><strong>14 Billion</strong> (Cosmos-Predict1-14B variant)</td>
+                    <td><strong>~20 Million hours</strong> of raw video ($10^8$ video clips)</td>
+                    <td>10,000 H100 GPUs (for 3 months)</td>

Collaborator

crheckman Feb 10, 2026

😱

were they really training for this long

Contributor

Jdvakil Feb 10, 2026

Llamas 405B model took 2 months to train on 16000 H100s too

content/textbook/audits/staging/cKohl10-lorinachey.mdx

+                  </tr>
+                  <tr>
+                    <td><strong>TesserAct</strong></td>
+                    <td><em>Not specified in sources</em></td>

Collaborator

crheckman Feb 10, 2026

The model is built on CogVideoX-5B. https://github.com/UMass-Embodied-AGI/TesserAct/blob/main/doc/usage.md, 30GB of weights.

content/textbook/audits/staging/cKohl10-lorinachey.mdx

+              Tokenization is a critical component for world models as it compresses high-dimensional image data into a lower-dimensional latent space that the world model can efficiently reason over.
+              The naive approach of sectioning images into patches and flattening them into vectors is often insufficient for capturing the complex spatial and temporal relationships in image data at a scale sufficiently efficient enough for practical use of a world model.
+              State of the art world models instead use a variety of **discrete** and **continuous** tokenization approaches as follows:

Collaborator

crheckman Feb 10, 2026

Define a "continuous token."

zlaouar reviewed

View reviewed changes

content/textbook/audits/staging/cKohl10-lorinachey.mdx

+                    <td><strong>GAIA-1</strong></td>
+                    <td><strong>Multimodal understanding</strong> and disentanglement of static and dynamic driving elements like pedestrians and road layouts.</td>
+                    <td>Potential for <strong>sampling errors</strong> (loops or OOD artifacts) if autoregressive sampling strategies are not carefully tuned.</td>
+                    <td>It uses a <strong>unified representation</strong> for video, text, and actions, but relies on a diffusion decoder to correct temporal inconsistencies in its latent predictions.</td>

Collaborator

zlaouar Feb 10, 2026

briefly expand on why this might be an issue (i.e. sampling errors)

antony-zhao reviewed

View reviewed changes

content/textbook/audits/staging/cKohl10-lorinachey.mdx

+                    <td><strong>Cosmos</strong></td>
+                    <td><strong>14 Billion</strong> (Cosmos-Predict1-14B variant)</td>
+                    <td><strong>~20 Million hours</strong> of raw video ($10^8$ video clips)</td>
+                    <td>10,000 H100 GPUs (for 3 months)</td>

Contributor

antony-zhao Feb 10, 2026

Might be useful to mention how much compute is required to actually run these, if that's mentioned anywhere in the paper.

aritrach reviewed

View reviewed changes

content/textbook/audits/staging/cKohl10-lorinachey.mdx

+              **Information Decay.** The tokenizer compresses 3.5M bits to 7,488 bits (470×).
+              Sub-pixel depth gradients, high-frequency textures, precise object boundaries, and small/distant objects may fall below tokenization resolution.
+              **The Semantic-Motor Gap.** GAIA-1 outputs video frames, not control commands.

Contributor

aritrach Feb 10, 2026

Was their target in the paper to use this world model for training purposes? Or was it to actually control vehicles in real time? If they are addressing it as a limitation, I'm wondering what their original intentions were..

yi-shiuan-tung reviewed

View reviewed changes

content/textbook/audits/staging/cKohl10-lorinachey.mdx


		If this paper were a technical proposal at Zoox/Tesla, would I sign off?

		For Production: CONDITIONAL NO

Contributor

yi-shiuan-tung Feb 10, 2026

How would this be used it production? To generate possible future sequences in real time (maybe for MPC)? Or would it be used for RL offline for finetuning a policy? The feasibility might be different depending on the use case.

zlaouar reviewed

View reviewed changes

content/textbook/audits/staging/cKohl10-lorinachey.mdx

		---

		# Technical Paper Audits: World Models

Collaborator

zlaouar Feb 10, 2026

I would appreciate a brief history of world models dating back from the 80s/90s. This paper gives a good introduction: World Models

This dispels the idea that world models are a new thing.

antony-zhao reviewed

View reviewed changes

content/textbook/audits/staging/cKohl10-lorinachey.mdx

+                  <tr>
+                    <td><strong>Genie</strong></td>
+                    <td><strong>Unsupervised learning</strong> of interactive environments from massive, action-free Internet video corpora.</td>
+                    <td>Limited to <strong>16 frames of memory</strong> and an inference speed of approximately <strong>1 FPS</strong>.</td>

Contributor

antony-zhao Feb 10, 2026

It looks like most of these models have much lower inference speeds compared to a standard simulator. Is there another way to get around this lower speed, like using larger batches in parallel, or is the seemingly better data just worth the speed trade-off?

crheckman requested changes

View reviewed changes

Collaborator

crheckman left a comment

second half of reading period

content/textbook/audits/staging/cKohl10-lorinachey.mdx

+                  <tr>
+                    <td><strong>Genie</strong></td>
+                    <td><strong>Unsupervised learning</strong> of interactive environments from massive, action-free Internet video corpora.</td>
+                    <td>Limited to <strong>16 frames of memory</strong> and an inference speed of approximately <strong>1 FPS</strong>.</td>

Collaborator

crheckman Feb 10, 2026

16 frames of memory is a disaster. is this not somewhere they can make use of mRoPE and long-context training? is the only requirement bottleneck realtime inference?

content/textbook/audits/staging/cKohl10-lorinachey.mdx

+                  <tr>
+                    <td><strong>Cosmos</strong></td>
+                    <td>Providing a <strong>highly scalable platform</strong> for Physical AI with state-of-the-art reconstruction quality.</td>
+                    <td>Models still struggle with perfect <strong>physics adherence</strong> and object permanence in certain edge cases.</td>

Collaborator

crheckman Feb 10, 2026

physics adherence -> no models can actually adhere to physics, so if it's stated here, are the hallucinations/violations obvious?

content/textbook/audits/staging/cKohl10-lorinachey.mdx

+              *   **Data and observability limits**: embodied, contact-rich interactions are underrepresented in large-scale datasets, and video-only observations cannot capture hidden state (e.g., forces, friction), limiting physics-faithful rollouts \[11\].
+              *   **Physical consistency failures**: long-horizon generations can violate object permanence and contact dynamics, making some models unreliable as safety-critical simulators \[6\].
+              *   **Weak closed-loop evidence**: GAIA-1 is a driving-focused generator rather than a deployable controller and is not evaluated in closed-loop autonomy \[3\].

Collaborator

crheckman Feb 10, 2026

I think you need to mention something about computational efficiency too.

We need to address the computational feasibility of the WFM in the control loop. If we run the WFM and VLA in parallel for predictive control, the inference latency of current generative architectures (diffusion/autoregressive) makes real-time operation impossible.

Furthermore, if we quantize or reduce sampling steps to force real-time performance, we risk washing out the variance in the simulation. This creates a 'mean-seeking' world model that fails to represent the dangerous edge cases our VLA actually needs to plan against. furthermore we'll end up with compounding simulation drift!

content/textbook/audits/staging/cKohl10-lorinachey.mdx

+              They can also serve as a "pre-trained" initialization to address **data scarcity** in real-world robotics.
+              *   **Safe Policy Training:** By pairing a WFM with a reward model, agents can gain proficiency through **reinforcement learning** in a simulated environment that faithfully adheres to physical laws.
+              *   **Planning and Model-Predictive Control (MPC):** Robots can use world models to simulate multiple potential future states based on different action sequences, executing only the path that maximizes the predicted reward.
+              *   **Synthetic Data Generation for Sim2Real:** WFMs can generate massive amounts of synthetic video data, including metadata like **depth or semantic maps**, to bridge the gap between simulation and real-world deployment.

Collaborator

crheckman Feb 10, 2026

I think you must also mention something about the practical impossibility of modeling phenomena like non-specular reflection, radiative diffusion, granular media, and other physical phenomena that these models can pretty faithfully reconstruct at scale. This means we can "observe" edge case phenomena at a much higher frequency than we would casually encounter them in the world, and build models that understand them using these newly generated datasets.

krusnim reviewed

View reviewed changes

Contributor

krusnim left a comment

Nice audit. I focused on the section on Genie for my comments.

content/textbook/audits/staging/cKohl10-lorinachey.mdx

+              ## **3. Data & Scaling**
+              Genie follows the scaling laws typical of Large Language Models (LLMs).
+              *   **Dataset (Platformers)**: Constructed by filtering 55M clips down to a high-quality "Curated" set of **6.8M clips (30,000 hours)**. Filtering distractor items like menu screens or streamer faces was found to be more beneficial than raw data quantity.

Contributor

krusnim Feb 10, 2026

I don't understand why Genie used this platformer data. (Are platformers specifically what Genie is "for?") They boast that this "generalizes beyond gaming to robotic manipulation," but that seems very suspect to me, unless they threw out the platformer data entirely and just used RT-1's dataset for that experiment. In which case, why lead with the platformer data?

Contributor

krusnim Feb 10, 2026

Reading the paper I see now that the RT-1 version is a separate model. So the generality they're boasting is of the approach, not of a singular model - might want to make that slightly clearer.

content/textbook/audits/staging/cKohl10-lorinachey.mdx

+              ## **5. Critical Synthesis & Sign-Off**
+              ### **5.1 Load-Bearing Assumptions**
+              *   **Semantic Consistency**: The model assumes that a learned latent action (e.g., "Latent Action 1") will consistently map to the same semantic behavior (e.g., "Jump") across vastly different visual textures—an assumption that holds empirically across game genres and robotic scenes.

Contributor

krusnim Feb 10, 2026

Yeah, I struggle to understand how the paper gets around this limitation. Action [jump] for a platformer and for a robot are vastly different.

content/textbook/audits/staging/cKohl10-lorinachey.mdx

+              The model treats interactive environment generation as a **next-token prediction task**, where future states are conditioned on inferred latent actions.
+              ### **2.2 Video Tokenization**
+              The **ST-ViViT tokenizer** (200M parameters) utilizes a **VQ-VAE** \[5\] with ST-transformer blocks in both the encoder and decoder.

Contributor

krusnim Feb 10, 2026

Maybe worth clarifying that (from my understanding) they actually use two VQ-VAEs: one for video tokenization and one for action tokenization.

content/textbook/audits/staging/cKohl10-lorinachey.mdx

+              ## **3. Data & Scaling**
+              Genie follows the scaling laws typical of Large Language Models (LLMs).
+              *   **Dataset (Platformers)**: Constructed by filtering 55M clips down to a high-quality "Curated" set of **6.8M clips (30,000 hours)**. Filtering distractor items like menu screens or streamer faces was found to be more beneficial than raw data quantity.

Contributor

krusnim Feb 10, 2026

Also, would like to know how they performed dataset filtering, if they mentioned it. Platformer video seems pretty recognizable in comparison to other content so it seems there are some tricks they could use.

content/textbook/audits/staging/cKohl10-lorinachey.mdx

+              ### **4.3 The Video-Only Assumption**
+              The fundamental technical thesis of Genie is that **ground-truth action labels are unnecessary for learning world models**.
+              By discarding the LAM encoder at inference and allowing a user to index the learned VQ codebook, Genie proves that internet-scale video provides enough causal structure to ground an agent's "understanding" of a world.

Contributor

krusnim Feb 10, 2026

Without action data, how does Genie differentiate between the agent and the (rest of the) environment?

zlaouar reviewed

View reviewed changes

content/textbook/audits/staging/cKohl10-lorinachey.mdx


		### 2.2 Image Tokenizer (0.3B parameters)

		Architecture: Fully convolutional 2D U-Net encoder-decoder with vector quantization

Collaborator

zlaouar Feb 10, 2026

Why did they opt to use a U-net instead of a newer architecture (e.g. Transformer)?

kalhamilton reviewed

View reviewed changes

content/textbook/audits/staging/cKohl10-lorinachey.mdx

+              ## **5. Critical Synthesis & Sign-Off**
+              ### **5.1 Load-Bearing Assumptions**
+              *   **Semantic Consistency**: The model assumes that a learned latent action (e.g., "Latent Action 1") will consistently map to the same semantic behavior (e.g., "Jump") across vastly different visual textures—an assumption that holds empirically across game genres and robotic scenes.

Contributor

kalhamilton Feb 10, 2026

In your presentation, you mentioned that the model only had 8 latent actions. I'm curious if there's a constraint here where the number of actions that work for 2D-gaming is inherently not enough to transition into 3D robotics; (even though this is a deliberate design choice to enable fully unsupervised!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

yi-shiuan-tung yi-shiuan-tung left review comments

aritrach aritrach left review comments

zlaouar zlaouar left review comments

antony-zhao antony-zhao left review comments

Jdvakil Jdvakil left review comments

krusnim krusnim left review comments

kalhamilton kalhamilton left review comments

crheckman crheckman requested changes

Requested changes must be addressed to merge this pull request.

Labels

None yet