Audit: World Models - Lorin Achey and Carson Kohlbrenner#48
Audit: World Models - Lorin Achey and Carson Kohlbrenner#48
Conversation
…orld Models category
|
🚀 Preview Deployed Your preview is ready for review! 🔗 Preview URL: https://arpg.github.io/vla-foundations/staging/pulls/48/textbook/audits/staging/Architecture-Fig2-TesserAct.png/ Review Checklist
Next Steps
This preview will be removed when the PR is closed. |
… correct rendering
…pported, so replace all 2952904 with $
|
@crheckman I believe all the final changes have been made and are ready for your review |
crheckman
left a comment
There was a problem hiding this comment.
first half of reading period.
| Classical simulators such as Isaac Sim \[1\] and Mujoco \[2\] capture the physical dynamics necessary for training embodied agents, however the hardcoded dynamics used in such simulators is not practical for large-scale data generation of nuanced physical phenomena and realistic rendering. | ||
| World models (also referred to as World Foundation Models or WFMs) offer an alternative, data-driven approach to simulation and future state prediction that can capture more nuanced physical phenomena and render realistic video/image outputs. | ||
| World models are trained to capture the underlying spatial and temporal dynamics in images and video to predict future states of the environment. | ||
| In this document, we will look at four prevalent world models: GAIA-1 \[3\], Genie \[4\], TesserAct \[5\], and Cosmos \[6\]. |
There was a problem hiding this comment.
nit: Oops, you missed one! https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simulation
|
|
||
| ## Architecture | ||
|
|
||
| Each world model analyzed in this document fundamentally learns to predict the spatio-temporal dynamics of static frames. |
There was a problem hiding this comment.
None of them consume video as context?
| ## Architecture | ||
|
|
||
| Each world model analyzed in this document fundamentally learns to predict the spatio-temporal dynamics of static frames. | ||
| Each model follows the encoder-decoder formulation where an encoder $\mathcal{E}$ ingests input frames $x$ from time $t=0:T$ and encodes them into latent tokens $z_{0:T}$, a dynamics model $\text{DYN}$ predicts the next latent tokens $z_{T+1:T+K}$, and a decoder $\mathcal{D}$ reconstructs the frames at time $t>T$. |
There was a problem hiding this comment.
Are none of them fully autoregressive? (ingest context, create latent vector, and autoregressively decode subsequent frames)? If not, and assuming such an architecture has been tried by anyone, seems like there may be an explanation (computational savings, stability, ...).
|
|
||
| ### Features | ||
|
|
||
| <table> |
There was a problem hiding this comment.
I think there's a way to render tables in Markdown with less ... HTML. Consider revising for reviewability's sake.
There was a problem hiding this comment.
You should be able to hit this repo (not just the mdx file) with your favorite AI code assistant and help out in cleaning some of this up.
| <td><strong>Cosmos</strong></td> | ||
| <td><strong>14 Billion</strong> (Cosmos-Predict1-14B variant)</td> | ||
| <td><strong>~20 Million hours</strong> of raw video ($10^8$ video clips)</td> | ||
| <td>10,000 H100 GPUs (for 3 months)</td> |
There was a problem hiding this comment.
😱
were they really training for this long
There was a problem hiding this comment.
Llamas 405B model took 2 months to train on 16000 H100s too
| </tr> | ||
| <tr> | ||
| <td><strong>TesserAct</strong></td> | ||
| <td><em>Not specified in sources</em></td> |
There was a problem hiding this comment.
The model is built on CogVideoX-5B. https://github.com/UMass-Embodied-AGI/TesserAct/blob/main/doc/usage.md, 30GB of weights.
|
|
||
| Tokenization is a critical component for world models as it compresses high-dimensional image data into a lower-dimensional latent space that the world model can efficiently reason over. | ||
| The naive approach of sectioning images into patches and flattening them into vectors is often insufficient for capturing the complex spatial and temporal relationships in image data at a scale sufficiently efficient enough for practical use of a world model. | ||
| State of the art world models instead use a variety of **discrete** and **continuous** tokenization approaches as follows: |
There was a problem hiding this comment.
Define a "continuous token."
| <td><strong>GAIA-1</strong></td> | ||
| <td><strong>Multimodal understanding</strong> and disentanglement of static and dynamic driving elements like pedestrians and road layouts.</td> | ||
| <td>Potential for <strong>sampling errors</strong> (loops or OOD artifacts) if autoregressive sampling strategies are not carefully tuned.</td> | ||
| <td>It uses a <strong>unified representation</strong> for video, text, and actions, but relies on a diffusion decoder to correct temporal inconsistencies in its latent predictions.</td> |
There was a problem hiding this comment.
briefly expand on why this might be an issue (i.e. sampling errors)
| <td><strong>Cosmos</strong></td> | ||
| <td><strong>14 Billion</strong> (Cosmos-Predict1-14B variant)</td> | ||
| <td><strong>~20 Million hours</strong> of raw video ($10^8$ video clips)</td> | ||
| <td>10,000 H100 GPUs (for 3 months)</td> |
There was a problem hiding this comment.
Might be useful to mention how much compute is required to actually run these, if that's mentioned anywhere in the paper.
| **Information Decay.** The tokenizer compresses 3.5M bits to 7,488 bits (470×). | ||
| Sub-pixel depth gradients, high-frequency textures, precise object boundaries, and small/distant objects may fall below tokenization resolution. | ||
|
|
||
| **The Semantic-Motor Gap.** GAIA-1 outputs video frames, not control commands. |
There was a problem hiding this comment.
Was their target in the paper to use this world model for training purposes? Or was it to actually control vehicles in real time? If they are addressing it as a limitation, I'm wondering what their original intentions were..
|
|
||
| **If this paper were a technical proposal at Zoox/Tesla, would I sign off?** | ||
|
|
||
| **For Production: CONDITIONAL NO** |
There was a problem hiding this comment.
How would this be used it production? To generate possible future sequences in real time (maybe for MPC)? Or would it be used for RL offline for finetuning a policy? The feasibility might be different depending on the use case.
| --- | ||
|
|
||
| # Technical Paper Audits: World Models | ||
|
|
There was a problem hiding this comment.
I would appreciate a brief history of world models dating back from the 80s/90s. This paper gives a good introduction: World Models
This dispels the idea that world models are a new thing.
| <tr> | ||
| <td><strong>Genie</strong></td> | ||
| <td><strong>Unsupervised learning</strong> of interactive environments from massive, action-free Internet video corpora.</td> | ||
| <td>Limited to <strong>16 frames of memory</strong> and an inference speed of approximately <strong>1 FPS</strong>.</td> |
There was a problem hiding this comment.
It looks like most of these models have much lower inference speeds compared to a standard simulator. Is there another way to get around this lower speed, like using larger batches in parallel, or is the seemingly better data just worth the speed trade-off?
crheckman
left a comment
There was a problem hiding this comment.
second half of reading period
| <tr> | ||
| <td><strong>Genie</strong></td> | ||
| <td><strong>Unsupervised learning</strong> of interactive environments from massive, action-free Internet video corpora.</td> | ||
| <td>Limited to <strong>16 frames of memory</strong> and an inference speed of approximately <strong>1 FPS</strong>.</td> |
There was a problem hiding this comment.
16 frames of memory is a disaster. is this not somewhere they can make use of mRoPE and long-context training? is the only requirement bottleneck realtime inference?
| <tr> | ||
| <td><strong>Cosmos</strong></td> | ||
| <td>Providing a <strong>highly scalable platform</strong> for Physical AI with state-of-the-art reconstruction quality.</td> | ||
| <td>Models still struggle with perfect <strong>physics adherence</strong> and object permanence in certain edge cases.</td> |
There was a problem hiding this comment.
physics adherence -> no models can actually adhere to physics, so if it's stated here, are the hallucinations/violations obvious?
|
|
||
| * **Data and observability limits**: embodied, contact-rich interactions are underrepresented in large-scale datasets, and video-only observations cannot capture hidden state (e.g., forces, friction), limiting physics-faithful rollouts \[11\]. | ||
| * **Physical consistency failures**: long-horizon generations can violate object permanence and contact dynamics, making some models unreliable as safety-critical simulators \[6\]. | ||
| * **Weak closed-loop evidence**: GAIA-1 is a driving-focused generator rather than a deployable controller and is not evaluated in closed-loop autonomy \[3\]. |
There was a problem hiding this comment.
I think you need to mention something about computational efficiency too.
We need to address the computational feasibility of the WFM in the control loop. If we run the WFM and VLA in parallel for predictive control, the inference latency of current generative architectures (diffusion/autoregressive) makes real-time operation impossible.
Furthermore, if we quantize or reduce sampling steps to force real-time performance, we risk washing out the variance in the simulation. This creates a 'mean-seeking' world model that fails to represent the dangerous edge cases our VLA actually needs to plan against. furthermore we'll end up with compounding simulation drift!
| They can also serve as a "pre-trained" initialization to address **data scarcity** in real-world robotics. | ||
| * **Safe Policy Training:** By pairing a WFM with a reward model, agents can gain proficiency through **reinforcement learning** in a simulated environment that faithfully adheres to physical laws. | ||
| * **Planning and Model-Predictive Control (MPC):** Robots can use world models to simulate multiple potential future states based on different action sequences, executing only the path that maximizes the predicted reward. | ||
| * **Synthetic Data Generation for Sim2Real:** WFMs can generate massive amounts of synthetic video data, including metadata like **depth or semantic maps**, to bridge the gap between simulation and real-world deployment. |
There was a problem hiding this comment.
I think you must also mention something about the practical impossibility of modeling phenomena like non-specular reflection, radiative diffusion, granular media, and other physical phenomena that these models can pretty faithfully reconstruct at scale. This means we can "observe" edge case phenomena at a much higher frequency than we would casually encounter them in the world, and build models that understand them using these newly generated datasets.
krusnim
left a comment
There was a problem hiding this comment.
Nice audit. I focused on the section on Genie for my comments.
|
|
||
| ## **3. Data & Scaling** | ||
| Genie follows the scaling laws typical of Large Language Models (LLMs). | ||
| * **Dataset (Platformers)**: Constructed by filtering 55M clips down to a high-quality "Curated" set of **6.8M clips (30,000 hours)**. Filtering distractor items like menu screens or streamer faces was found to be more beneficial than raw data quantity. |
There was a problem hiding this comment.
I don't understand why Genie used this platformer data. (Are platformers specifically what Genie is "for?") They boast that this "generalizes beyond gaming to robotic manipulation," but that seems very suspect to me, unless they threw out the platformer data entirely and just used RT-1's dataset for that experiment. In which case, why lead with the platformer data?
There was a problem hiding this comment.
Reading the paper I see now that the RT-1 version is a separate model. So the generality they're boasting is of the approach, not of a singular model - might want to make that slightly clearer.
| ## **5. Critical Synthesis & Sign-Off** | ||
|
|
||
| ### **5.1 Load-Bearing Assumptions** | ||
| * **Semantic Consistency**: The model assumes that a learned latent action (e.g., "Latent Action 1") will consistently map to the same semantic behavior (e.g., "Jump") across vastly different visual textures—an assumption that holds empirically across game genres and robotic scenes. |
There was a problem hiding this comment.
Yeah, I struggle to understand how the paper gets around this limitation. Action [jump] for a platformer and for a robot are vastly different.
| The model treats interactive environment generation as a **next-token prediction task**, where future states are conditioned on inferred latent actions. | ||
|
|
||
| ### **2.2 Video Tokenization** | ||
| The **ST-ViViT tokenizer** (200M parameters) utilizes a **VQ-VAE** \[5\] with ST-transformer blocks in both the encoder and decoder. |
There was a problem hiding this comment.
Maybe worth clarifying that (from my understanding) they actually use two VQ-VAEs: one for video tokenization and one for action tokenization.
|
|
||
| ## **3. Data & Scaling** | ||
| Genie follows the scaling laws typical of Large Language Models (LLMs). | ||
| * **Dataset (Platformers)**: Constructed by filtering 55M clips down to a high-quality "Curated" set of **6.8M clips (30,000 hours)**. Filtering distractor items like menu screens or streamer faces was found to be more beneficial than raw data quantity. |
There was a problem hiding this comment.
Also, would like to know how they performed dataset filtering, if they mentioned it. Platformer video seems pretty recognizable in comparison to other content so it seems there are some tricks they could use.
|
|
||
| ### **4.3 The Video-Only Assumption** | ||
| The fundamental technical thesis of Genie is that **ground-truth action labels are unnecessary for learning world models**. | ||
| By discarding the LAM encoder at inference and allowing a user to index the learned VQ codebook, Genie proves that internet-scale video provides enough causal structure to ground an agent's "understanding" of a world. |
There was a problem hiding this comment.
Without action data, how does Genie differentiate between the agent and the (rest of the) environment?
|
|
||
| ### 2.2 Image Tokenizer (0.3B parameters) | ||
|
|
||
| **Architecture**: Fully convolutional 2D U-Net encoder-decoder with vector quantization |
There was a problem hiding this comment.
Why did they opt to use a U-net instead of a newer architecture (e.g. Transformer)?
| ## **5. Critical Synthesis & Sign-Off** | ||
|
|
||
| ### **5.1 Load-Bearing Assumptions** | ||
| * **Semantic Consistency**: The model assumes that a learned latent action (e.g., "Latent Action 1") will consistently map to the same semantic behavior (e.g., "Jump") across vastly different visual textures—an assumption that holds empirically across game genres and robotic scenes. |
There was a problem hiding this comment.
In your presentation, you mentioned that the model only had 8 latent actions. I'm curious if there's a constraint here where the number of actions that work for 2D-gaming is inherently not enough to transition into 3D robotics; (even though this is a deliberate design choice to enable fully unsupervised!)
This technical audit explores the current state of the art for world models: Their architecture, robotics use cases, and limitations. The audit focuses on GAIA-1, Genie, TesserAct, and Cosmos as case studies for the audit.