Training in a World That Does Not Exist Yet

World Models as Production Infrastructure for Embodied AI

A detailed miniature model of a modern city with illuminated buildings and roads

A robot policy ships to a customer warehouse. It has never seen that warehouse. It has never seen any warehouse. The training environment was generated, frame by frame, by a learned model of how warehouses look and how forklifts and pallets and overhead lights behave. That same generator is in the deployment pipeline, used to evaluate new policies before they touch real hardware, and it is patched, versioned, and rolled back like any other piece of infrastructure.

A few years ago this was a research demonstration. By 2026 it is closer to a load-bearing dependency for embodied AI teams.

A world model is, in the working definition used across recent literature, a predictive model of how an environment evolves under actions. A policy queries it the way a planner queries a physics engine, and it returns a plausible next frame or next state. The current generation of these models is trained on internet-scale video and large robot-trajectory corpora, then fine-tuned on the specific environment a given robot will operate in. The role they play in a production pipeline has expanded from "useful for sample-efficient research" to "the thing the policy is mostly trained against."

The Algorithm Becomes the Substrate

The canonical demonstration that world models work as general training environments is DreamerV3, published in Nature in 2025. A single configuration of the algorithm learns 150-plus diverse tasks and, in the most cited result, collects diamonds in Minecraft from scratch without human data or curricula ^[1]. The recipe is straightforward at the conceptual level. A compact world model is learned from interaction. A policy is trained by imagining trajectories inside that model. Real interaction is used sparingly, mostly to keep the model honest. What made the result production-relevant was less the imagination loop than the operational story around it. A fixed set of hyperparameters worked across 150 tasks, which meant an engineering team could plug the algorithm into a new environment without the months of tuning that earlier reinforcement learning recipes required.

The 2024 NeurIPS spotlight DIAMOND took a parallel path with diffusion-based world models, using the same architectural family that powers modern image and video generators to render the next observation. Visual fidelity matters more than the early world model literature assumed, and a diffusion world model gives the agent enough detail to act on cues that compressed latent models throw away ^[2]. The same paper showed something more provocative for the production conversation. A diffusion world model trained on a few hours of Counter-Strike footage could stand alone as an interactive game engine, suggesting that world models were no longer just training tools but candidate runtimes.

Google's GameNGen, presented at ICLR 2025, made that suggestion concrete by running the classic game DOOM at over twenty frames per second on a single TPU, with human raters near chance at distinguishing real footage from the simulated rollout ^[3]. DeepMind's Genie line generalized the idea further. A foundation world model trained on thousands of hours of unlabelled gameplay video learned to generate action-controllable environments from a single prompt image, with the action vocabulary itself discovered from the data ^[4]. The lineage moves quickly from "fast Atari simulator" to "general substrate that can be conjured from a still image."

The Production Stack Forms

The phrase "world foundation model" started appearing in industry releases in 2025. NVIDIA's Cosmos platform put it bluntly. Physical AI needs a digital twin of the world before it ever touches the real one. Cosmos ships pre-trained world foundation models with open weights, along with the video curation pipeline, tokenizers, and post-training recipes a team would need to specialize the model to its own robot or driving scenario ^[5]. The pattern is familiar from the language model era. Pre-train a generalist on broad data, post-train on the target deployment, treat the result as infrastructure.

A miniature train traveling through a tiny village with model trees and buildings

A diorama is a careful approximation of somewhere. The forklifts and warehouses learned by a world model are too, and the policy that trains there is making decisions about a place that exists only in the weights of another network. The arrangement is closer to model railroading than to a physics engine. Trees are placed because they make the scene legible. Lighting is tuned because the cameras need to see. The geometry is real enough to learn from, and that is the part that matters for the policy.

Wayve's GAIA-2 illustrates how this looks in a specific application. A multi-camera, multi-view latent diffusion model generates spatiotemporally consistent driving footage across UK, US, and German roads, with structured controls for ego-vehicle dynamics, agent placement, weather, and road semantics ^[6]. The use case is the autonomous driving development cycle. Rare scenarios are scarce in real fleets, and a controllable simulator that produces them on demand is more useful than another sensor on another car. GigaWorld-0, from late 2025, applied the same logic to general embodied AI, framing the world model as a "data engine" for vision-language-action policies and reporting that policies trained on its synthetic trajectories improved task success and zero-shot generalization on real robots ^[7].

A 2026 survey of world models in robot learning summarizes the result. World models now serve at least three production roles, sometimes simultaneously. They are training environments for policy learning. They are evaluation harnesses for new policies before deployment. They are synthetic data engines that produce trajectories which would be expensive or unsafe to collect in the real world ^[8]. The same artifact is wearing several hats in the stack, and the hat it is wearing matters when something goes wrong.

The Physics Gap

A simulator that gets the look of an environment right while getting the dynamics wrong is a familiar problem from classical computer graphics. The current world model literature suggests the problem has migrated rather than disappeared. WorldBench, a 2026 diagnostic benchmark from a UCLA-led group, shows that current frontier world models including Cosmos generate visually realistic scene continuations but routinely miss the physical parameters that govern those continuations. A ball follows a believable parabolic trajectory and accelerates downward at the wrong rate. A high-viscosity fluid behaves like a low-viscosity fluid. The visual envelope is plausible. The underlying dynamics are not ^[9].

A water drop creating concentric ripples on a dark surface

The ripple from a single drop is governed by surface tension, viscosity, and impact velocity, in ways that are easy to picture and hard to predict from pixels. The PhysicsMind benchmark made the same point with textbook mechanics, finding that current video generators frequently violate center-of-mass and inertia constraints when asked to continue scenes that involve balance, levers, and rotation. The failure modes are consistent with appearance heuristics rather than internalized physics ^[10]. PhyWorld, also from 2026, attempts to close part of the gap by post-training a video generation world model with direct preference optimization over physics-faithful versus physics-violating continuations, improving benchmark scores without claiming the problem is solved ^[11].

From a systems perspective, the gap matters because a policy trained against an inaccurate dynamics model will learn behaviors that exploit those inaccuracies. The classical sim-to-real failure mode showed up when hand-coded simulators got friction wrong or contact wrong, and policies optimized in those simulators developed brittle strategies that broke on real hardware. The current generation of the problem is endogenous to the simulator. The dynamics errors are not local quirks of a particular physics engine but distributed across a learned model that was optimized for visual likelihood rather than physical correctness. The SimDist framework from early 2026 acknowledges the residual gap directly, bootstrapping a latent world model in simulation and then adapting it in the real world with online planning and supervised dynamics learning ^[12]. World models began as a way to avoid the real world. The current best practice quietly bolts a real-world feedback loop back on at the end.

Questions a Practitioner Would Ask

When a piece of software becomes infrastructure, the questions shift from whether it works to how it should be operated. A few of those questions follow naturally from where the world model literature is in 2026.

Versioning becomes nontrivial. If a world model is patched, what is the status of the policies that were trained against the previous version? A trained policy is downstream of a learned simulator in roughly the way a model artifact is downstream of its training data. Replaying a policy's training in a refreshed simulator is closer to a recompilation than to a config change, and the operational tooling for that pattern is thinner than the research literature might suggest. Some teams will be tempted to fine-tune the policy against the new simulator, which is faster but accumulates an implicit history of overfit to whichever world model happens to be live at the time.

Evaluation contamination is a second concern. When the world model is itself learned from production data, a policy evaluated against it is being evaluated against a learned approximation of the environment that generated its own training signal. The shape of this circle is familiar from supervised learning, where train-test overlap inflates measured performance, and the field has spent a long time working out auditing tools. Translating those tools to embodied agents and learned simulators is open work. The Stanford AI Index for 2026 notes that benchmarks for physical-world robotics, multiagent coordination, and tool-using agents remain underdeveloped and harder to standardize than text benchmarks, in part because physical tasks involve unpredictable environments and diverse hardware that resist repeatable scoring ^[13]. A learned simulator that is treated as the canonical evaluation harness inherits all of that fragility and adds its own.

Reliability becomes a shared-fate question. A world model that many teams train against becomes a common point of failure in the embodied AI stack, in the way a shared base model is a common point of failure for downstream language applications. If the simulator's physics is wrong in one regime, every policy trained against it may inherit the same blind spot. The pattern is reminiscent of dataset-level overfitting in computer vision, scaled up from images to environments. Mitigations probably look like the ones that emerged for foundation models, including ensemble evaluation across structurally different simulators, held-out scenarios not used during training, and a real-world sample budget reserved specifically for surfacing failures the learned simulator hides.

What This Suggests

World models are crossing a line that data infrastructure crossed earlier. They are no longer a research artifact that some teams happen to use. They are an operational layer that other layers of the stack assume is there and behaves predictably. The research literature has good answers about the algorithms behind them, and the deployment literature is much thinner about what it means to depend on one.

A few things seem likely from where the field stands now. Pre-trained world foundation models, distributed the way base language models are distributed, will continue to be the entry point for most teams. Post-training on environment-specific data will be where the real engineering happens. Physical faithfulness will improve, partly through better data and partly through post-training methods that score continuations against explicit dynamics. The residual gap to real hardware will not close fully in this generation of the technology, which means the operational practices around sim-to-real adaptation, evaluation auditing, and version control will matter more than the algorithmic frontiers most papers describe.

The honest current state is that the technology is improving faster than the operating practices around it. A world model that many teams train against is also a world model whose mistakes many teams inherit. The interesting question for the next year of this research is less about how good the simulators can get and more about how a team should reason when its policies were trained, evaluated, and shipped against a world that does not exist outside the weights of another network.

References

D. Hafner et al., "Mastering Diverse Control Tasks Through World Models," Nature, vol. 640, pp. 647–653, 2025. DOI: [Online]
E. Alonso et al., "Diffusion for World Modeling: Visual Details Matter in Atari," 2024. DOI: [Online]
D. Valevski et al., "Diffusion Models Are Real-Time Game Engines," in Proc. International Conference on Learning Representations (ICLR'25), 2025. DOI: [Online]
J. Bruce et al., "Genie: Generative Interactive Environments," in Proc. International Conference on Machine Learning (ICML'24), 2024. DOI: [Online]
NVIDIA et al., "Cosmos World Foundation Model Platform for Physical AI," arXiv, 2025, [Online]
L. Russell et al., "GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving," arXiv, 2025, [Online]
A. Ye et al., "GigaWorld-0: World Models as Data Engine to Empower Embodied AI," arXiv, 2025, [Online]
B. Hou et al., "World Model for Robot Learning: A Comprehensive Survey," arXiv, 2026, [Online]
R. Upadhyay et al., "WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models," arXiv, 2026, [Online]
C. Mak et al., "PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models," arXiv, 2026, [Online]
P. Zhao et al., "PhyWorld: Physics-Faithful World Model for Video Generation," arXiv, 2026, [Online]
J. Levy et al., "Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation," arXiv, 2026, [Online]
N. Maslej et al., "The 2026 AI Index Report, Chapter 2: Technical Performance," Stanford Institute for Human-Centered AI, 2026, [Online]

Guess First, Check Later

Speculative decoding made large language models faster by drafting cheaply and verifying expensively. The same draft-verify shape now shows up in code generation, retrieval, and agent planning, which raises the question of whether teams are solving the same design problem several times without noticing it is one pattern.

When Optimization Optimizes Itself

Self-modifying AI systems can now edit the very procedure that improves them. That capability quietly changes how Goodhart's Law works, and persistent memory may be the channel through which evaluation-gaming compounds.

Discuss This with Our AI Experts