The Dream Machines

How AI is learning to simulate our physical world

A young girl sits, not in front of a screen, but within a world of her own making. With a thought, she conjures a cyberpunk metropolis—a sprawling cityscape alive with neon lights and towering skyscrapers. The air is thick with the scent of rain as crowds of people navigate elevated walkways under umbrellas, their reflections shimmering on wet surfaces below. She slips into the body of a luminous koi, diving through this immersive world from an aquatic perspective. The city comes alive around her, its neon glow reflecting off her scales as she swims past towering buildings and floating advertisements. She is not just playing a game; she is living in a dream—a world that responds to her every whim, a world that learns and grows with her. This is not a scene from a distant science fiction novel. This is the future that “dream machines” like Genie 3 are beginning to build, one pixel at a time.

Figure 1. A sample world generated by Genie 3. Clip from @apples_jimmy and @MattMcGill_ on X.

These models aren’t just tools for creating games. They are engines of experience, simulators of reality, and perhaps, the key to unlocking the next stage of artificial general intelligence (AGI). But what does it mean when the line between our dreams and our digital realities begins to blur? In this post, we will explore how foundation world models like Genie are reshaping our digital world and where they might take us next.

The birth of dream machines

For years, AI has dazzled us with its creative abilities, from writing eloquent stories and generating stunning artwork to producing convincing video. But now, with models like Genie, we are witnessing a new kind of breakthrough. Rather than simply creating content to be observed, these models generate worlds that can be explored and shaped in real time. For example, we can now generate a 3D world from a single image, and even interact with it in real time. This shift marks the beginning of what NVIDIA’s Jensen Huang envisioned—a future where every single pixel will be generated, not rendered.

The path to interactive world generation began with a crucial realization: the most sophisticated video generation models were inadvertently learning to simulate reality. When OpenAI unveiled Sora in early 2024, they explicitly positioned it not just as a video generator, but as a “world simulator”During that time, OpenAI claimed that scaling video generation models is a promising path towards building general purpose simulators of the physical world.. What made Sora remarkable wasn’t just its visual fidelity, but its apparent understanding of physical laws. Objects moved with convincing momentum, liquids flowed naturally, and complex interactions emerged without explicit programming. The model had learned these behaviors by observing millions of hours of video, internalizing patterns of how the world works at a level that went far beyond surface appearances.

Figure 2. A video generated by Sora using the prompt: “Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee.”

Google’s Veo 3 pushed these capabilities further, offering unprecedented creative control through reference images, camera movement specifications, and synchronized audio generation. The result was a new genre of AI-generated content, including entirely novel forms like AI ASMR videos that pushed the boundaries of synthetic media.

Yet for all their sophistication, these systems shared a fundamental limitation that highlighted the next frontier. You could watch their generated worlds, but you couldn’t inhabit themWhile AI models like Sora and Veo can generate stunning, immersive scenes, they lack the interactivity to let users freely explore or alter the environment in real time.. This gap between observation and interaction represents one of the most significant challenges in AI today: how do we move from systems that generate convincing simulations to systems that generate inhabitable realities? The answer lies in understanding the so-called “world models”—AI systems that don’t just generate plausible content, but maintain consistent internal representations of how worlds work.

What is a “world model”?

Before we dive deeper, let’s clarify what we mean by a “world model.”

A world model is a system that can simulate the dynamics of an environment.

In other words, it is a model that is able to predict how actions change states and how the environment evolves over time. Perhaps the best way to understand world models is to consider how humans operate. As Jay Wright Forrester, a pioneer of systems dynamics, observed:

“The image of the world around us, which we carry in our head, is just a model. Nobody in his head imagines all the world, government or country. He has only selected concepts, and relationships between them, and uses those to represent the real system.”

To understand this better, consider the following intuitive example from :

Imagine you’re playing a baseball game. You have mere milliseconds to decide how to swing—less time than it takes for visual signals to travel from eyes to brain. Yet professional players consistently make contact. How? Their brains have developed predictive models that can anticipate where and when the ball will arrive, allowing for subconscious, reflexive responses based on internal simulations of the ball’s trajectory.

AI world models also use similar principles to simulate our physical world. They learn the “rules” not through explicit programming, but by observing countless examples of how things behave. For instance, a world model might discover that water flows downward and around obstacles, objects cast shadows that change with lighting, and characters maintain consistent appearances from different angles.

How to build a world model?

The classic world model, proposed by David Ha and Jürgen Schmidhuber, consists of three key components that work together to create and navigate simulated realities:

The role of the V component is to take high-dimensional observations and encodes them into compact, meaningful representationsThis is similar to how our brains process visual information, where we can recognize objects even when they are partially occluded or in different lighting conditions..

The role of the M component is to learn temporal patterns and predicts future states based on past experienceThis is similar to how our brains works, where we can speculate future events based on what we have seen in the past..

The role of the C is to map the current compressed state and predicted future to select actions This is similar to how our brains make decisions, where we can plan our actions based on our current state and predicted future state..

World Model Overview

Figure 3. Overview of a world model architecture showing the interaction between Vision (V), Memory (M), and Controller (C) components .

Learning inside dreams

Perhaps the most remarkable capability of world models is to “learn inside dreams”. Instead of learning in the real world, an agent can learn to perform tasks entirely within the simulated environment generated by its own world model. The process works like this:

  1. The world model observes the real environment and learns its dynamics.
  2. The controller trains by taking actions in this learned simulation, experiencing consequences and rewards without ever touching the actual environment.
  3. The trained policy transfers back to reality.

This approach offers several advantages:

While traditional simulators rely on people manually programming how things move and interact, including rare or complex situations, world models learn these behaviors by analyzing real-world data, which allows them to capture details that humans might overlook or find too hard to describe.

A whole new world

Building on the foundational insights from world models research, DeepMind’s Genie represents a significant leap forward. While the original world models work focused on learning compressed representations for efficient control in constrained environments, Genie scales this vision to create photorealistic, explorable worlds that respond to human input in real-time.

How Genie works

Unlike traditional game engines that rely on hand-coded physics and pre-designed assets, or video models that generate fixed sequences, Genie learns to create controllable environments entirely from observing unlabeled internet videosGenie was trained on over 200,000 hours of publicly available gaming footage., without being explicitly taught anything about the environments. Genie consists of 3 main components that work together to enable interactive world generation:

Converts raw video frames into compressed discrete tokens that capture both spatial and temporal patterns. Rather than processing each frame independently, this component uses a novel spatiotemporal approach that understands how visual elements change over time. It compresses 16×16 pixel patches across multiple frames into discrete tokens This compression is crucial—it reduces the computational burden while preserving the essential dynamics needed for interactive control., learning to represent not just what objects look like, but how they move and change.

Discovers and learns a discrete action space entirely from observing video transitions, without any action labels. This component observes pairs of consecutive video frames and learns to infer what “action” must have occurred to cause the transition from frame A to frame B.

Generates the next frame tokens given the current state and a chosen latent actionThis component uses a sophisticated autoregressive architecture based on MaskGIT, which generates video tokens in parallel rather than sequentially.. When a user selects an action, the dynamics model predicts how the world should change, generating new video tokens that maintain visual and physical consistency with the previous frame.

Genie Architecture

Figure 4. Genie takes in $T$ frames of video as input, tokenizes them into discrete tokens $\mathbf{z}$ via the video tokenizer, and infers the latent actions $\tilde{\mathbf{a}}$ between each frame with the latent action model. Both are then passed to the dynamics model to generate predictions for the next $T$ frames in an iterative manner.

What makes Genie remarkable is how these components learn to work together without explicit supervision. The system watches millions of video transitions and automatically discovers that certain types of changes occur repeatedly—characters moving in different directions, jumping, interacting with objects. It learns to represent these as discrete latent actions. Simultaneously, the dynamics model learns to predict what happens when each type of action is taken in different contexts. It develops an understanding of physics, object interactions, and environmental consistency. All three components are trained together, creating a feedback loop where better action recognition improves dynamics prediction, and better dynamics prediction enables more precise action discovery.

Beyond 2D worlds

The original Genie’s transformation of 2D sprite-based games into interactive, explorable worlds was just the beginning. By late 2024, DeepMind had set its sights on a far more ambitious target: scaling these insights to create fully three-dimensional, photorealistic worlds that could rival modern game engines in visual quality while surpassing them in creative flexibility. Just eight months after the original Genie captured the world’s imagination with its 2D interactive environments, DeepMind unveiled Genie 2—a foundation world model that represents one of the most significant advances in AI-generated interactive content to date. Where Genie transformed simple 2D videos into playable experiences, Genie 2 creates rich, three-dimensional worlds from nothing more than a single prompt image.

Genie 2

Figure 5. Overview of the diffusion world model used in Genie 2.

While the original Genie operated on discrete video tokens in 2D space, Genie 2 employs an autoregressive latent diffusion model trained on a massive dataset of 3D game videos. This hybrid approach combines the sequential prediction capabilities of autoregressive models with the high-quality generation of diffusion models. The system processes video through a sophisticated autoencoder that maps high-resolution 3D scenes into a compressed latent space. Within this space, a large transformer dynamics model—similar in structure to large language models but adapted for spatial-temporal prediction—learns to generate coherent sequences of 3D environments. The use of classifier-free guidance during inference allows for precise control over action execution, ensuring that user inputs translate reliably into desired environmental changes.

One of Genie 2’s most impressive capabilities is its ability to transform real-world photographs into interactive 3D environments. Show it a picture of a forest path, and it generates a navigable woodland where grass sways in the wind and leaves rustle overhead. Provide an image of a rushing river, and it creates a dynamic aquatic environment with flowing water and realistic fluid dynamics. This capability suggests that Genie 2 has developed sophisticated scene understanding that goes beyond simple pattern matching. The model appears to infer the three-dimensional structure of scenes, the likely physics governing environmental elements, and the potential interaction affordances—all from a single static image.

Figure 6. An environment concept by Max Cant transformed into a 3D world by Genie 2.

Interactive 3D worlds

More recently, DeepMind released Genie 3, which represents the next evolution in interactive world generation. While Genie 2 demonstrated the ability to create 3D environments from single images, Genie 3 transforms these capabilities into truly real-time, high-fidelity interactive experiences that approach the quality and responsiveness of modern game engines.

Perhaps Genie 3’s most impressive advancement is its visual memory that remembers objects, textures, and even text for up to a minute. Turn away from a scene and look back—the world remains exactly as you left it, with objects in their previous positions and environmental details intact. This consistency enables longer storytelling sessions, complex navigation tasks, and meaningful interaction with persistent world elements. It’s the difference between a fleeting dream and a stable reality you can truly inhabit.

Figure 7. A demonstration of Genie 3’s visual memory, where the world remains consistent even when the camera is turned away.

Genie 3 also introduces promptable world events, where you can instantly transform the world (e.g., change the weather, add a character, or trigger an event) using natural language. These changes integrate seamlessly into the ongoing experience without breaking immersion or requiring scene resets. This capability also enables the generation of “what if” scenarios that can be learned by agents to handle unforeseen events.

Figure 8. With Genie 3, we can use natural language promps like “spawn a brown bear” to trigger events in the world.

The progression from Genie 1 to Genie 3 is mind-blowing considering that the timeframe was only about 1.5 years! Here’s a table comparing the features of the three generations:

Feature Genie 1 Genie 2 Genie 3
Resolution Low (2D sprites) 360p 720p
Control Basic 2D actions Limited keyboard/mouse actions Navigation + promptable world events
Interaction latency Not real-time Not real-time Real-time
Interaction horizon Few seconds 10–20 seconds Multiple minutes
Visual memory Minimal consistency Minimal, scenes changed quickly Remembers objects and details for ~1 minute
Scene consistency 2D sprite coherence Frequent visual shifts in 3D Stable, believable 3D environments

Waking up to a new reality

We stand at the threshold of a transformative era—one where the ancient human dream of creation becomes as accessible as natural language. The dream machines like Genie represent more than technological achievement. They herald a fundamental shift in how we conceive of digital creation, learning, and experience. Imagine:

Yet perhaps the most exciting part is what we cannot yet imagine. We are likely only glimpsing the surfaceAs Tim Rocktäschel tweeted, "we have only scratched the surface of what can be done with prompting and post-training of foundational world models." of what becomes possible when anyone can conjure interactive worlds from imagination alone. Though the challenges ahead are substantial and realAs DeepMind suggests, the computational costs and accessibility barriers are not the only challenges. There are also ethical concerns about authenticity and potential misuse., history suggests a familiar pattern, “the most transformative technologies initially seem impossible, then inevitable”. Dream machines appear to be following this well-trodden path, moving rapidly from research curiosity to practical capability. The question is not whether this future will arrive, but how we will shape it as it emerges. The dream machines are awakening, offering unprecedented creative possibilities while challenging fundamental assumptions about reality, creativity, and human-AI collaboration. As we stand at this inflection point, we have the opportunity—and responsibility—to guide this technology toward applications that amplify human creativity, accelerate learning, and expand the boundaries of what we can experience and achieve together.

If you could dream up any world and bring it to life, what would you create?

Citation

If you find this post useful, please cite it as:

Suwandi, R. C. (Aug 2025). The Dream Machines. Posterior Update. https://richardcsuwandi.github.io/blog/2025/dream-machines/.

Or in BibTeX format:

@article{suwandi2025dream,
    title   = "The Dream Machines",
    author  = "Suwandi, Richard Cornelius",
    journal = "Posterior Update",
    year    = "2025",
    month   = "Aug",
    url     = "https://richardcsuwandi.github.io/blog/2025/dream-machines/"
}