The Science of Intelligent Exploration

Why we need to re-center exploration in AI

One of the most thought-provoking moments for me at ICML 2025 didn’t come from a new architecture or a scaling law. It emerged from a simple, unsettling question: What happens when AI stops exploring? Recent breakthroughs in AI—especially in LLMs—have been fueled not by curiosity, but by curation. By training on vast amounts of human-generated data, models like LLMs bypass the messy, uncertain process of active exploration. Instead, they absorb a pre-digested version of our collective knowledge, effectively “pre-exploring” the world through the lens of what’s already been writtenEric Jang once called these models excellent "data sponges", as they are really great at memorizing vast amounts of data and can do this quickly by training with batch sizes in the tens of thousands.. While these models can recombine, paraphrase, and simulate, they rarely discover new things. This is the core mission of the Exploration in AI Today (EXAIT) Workshop at ICML 2025: to confront the quiet crisis of over-exploitation and re-center exploration to enable progress in modern AI. Because whether it’s a robot learning to walk, a recommender system fighting filter bubbles"Filter bubbles" refers to the phenomenon where users are only shown familiar content., or an AI searching for a drug in a vast space of possible molecules, the path to breakthroughs isn’t paved by more data alone. In this post, we will explore the science of intelligent exploration, from the basics of novelty search to the cutting-edge of open-endedness.

EXAIT Workshop

Figure 1. List of research questions at the EXAIT Workshop at ICML 2025.

Embracing the unexpected

Let’s start with a counterintuitive concept that flips traditional optimization on its head: novelty search. Imagine trying to solve a maze by obsessively chasing the exit, only to hit dead end after dead end. Now imagine wandering the maze, seeking out new paths regardless of the goal—and stumbling upon the exit by accident. This is the essence of novelty search, a paradigm that prioritizes exploring new behaviors over optimizing for a specific objective. The novelty of a new solution is measured by its distance (typically Euclidean) from previously discovered behaviors in a so-called behavior characterization (BC)Behavior characterization (BC) refers to a set of features that describe how an agent behaves. For a robot, this might be the sequence of positions it visits; for a neural network, it could be the activation patterns it produces. space. The algorithm maintains an archive of all discovered behaviors and calculates novelty as:

\[\text{Novelty}(x) = \frac{1}{k} \sum_{i=1}^{k} \text{distance}(x, x_i)\]

where $k$ is the number of nearest neighbors in the archive. This encourages agents to venture into unexplored regions of behavior space, creating a diverse portfolio of solutions.

In a classic maze experiment, algorithms chasing rewards failed to escape complex mazes, getting trapped in local optima. But those maximizing novelty—exploring diverse paths without fixating on the goal—succeeded. Why? Because chasing ambitious objectives can lead to deception, where the objective function becomes a false compass. Consider the “deceptive maze” where the path to the goal requires initially moving away from it. Traditional fitness-based search gets trapped in dead ends that appear promising (high fitness) but lead nowhere. Novelty search, by ignoring the goal entirely, naturally explores the entire maze and discovers the true path.

Fitness vs Novelty

Figure 1. Novelty search vs fitness-based search in a maze.

The stepping stones to success often look nothing like the goal itself. For example, the path from abacuses to laptops involved seemingly unrelated innovations like electricity and vacuum tubes. An interesting experiment that demonstrates this is Picbreeder, a platform where users evolve images through novelty search. A user aiming for a car might end up with a spaceship-like form that, through further exploration, morphs into a car. The lesson? Ignoring the objective can sometimes get you there faster.

Picbreeder

Figure 2. What Picbreeder shows: The stepping stones almost never resemble the final product! You can only find things by not looking for them.

But novelty alone isn’t enough. What if we could balance exploration with quality? That’s where quality diversity comes in.

Beyond a single solution

While novelty search embraces exploration, quality diversity (QD) takes it a step further by seeking diverse solutions that are also high-performing. Instead of finding a single “best” solution, QD algorithms like MAP-Elite or Go-Explore illuminate the entire space of possibilities, collecting a portfolio of solutions that solve a problem in different ways. The MAP-Elites discretizes the behavior space into a grid (the “map”), with each cell representing a unique combination of behavioral features. The algorithm seeks to fill each cell with the highest-performing solution found for that behavior type, creating a diverse “archive” of elite solutions. The process is elegantly simple:

Initialize: Create an empty map with predefined behavioral dimensions
Generate: Create new solutions through mutation or crossover
Evaluate: Measure both performance (fitness) and behavior characteristics
Place: Assign each solution to its corresponding map cell
Select: Keep only the best-performing solution in each cell
Repeat: Continue until the map is sufficiently filled

MAP-Elites

Figure 3. MAP-Elites in action.

The result is a comprehensive “atlas” of high-quality solutions across the entire behavioral landscapeThis connects to the concept of **illumination**—the goal is not just to find good solutions, but to understand the entire fitness landscape.. This approach answers the question: “What is the best possible performance achievable for each way of solving this problem?” By explicitly maintaining diversity, they prevent the population from collapsing to a single solution typeThis is also known as "premature convergence" or "convergence to a local optimum" in the context of optimization..

One of QD’s most striking successes came in robotics, where MAP-Elites generated diverse walking gaits for six-legged robots. When a robot loses a leg, it can immediately switch to a pre-evolved gait adapted to that specific damage pattern—recovering in under two minutes without any learning or simulation.

Another notable QD algorithm is Go-Explore. Go-Explore works in two phases: first, it systematically explores and remembers promising states—even those that seem irrelevant at first. Then, in a second phase, it robustifies these solutions to ensure they work reliably in the real, noisy environment. By explicitly separating exploration from exploitation, Go-Explore was able to crack Montezuma’s Revenge, discovering not just one way to win, but mapping out a constellation of valuable approaches.

Yet, QD still operates within a finite, predefined domain. What if we could go beyond finding what’s possible and start inventing new possibilities? This brings us to open-ended algorithms.

Towards endless innovation

Open-ended algorithms aim to mimic the boundless creativity of natural evolution or human culture. Unlike traditional algorithms that converge on a solution, open-ended systems diverge, endlessly generating new challenges and solving them. The goal? To keep learning and innovating, no matter how much time or compute is available.

Interested readers can refer to my previous post for a comprehensive overview of open-endedness.

A prime example of modern open-endedness is the Paired Open-Ended Trailblazer (POET) algorithm. POET creates a dynamic ecosystem of tasks and agents, where each agent evolves to tackle new challenges generated by the system itself. The process is as follows:

Environment Generation: Create new training environments by mutating existing ones (e.g., changing terrain difficulty, adding obstacles)
Agent Training: Each environment trains its own population of agents using standard RL
Transfer Evaluation: Regularly test agents on environments other than their native ones
Selective Transfer: Move high-performing agents to environments where they can contribute
Environment Selection: Preserve environments that are “minimal criteria” (not too easy, not impossibly hard)

The most interesting part lies in the co-evolutionary arms race: as agents get better, environments become more challenging; as environments become harder, agents must develop more sophisticated strategies. Open-endedness is about more than solving problems—it’s about creating a system that generates its own problems and learns from them. This brings us to the next frontier: AI-generating algorithms (AI-GAs).

A path to general intelligence

In his 2019 paper, Jeff Clune proposed AI-GAs as a path to AGI, built on three pillars:

Meta-learning architectures: Automatically designing neural network structures tailored to specific tasks.
Meta-learning learning algorithms: Evolving the rules of learning itself, like how gradients are updated.
Generating effective learning environments: Creating diverse, challenging environments to train AI systems.

Meta-learning architectures

Traditional neural architecture search (NAS) focuses on finding good architectures for specific datasets. AI-GA approaches go further by evolving architectures that can quickly adapt to new tasks. Recent examples include:

ENAS (Efficient neural architecture search): Uses reinforcement learning to discover architectures, dramatically reducing search time from thousands to single GPU-days.
DARTS (Differentiable architecture search): Makes architecture search differentiable, enabling gradient-based optimization of network topology.
AutoML-Zero: Evolves entire machine learning algorithms from scratch, starting with mathematical primitives and building up to complex architectures and optimizers.

Instead of hand-designing architectures, let evolution discover designs optimized for specific problem classes or computational constraints.

Meta-learning learning algorithms

This involves evolving not just what the network learns, but how it learns. Examples include:

Learned optimizers: Instead of using SGD or Adam, train neural networks to optimize other neural networks. These “learned optimizers” can adapt their strategy based on the loss landscape.
Meta-learning with gradient descent: Model-agnostic meta-learning (MAML) trains models to be good at learning new tasks with just a few gradient steps.
Evolutionary strategy for RL: Replace backpropagation entirely with evolution strategies that can discover entirely new learning rules.

Generating effective learning environments

Traditional AI training has relied on fixed datasets or hand-crafted environments. While this approach has enabled progress, it is fundamentally limited: hand-coding environments is brittle, and it is notoriously difficult to define what makes a task “interesting” or “useful” for learning. Early attempts to automate environment generation often used simple heuristics, such as:

Goldilocks Principle: Environments should be neither too easy (boring) nor too hard (impossible)
Learning progress: Prioritize environments where agents are improving fastest
Behavioral diversity: Generate environments that elicit different behaviors

But these approaches often miss the nuanced understanding of what makes a problem genuinely interesting or valuable for developing intelligence. A recent example is OMNI. OMNI uses foundation models (FMs) to propose and implement new reinforcement learning tasks that maximize agent learning progress and align with human intuitions about what is “interesting.” The core idea is to use the FM’s broad knowledgeFMs are trained on vast internet data, and they implicitly understand what humans find interesting—they've read our blogs, tweets, and papers, after all. to guide the creation of a diverse and ever-expanding set of environments.

OMNI

Figure 4. OMNI combines a learning progress auto-curriculum and a model of interestingness, to train an RL agent in a task-conditioned manner.

Despite this, OMNI is still fundamentally limited by the scope of their environment generators—typically confined to a narrow, predefined distribution of tasks. This limitation restricts the true potential of open-ended learning, which aspires to create agents capable of tackling an unbounded variety of challenges. On the other hand, the grand vision of open-endedness in AI is to continuously generate and solve increasingly complex and diverse tasks, much like the creative explosion seen in biological evolution and human culture Achieving this would require algorithms that can operate within a truly vast—ideally infinite—space of possible environments. A key concept here is Darwin Completeness.

Darwin Completeness is the ability of an environment generator to, in principle, create any possible learning environment. This means not just tweaking parameters within a fixed simulator, but being able to generate entirely new worlds, rules, and reward structures.

OMNI-EPIC is a recent framework that takes a major step toward Darwin Completeness. It augments the OMNI approach by leveraging foundation models not just to select the next interesting and learnable task, but also to generate the code for entirely new environments and reward functions. In OMNI-EPIC, the FM can write Python code to specify new simulated worlds, define novel reward and termination conditions, and even modify or create new simulators if needed. This enables OMNI-EPIC to, in principle, generate any computable environment—ranging from physical obstacle courses to logic puzzles or even quests in virtual worlds.

OMNI-EPIC

Figure 5. Examples of environments generated by OMNI-EPIC. All of these are generated using only 3 initial seeds!

Another notable advance in this direction is Genie, a foundation world model developed by Google DeepMind. Genie, and especially its latest version Genie 2, represents a significant leap forward in the automatic generation of diverse, interactive environments for both human and AI agents. Genie 2 is designed to generate an endless variety of action-controllable, playable 3D environments from a single prompt image. Unlike earlier world models that were limited to narrow domains or 2D settings, Genie 2 can create rich, fully interactive 3D worldsThese environments are not just visually diverse—they are also physically consistent and can be explored and manipulated by agents in real time. with emergent properties such as object interactions, complex character animations, realistic physics (including gravity, water, smoke, and lighting effects), and dynamic environmental responses. A key feature of Genie 2 is its ability to rapidly prototype new interactive experiences. Researchers and designers can prompt Genie 2 with concept art, drawings, or even real-world images, and the model will generate a corresponding 3D world that can be immediately explored by an agentThis enables a new workflow for environment design, where creative ideas can be quickly tested and iterated upon, dramatically accelerating the pace of research and development..

Figure 6. From concept art and drawings to fully interactive environments.

To showcase the power of Genie 2, DeepMind introduced SIMA, a generalist agent capable of following natural language instructions and acting within a wide range of Genie-generated environments. SIMA can be given high-level goals—such as “open the blue door” or “go up the stairs”—and will control an avatar using keyboard and mouse inputs to accomplish these tasks, even in worlds it has never seen before.

Figure 7. SIMA can follow natural language instructions in an unseen environment. The environment is generated via a single prompt image using Imagen and turned into a 3D world by Genie 2.

The combination of a powerful environment generator and an agent forms a virtuous cycle: as the environment generator creates new worlds, the agent must adapt and learn, and their progress can be used to further refine both the agent and the environment generation process.

Exploration is the future

One way to understand the urgency of re-centering exploration in AI is through the lens of the emerging Software² paradigm. While traditional deep learning (Software 2.0) focuses on learning from vast, static datasets, Software² envisions a new generation of AI systems that actively seek out and generate their own training data. This shift—from passively absorbing curated data to actively exploring and producing new, informative experiences—places exploration at the heart of progress. In this view, the ability of an AI to decide what data to learn from, and to continually expand its own learning environment, becomes a critical driver of generality and innovation. As we move toward more open-ended and self-improving AI, the science of exploration is poised to become the central engine of advancement.

Software²

Figure 8. Software² rests on a form of generalized exploration for active data collection. Unlike existing notions of exploration in RL and SL (where it takes the form of active learning), generalized exploration seeks the most informative samples from the full data space.

Closely related is the vision articulated in The Era of Experience, which argues that the next leap in AI will come not from scaling up static data, but from enabling agents to learn through rich, interactive experiences. In this new era, AI systems will continually generate, seek out, and learn from novel experiences—mirroring the way humans and animals learn by engaging with the world. Exploration, therefore, is not just a technical detail, but the foundation of a new paradigm where experience itself becomes the primary driver of intelligence.

Era of Experience

Figure 9. We are currently transitioning from the Era of Data to the Era of Experience.

Takeaways

Intelligent exploration lies at the heart of discovery, creativity, and adaptation—across science, innovation, and AI. We have just seen that breakthroughs rarely come from following a single, well-trodden path. Instead, they emerge from venturing into the unknown, embracing diversity, and allowing for serendipity and surprise. As AI evolves, the most capable and resilient systems will be those that do more than optimize known patterns—they will actively seek the adjacent possible, generate novel experiences, and expand the frontiers of knowledge. To build truly general and self-improving systems, we must elevate exploration to a first-class principle in AI design. The future belongs to those who explore. Let us design AI that does the same.

Citation

If you find this post useful, please cite it as:

Suwandi, R. C. (Jul 2025). The Science of Intelligent Exploration. Posterior Update. https://richardcsuwandi.github.io/blog/2025/exploration-in-ai/.

Or in BibTeX format:

@article{suwandi2025explorationai,
    title   = "The Science of Intelligent Exploration",
    author  = "Suwandi, Richard Cornelius",
    journal = "Posterior Update",
    year    = "2025",
    month   = "Jul",
    url     = "https://richardcsuwandi.github.io/blog/2025/exploration-in-ai/"
}