Beyond Entropy Collapse

When Exploration Succeeds but Learning Fails

In Thinkata's previous insight about entropy collapse, the insight examined how reinforcement learning agents can fail by exploring too little. Temperature tuning and adaptive entropy scheduling help ensure agents discover good solutions during training. But what happens after discovery? A growing body of research reveals a troubling answer. Deep RL agents routinely find excellent strategies during training, then fail to reliably reproduce them. Like a dart player who can see the bullseye perfectly but watches throw after throw land wide of the mark. The target is clear. The execution is not. This phenomenon suggests that the challenge facing deep RL extends beyond exploration into fundamental questions about how neural networks learn from experience.

The Optimization Gap

Traditional analysis frames reinforcement learning success as primarily an exploration problem. Find the right states and actions, the reasoning goes, and the agent will learn to reproduce them. Recent research challenges this assumption by introducing a practical metric called the optimization gap ^[1]. By comparing the best trajectories an agent ever generates against its final learned policy, researchers can decompose performance into what the agent discovered versus what it actually learned. A trajectory is a complete sequence of states, actions, and rewards from the start of an episode to its end.

The findings are sobering. Across multiple environments and algorithms, deep RL methods exploit only about half of the good experience they generate. In Atari environments, two popular algorithms achieve roughly 30% of their best experienced performance ^[1]. DQN, or Deep Q-Network, learns by estimating the value of each action. PPO, or Proximal Policy Optimization, learns by directly adjusting the probability of taking actions. Despite their different approaches, both exhibit the same failure pattern. The pattern resembles a student who occasionally aces practice tests but consistently underperforms on the real exam. The knowledge exists somewhere, but retrieving it reliably proves difficult.

The optimization gap persists even in environments where exploration is not the limiting factor ^[1]. This indicates that optimization difficulties represent a distinct bottleneck. Improving exploration algorithms may yield diminishing returns if the underlying learning problem remains unsolved. Perhaps most counterintuitively, when researchers replaced standard networks with deeper ResNet architectures, the gap widened rather than narrowed ^[1]. ResNets use skip connections that allow information to bypass layers, enabling training of much deeper networks than was previously possible. Larger networks discover higher-value trajectories but struggle even more to distill them into reliable policies. Simply adding parameters does not address fundamental learning difficulties.

Why Networks Forget How to Learn

The optimization gap reflects a deeper phenomenon called plasticity loss, where neural networks gradually lose their ability to adapt to new information. This degradation has been identified under various names in recent literature, including capacity loss, implicit under-parameterization, and the primacy bias ^[7]. Plasticity functions like wet clay. Early in training, the network can be molded easily. But as training continues, the clay hardens, and reshaping becomes increasingly difficult.

Visualization of primacy bias in neural networks

First Impressions That Stick

Primacy bias describes how agents overfit to early experiences ^[2]. Because deep RL trains on progressively growing datasets, networks risk becoming locked into patterns established during initial exploration. These patterns behave like stubborn first impressions, causing later evidence to be ignored no matter how informative. The replay buffer is a memory bank that stores past experiences so the agent can learn from them multiple times. While designed to stabilize learning by breaking correlations between consecutive experiences, the replay buffer can actually reinforce early biases by repeatedly presenting outdated experiences long after the agent has moved on.

Sleeping on the Job

Dormant neurons represent another pathway to failure ^[3]. During training, an increasing fraction of neurons become inactive, effectively employees who have stopped contributing to the project. Research shows this phenomenon is exacerbated by the non-stationary nature of RL training. Unlike supervised learning where the training data remains fixed, RL agents generate their own data through interaction with the environment. As the policy improves, the states and actions the agent encounters change, creating a constantly shifting target for the network to learn. The agent shrinks its own workforce over time, reducing the network's capacity to represent complex behaviors.

Visualization of dormant neurons during training

The Mathematical Signature

Plasticity loss leaves measurable traces in network geometry. Research connects this degradation to the empirical Neural Tangent Kernel, or NTK ^[4]. The NTK is a mathematical tool that characterizes how a network's outputs change in response to weight updates. It reveals the directions in which the network can easily learn versus directions where learning is difficult. As training progresses, the rank of learned features collapses. Rank measures the effective dimensionality of the network's representations, essentially how many independent patterns the network can distinguish. When rank collapses, the network represents information using fewer and fewer independent dimensions. The effect mirrors a camera lens gradually narrowing its focus until it can only see one thing clearly, losing the ability to capture the full scene.

This rank collapse creates highly correlated gradients, meaning the network responds similarly to very different inputs ^[5]. The loss landscape also sharpens, trapping the network in narrow valleys that are difficult to escape. The loss landscape is a surface defined by error values across all possible weight configurations. Smooth, wide valleys allow the optimizer to find good solutions easily, while sharp, narrow valleys make optimization treacherous.

These pathologies emerge from the interaction between bootstrapping and gradient descent optimization. Bootstrapping is a technique where value estimates are updated using other value estimates rather than waiting for true outcomes. For example, an agent might update its estimate of the value of being in state A based on its estimate of the value of state B, which is itself an estimate. This creates a chain of interdependent approximations. This combination is unique to reinforcement learning and explains why plasticity loss is more severe in RL than in supervised learning settings.

Recent work has further connected plasticity loss to policy churn, the phenomenon where network outputs change unpredictably for states not included in the current training batch ^[8]. As the NTK rank decreases, churn intensifies, creating a vicious cycle where learning becomes increasingly unstable.

Practical Solutions

Researchers have proposed several interventions that target different aspects of plasticity loss.

Periodic network resetting combats primacy bias by refreshing learned representations ^[2]. By periodically reinitializing the later layers of the network while keeping the replay buffer intact, agents can escape patterns locked in during early training. This approach has proven effective across both discrete and continuous control domains, clearing those stubborn first impressions.

Recycling dormant neurons maintains network expressivity throughout training ^[3]. The ReDo method identifies neurons that have become inactive and reinitializes them, effectively waking up the sleeping workforce. This preserves the network's capacity to represent diverse behaviors without requiring full network resets.

Architectural interventions can also help. Weight decay is a regularization technique that adds a penalty for large weight values, encouraging the network to use smaller, more distributed representations. Layer normalization standardizes the activations within each layer, preventing any single feature from dominating. Together, these techniques preserve the conditioning of feature matrices, keeping the camera lens from narrowing too far ^[6].

Network pruning involves removing unnecessary connections from the network. Counterintuitively, making the network smaller can improve plasticity by reducing gradient correlation and maintaining effective rank ^[7]. A leaner network may learn more flexibly than a bloated one.

What This Means

The key insight for practitioners is this. When performance plateaus despite apparent exploration success, the bottleneck may be exploitation rather than discovery. Monitoring the gap between best-episode returns and average policy returns can reveal whether optimization difficulties are limiting progress. A large and persistent gap indicates that simply running more episodes or adding exploration bonuses will not help.

The optimization gap research suggests that both continuous control and language model training may be limited not by exploration but by exploitation. Future work may need to focus as much on the distillation problem, how to reliably encode good solutions into neural networks, as on the search problem. For now, periodic resetting, dormant neuron recycling, and careful regularization offer paths forward when agents discover success but struggle to remember it.

References

G. Berseth, "Is Exploration or Optimization the Problem for Deep Reinforcement Learning?," arXiv, 2025, [Online]
E. Nikishin, M. Schwarzer, P. D'Oro, P.-L. Bacon, A. Courville, "The Primacy Bias in Deep Reinforcement Learning," arXiv, 2022, [Online]
G. Sokar, R. Agarwal, P. S. Castro, U. Evci, "The Dormant Neuron Phenomenon in Deep Reinforcement Learning," arXiv, 2023, [Online]
A. Kumar, R. Agarwal, D. Ghosh, S. Levine, "Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning," arXiv, 2021, [Online]
C. Lyle, Z. Zheng, E. Nikishin, B. Pires, R. Pascanu, W. Dabney, "Understanding Plasticity in Neural Networks," arXiv, 2023, [Online]
C. Lyle, Z. Zheng, H. Giber, C. Sherstan, R. Pascanu, W. Dabney, "Disentangling the Causes of Plasticity Loss in Neural Networks," arXiv, 2024, [Online]
Z. Abbas, R. Zhao, J. Modayil, A. White, M. White, "Loss of Plasticity in Continual Deep Reinforcement Learning," arXiv, 2023, [Online]
Y. Tang, G. Berseth, "Mitigating Plasticity Loss in Continual Reinforcement Learning by Reducing Churn," arXiv, 2025, [Online]

When Success Has No Author: The Temporal Credit Assignment Problem

Why deep reinforcement learning agents struggle to determine which actions caused success, and what recent research reveals about solving the temporal credit assignment problem

How Temperature Tuning Makes or Breaks Reinforcement Learning

Understanding entropy collapse in maximum entropy reinforcement learning, with insights from both continuous control and language model fine-tuning research

Discuss This with Our AI Experts