Experts All the Way Down
Recursive Hierarchical Gating and the Gap Between a 1994 Idea and Today's Composed Models
A mixture-of-experts model holds many specialized sub-networks, called experts, and a small gating network that decides which of them handle a given input. In the usual design the gate fires once. It looks at a token, picks a few experts, and the rest of the layer sits out, a setup covered in an earlier overview of mixture-of-experts. The idea worth examining keeps the gate but changes what happens after it decides. Rather than pick an expert and stop, the gate splits the incoming context into categories and forwards each piece to a more specialized system, and that system may carry its own gate, which splits again, to whatever depth helps. The decision becomes a tree rather than a single fork, and only at the leaves does an expert actually answer.
The structure branches the way a tree divides a trunk into limbs and limbs into twigs, each split finer than the last, with an expert waiting only at the final tip. Two features set it apart from a standard routing layer. The split is recursive rather than a single step, and the thing being routed is a chunk of meaning, a document or a sub-task, rather than an individual token. Whether that structure earns its added complexity, measured against a single flat model or a single flat routing step, is an open question with a longer history behind it than the recent interest suggests.
A Soft Decision Tree, Circa 1994
The recursive version is not new. Jordan and Jacobs described it in 1994 as a tree-structured architecture in which gating networks sit at the branch points and experts sit at the leaves [1]. Each gate partitions the input space, except the boundaries are soft, the way ink dropped into water spreads into a shared zone instead of stopping at a clean line, so an input can belong partly to several branches at once rather than landing in exactly one. The same construction repeats at every level, which yields a tree of arbitrary depth. They named the result a hierarchical mixture of experts and fit the whole thing with the expectation-maximization algorithm, a standard method for learning models that contain hidden structure. Their own phrase remains the clearest handle on the idea, a soft decision tree, where each node asks a fuzzy question and the answers blend instead of committing to one path.
That 1994 model routed fixed-length numeric inputs through simple linear experts, a long way from variable-length natural language. The worked examples ran two levels deep, though the authors noted the method extends to arbitrary depth [1]. Whether the soft-partition mathematics survives the move to language, where the input is a long sequence with no fixed dimension and the categories are semantic rather than geometric, is the part that does not carry over for free.
Expert Layers Versus Expert Models
A distinction matters before going further. In a transformer mixture-of-experts the experts are feed-forward blocks inside one network, trained together, sharing a backbone, and routing picks among parts of a single model, which is the common case a recent survey of the area documents [2]. A different design treats each expert as a whole, separately trained, separately deployable model and puts a router in front of the collection. The literature calls this composition of experts, a model of models rather than a layer of them.
One such system uses a single router over a pool of expert language models and reaches the quality of a much larger model while keeping average active parameters low, around 31 billion on one benchmark [3]. Its routing runs in two steps, a category router first sorts the prompt into one of a fixed set of categories, then a lookup maps that category to the best expert [3]. That is close to the category-splitting idea, yet the routing still resolves to a single expert in one pass. A second design encodes each expert model as a special token in a controller model's vocabulary, so choosing an expert looks like generating the next token, and it reports a few percent gain over earlier multi-model methods [4]. Useful as these are, the router in each is a flat dispatch to one of several models. The recursive part, a chosen branch that is itself another gated system, is absent.
Hierarchical, Recursive, and the Difference
Two recent lines of work carry the words hierarchical and recursive, and both deserve pinning down, because neither is the recursive category dispatch sketched above. One groups a model's experts and applies routing control at two coupled levels, balancing traffic across groups while encouraging specialization within them, and reports a modest perplexity gain and much better expert balance at the seven-billion scale [5]. That is hierarchy inside one model's router, still operating on tokens. The other reuses a single shared stack of layers several times and lets a lightweight router decide how many passes each token takes, which saves parameters and compute [6]. Its recursion is over depth of computation, how often a token revisits the same block, not over which specialized model handles which category of content. Recursive computation and recursive routing are easy to conflate and are not the same idea.
Whether the Extra Structure Pays
A prior question hides under all of this. If routing by meaning is valuable, do large models already do it on their own? A 2025 study probed several open mixture-of-experts models and found clear, statistically significant evidence that routing is sensitive to semantics, with expert overlap rising when meaning is preserved and falling when it changes, an effect strongest in the middle layers and growing with model size [7]. The behavior looks learned and emergent rather than designed in. If category-like specialization arises on its own during ordinary training, a hand-built category gate has to justify itself against a baseline that already routes semantically without being asked to.
The statistical theory is encouraging but conditional. A recent analysis of hierarchical mixtures shows that the choice of gating function changes the outcome, that the familiar softmax gate creates parameter interactions which slow expert convergence, and that a different gating function removes them and sharpens specialization [8]. A hierarchical gate can provably help, in other words, but only under the right design, and the wrong gate blunts the advantage the structure was meant to provide.
The simple version of the pattern already ships, though rarely more than two levels deep. A production audio assistant routes a query with a lightweight intent classifier to one of several specialized models, speech recognition, speaker identification, music tagging, then lets a small language model assemble the answer, and the cheap classifier beats a large model at the routing step [9]. Nesting these systems further, a classifier whose chosen branch is itself a classifier-plus-experts system, is uncommon in published work. The closest the agent-orchestration literature comes is a hierarchical scheme that decomposes a task with a planning agent, instantiates specialized worker agents per sub-task, and searches over their arrangements, with double-digit accuracy gains reported on reasoning benchmarks [10]. The decomposition there is genuinely multi-level, though it splits tasks rather than routing categories of context to standing expert models.
What This Suggests
The thirty-year arc is tidy in outline and unfinished in substance. The recursive, tree-structured gate was written down in 1994, and the modern pieces exist in scattered form, composition across whole models, hierarchical control inside one model, recursion over compute, emergent semantic routing, and conditional theory about when gating helps. What is missing is the join, a system that splits context by category and dispatches recursively across separately specialized models, more than two levels deep, with evidence that it beats a flat router. A recursive gate that dispatches to separate models would also inherit hard questions about where each leaf actually runs, since every leaf is a model that has to be served somewhere. The appeal of the idea is its plainness, a small classifier asking a question and asking it again. Whether that plainness holds once the splits are semantic, the experts are full models, and the tree is more than shallow, is still closer to a promising hypothesis than a settled result.
References
- M. I. Jordan and R. A. Jacobs, "Hierarchical Mixtures of Experts and the EM Algorithm," Neural Computation, vol. 6, no. 2, pp. 181–214, 1994. DOI: [Online]
- W. Cai et al., "A Survey on Mixture of Experts in Large Language Models," arXiv, 2024, [Online]
- S. Jain et al., "Composition of Experts: A Modular Compound AI System Leveraging Large Language Models," arXiv, 2024, [Online]
- Z. Chai et al., "An Expert is Worth One Token: Synergizing Multiple Expert LLMs as Generalist via Expert Token Routing," arXiv, 2024, [Online]
- G. Molodtsov et al., "Hierarchical Mixture-of-Experts with Two-Stage Optimization," arXiv, 2026, [Online]
- S. Bae et al., "Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation," arXiv, 2025, [Online]
- M. L. Olson et al., "Probing Semantic Routing in Large Mixture-of-Expert Models," in Findings of the Association for Computational Linguistics (EMNLP), 2025, [Online]
- H. Nguyen et al., "On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions," arXiv, 2024, [Online]
- V. Naveen et al., "Comprehensive Audio Query Handling System with Integrated Expert Models and Contextual Understanding," arXiv, 2024, [Online]
- Z. Hou et al., "HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems," arXiv, 2025, [Online]
Discuss This with Our AI Experts
Have questions about implementing these insights? Schedule a consultation to explore how this applies to your business.