Scientist looking into AI speech bubble — Credit: Moor Studio/Getty Images

Why AI Models Work When Theory Suggests They Shouldn’t

Using a Bayesian framework helps explain ‘double descent.’

In recent years, something unexpected has been happening in artificial intelligence. Modern AI appears to be breaking a rule that statisticians have preached for nearly a century: Keep models in a Goldilocks zone. They should be complex enough to capture patterns in the data but still simple enough that they don’t become too tailored to their training examples.

But is it possible the rule isn’t actually being broken? Chicago Booth’s Nicholas Polson and George Mason University’s Vadim Sokolov find that modern AI’s success can be understood within established Bayesian statistical principles.

Traditionally, model performance has followed a predictable U-shaped curve. At first, increasing a model’s size reduces test error—the error that can crop up when a model is applied to unseen data. But as the model becomes more complex and tightly fitted to its training data, the test-error rate will start rising again. Practitioners aim to find the sweet spot for model complexity at the bottom of the U shape before the error begins to rise again.

As its complexity increases, a model can eventually reach the interpolation threshold, where its parameters—think of them as dials on a recording studio’s mixing board, each shaping a particular element of how the model processes data—equal the number of training examples. At this point, the model essentially memorizes its training data and, by conventional logic, should fail when applied to new data. It acts like a student who memorizes practice test questions but fails to learn concepts.

But with recent advances in AI technology adding even more complexity and pushing models far beyond the interpolation threshold, researchers are observing that some, unexpectedly, will see their error rate fall a second time —a phenomenon known as “double descent” that has them flummoxed. This occurrence was first formally documented in 2019 by a team of researchers including University of California at San Diego’s Mikhail Belkin working with linear regression models, and then later observed in some generative AI systems. It baffles researchers because it seems to conflict with Occam’s razor, the idea that simpler explanations are usually better.

Polson and Sokolov argue, using Bayesian statistical methods, that this seeming paradox makes mathematical sense when viewed through the right analytical framework.

What is a double descent in AI models?

The Bayesian approach that Polson and Sokolov used is a 250-year-old method that treats probability as a measure of belief. Unlike classical statistics, which considers probability to be the frequency of an occurrence in the long run, Bayesian statistics assumes likely values for model parameters and updates them as data arrive.

Polson and Sokolov’s key insight suggests that Bayesian methods naturally implement Occam’s razor through an automatic quality control, by penalizing unnecessary flexibility. (Flexibility is the model’s ability to approximate complex patterns by allowing its fitted function to bend and curve while preserving signals from the data.) They contend that complex models that overcome these penalties and still outperform their simpler counterparts represent genuinely superior solutions, not statistical flukes.

Notably, using Bayesian methods simultaneously solves two problems by identifying which complexity level is best and defining the optimal parameter values within that level. Classical approaches usually handle these issues separately.

The researchers suggest that this works through prior specification—making assumptions about reasonable parameter values before seeing the data. In their framework, basic model features might be given loose constraints while more complex features are tightly restricted. Appropriate starting assumptions are crucial for whether you see double descent or not.

Polson and Sokolov provide a simple example to illustrate how this works. Imagine there are four data points: -1, 3, 7, 11. Two models could both explain this pattern perfectly. The first might assume an arithmetic sequence (adding 4 each time) using just two parameters, and the second might use a cubic equation with four parameters. Both fit the data, but the Bayesian approach automatically favors the simpler arithmetic model with fewer parameters.

Why? The cubic model spreads its probability across a vastly larger space of possible parameter combinations. The researchers suggest that a mathematical penalty for complexity happens automatically through marginal likelihood calculations (a measure of the probability that a particular model will produce the observed data), implementing Occam’s razor without human intervention. While this example uses simple polynomial models, Polson and Sokolov also laid out how their Bayesian framework could extend to high-dimensional neural network regression, demonstrating a potential connection between their theory and contemporary deep-learning approaches.

This framework may explain why traditional statistical methods struggle to explain double descent. Classical statistics often rely on complexity measures that ignore the value of smart starting assumptions in keeping models in check.

Viewing high-performing large-scale AI models through a Bayesian lens suggests they may be succeeding in ways consistent with sophisticated, hierarchical quality control rather than in violation of Occam’s razor. When the data warrant it, more complex models that use their extra flexibility well can overcome the marginal likelihood penalization and outperform the simpler, best model at the bottom of the U-shaped curve.

Polson and Sokolov’s work focuses on simpler mathematical models rather than the complex deep-learning systems and massive neural networks where researchers most prominently observe double descent. Nevertheless, the framework suggests this puzzling phenomenon isn’t as mysterious as it first appeared. Their work offers a way to interpret at least some of what’s happening through traditional Bayesian principles. Future research could develop practical applications from their theoretical foundation.

Works Cited

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal, “Reconciling Modern Machine-Learning Practice and the Classical Bias–Variance Trade-Off,” Proceedings of the National Academy of Sciences, July 2019.
Nicholas Polson and Vadim Sokolov, “Bayesian Double Descent,” Working paper, September 2025.

NECESSARY COOKIES These cookies are essential to enable the services to provide the requested feature, such as remembering you have logged in.	ALWAYS ACTIVE
	Reject \| Accept
PERFORMANCE AND ANALYTIC COOKIES These cookies are used to collect information on how users interact with Chicago Booth websites allowing us to improve the user experience and optimize our site where needed based on these interactions. All information these cookies collect is aggregated and therefore anonymous.
FUNCTIONAL COOKIES These cookies enable the website to provide enhanced functionality and personalization. They may be set by third-party providers whose services we have added to our pages or by us.
TARGETING OR ADVERTISING COOKIES These cookies collect information about your browsing habits to make advertising relevant to you and your interests. The cookies will remember the website you have visited, and this information is shared with other parties such as advertising technology service providers and advertisers.
SOCIAL MEDIA COOKIES These cookies are used when you share information using a social media sharing button or “like” button on our websites, or you link your account or engage with our content on or through a social media site. The social network will record that you have done this. This information may be linked to targeting/advertising activities.

Why AI Models Work When Theory Suggests They Shouldn’t

What is a double descent in AI models?

Related Topics

Related Topics