The ‘Professor of Uncertainty’ on AI
Chicago Booth’s Veronika Ročková uses statistical methods that exploit the randomness in AI responses.
The ‘Professor of Uncertainty’ on AIAnyone who is a fan of detective stories, or has mediated a playground fight, knows that the truth can be extracted from differing accounts of the same event. In Agatha Christie’s Five Little Pigs, detective Hercule Poirot solves a case by listening to five different accounts of the same day to form a composite picture that allows him to solve the crime.
In a way, artificial intelligence is like the characters in the novel, as it gives different replies to the same question, seemingly at random. And researchers are finding that, like Poirot, they can extract valuable signals from AI’s inconsistent outputs and use those outputs to provide statistical models with a head start.
By applying a statistical method, University of Chicago PhD student Sean O’Hagan and Chicago Booth’s Veronika Ročková weave together AI’s varied predictions with verified answers. The outcome isn’t a single definitive answer but a probability map that shows the range of possible outcomes and how likely each one is. This kind of forecast gives decision-makers a clearer sense of where the evidence is strong and where uncertainty still remains, the researchers explain.
Large language models such as OpenAI’s GPT-4o work a bit like improvisational storytellers: Ask the same question twice, and you may get slightly different answers. Feed the model a list of symptoms for a skin condition, as the researchers did, and its output might offer varying diagnoses depending on how the question is phrased.
Rather than treat that variability as a flaw, O’Hagan and Ročková use it as a source of extra information to help traditional statistical models learn faster from limited real-world data. In many fields, collecting expert-labeled data—such as dermatologists’ diagnoses—is expensive, so researchers often work with small datasets. But unlabeled data are often abundant, and AI can help fill in the gaps.
Here’s the catch: AI output is uncertain. The model might be confident, or just confidently wrong. The randomness in AI outputs creates a spread of possible answers—capturing useful information about uncertainty. These AI labels aren’t perfect, but they reflect patterns drawn from the LLM’s vast training data that can be harnessed to analyze real data.
To make use of them, the researchers turn to a Bayesian model—a framework that operates like a detective starting with a hunch and updating it as new clues emerge.
Traditional Bayesian models often begin with fixed assumptions, such as the expectation that data will follow a bell-shaped curve when viewed graphically. That can be limiting. Instead, O’Hagan and Ročková use a more flexible statistical technique called a Dirichlet process prior, which allows the model to start with an AI-sketched profile that can follow any shape and adapt freely as real data are introduced. Their detective no longer begins every case convinced “the butler did it;” the investigation stays open to the evidence.
Bayesian analysis with the Dirichlet process prior is like a detective investigating a case where each real datapoint represents testimony from a credible eyewitness, while each AI-generated datapoint represents testimony from an informant of questionable reliability. The detective weighs all the testimony, but gives much more credence to the eyewitnesses than to the informants.
The researchers apply different weightings to the testimony of the less reliable informants and keep the highest weight that still leaves them about 90 percent sure the real answer sits somewhere inside the model’s predicted range of outcomes.
With its panel of eyewitnesses, the O’Hagan and Ročková detective isn’t singularly focused on guessing the most likely culprit. The researchers’ approach is about understanding the entire crime scene. The model’s output produces probabilities across all suspects—like the skin disease predictions. Applied to Five Little Pigs, it might produce a 35 percent chance that stockbroker Philip Blake was the murderer, 28 percent chance that it was the governess, Cecilia Williams, and so on. As new clues arrive, those numbers shift and home in on the most likely culprit.
While the AI-generated data creates a good picture for the Bayesian model to start with, it’s not completely trustworthy. Each synthetic label is treated as a clue—useful, but not definitive. As more real data come in, the story reshapes itself.
Their approach also offers a speed advantage. Traditional Bayesian models often rely on slow, step-by-step sampling. O’Hagan and Ročková instead draw random weights for the data and solve weighted optimization problems independently, allowing parallel processing that delivers all results simultaneously.
The researchers tested their method on two real-world datasets, the one involving diagnosing skin conditions and another using images from space. In the first, they attempted to diagnose six conditions using about 150 dermatologist-labeled examples from the UC Irvine Machine Learning Repository and roughly 220 ChatGPT predictions. Their method boosted accuracy by 5 percentage points when the AI input was moderately weighted, compared with only expert-labeled data as the input. But when the AI’s influence became too strong, accuracy dropped—underscoring the importance of careful calibration.
The second dataset classified whether galaxies exhibited spiral arms, a feature that is useful for understanding star formation. Using fewer than 1,000 human-labeled images from the citizen science initiative Galaxy Zoo 2 and 15,000 labeled by a computer vision model, their method nearly cut in half the width of the 90 percent credible interval, the Bayesian version of a confidence interval, for estimating the fraction of spiral galaxies—and did so without losing coverage. It produced nearly the same answer, with twice the confidence.
Unlike earlier methods that tweak a single AI estimate, this approach delivers a range of possible outcomes with probabilities attached, something that can be used in further analysis. It doesn’t rely on one set of fixed AI predictions or assume the data must follow a fixed pattern. Instead, it starts with a flexible guess that adapts as more information comes in, integrating randomness in AI predictions. The only setting to adjust—the weighting metric—is easy to fine-tune, and if it’s miscalibrated, the model quickly signals that something’s not right.
The research suggests this approach could benefit any field with limited labeled data and plenty of unlabeled examples. Because the AI-generated labels don’t have to be perfect, the Bayesian framework automatically reduces their weight when the real data indicate they’re off. The method is robust and adaptable, making it practical even when expert data are scarce.
Sean O’Hagan and Veronika Ročková, “AI-Powered Bayesian Inference,” Preprint, arXiv, May 2025, arXiv:2502.19231.
Your Privacy
We want to demonstrate our commitment to your privacy. Please review Chicago Booth's privacy notice, which provides information explaining how and why we collect particular information when you visit our website.