(gentle music)
Narrator: When experts build predictive algorithms, whether for football scores, election results, or commute times, they rely on standard definitions of accuracy. A common one is minimizing squared error, which penalizes large mistakes more than small ones. This helps models avoid being big misses, but often at the cost of rarely hitting the bullseye. But how are everyday users of these models judging their performance? That’s the question Chicago Booth’s Berkeley Dietvorst set out to explore.
Berkeley Dietvorst: We wanted to figure out—and it’s really been a long stream of research that I’ve been doing over years that’s kind of culminated in this question—how do laypeople define different levels of performance? So how would a layperson feel about being off by zero versus one versus two versus four versus 20? When I say a layperson, I just mean someone who hasn’t had any formal statistical training, so someone who either wants to make predictions themselves or use a model to make predictions that hasn’t really been trained in prediction or how experts think about predictions, but just is using their own intuition and their own feelings for how they would like a prediction to go. And what we found that was pretty interesting is that the way that laypeople define this is that they feel there’s a really big difference between being perfect versus off by one. And that kind of makes intuitive sense when you talk to people, right? Being perfect versus off by one is the difference between being right and wrong, right? People feel there’s a giant difference between being right versus wrong. On the other hand, they don’t feel as big a difference between being off by four versus five. You know, imagine you made a prediction yesterday that was off by four versus five for the temperature or a sports game, you might feel that those are really similar. (gentle music continues) We’re often building models that adopt the objective of “just don’t make a big mistake, but it doesn’t matter if you’re exactly right.” Where people are thinking, “I want to be exactly right, but if I make a mistake, it doesn’t matter if it’s a really big or just a kind of big mistake. I don’t care about those differences.” So the interesting thing, then, when we reflect on it is that the way people are thinking about prediction performance is kind of the opposite of how models are thinking about performance.
Narrator: If laypeople and developers judge the performance of predictive models so differently, does the way these models are constructed need to change? (gentle music continues)
Berkeley Dietvorst: So the reason this is really relevant is that a lot of algorithms today are actually consumer-facing products. So there’s tons of times where we’re going to Google Maps and having it estimate how long it’ll take to drive somewhere. We might be going to someone like Nate Silver’s account and learning what’s gonna happen in an election, or what’s the score of a sports game going to be? And if we’re building these models to give people predictions that they like that maximize their interests, then we actually wanna learn how they feel about prediction error and give them a prediction that accomplishes their objectives. So if it’s not a statistician or an expert that’s using the model, then it might not matter so much how a statistician feels about the model’s performance, right? We should be building the model for the layperson.
Narrator: What if building predictive models is more like designing a consumer product than solving a math problem? (gentle music continues)
Berkeley Dietvorst: I think a really good analogy would be, we might make a lot of choices when we’re building a minivan that a race-car driver wouldn’t like at all, but that doesn’t matter because we’re not trying to sell the minivan to a race-car driver, right? We’re building the minivan with a certain target market in mind. We wanna learn about their preferences and build it the way they want it. And so this is a way of saying a lot of these algorithms, we should really start thinking about them as products, where we’re learning about the user’s preferences and then actually building the algorithm to do what the user wants instead of looking at statistics or experts and building an algorithm for them. (gentle music continues)
Narrator: If we want algorithms to truly serve the people who use them, we need to rethink not just how to measure their success, but who we’re measuring them for. After all, even the best-designed model can miss the mark if it’s aimed at the wrong audience. (gentle music continues)