AI Works Better When It’s a Little Bit Human

When people think about how machines process information, they tend to think of a cold, rational operation. But training artificial intelligence with a measure of human misperception might be the key to making it smarter and cheaper.

Princeton PhD student Sijia Liu, Stanford PhD student Niklas Muennighoff, and Chicago Booth’s Kawin Ethayarajh looked into the difference between the two broad approaches AI developers currently use to align their models after pretraining is complete. They find that the better-performing class of methods serendipitously reflects human biases about probabilities—an insight that explains why the industry’s priciest training techniques work so well. This finding allowed the researchers to create an approach that they contend matches the quality of the expensive method at a fraction of the cost.

Training a state-of-the-art language model can cost more in a week than a small startup spends in a year. A growing fraction of that expense comes from alignment—a process that trains the model using feedback signals on its outputs—after they’ve absorbed the huge datasets that form the foundation of their learning. Alignment is what shapes raw model capabilities into outputs that are actually useful and appropriate, whether in the context of safety—can we prevent the model from being used for hacking?—or simply in the context of correctness—is the model doing the right mathematical reasoning?

Alignment can be broken down into two broad types: offline, in which the model learns from a fixed dataset that teaches it how to behave, and online. In online alignment training, the model generates outputs, receives automated feedback from a scoring system, generates more outputs using that feedback, and repeats. It’s like training chefs by having them cook dish after dish from scratch and critiquing them after each one instead of simply giving them cookbooks to read.

Online alignment methods are slow and costly but yield better results than those using the offline approach. Conventional wisdom credits the constant flow of fresh data that the model generates during training, but Liu, Muennighoff, and Ethayarajh’s research suggests online alignment succeeds at least partly because it accidentally forces language models to learn in a way that mirrors human psychology.

Humans systematically distort probability. Someone might overestimate the chance of winning a massive lottery jackpot while simultaneously underestimating the chance of earning a modest return on an index fund. Behavioral economists call this “probability weighting,” a central concept in the behavioral framework known as prospect theory. This theory explains how we overestimate extreme, rare outcomes and underestimate common, highly probable ones.

“Crucially, this distortion is systematic—not random—and the shape of the distortion is consistent across humans,” Ethayarajh says. “If, instead of thinking about the amount of money that might be gained, we think about the amount of information that might be gained, we can frame alignment through the lens of behavioral economics.”

A less expensive way to train a model

Aligning a model—training it to give answers that people judge as helpful, accurate, and appropriate—with pricier, continuously generated, online data typically produces better performance than with cheaper, fixed “offline” data. But the researchers’ approach, which they call “humanline” because it adjusts training to reflect human biases, can make offline alignment comparable to the online method. The results hold regardless of the training algorithm used.

Applying that frame, the researchers find evidence that online alignment distorts probabilities just like humans do. Much the way winning a jackpot occupies an outsize position in our minds, so do certain model outputs play an outsize role during training.

Liu, Muennighoff, and Ethayarajh’s key insight is that if the online alignment method succeeds by accidentally mimicking human perceptual biases, those biases could be deliberately incorporated into any training approach. The key to effective alignment may be as much about how well the method matches human perception as where its data come from.

On the basis of this idea, they developed “humanline,” a design modification that applies two procedural changes to existing alignment techniques.

Most alignment methods compare the model as it evolves during training to a fixed reference version of the original model before training began. Humanline instead regularly updates the reference model so that it is closer to the model being trained and never too “stale.” This syncing mirrors how humans evaluate quality against current expectations rather than a fixed historical standard.

The second change humanline applies is something called asymmetric clipping. This entails putting uneven limits on how much the model can adjust its predictions during training. It forces the model to mimic the way humans overweight some probabilities and underweight others.

In the researchers’ experiments, online training consistently outperformed offline training by 30–60 percent. Applying humanline to offline methods closed the entire gap with online methods while running, in some cases, more than six times faster.

On instruction-following tasks, where prompts like “Write a professional email” received completions scored by a reward model, the baseline offline approach outperformed OpenAI’s GPT-4 Turbo model just over 15 percent of the time. But the offline + humanline approach notched a roughly 25 percent rate, consistent with the performance of standard online training.

For math reasoning, where the data contained verifiable right-wrong answers, both standard online training and the humanline variant of offline alignment reached approximately 59 percent accuracy on the MATH500 benchmark, a set of 500 math problems used to assess AI models’ capacity for mathematical problem-solving. However, the humanline approach sampled new training problems 64 times less frequently. This dramatically reduced computational requirements while still maintaining performance.

However, the researchers stress that these results showing parity with online alignment are an empirical regularity, as opposed to a formal guarantee. Data quality remains crucial, and their premise—that prospect theory extends to AI outputs—is but one possible explanation, albeit one that is supported by experimental results.

Ethayarajh notes that GLM-5, a widely discussed coding model from China, recently used a scalable combination of online and humanline alignment: a training approach that works with data generated by a slightly older variant of the model rather than fully static offline data. If future research shows that humanline works broadly, it could make AI training faster and cheaper, thereby enabling more businesses to use their own data to build and regularly update language models tailored to their needs.

Works Cited

Sijia Liu, Niklas Muennighoff, and Kawin Ethayarajh, “Humanline: Online Alignment as Perceptual Loss,” Fourteenth International Conference on Learning Representations, forthcoming.

AI Works Better When It’s a Little Bit Human

A less expensive way to train a model

Related Topics

Related Topics