AI magnifying glass
Credit: Sloop Communications/Getty Images

Do AI Detectors Work Well Enough to Trust?

Researchers developed a policy framework for evaluating AI detection tools.

Generative artificial intelligence has set off a tremendous amount of excitement, speculation, and anxiety thanks to its ability to convincingly mimic human work, including human writing. Although a machine that writes like a person is useful in many applications, an inability to discern human from AI writing can also create real problems: Students can avoid learning, lawyers can cite bogus case law, and journalists can publish misleading information.

Accusing someone of using AI inappropriately when they haven’t can have lasting reputational consequences; failing to identify AI-generated work can affect evaluations of human work. This conundrum has inspired a cottage industry of companies that claim to help users consistently tell the difference between AI and human writing. But how useful are they?

Research from Chicago Booth principal researcher Brian Jabarian and Booth’s Alex Imas evaluated consumer tools for identifying AI-generated text. Their results not only demonstrate the viability of AI writing detectors, but also suggest a data-driven method for schools, employers, and others to implement such tools in their own institutional settings.

The researchers built a dataset of about 2,000 human-written passages spanning six mediums: blogs, consumer reviews, news articles, novels, restaurant reviews, and résumés. They then used four popular large language models to generate AI versions of the content by using prompts designed to elicit similar text to the originals.

They used the passages to test three commercial detectors and one open-source model, evaluating each on its rate of false negatives (identifying AI text as human) and false positives (flagging human writing as AI). The study evaluated the detectors across different lengths of writing—long (résumés and excerpts from novels, roughly 1,000 words), medium (articles and blogs, 200–500 words), short (reviews, 80–130 words), and very short (passages under 50 words). Because shorter passages give the detectors less to analyze, they tend to be more challenging to correctly identify.

Jabarian and Imas also explored how the detectors’ results changed when they adjusted the level of certainty the tools needed in order to identify writing as AI-generated. A higher threshold means a lower tolerance for false positives; dialing that number up or down reflects the reality that some organizations will be more reticent than others to incorrectly flag human work as that of AI.

The detectors’ performance varied depending on the length of the writing, the LLM used, and the decision threshold, but some patterns emerged. All three commercial tools—GPTZero, Originality.ai, and Pangram—showed at least a reasonably discerning eye for AI text on medium-length and long pieces of writing but lost accuracy on passages under 50 words. The open-source detector, RoBERTa, performed substantially worse than all three commercial alternatives—in many cases, its accuracy was close to that of random guessing—leading the researchers to conclude that it is “unsuitable for high-stakes applications.”

Four detectors get put to the test 

The researchers tested three commercial and one open-source detector on various texts that were either human written or AI generated. Three of the four were able to minimize the rate, to around 2 percent or less, of incorrectly marking human-written text as AI generated. Pangram did best, making almost no mistakes.

One method they used to evaluate the tools was to determine the likelihood of a detector being more suspicious of a randomly selected piece of AI writing than a randomly selected piece of human writing. By this metric, Pangram’s accuracy was 100 percent for most models and types of writing, and never lower than 99.8 percent. Originality.ai scored slightly lower, and GPTZero slightly lower still, though its accuracy remained at 96 percent even on short passages.

All three commercial tools kept false positive rates below 1 percent, with Pangram’s the lowest—essentially 0 across most decision thresholds. False negative rates were higher, coming in between roughly 0 percent and 2 percent for GPTZero and between 2 percent and 4 percent for Pangram. Originality.ai’s false negatives were higher still: between 10 percent and 40 percent, depending on the model.

Given that every detector is at least slightly imperfect, organizations still have to evaluate for themselves if and how to use them, trading off the potential for AI misuse with the risk of false accusations. To guide these choices, the researchers propose a “policy cap” framework that lets institutions set a strict tolerance for false positives—for instance, no more than 0.5 percent of human writing flagged as AI—and then compare detectors under that same standard.

Right now, the researchers say, Pangram is “the only AI detector maintaining policy-grade levels on our main metrics when evaluated on all four generative AI models.” However, they also warn that AI detection is a rapidly evolving field, and that performance will likely vary as detectors, LLMs, “humanizer” tools (which rewrite AI text to evade detection), and users compete in a “technical arms race.” They suggest that regular performance audits should be undertaken and their results published in order to help organizations make sure their use of detectors is both effective and fair.

More from Chicago Booth Review
More from Chicago Booth

Your Privacy
We want to demonstrate our commitment to your privacy. Please review Chicago Booth's privacy notice, which provides information explaining how and why we collect particular information when you visit our website.