For the approximately 450,000 people released annually from US prisons, the chances are high that they’ll be back in a cell in 5–10 years. About 61 percent of state prisoners who were released in 2008 returned to prison within 10 years for a parole or probation violation or a new sentence, find Matthew R. Durose and Leonardo Antenangeli, both formerly at the Bureau of Justice Statistics. In another analysis, the same researchers report that about 70 percent of state prisoners released in 2012 were rearrested within five years, and nearly 50 percent ended up back in prison within that time.
These statistics tell the collective story of people who are struggling—and whose futures may turn on whether or not they are accepted into a diversion program. Such programs are meant to direct people who commit low-level offenses away from the road to incarceration and keep convicted criminals from returning to prison. Increasingly, the humans making the acceptance and rejection decisions are helped by machine-run algorithms that can sort through thousands of data points to predict who is most likely to succeed in such a program.
In recent decades, many county court systems and municipalities have begun using algorithms to guide this and other criminal-justice decisions. These algorithms, which can inform choices such as where to put police or whether someone accused of a crime should be let out on bail, execute a set of rules that may include basic statistical analysis or more sophisticated techniques from machine learning (ML)—a branch of artificial intelligence that learns patterns from data. Their use has raised concerns from the American Civil Liberties Union, among others. Unlike when they’re to blame for a bad hiring decision or a failed ad-targeting campaign, algorithmic errors in policing or sentencing can destroy lives. (For more, read “Law and Order and Data.”)
The stakes are just as high for decisions about diversion programs, where resources are often limited, and the goal is to identify the people most likely to complete treatment and be rehabilitated. Provide services to someone who goes on to commit a violent crime, and you’ve failed in your duty to protect public safety. Send someone to prison who would have thrived in treatment, and you might condemn them to a lifetime of cycling through the system.
Recognizing both pitfalls and opportunities, teams of researchers are employing techniques from criminology, machine learning, and operations research with the goal of making such systems fairer and more reliable. They’re applying queueing theory to better design diversion programs. They’re using ML techniques to uncover hidden patterns in data that help algorithms understand an offender’s personal circumstances. They’re even employing equilibrium analysis techniques to understand the long-term dynamics between offenders, communities, and policymakers, revealing how short-sighted policies can worsen community welfare over time. The key to all of the projects, says Purdue’s Pengyi Shi, is to “promote responsible use of AI.”
Understanding how the whole system evolves over time is crucial for designing effective policies, the researchers write.
The work to lower rates of recidivism has the potential to benefit the individuals directly affected, but also taxpayers. The United States incarcerates more people than any other country, according to the Prison Policy Initiative, which puts the number at approximately 2 million. The nonpartisan, nonprofit organization estimates the annual tally for funding prisons, jails, parole, and probation to be at least $80 billion annually.
Diversion programs represent just a small piece of the community-justice system: About 2 percent of felony criminal cases in state courts result in diversion or are deferred, according to a 2020 report from the National Center for State Courts. But improved decision-making tools have the potential to make a big difference, both for state budgets and perhaps especially for individuals who want to get their life back on track. “If we do more to reduce recidivism, we get safer communities,” says Chicago Booth’s Amy Ward.
The issue of imperfection
While diversion programs can be designed in various ways and be of varying quality, as a general statement, they attempt to address the root causes of criminal behavior, including substance use disorder, mental-health issues, and a lack of job skills.
In a typical program, a case manager assesses potential candidates for their risk of reoffending and determines what kind of help they need. This is, in some ways, a challenge familiar to researchers who study service operations, in which customers may receive different treatment depending on their status. For example, a high-mileage frequent flier contacting an airline’s call center may have priority status and so wait less time for an agent than others. Resource constraints prevent providing such red-carpet treatment to everyone, and increasingly, ML techniques are being used to identify the optimal recipients—the people most likely to be or become loyal patrons, and therefore the most profitable customers. Criminal justice faces similar challenges, notes Ward. “In order to use their limited resources as efficiently as possible, diversion programs must assess which individuals to give priority status to when making admission decisions,” she says.
The case manager’s assessment process could be partially automated, suggests research by University of Illinois PhD student Bingxuan Li, Shi, and Ward. The researchers, working with Adult Redeploy Illinois, a program that offers people convicted of minor offenses community-based rehabilitation instead of prison time, taught a model to think like an experienced case manager and make the same expert-level inferences. But their research also illustrates some of the ways algorithms are imperfect—and emphasizes the need for a framework for how human judgment should be part of the process. (Read “The Best Experts May Be People-Assisted Machines.”)
There’s no inherently optimal policy; the model simply reflects whatever values policymakers choose to build into the system.
Operations researchers generally cannot experiment with how policy changes affect real people and instead use theoretical models and available data to run thousands of what-if scenarios. In doing this, research can help identify how and why algorithms stumble. When Li, Shi, and Ward taught AI to think like a case manager, they used a large language model, a type of generative AI susceptible to hallucinations. At one point, the model confidently claimed that households in a certain zip code had a median income of $200,000 and 99.9 percent employment, when the actual figures were $72,280 and 64 percent.
Data proved another complicating factor. The researchers wanted to predict candidates’ risk levels, namely the probability that someone would reoffend. They had access to a variety of data including criminal history, demographics, employment status, and housing stability. But unobserved factors also play a role in whether or not someone will reoffend, and it’s impossible to identify every possible factor that would need to be taken into account and recorded. As a result, ML predictions are inherently subject to error.
Li, Hebrew University of Jerusalem’s Antonio Castellanos, Shi, and Ward studied staffing requirements for a diversion program and encountered another data challenge. They had detailed information that included each participant’s age, county, criminal history, and race—and wanted to predict the length of time each participant would stay in the program before either graduating, dropping out, or being removed (perhaps after reoffending).
The problem lay in the nature of the data. Participants with identical observable characteristics showed widely varying program durations. Some stayed for weeks while others remained for one year or more. The categorical data couldn’t be used to distinguish between individuals in ways that would enable accurate length-of-stay predictions, a shortcoming that the researchers confirmed via statistical tests.
So they adopted a different approach for the length-of-stay determination, grouping past participants by key characteristics and sampling from historical data patterns. This practical solution worked better, which illustrates another limitation of ML: Even sophisticated models cannot derive predictive signals when useful patterns can’t be found in the data.
Data scarcity and bias
Data are crucial to the success or failure of an AI system, and unlike at tech companies, researchers at work on the problems in the criminal-justice system don’t have the luxury of gathering detailed behavioral data about millions of users. They often work with anonymized datasets, for starters. That makes it impossible to then collect linked information from social media or psychological profiling.
Watch: What's the Best Way to Decide Who Stays Out of Prison?
Such constraints force algorithms to make predictions based on relatively crude proxies such as age at first offense for developmental factors, educational attainment for cognitive capacity, standardized risk scores for complex psychological profiles, and zip-code demographics for socioeconomic status. These factors may correlate with success or failure in a rehabilitation program but are far from perfect predictors. The result is a system that works reasonably well for clear-cut cases but struggles with the nuanced, complex situations that make up most criminal-justice decisions.
Booth PhD student Zhiqiang Zhang, Shi, and Ward’s 2025 research on diversion-program admissions finds this problem to be particularly acute for borderline cases, which involve candidates whose risk scores fall near the decision boundary between admittance or denial. It’s intuitive that someone with unstable housing and no job is at higher risk of committing a crime than someone with stable housing and full-time employment. But what about someone with a part-time job who has temporary housing? These complex cases involve people who don’t fit neatly into high-risk or low-risk categories. They are exactly the situations where the research indicates we may not want to rely on an algorithm to make the decision but instead exercise human judgment.
There’s also an observational blind spot: Researchers see the outcomes of people who were admitted to programs but don’t have any information about how rejected applicants might have performed. What if algorithms consistently chose the wrong participants? Could the people they rejected have had better success rates than those who were admitted?
Greater diversity in training data would help address that weakness, but again there’s an ethical roadblock. The researchers’ simulations involve real data, and “you can’t put hardened criminals in a program just to gather data,” says Ward. An experiment that overlooked a promising candidate for a less-promising individual would be hard to justify, particularly if the latter were to reoffend while enrolled in a program—a public-safety concern and a public-relations nightmare.
Algorithms also face a well-documented bias problem. The training data they use can create unfair disparities in how groups are treated. AI might identify patterns that systematically disadvantage certain groups while favoring others. For instance, if a model tasked with admission decisions were to learn that individuals with tattoos had a higher tendency than others to relapse into criminal behavior, it could inadvertently and unfairly penalize people from the South Pacific, where tattooing is traditionally valued, while favoring those from religions such as Islam and Judaism, which traditionally prohibit tattoos.
AI could also mistake correlation for causality. Just because more people with tattoos committed crimes does not mean that tattoos caused the people to commit crimes. There could be an unobserved factor at play, such as that individuals with higher intrinsic susceptibility to violence are both more likely to commit crimes and more likely to get a tattoo. What’s more, using biased data to train algorithms creates a feedback loop that amplifies inequality over time.
One possible remedy would be to impose fairness constraints, such as incorporating criteria requiring algorithms to output similar proportions of positive outcomes for various communities, particularly around gender, race, or socioeconomic status. A model could be written so that if 10 percent of White defendants were admitted to a diversion program, 10 percent of Black defendants and 10 percent of Asian defendants would receive that same positive outcome. However, these types of criteria are often imposed without a full understanding of the underlying group dynamics (in the case of the above example, how different groups may react differently to the 10 percent admission rate), which research—by the Ohio State University’s Xueru Zhang and Mahdi Khalili, Bilkent University’s Cem Tekin, and University of Michigan’s Mingyan Liu—has shown can have unintended and likely negative consequences.
In a 2022 paper about routing, Zhang, Shi, and Ward considered this thorny issue of fairness. If you force equal outcomes, you’re essentially ignoring real differences that might exist between groups, not because of inherent traits but because of circumstances. Say people with tattoos are less affluent than people without them, on average. Over time you would likely see a lower proportion of tattooed applicants succeed because they had less support.
The cost of fairness
This raises a question about fairness and the associated trade-offs. In practice, if judges or caseworkers knew certain people from groups were more likely to complete treatment successfully, they could reduce crime more effectively by giving those groups preferential access to a program. But this efficiency gain would come at the expense of fairness. Is it acceptable to give some people an edge in accessing services and additional resources on the basis of group characteristics rather than individual merit?
Zhang, Shi, and Ward don’t directly address the question of fairness, but they can quantify how much it would cost a theoretical queueing system to treat all groups equally versus optimizing for the best overall outcomes. To do so, they used a theoretical, mathematical “survival” model often employed to predict patient outcomes in medical research. The model involves two customer groups with different risk distributions and different responses to intensive versus standard services.
The researchers tested two policies: a “fair” approach that gives both groups equal chances at intensive services, and a potentially “unfair” policy that prioritizes the group more likely to succeed when receiving such services. The cost of fairness can be substantial—in some scenarios, fair policies reduced system efficiency by more than 15 percent, the researchers find. Even more troubling, policies designed to be fair in the short run can exacerbate unfairness in the long run. This is due to a broader issue: Metrics that look good after one year may look poor after 10 years. If a policy leads to most defendants being put behind bars for 18 months, the one-year metrics may look great because no one has recidivated. But after 10 years, without attempts to address root-cause issues through rehabilitation or community support, the amount of recidivism may be much worse.
Optimizing for long-term good
In another study that’s still in early stages, Shi and Ward, this time with Booth principal researcher Chuwen Zhang, employed an equilibrium approach, similar to that used to study long-term behavior in epidemics, that allowed them to approach resource allocation like a chess match, examining the interactions between three key players: an offender, a policymaker, and a member of the general public.
In their model, each individual makes decisions based on the others’ moves. The offender is less likely to commit a crime if provided support when reentering society. The policymaker allocates limited resources—probation officers, prison beds, or slots in diversion programs—to reduce crime while maintaining public support. Meanwhile, public attitudes toward crime and punishment influence both policy implementation and effectiveness.
Their theoretical framework suggests there’s an ongoing struggle: When the justice system changes its approach, such as by increasing probation-officer staffing or expanding diversion programs, people involved in it are likely to adjust their behavior as a result.
Rather than examine the immediate effects of these algorithm-informed policies, the researchers focused on long-term equilibrium, or what happens when everyone has figured out how to respond and the system reaches new balance. They find that well-intentioned policies can backfire unexpectedly—as would be the case with locking up everyone for 18 months. Understanding how the whole system evolves over time is crucial for designing effective policies, the researchers write. When evaluating an algorithm, we should wonder what happens when we follow it in practice, across different groups and over years, to see the effects on long-term public safety and well-being. Their framework lays the groundwork for developing a tool to do this.
Models reflect values
An emphasis on long-term thinking leads to this crucial consideration: We shouldn’t be focusing on fully automating criminal-justice decisions but instead be determining where we can trust algorithms and when humans should intervene. Research suggests automation works best at extremes—when program capacity is either limited or abundant, it creates clearer decision boundaries. Human oversight proves most valuable in the middle ranges, where admission thresholds are fuzzier.
But technical decisions reflect profound value judgments. For example, emphasizing short-term savings versus long-term societal benefits can produce dramatically different recommendations from identical data. There’s no inherently optimal policy; the model simply reflects whatever values humans build into the system.
As policymakers, program directors, and researchers weigh the benefits of fairness versus efficiency, or algorithmic versus human decision-making, the big picture is that the status quo perpetuates cycles of incarceration and reoffending. Thoughtfully redesigned systems could help break these cycles while protecting communities. Used well, algorithms could help make this a reality. Used badly, they risk entrenching existing inequalities under the guise of scientific legitimacy.
Early results offer genuine hope. A spokesperson for the Adult Redeploy Illinois program says that “the evidence-informed practices employed by ARI sites, such as cognitive behavioral therapy, have been shown to reduce recidivism rates by 20 percent or more in some cases.” In terms of per-year costs, sending someone to the Illinois Department of Corrections is $49,000, whereas the average ARI intervention is $5,000. “For state fiscal year 2025, ARI reported an estimated $83 million in total costs avoided,” according to the spokesperson, who says that “programs can benefit from the use of tools that support objective decision-making and effective resource allocation.”
Could AI tools help make the financial impact even bigger? And could they keep more people from returning to prison? “The promise of AI is to improve our lives,” says Ward. “Let’s hold it to that standard.”
- Matthew R. Durose and Leonardo Antenangeli, “Recidivism of Prisoners Released in 34 States in 2012: A 5-Year Follow-Up Period (2012–2017),” Bureau of Justice Statistics report, July 2021.
- Bingxuan Li, Antonio Castellanos, Pengyi Shi, and Amy Ward, “Combining Machine Learning and Queueing Theory for Data-Driven Incarceration-Diversion Program Management,” Proceedings of the AAAI Conference on Artificial Intelligence, March 2024.
- Bingxuan Li, Pengyi Shi, and Amy Ward, “Latent Feature Mining for Predictive Model Enhancement with Large Language Models,” Preprint, arXiv, October 2024, arXiv: 2410.04347.
- Chuwen Zhang, Pengyi Shi, and Amy Ward, “Better Resource Allocations in the Criminal Justice System: Optimizing for the Long-term Good,” Research in progress.
- Xueru Zhang, Mahdi Khalili, Cem Tekin, and Mingyan Liu, “Group Retention When Using Machine Learning in Sequential Decision Making: The Interplay Between User Dynamics and Fairness,” Advances in Neural Information Processing Systems 32, 2019.
- Zhiqiang Zhang, Pengyi Shi, and Amy Ward, “Admission Decisions Under Imperfect Classification: An Application in Criminal Justice,” Working paper, May 2025.
- ———, “Routing for Fairness and Efficiency in a Queueing Model with Reentry and Continuous Customer Classes,” 2022 American Control Conference, June 2022.
Your Privacy
We want to demonstrate our commitment to your privacy. Please review Chicago Booth's privacy notice, which provides information explaining how and why we collect particular information when you visit our website.