Center for Applied AI's Faculty Research
Center for Applied AI's Faculty Research
The Center for Applied AI is proud to share research from our talented faculty members and affiliates.
Faculty + Research Papers
Explore Research Papers by Topic
Liang, Tengyuan: Professor of Econometrics and Statistics and William Ladany Faculty Fellow, Chicago Booth
We study Langevin dynamics for recovering the planted signal in the spiked matrix model. We provide a path-wise characterization of the overlap between the output of the Langevin algorithm and the planted signal. This overlap is characterized in terms of a self-consistent system of integrodifferential equations, usually referred to as the Crisanti-Horner-Sommers- Cugliandolo-Kurchan (CHSCK) equations in the spin-glass literature. As a second contribution, we derive an explicit formula for the limiting overlap in terms of the signal-to-noise ratio and the injected noise in the diffusion. As an upshot, this uncovers a sharp phase transition—in one regime, the limiting overlap is strictly positive, while in the other, the injected noise overcomes the signal, and the limiting overlap is zero.
Ludwig, Jens: Edwin A. and Betty L. Bergman Distinguished Service Professor, UChicago Harris School of Public Policy
Algorithms (in some form) are already widely used in the criminal justice system. We draw lessons from this experience for what is to come for the rest of society as machine learning diffuses. We find economists and other social scientists have a key role to play in shaping the impact of algorithms, in part through improving the tools used to build them.
Most empirical policy work focuses on causal inference. We argue an important class of policy problems does not require causal inference but instead requires predictive inference. Solving these "prediction policy problems" requires more than simple regression techniques, since these are tuned to generating unbiased estimates of coefficients rather than minimizing prediction error. We argue that new developments in the field of "machine learning" are particularly useful for addressing these prediction problems. We use an example from health policy to illustrate the large potential social welfare gains from improved prediction.
The law forbids discrimination. But the ambiguity of human decision-making often makes it extraordinarily hard for the legal system to know whether anyone has actually discriminated. To understand how algorithms affect discrimination, we must therefore also understand how they affect the problem of detecting discrimination. By one measure, algorithms are fundamentally opaque, not just cognitively but even mathematically. Yet for the task of proving discrimination, processes involving algorithms can provide crucial forms of transparency that are otherwise unavailable. These benefits do not happen automatically. But with appropriate requirements in place, the use of algorithms will make it possible to more easily examine and interrogate the entire decision process, thereby making it far easier to know whether discrimination has occurred. By forcing a new level of specificity, the use of algorithms also highlights, and makes transparent, central tradeoffs among competing values. Algorithms are not only a threat to be regulated; with the right safeguards in place, they have the potential to be a positive force for equity.
While hypothesis testing is a highly formalized activity, hypothesis generation remains largely informal. We propose a procedure that uses machine learning algorithms—and their capacity to notice patterns that people might not—to generate novel hypotheses about human behavior. We illustrate the procedure with a concrete empirical application: pre-trial decisions by judges. We begin with a striking fact. An algorithmic model reveals that a single factor explains nearly half of the predictable variation in who judges choose to jail: the pixels in the defendant’s mugshot. The mugshot remains highly predictive even after controlling for race, skin color, demographics, and facial features previously emphasized by psychologists. Moreover, human judgments about who will be jailed—based on the mugshots—do significantly worse than the algorithm’s. What, then, has the algorithm discovered about who judges choose to jail? To answer this question, we build a communication procedure that allows people to see what the algorithm “sees.” We find that subjects using the procedure appear to understand the algorithm: they are able to articulate facial features that turn out to explain the algorithm’s predictions. These novel features also explain the actual choices judges make: defendants with these features are jailed at significantly higher rates. Though our results are specific, our approach is general. The modern world produces troves of high-dimensional data, e.g., from cell phones, satellites, online behavior, news headlines, corporate filings, and high-frequency time series. Our framework provides a way to produce novel interpretable hypotheses from high-dimensional data such as these, a way to marry the predictive power of machine learning with human intuition. A central tenet of our paper is that hypothesis generation is in and of itself a valuable activity, and hope this encourages future work in this “pre-scientific” stage of science.
Concerns that algorithms may discriminate against certain groups have led to numerous efforts to 'blind' the algorithm to race. We argue that this intuitive perspective is misleading and may do harm. Our primary result is exceedingly simple, yet often overlooked. A preference for fairness should not change the choice of estimator. Equity preferences can change how the estimated prediction function is used (e.g., different threshold for different groups) but the function itself should not change. We show in an empirical example for college admissions that the inclusion of variables such as race can increase both equity and efficiency.
We calculate the social return on algorithmic interventions (specifically their Marginal Value of Public Funds) across multiple domains of interest to economists—regulation, criminal justice, medicine, and education. Though these algorithms are different, the results are similar and striking. Each one has an MVPF of infinity: not only does it produce large benefits, it provides a “free lunch.” We do not take these numbers to mean these interventions ought to be necessarily scaled, but rather that much more R&D should be devoted to developing and carefully evaluating algorithmic solutions to policy problems.
Evaluating whether machines improve on human performance is one of the central questions of machine learning. However, there are many domains where the data is selectively labeled, in the sense that the observed outcomes are themselves a consequence of the existing choices of the human decision-makers. For instance, in the context of judicial bail decisions, we observe the outcome of whether a defendant fails to return for their court appearance only if the human judge decides to release the defendant on bail. This selective labeling makes it harder to evaluate predictive models as the instances for which outcomes are observed do not represent a random sample of the population. Here we propose a novel framework for evaluating the performance of predictive models on selectively labeled data. We develop an approach called contraction which allows us to compare the performance of predictive models and human decision-makers without resorting to counterfactual inference. Our methodology harnesses the heterogeneity of human decision-makers and facilitates effective evaluation of predictive models even in the presence of unmeasured confounders (unobservables) which influence both human decisions and the resulting outcomes. Experimental results on real world datasets spanning diverse domains such as health care, insurance, and criminal justice demonstrate the utility of our evaluation metric in comparing human decisions and machine predictions.
Paper accessible through abstract link
Can machine learning improve human decision making? Bail decisions provide a good test case. Millions of times each year, judges make jail-or-release decisions that hinge on a prediction of what a defendant would do if released. The concreteness of the prediction task combined with the volume of data available makes this a promising machine-learning application. Yet comparing the algorithm to judges proves complicated. First, the available data are generated by prior judge decisions. We only observe crime outcomes for released defendants, not for those judges detained. This makes it hard to evaluate counterfactual decision rules based on algorithmic predictions. Second, judges may have a broader set of preferences than the variable the algorithm predicts; for instance, judges may care specifically about violent crimes or about racial inequities. We deal with these problems using different econometric strategies, such as quasi-random assignment of cases to judges. Even accounting for these concerns, our results suggest potentially large welfare gains: one policy simulation shows crime reductions up to 24.7% with no change in jailing rates, or jailing rate reductions up to 41.9% with no increase in crime rates. Moreover, all categories of crime, including violent crimes, show reductions; these gains can be achieved while simultaneously reducing racial disparities. These results suggest that while machine learning can be valuable, realizing this value requires integrating these tools into an economic framework: being clear about the link between predictions and decisions; specifying the scope of payoff functions; and constructing unbiased decision counterfactuals.
Economists have become increasingly interested in studying the nature of production functions in social policy applications, with the goal of improving productivity. Traditionally models have assumed workers are homogenous inputs. However, in practice, substantial variability in productivity means the marginal productivity of labor depends substantially on which new workers are hired--which requires not an estimate of a causal effect, but rather a prediction. We demonstrate that there can be large social welfare gains from using machine learning tools to predict worker productivity, using data from two important applications - police hiring and teacher tenure decisions.
Misra, Sanjog: Charles H. Kellstadt Professor of Marketing
We study deep neural networks and their use in semiparametric inference. We establish novel rates of convergence for deep feedforward neural nets. Our new rates are sufficiently fast (in some cases minimax optimal) to allow us to establish valid second-step inference after first- step estimation with deep learning, a result also new to the literature. Our estimation rates and semiparametric inference results handle the current standard architecture: fully connected feedforward neural networks (multi-layer perceptrons), with the now-common rectified linear unit activation function and a depth explicitly diverging with the sample size. We discuss other architectures as well, including fixed-width, very deep networks. We establish nonasymptotic bounds for these deep nets for a general class of nonparametric regression-type loss functions, which includes as special cases least squares, logistic regression, and other generalized linear models. We then apply our theory to develop semiparametric inference, focusing on causal pa- rameters for concreteness, such as treatment effects, expected welfare, and decomposition effects. Inference in many other semiparametric contexts can be readily obtained. We demonstrate the effectiveness of deep learning with a Monte Carlo analysis and an empirical application to direct mail marketing.
This paper proposes a procedure for assessing sensitivity of inferential conclusions for functionals of sparse high-dimensional models following model selection. The proposed procedure is called targeted under-smoothing. Functionals considered include dense functionals that may depend on many or all elements of the high-dimensional parameter vector. The sensitivity analysis is based on systematic enlargements of an initially selected model. By varying the enlargements, one can conduct sensitivity analysis about the strength of empirical conclusions to model selection mistakes. We illustrate the procedure’s performance through simulation experiments and two empirical examples.
Mullainathan, Sendhil: Roman Family University Professor of Computation and Behavioral Science
Abstract
In a wide array of areas, algorithms are matching and surpassing the performance of human experts, leading to consideration of the roles of human judgment and algorithmic prediction in these domains. The discussion around these developments, however, has implicitly equated the specific task of prediction with the general task of automation. We argue here that automation is broader than just a comparison of human versus algorithmic performance on a task; it also involves the decision of which instances of the task to give to the algorithm in the first place. We develop a general framework that poses this latter decision as an optimization problem, and we show how basic heuristics for this optimization problem can lead to performance gains even on heavily-studied applications of AI in medicine. Our framework also serves to highlight how effective automation depends crucially on estimating both algorithmic and human error on an instance-by-instance basis, and our results show how improvements in these error estimation problems can yield significant gains for automation as well.
Most empirical policy work focuses on causal inference. We argue an important class of policy problems does not require causal inference but instead requires predictive inference. Solving these "prediction policy problems" requires more than simple regression techniques, since these are tuned to generating unbiased estimates of coefficients rather than minimizing prediction error. We argue that new developments in the field of "machine learning" are particularly useful for addressing these prediction problems. We use an example from health policy to illustrate the large potential social welfare gains from improved prediction.
While hypothesis testing is a highly formalized activity, hypothesis generation remains largely informal. We propose a procedure that uses machine learning algorithms—and their capacity to notice patterns that people might not—to generate novel hypotheses about human behavior. We illustrate the procedure with a concrete empirical application: pre-trial decisions by judges. We begin with a striking fact. An algorithmic model reveals that a single factor explains nearly half of the predictable variation in who judges choose to jail: the pixels in the defendant’s mugshot. The mugshot remains highly predictive even after controlling for race, skin color, demographics, and facial features previously emphasized by psychologists. Moreover, human judgments about who will be jailed—based on the mugshots—do significantly worse than the algorithm’s. What, then, has the algorithm discovered about who judges choose to jail? To answer this question, we build a communication procedure that allows people to see what the algorithm “sees.” We find that subjects using the procedure appear to understand the algorithm: they are able to articulate facial features that turn out to explain the algorithm’s predictions. These novel features also explain the actual choices judges make: defendants with these features are jailed at significantly higher rates. Though our results are specific, our approach is general. The modern world produces troves of high-dimensional data, e.g., from cell phones, satellites, online behavior, news headlines, corporate filings, and high-frequency time series. Our framework provides a way to produce novel interpretable hypotheses from high-dimensional data such as these, a way to marry the predictive power of machine learning with human intuition. A central tenet of our paper is that hypothesis generation is in and of itself a valuable activity, and hope this encourages future work in this “pre-scientific” stage of science.
How effective are physicians at diagnosing heart attacks? To answer this question, we contrast physician testing decisions with a machine learning model of risk. When the two deviate, we use actual health outcome data to judge whether the algorithm or the physician was right. We find physicians over-test: tests that are predictably useless are still performed. At the same time, physicians also under-test: many predicted high-risk patients are untested and then suffer adverse health events (including death) at high rates. A natural experiment using shift-to-shift testing variation confirms these findings: increasing testing improves health and reduces mortality, but only for patients flagged as high-risk by the algorithm. The simultaneous existence of over- and under-testing cannot easily be explained by incentives alone, and instead suggests errors. We provide suggestive evidence on the psychology behind these errors: (i) physicians use too simple a model of risk, suggesting bounded rationality; (ii) they over-weight salient information; and (iii) they over-weight symptoms that are representative or stereotypical of heart attack. Together, these results suggest the need for health care models and policies to incorporate not just physician incentives, but also physician mistakes.
Obermeyer, Ziad: Blue Cross of California Distinguished Associate Professor of Health Policy and Management, Berkeley School of Public Health
Abstract
Health systems rely on commercial prediction algorithms to identify and help patients with complex health needs. We show that a widely used algorithm, typical of this industry-wide approach and affecting millions of patients, exhibits significant racial bias: At a given risk score, Black patients are considerably sicker than White patients, as evidenced by signs of uncontrolled illnesses. Remedying this disparity would increase the percentage of Black patients receiving additional help from 17.7 to 46.5%. The bias arises because the algorithm predicts health care costs rather than illness, but unequal access to care means that we spend less money caring for Black patients than for White patients. Thus, despite health care cost appearing to be an effective proxy for health by some measures of predictive accuracy, large racial biases arise. We suggest that the choice of convenient, seemingly effective proxies for ground truth can be an important source of algorithmic bias in many contexts.
Most empirical policy work focuses on causal inference. We argue an important class of policy problems does not require causal inference but instead requires predictive inference. Solving these "prediction policy problems" requires more than simple regression techniques, since these are tuned to generating unbiased estimates of coefficients rather than minimizing prediction error. We argue that new developments in the field of "machine learning" are particularly useful for addressing these prediction problems. We use an example from health policy to illustrate the large potential social welfare gains from improved prediction.
The law forbids discrimination. But the ambiguity of human decision-making often makes it extraordinarily hard for the legal system to know whether anyone has actually discriminated. To understand how algorithms affect discrimination, we must therefore also understand how they affect the problem of detecting discrimination. By one measure, algorithms are fundamentally opaque, not just cognitively but even mathematically. Yet for the task of proving discrimination, processes involving algorithms can provide crucial forms of transparency that are otherwise unavailable. These benefits do not happen automatically. But with appropriate requirements in place, the use of algorithms will make it possible to more easily examine and interrogate the entire decision process, thereby making it far easier to know whether discrimination has occurred. By forcing a new level of specificity, the use of algorithms also highlights, and makes transparent, central tradeoffs among competing values. Algorithms are not only a threat to be regulated; with the right safeguards in place, they have the potential to be a positive force for equity.
How effective are physicians at diagnosing heart attacks? To answer this question, we contrast physician testing decisions with a machine learning model of risk. When the two deviate, we use actual health outcome data to judge whether the algorithm or the physician was right. We find physicians over-test: tests that are predictably useless are still performed. At the same time, physicians also under-test: many predicted high-risk patients are untested and then suffer adverse health events (including death) at high rates. A natural experiment using shift-to-shift testing variation confirms these findings: increasing testing improves health and reduces mortality, but only for patients flagged as high-risk by the algorithm. The simultaneous existence of over- and under-testing cannot easily be explained by incentives alone, and instead suggests errors. We provide suggestive evidence on the psychology behind these errors: (i) physicians use too simple a model of risk, suggesting bounded rationality; (ii) they over-weight salient information; and (iii) they over-weight symptoms that are representative or stereotypical of heart attack. Together, these results suggest the need for health care models and policies to incorporate not just physician incentives, but also physician mistakes.
Pope, Devin: Steven G. Rothmeier Professor of Behavioral Science and Economics, and Robert King Steel Faculty Fellow
The stalling of COVID-19 vaccination rates threatens public health. To increase vaccination rates, governments across the world are considering the use of monetary incentives. Here we present evidence about the effect of guaranteed payments on COVID-19 vaccination uptake. We ran a large preregistered randomized controlled trial (with 8286 participants) in Sweden and linked the data to population-wide administrative vaccination records. We found that modest monetary payments of 24 US dollars (200 Swedish kronor) increased vaccination rates by 4.2 percentage points (P = 0.005), from a baseline rate of 71.6%. By contrast, behavioral nudges increased stated intentions to become vaccinated but had only small and not statistically significant impacts on vaccination rates. The results highlight the potential of modest monetary incentives to raise vaccination rates.
Abstract
Equal access to voting is a core feature of democratic government. Using data from millions of smartphone users, we quantify a racial disparity in voting wait times across a nationwide sample of polling places during the 2016 U.S. presidential election. Relative to entirely-white neighborhoods, residents of entirely-black neighborhoods waited 29% longer to vote and were 74% more likely to spend more than 30 minutes at their polling place. This disparity holds when comparing predominantly white and black polling places within the same states and counties, and survives numerous robustness and placebo tests. We shed light on the mechanism for these results and discuss how geospatial data can be an effective tool to both measure and monitor these disparities going forward.
Rockova, Veronika: Professor of Econometrics and Statistics, and James S. Kemper Foundation Faculty Scholar
In the absence of explicit or tractable likelihoods, Bayesians often resort to approximate Bayesian computation (ABC) for inference. Our work bridges ABC with deep neural implicit samplers based on generative adversarial networks (GANs) and adversarial variational Bayes. Both ABC and GANs compare aspects of observed and fake data to simulate from posteriors and likelihoods, respectively. We develop a Bayesian GAN (B-GAN) sampler that directly targets the posterior by solving an adversarial optimization problem. B-GAN is driven by a deterministic mapping learned on the ABC reference by conditional GANs. Once the mapping has been trained, iid posterior samples are obtained by filtering noise at a negligible additional cost. We propose two post-processing local refinements using (1) data-driven proposals with importance reweighing, and (2) variational Bayes. We support our findings with frequentist-Bayesian results, showing that the typical total variation distance between the true and approximate posteriors converges to zero for certain neural network generators and discriminators. Our findings on simulated data show highly competitive performance relative to some of the most recent likelihood-free posterior simulators.
Abstract
For a Bayesian, the task to define the likelihood can be as perplexing as the task to define the prior. We focus on situations when the parameter of interest has been emancipated from the likelihood and is linked to data directly through a loss function. We survey existing work on both Bayesian parametric inference with Gibbs posteriors as well as Bayesian non-parametric inference. We then highlight recent bootstrap computational approaches to approximating loss-driven posteriors. In particular, we focus on implicit bootstrap distributions defined through an underlying push- forward mapping. We investigate iid samplers from approximate posteriors that pass random bootstrap weights trough a trained generative network. After training the deep-learning mapping, the simulation cost of such iid samplers is negligible. We compare the performance of these deep bootstrap samplers with exact bootstrap as well as MCMC on several examples (including support vector machines or quantile regression). We also provide theoretical insights into bootstrap posteriors by drawing upon connections to model mis-specification.
Tan, Chenhao: Assistant Professor of Computer Science, UChicago
Changing someone's opinion is arguably one of the most important challenges of social interaction. The underlying process proves difficult to study: it is hard to know how someone's opinions are formed and whether and how someone's views shift. Fortunately, ChangeMyView, an active community on Reddit, provides a platform where users present their own opinions and reasoning, invite others to contest them, and acknowledge when the ensuing discussions change their original views. In this work, we study these interactions to understand the mechanisms behind persuasion.
We find that persuasive arguments are characterized by interesting patterns of interaction dynamics, such as participant entry-order and degree of back-and-forth exchange. Furthermore, by comparing similar counterarguments to the same opinion, we show that language factors play an essential role. In particular, the interplay between the language of the opinion holder and that of the counterargument provides highly predictive cues of persuasiveness. Finally, since even in this favorable setting people may not be persuaded, we investigate the problem of determining whether someone's opinion is susceptible to being changed at all. For this more difficult task, we show that stylistic choices in how the opinion is expressed carry predictive power.
Abstract
Consider a person trying to spread an important message on a social network. He/she can spend hours trying to craft the message. Does it actually matter? While there has been extensive prior work look- ing into predicting popularity of social- media content, the effect of wording per se has rarely been studied since it is of- ten confounded with the popularity of the author and the topic. To control for these confounding factors, we take advantage of the surprising fact that there are many pairs of tweets containing the same url and written by the same user but employing different wording. Given such pairs, we ask: which version attracts more retweets? This turns out to be a more difficult task than predicting popular topics. Still, hu- mans can answer this question better than chance (but far from perfectly), and the computational methods we develop can do better than both an average human and a strong competing method trained on non- controlled data.
Todorov, Alex: Leon Carroll Marshall Professor of Behavioral Science and Richard Rosett Faculty Fellow
Abstract
Research in person and face perception has broadly focused on group-level consensus that individuals hold when making judgments of others (e.g., “X type of face looks trustworthy”). However, a growing body of research demonstrates that individual variation is larger than shared, stimulus-level variation for many social trait judgments. Despite this insight, little research to date has focused on building and explaining individual models of face perception. Studies and methodologies that have examined individual models are limited in what visualizations they can reliably produce to either noisy and blurry or computer avatar representations. Methods that produce low-fidelity visual representations inhibit generalizability by being clearly computer manipulated and produced. In the present work, we introduce a novel paradigm to visualize individual models of face judgments by leveraging state-of-the-art computer vision methods. Our proposed method can produce a set of photorealistic face images that correspond to an individual's mental representation of a specific attribute across a variety of attribute intensities. We provide a proof-of-concept study which examines perceived trustworthiness/untrustworthiness and masculinity/femininity. We close with a discussion of future work to substantiate our proposed method.
Abstract
We quickly and irresistibly form impressions of what other people are like based solely on how their faces look. These impressions have real-life consequences ranging from hiring decisions to sentencing decisions. We model and visualize the perceptual bases of facial impressions in the most comprehensive fashion to date, producing photorealistic models of 34 perceived social and physical attributes (e.g., trustworthiness and age). These models leverage and demonstrate the utility of deep learning in face evaluation, allowing for 1) generation of an infinite number of faces that vary along these perceived attribute dimensions, 2) manipulation of any face photograph along these dimensions, and 3) prediction of the impressions any face image may evoke in the general (mostly White, North American) population.
Xiu, Dacheng: Professor of Econometrics and Statistics
Abstract
We conduct inference on volatility with noisy high-frequency data. We assume the observed transaction price follows a continuous-time Itô-semimartingale, contami- nated by a discrete-time moving-average noise process associated with the arrival of trades. We estimate volatility, defined as the quadratic variation of the semimartingale, by maximizing the likelihood of a misspecified moving-average model, with its order se- lected based on an information criterion. Our inference is uniformly valid over a large class of noise processes whose magnitude and dependence structure vary with sam- ple size. We show that the convergence rate of our estimator dominates n1/4 as noise vanishes, and is determined by the selected order of noise dependence when noise is sufficiently small. Our implementation guarantees positive estimates in finite samples.
Abstract
Standard estimators of risk premia in linear asset pricing models are biased if some priced factors are omitted. We propose a three-pass method to estimate the risk premium of an observable factor, which is valid even when not all factors in the model are specified or observed. The risk premium of the observable factor can be identified regardless of the rotation of the other control factors if together they span the true factor space. Our approach uses principal components of test asset returns to recover the factor space and additional regressions to obtain the risk premium of the observed factor.