There are data about practically everything these days, and they can be used to try to answer any number of questions. Do clinical trials really show a drug works? Can surveys really signal who’s going to win the next election? Can a financial manager really predict a winning portfolio?

As powerful as data are, adjustments made for missing information—the people who drop out of drug trials, the questions people don’t answer in polling, incomplete corporate financial reports—may dramatically skew the results of predictive models, according to University of Bonn’s Joachim Freyberger and Björn Höppner (a PhD student), Washington University in St. Louis’s Andreas Neuhierl, and Chicago Booth’s Michael Weber.

They propose an improved method for handling missing data, having tested it against two popular existing approaches in a practical application of data, namely predicting stock returns. The results indicate that their method provides a consistent edge.

To compare the three methods, the researchers obtained a database of US stock and balance-sheet data from 1978 to 2021. The data set started out with 2.4 million observations, or rows, each with 82 variables covering trading volume, accounting information, momentum indicators, and the like. As is the case with many data sets, it wasn’t complete: some rows didn’t have values for all 82 variables.

The first of the two widely used methods, the “complete cases” approach, drops all incomplete observations, although this violates a cardinal rule of data analysis: “Thou shalt not throw data away.” This approach required that the researchers exclude rows of data where any information was missing—for example, if a stock was missing trading volume for one month, the complete cases method required dropping all the data collected for that stock that month. After the researchers did this, just 10 percent of the data remained. Most of the dropped instances were missing values for five variables or fewer.

The other well-known method, “mean imputation,” keeps all of the observations but creates biases. It replaces missing values with an average of all the data set’s existing data points for a given variable and month. But the missing data might include extreme values that could make a significant difference in prediction models. For example, say there’s a housing database, but most of the high-end houses in it are sold by a realtor who never lists the square footage. If analysts replaced the missing data with the average square footage of all housing, they would most likely undershoot and skew their model’s predictions of market values.

Filling a data gap

The researchers’ method of accounting for missing values outperformed the widely used “complete cases” and “mean imputation” approaches in predicting stock returns.

Freyberger, Höppner, Neuhierl, and Weber’s method fills in the blanks by first grouping observations with similar patterns of missing data and then taking the ones with complete data to estimate the missing values. The cases with complete data and those with estimated data are recombined into one data set and employed by a regression model.

In simulations where the researchers’ method for handling missing data was used, portfolios returned about 52 percent when long the 100 stocks with the highest predicted return (according to a linear model) and short the 100 stocks with the lowest predicted return. This handily beat the 11 percent and 49 percent returns achieved by the portfolios using the complete cases and mean imputation methods, respectively. Portfolios using the researchers’ approach also outperformed those using the other two methods in terms of the return received for the amount of risk taken. The Sharpe ratio (measuring risk-adjusted returns) was 1.79, compared with the others’ 1.19 and 1.66.

When a nonlinear model was employed to make return predictions, the outperformance increased for their method, with a 92 percent return versus 11 percent and 86 percent returns for the popular methods. The Sharpe ratio, meanwhile, rose to 2.82, versus 1.29 and 2.44 for the prevailing strategies.

Weber notes that by identifying which of the hundreds of potential return predictors provide sound information, the improved method for managing missing values allows investors to build well-balanced portfolios with high risk-adjusted returns.

More from Chicago Booth Review

More from Chicago Booth

Your Privacy
We want to demonstrate our commitment to your privacy. Please review Chicago Booth's privacy notice, which provides information explaining how and why we collect particular information when you visit our website.