A Better Way for Finance (and Others) to Handle Missing Data

There are data about practically everything these days, and they can be used to try to answer any number of questions. Do clinical trials really show a drug works? Can surveys really signal who’s going to win the next election? Can a financial manager really predict a winning portfolio?

As powerful as data are, adjustments made for missing information—the people who drop out of drug trials, the questions people don’t answer in polling, incomplete corporate financial reports—may dramatically skew the results of predictive models, according to University of Bonn’s Joachim Freyberger and Björn Höppner (a PhD student), Washington University in St. Louis’s Andreas Neuhierl, and Chicago Booth’s Michael Weber.

They propose an improved method for handling missing data, having tested it against two popular existing approaches in a practical application of data, namely predicting stock returns. The results indicate that their method provides a consistent edge.

To compare the three methods, the researchers obtained a database of US stock and balance-sheet data from 1978 to 2021. The data set started out with 2.4 million observations, or rows, each with 82 variables covering trading volume, accounting information, momentum indicators, and the like. As is the case with many data sets, it wasn’t complete: some rows didn’t have values for all 82 variables.

The first of the two widely used methods, the “complete cases” approach, drops all incomplete observations, although this violates a cardinal rule of data analysis: “Thou shalt not throw data away.” This approach required that the researchers exclude rows of data where any information was missing—for example, if a stock was missing trading volume for one month, the complete cases method required dropping all the data collected for that stock that month. After the researchers did this, just 10 percent of the data remained. Most of the dropped instances were missing values for five variables or fewer.

The other well-known method, “mean imputation,” keeps all of the observations but creates biases. It replaces missing values with an average of all the data set’s existing data points for a given variable and month. But the missing data might include extreme values that could make a significant difference in prediction models. For example, say there’s a housing database, but most of the high-end houses in it are sold by a realtor who never lists the square footage. If analysts replaced the missing data with the average square footage of all housing, they would most likely undershoot and skew their model’s predictions of market values.

Filling a data gap

The researchers’ method of accounting for missing values outperformed the widely used “complete cases” and “mean imputation” approaches in predicting stock returns.

Freyberger, Höppner, Neuhierl, and Weber’s method fills in the blanks by first grouping observations with similar patterns of missing data and then taking the ones with complete data to estimate the missing values. The cases with complete data and those with estimated data are recombined into one data set and employed by a regression model.

In simulations where the researchers’ method for handling missing data was used, portfolios returned about 52 percent when long the 100 stocks with the highest predicted return (according to a linear model) and short the 100 stocks with the lowest predicted return. This handily beat the 11 percent and 49 percent returns achieved by the portfolios using the complete cases and mean imputation methods, respectively. Portfolios using the researchers’ approach also outperformed those using the other two methods in terms of the return received for the amount of risk taken. The Sharpe ratio (measuring risk-adjusted returns) was 1.79, compared with the others’ 1.19 and 1.66.

When a nonlinear model was employed to make return predictions, the outperformance increased for their method, with a 92 percent return versus 11 percent and 86 percent returns for the popular methods. The Sharpe ratio, meanwhile, rose to 2.82, versus 1.29 and 2.44 for the prevailing strategies.

Weber notes that by identifying which of the hundreds of potential return predictors provide sound information, the improved method for managing missing values allows investors to build well-balanced portfolios with high risk-adjusted returns.

Works Cited

Joachim Freyberger, Björn Höppner, Andreas Neuhierl, and Michael Weber, “Missing Data in Asset Pricing Panels,” NBER working paper, December 2022.

NECESSARY COOKIES These cookies are essential to enable the services to provide the requested feature, such as remembering you have logged in.	ALWAYS ACTIVE
	Reject \| Accept
PERFORMANCE AND ANALYTIC COOKIES These cookies are used to collect information on how users interact with Chicago Booth websites allowing us to improve the user experience and optimize our site where needed based on these interactions. All information these cookies collect is aggregated and therefore anonymous.
FUNCTIONAL COOKIES These cookies enable the website to provide enhanced functionality and personalization. They may be set by third-party providers whose services we have added to our pages or by us.
TARGETING OR ADVERTISING COOKIES These cookies collect information about your browsing habits to make advertising relevant to you and your interests. The cookies will remember the website you have visited, and this information is shared with other parties such as advertising technology service providers and advertisers.
SOCIAL MEDIA COOKIES These cookies are used when you share information using a social media sharing button or “like” button on our websites, or you link your account or engage with our content on or through a social media site. The social network will record that you have done this. This information may be linked to targeting/advertising activities.

A Better Way for Finance (and Others) to Handle Missing Data

Filling a data gap

Related Topics

Related Topics