The Devil’s in the Data

When all that number crunching is not enough.

July 2022

Reading time min

Comic illustration of dancing binary code: 1's and 0's.

Illustration: Gary Taxali

Last September, a computer-generated Facebook prompt incited public outrage. At the end of a video showing police officers arresting a Black man, an automated suggestion read: “Keep seeing videos about Primates?” 

Facebook apologized for the “unacceptable error,” which was, in fact, more than that: It pointed to a significant challenge for the field of machine learning. The word primates was generated by a machine learning model, a tool built using an algorithm that crunches some existing data sets (in this case, videos) and then uses that knowledge to classify new information. Such models are limited by the data they’re trained on. In this case, researchers presume, the training data did not include a sufficient number of Black men. Google and Amazon have had to face up to similar blunders. “This is a problem,” says John Duchi, ’06, MS ’07, an associate professor of statistics and of electrical engineering, “and I think people are recognizing exactly how important it is to get good data.”

Machine learning models can automate elements of criminal justice risk assessments, medical diagnoses or wildlife identification. The large data sets they’re trained on are supposed to represent the world at large. “You are picking out some hopefully random—literally, uniformly random—sample of the entire population,” Duchi says. 

Machine learning models—warts and all—are already heavily relied upon worldwide.

But researchers are seeing that randomly selected data never fully reflects real-world populations or settings. After all, no training data set is actually infinite, and there are too many variations in real-world data—even with a set of, say, 14 million training images—to capture all the different possible ways inputs could appear. “These companies get a lot of flack” for relying on limited data, Duchi says, “and they do deserve some flack. But at the same time, we—the statisticians, the machine learning researchers—we haven’t really actually developed the tools to prevent this.”

Curating better random data sets is now one of the most active areas of machine learning research, Duchi says, but the work is in its early stages. Meanwhile, machine learning models—warts and all—are already heavily relied upon worldwide. 

In 2018, health care software company Epic debuted an early-detection model to help clinicians identify cases of sepsis, which, according to the Centers for Disease Control and Prevention, affects 1.7 million U.S. adults annually. Nearly 270,000 die as a result. The model—trained on data from 405,000 patient encounters at three health care organizations—was rolled out in hundreds of hospitals across the country. Thereafter, it failed to identify 67 percent of sepsis cases, while falsely identifying patients without sepsis.

‘It’s a major challenge of our time—to make sure that we have the right representation of the people we’re going to affect with these models.’

“Different hospitals have different equipment, different scanners,” says Pang Wei Koh, ’13, MS ’13, who expects to complete his doctorate in computer science in September. “The doctors may use slightly different protocols. And so the data points from each hospital follow a different distribution.”

Simply gathering better data to improve outcomes is not an easy task. “There are few convenient data sets for researchers to use,” says Koh. “Machine learning is all about learning stuff from the data, and it’s very expensive to go and collect data.” Once you have the data, labeling it is labor- and time-intensive.

To begin to address such shortcomings, Koh and fellow Stanford doctoral student Shiori Sagawa created WILDS, a collection of 10 data sets intentionally curated to be more representative of the real world. The researchers considered how machine learning models might be deployed and sought out additional data with greater variation—time of day, type of camera, location and so forth. Koh hopes the move will urge more machine learning researchers to consider the range of data they use when developing and evaluating models.

“Before Google and Facebook had these episodes, I don’t think people were thinking about it particularly carefully,” says Duchi. “It’s a major challenge of our time—to make sure that we have the right representation of the people we’re going to affect with these models.”

Kali Shiloh is a staff writer at Stanford. Email her at kshiloh@stanford.edu.

You May Also Like

© Stanford University. Stanford, California 94305.