How a diverse dataset
can help close the
racial pain gap

6 min readJun 3, 2021


Researchers trained an algorithm to predict knee pain better than the decades-old standard

By Melanie Ehrenkranz

Emma Pierson, a senior researcher at Microsoft Research New England, said that her medical collaborator on a recent study shared an accurate, though not very reassuring fact about pain. “We don’t understand it very well.”

Pierson is a computer scientist developing machine learning solutions to inequality and healthcare. A research paper she published in January alongside other researchers explores pain disparities in underserved populations, specifically looking at osteoarthritis in the knee and how it disproportionately affects people of color. And they found that their algorithm detected pain areas that doctors and machines have since missed.

The motivation behind the study was this mysterious pain gap

Pierson said that the basic idea was to train a machine learning algorithm to find any additional signals in the knee X-ray which isn’t being captured by regular risk scores and medical assessments, seeing if this algorithmic approach could narrow the racial pain gap for knee osteoarthritis and, subsequently, for other medical problems.

The study points out that the correlation between radiographic measures (using X-rays) and pain is contested — people whose X-rays don’t show severe disease might experience severe pain, and vice versa. The current standard system for measuring osteoarthritis is the Kellgren-Lawrence grade (KLG), which was developed more than 50 years ago in white populations. “It’s plausible they are not capturing factors relevant to pain in more diverse populations living and working very differently,” Pierson said. “You take a score that’s 60 years old, yeah it might not capture the full story.”

“The current standard system… was developed more than 50 years ago in white populations.”

And the KLG system is just one example of a diagnostic test that fails patients of color. For instance, there’s a kidney test that automatically adjusts scores for Black patients based on a discredited scientific theory on race and genetic differences. Because of this unfounded basis for adjusting diagnostic algorithms, nonwhite patients were more inclined to miss out on vital treatments. This system is still prevalent.

Machine learning and healthcare go pretty far back

The concept of machine learning in the diagnostic process is hardly new — a study for “an algorithm to assist in the selection of the most probable diagnosis of a given patient” was published in the National Library of Medicine in 1986. But neglecting to reexamine decades-old medical systems — especially ones that were built on the foundation of discredited racially-biased theories or on a foundation that fails to include nonwhite communities at all — is perpetuating healthcare inequality and harming vulnerable communities.

“On the one hand, having an algorithm is sort of like the illusion of objective in science,” Dr. Ezemenari M. Obasi, director of the HEALTH Research Institute at the University of Houston who studies health disparities, told Mashable, citing the importance of checks and balances when it comes to how a diagnostic algorithm might disproportionately harm or benefit certain groups. “Otherwise, you’re creating a scientific way of justifying the unequal distribution of resources.”

The outdated algorithm absent of any meaningful checks and balances is just one possible reason the radiographic measure might fail to identify pain in people of color and continue to be widely used. An external factor of racial bias can be attributed to pain assessment in patients of color. And so the researchers trained a convolutional neural network to predict the pain score in knees using a diverse dataset from an NIH-funded study.

The dataset had a sample of 4,172 patients in the United States predisposed to or who already had a high risk of developing knee osteoarthritis. The algorithm’s predictions found more of the variance in pain than KLG did, showing that the X-rays did have signals for pain that the current system didn’t detect. The researchers attribute the success to the diverse dataset.

They even retrained the neural network on a non-diverse training set. Both instances were better at detecting pain in X-rays than KLG, but the machine learning models trained on diverse datasets were better at predicting pain and narrowing the racial and socioeconomic pain gap.

Pierson said that she didn’t go in assuming they would find signals not being captured by the existing, conventional scores, but there were reasons to believe it wasn’t impossible, but looking at the risk versus return, the impact was high if their algorithmic approach did find undetected signals.

“Machine learning models trained on diverse datasets were better at predicting pain and narrowing the racial and socioeconomic pain gap.”

“It’s quite clear empirically that diversity of training set is important,” she said, adding that, when it comes to medicine in the broader context, “you shouldn’t throw all the women out of the study or only do your analysis on white European ancestry.”

Using machine learning to reduce (rather than perpetuate) bias in healthcare settings

Machine learning systems have a harmful and biased track record when it comes to diversity and inclusivity. Because these systems, until recently, were largely trained on predominantly white datasets, their outputs were at best skewed to certain demographics. At worst, they are racist and perpetuate discrimination.

Pierson’s research illustrated that patients of color in pain have been disproportionately misdiagnosed by a system designed for white populations. That ultimately impacts treatment options. Using the algorithmic predictor that was trained on the diverse dataset, more black patients would be eligible for knee surgery. The neural network also found that those with the most severe pain were most likely to be taking pain killers, like opiods. The researchers note that knee surgery intervention could help lower opioid use among certain racial and socioeconomic populations, since it would help with their pain.

“Using the algorithmic predictor that was trained on the diverse dataset, more black patients would be eligible for knee surgery.”

The researchers don’t see an algorithmic approach as a replacement for humans. Instead, it can be used as a decision aid. So rather than just a human or an algorithm making the final call, the radiologist can look at both the X-ray and the results from the algorithm, to see if they might have missed something.

Pierson also said that their findings show “the potential for more equitably allocating surgery using these algorithmic severity scores.” While not yet a flawless system, these approaches show promise in closing the racial disparities in pain assessment and treatment options.

“I think there are follow ups along the directions of surgery allocations and decision aids,” Pierson said. “Those are not hypothetical things, those are things I’m actively interested in.”

An unconventional approach to pain or the new standard?

This approach is a bit more unconventional in that it wasn’t trained to do what the doctor does. It was trained to see what doctors and existing systems are missing. Rather than learn from the doctor, the algorithm was learning from the patient. When clinical knowledge is incomplete or inaccurate, you can go beyond the systems in play and learn from the patient directly.

“Rather than learn from the doctor, the algorithm was learning from the patient.”

What’s also an important takeaway from this research is that algorithms can be used for pure knowledge discovery — by training an algorithm to read thousands of X-rays, they were able to equate certain parts of the image to pain, detections that radiologists missed. Though because of the black box nature of algorithms, it’s unclear exactly what the algorithm is “seeing” — but it’s a notion that can be applied to other medical practices with archaic foundations that might not capture the lived experiences of the diverse demographic of patients.

Pierson said that the study was predicated on the existence of this diverse, publicly-available dataset with suitable privacy protections. “Without that data collection, the study wouldn’t have been possible,” she said. Part of the onus ultimately falls on data-collection efforts, ones that are inclusive and ethical. This type of study proves that those efforts are not in vain — they can quite literally ease the pain.

Melanie Ehrenkranz is a writer with a focus on tech, culture, power, and the environment. She has been featured in Gizmodo, Vice’s Motherboard, Medium’s OneZero, National Geographic, and more. You can follow her work here.

Originally published at




Loka is an elite tech team that helps ship fascinating innovations. Our stories give you a peek into what’s now & next for ML & humanity.