AI Training Datasets in Healthcare
Better training datasets are key to unlocking medical AI’s power. These orgs are on the job.
Google researchers developed an algorithm that could predict kidney damage with a 48-hour warning. Given this is a leading killer in hospitalized patients, this sounded like great news.
A few months ago, a new study published in Nature revealed that the original algorithm worked well—unless the patients were female.
Why? Because the model was trained on an almost entirely-male dataset from the VA.
And this is just one such example. The issues with medical AI’s inaccuracies—and even generative AI’s hallucinations—often come down to the data the models are trained on.
In our feature on foundation models, we discussed how pretrained, reusable models could improve medical AI’s overall efficiency and applicability. However, these models can also reproduce bias represented in the large training datasets.
This brings us to another key issue plaguing healthcare AI: data diversity.
Today, we’re highlighting three initiatives we believe exemplify some of the most exciting work being done to improve our industry’s data—and thus the availability of better, more representative training datasets.
Check them out—and let us know which you’re most interested in.
A startup building better medical imaging libraries
HealthTech startup Gradient Health recently closed their latest funding round. Their big mission? Decreasing health discrepancies. Their means? Creating the world’s largest medical imaging library.
Gradient Health CEO Josh Miller said that, when AI is built on poor data, “you actually increase the discrepancies. We’re failing at the mission of healthcare AI if we let it only work for white guys from the coasts. People from underrepresented backgrounds will actually suffer more discrimination as a result, not less.”
If startups like Gradient deliver on their promise, medical AI innovators will have better data to turn to for training their models. They won’t need to settle for non-representative data.
A bonus to health systems? Gradient offers the help make these hospitals’ stored data actionable by de-identifying it and putting it to work. If you’ve read our feature on dark data, you know why we love that proposal.
When academia and medical devices team up
Startups aren’t the only ones working to improve the availability of diverse medical datasets out there.
MIT & Philips’ eICU Collaborative Research Database (eICU-CRD) was initially released in 2016. The pair recently announced a significant expansion of the database at HIMSS in April. The new database gives AI researchers access to de-identified data from 200,000 patients from 200 hospitals
And the best part? The database is open to access (for anyone with medical research credentials and human subjects research training, of course).
This makes it an invaluable resource for medical AI researchers from a variety of backgrounds—including researchers from low-resource institutions or cash-strapped early-stage startups.
Advocating for data diversity in healthcare
Here’s a data diversity initiative we’re especially excited about: The STANDING Together Project.
The project seeks to develop standards for healthcare AI dataset diversity—so that “no-one is left behind as we seek to unlock the benefits of AI in healthcare,” as their website boldly declares.
The project’s Working Group is made up of healthcare and AI professionals, but it also seeks to involve the public. (Hey, that might be you!)
This study (which you can participate in here) seeks to gather as many opinions as possible about healthcare AI to help STANDING Together shape their data diversity and healthcare AI policy recommendations. The group published an announcement paper about the project in Nature Medicine last September.
In the long-term, the project seeks to evaluate the representability and usability of existing datasets, map dataset deficiencies in priority disease areas, and help dataset curators overcome these deficiencies.
Inequality is a major issue in healthcare, which poorly designed algorithms have the potential to exacerbate.
But it’s not just AI-enabled discrimination and inequality we have to worry about. Constantly needing to re-train models trained on non-representative data is time-consuming and expensive.
Plus, to truly design a medical AI ecosystem that limits bias, we don’t just need to promote diverse datasets. After all, there are many other sources of bias along the way to producing a healthcare algorithm.
To address all these pathways, we need better representation in the field. We need to promote a diverse body of medical AI researchers and innovators. After all, we as people have blind spots that may get introduced into our work—whether we intend to or not.
Increasing access to resources—like the eICU-CRD—is one way to help that happen. Otherwise, creating more opportunities for medical AI professionals from a variety of backgrounds to enter into—and lead—the field must be a priority.
How else can we better promote data diversity and reduce bias in healthcare AI? What other initiatives should we look into? Reply and let us know!