Four Steps to Measure and Mitigate Algorithmic Bias in Healthcare

Artificial intelligence (AI) and machine learning (ML) are increasingly used in healthcare to combat unsustainable spending and produce better outcomes with limited resources, but healthcare organizations (HCOs) must take steps to ensure they are actively mitigating and avoiding algorithmic bias.

This post is part of our health equity series. Please read our overview post, Why Health Equity Matters in 2022, to learn more about how you can help advance health equity.

Artificial intelligence (AI) and machine learning (ML) are increasingly used in healthcare to combat unsustainable spending and produce better outcomes with limited resources, but healthcare organizations (HCOs) must take steps to ensure they are actively mitigating and avoiding algorithmic bias.

While AI/ML has the potential to identify and combat disparities, it also has the potential to inadvertently perpetuate and exacerbate health inequities—despite apparent objectivity. In fact, algorithmic bias in healthcare is already pervasive. The Chicago Booth Center for Applied Artificial Intelligence states, “algorithmic bias is everywhere…Biased algorithms are deployed throughout the healthcare system, influencing clinical care, operational workflows, and policy.”

What is Algorithmic Bias?

Algorithmic bias occurs when issues related to AI/ML model design, data, and sampling result in measurably different model performance for different subgroups. This has the potential to systematically produce results that are less favorable to individuals in a particular group compared to others, without any justification. In healthcare, this can lead to inequitable allocation or prioritization of limited resources.‍

High-profile cases of algorithmic bias have directly propagated racial health disparities. In 2019, Obermeyer et al., a seminal paper on algorithmic bias in healthcare, found evidence of racial bias in a widely used Optum algorithm that managed 70 million lives. Bias occurred because the algorithm inappropriately used health costs as a proxy for health needs. As Black people face more barriers to accessing care and systemic discrimination, less money is spent on caring for them compared to white people. As a result, Optum’s algorithm incorrectly learned that Black members are healthier than equally sick white members.

This systematically disadvantaged and produced worse outcomes for Black members while prioritizing white members for care and special programs, despite being less sick than Black members on average. Researchers found that “the algorithm’s bias effectively reduced the proportion of Black patients receiving extra help by more than half, from almost 50 percent to less than 20 percent. Those missing out on extra care potentially faced a greater chance of emergency room visits and hospital stays.”

What Causes Bias?

There are many causes of algorithmic bias, but the majority fall into two broad categories: subgroup invalidity and label choice bias.

Subgroup Invalidity

Subgroup invalidity occurs when AI/ML is predicting an appropriate outcome or measure, but the model does not perform well for particular subgroups. This may be due to poor calibration, a significant difference in the distribution of predicted and actual outcomes for certain subgroups. Generally, subgroup invalidity occurs when AI/ML models are trained on non-diverse populations or with data that underrepresents the subgroup or fails to include specific risk factors affecting them.

For example, a recent study of pulse oximeter algorithms, which were originally developed in populations that were not racially diverse, demonstrated subgroup invalidity bias. This measurement technology uses a cold light that shines through a person’s fingertip and makes it appear red. By analyzing the light that passes through the finger, an algorithm is capable of determining blood oxygen levels. However, the study found that “Black patients had nearly three times the frequency of occult hypoxemia that was not detected by pulse oximetry as white patients.” In this case, the algorithm performed poorly, likely because it was primarily trained on white people, potentially resulting in worse health outcomes for Black people.

Label Choice Bias

Label choice bias is more common and more difficult to detect than subgroup invalidity. It occurs when the algorithm’s predicted outcome is a proxy variable for the actual outcome it should be predicting. The source of racial bias Obermeyer et al. found in Optum’s algorithm (predicting healthcare utilization costs in an attempt to predict future health needs) is a quintessential example of label choice bias in healthcare. Using cost as a proxy to allocate extra resources or care demonstrates this form of bias because Black people are less likely to receive necessary care due to systematic discrimination, racial biases, and barriers to care. This results in lower healthcare costs and significantly biases cost as a proxy for future health needs.

Four Steps to Address Bias

Algorithmic bias in healthcare is not inevitable. Organizations are taking major steps to ensure AI/ML is unbiased, fair, and explainable. The University of Chicago Booth School of Business has developed a playbook to guide HCOs and policy leaders on defining, measuring, and mitigating bias.

While the playbook describes practical ways to mitigate bias in live AI/ML models, the steps below are also helpful to consider if you are in your initial considerations of AI/ML. It provides guidance for technical teams working with AI/ML daily, but also describes preventive oversight structures and ways to address bias from the outset.

Simply taking an inventory of AI/ML models your organization is currently using, developing, and considering implementing will enable you to begin assessing them for bias and establishing structural oversight. Your organization may not have a record of deployed models, but either way, talking to a diverse group of stakeholders and decision-makers across business units will help develop your overview and fill in key details about how AI/ML is being used. Gaining a comprehensive understanding of the kind of decisions your organization is making and the tools being used to support these processes is key to determine potential for bias.

A leader in your organization should also be responsible for overseeing algorithmic bias across all departments. Developing a comprehensive inventory that evolves over time will ideally be supported by active governance from someone with insight into high-level strategy and your organization’s goals. Members of your C-suite are great candidates for this role, and your organization should consider hiring a leader with experience in addressing algorithmic bias and advancing health equity if it lacks one.

Critically, this leader should not work in a silo, and should strive to collaborate with teams across the organization while considering diverse perspectives from internal and external stakeholders. They should work closely or form a team with people from diverse backgrounds to ensure appropriate representation and avoid inadvertently introducing their own biases into their analytical work.

Evaluate AI/ML models in your inventory for label choice bias and subgroup invalidity by carefully assessing the data used, potential discrepancies with the ideal predicted outcome, and variables used if possible.‍

You can begin by examining the data for problems related to lacking diversity and poor performance for underserved groups. Confirm that the data on which the models are trained is representative of the populations to which they will be applied. One-size-fits-all models that do not account for the unique demographics and qualities of the population they are applied to may exacerbate health disparities. You should also carefully consider how subgroups are defined and the fraction of each subgroup relative to the total population. If definitions differ from the training set to your population, the model may be biased.

Next, measure performance in each subgroup to ensure models are well calibrated and do not perform worse for any particular group. Good performance across subgroups does not guarantee a lack of bias, but it is a key step to avoiding it.

Once you’ve ruled out subgroup invalidity, you will also need to evaluate the model for label choice bias. Unfortunately, assessing label choice bias requires domain experience and cannot be automated. You should establish processes to systematically evaluate the potential for label choice bias with multiple stakeholders and also attempt to ensure the labels used in training data align with your organization’s intended use.

If your AI/ML model is predicting a proxy outcome, you will need to concretely define it and your ideal outcome in the data to compare calibration effectively. For example, “future health needs” is abstract, poorly-defined, and cannot be proxied with cost. Instead, your model will need a complex, measurable metric to better compare the ideal outcome and the outcome in use. Comparing the ideal outcome to a composite assessment of health needs, such as the Charlson Comorbidity Index, will enable you to better evaluate how the model actually performs across subgroups. In practice, proxy measures should be multifactorial, consider how model outputs will be used, the data you have available and input from a diverse group of stakeholders.

Finally, evaluate AI/ML models for feature bias, which occurs when the meaning of individual features (variables) in a model differ across subgroups. This can arise due to differences in access to care, how diagnoses are ascertained, and health data collection. As an example, a model making predictions related to respiratory disease may demonstrate feature bias if it uses air quality reports without considering that rural populations may have missing data, as smaller towns are not necessarily required to regularly report air quality to the EPA. Again, health care costs are a feature that can introduce significant bias. Using relevant and diverse features can help to achieve high accuracy and validity across different subgroups.

Biased algorithms directly affect health outcomes and perpetuate inequities, and ideally should not be deployed. If you’ve discovered bias in a deployed model, it must be retrained or discontinued.

To mitigate label choice bias, you may be able to retrain the model with the same variables you originally used to illustrate the presence of bias as your new composite outcome. The Chicago Booth School of Business team demonstrated this by replacing cost as the proxy variable for predicting future care needs and retraining “a new candidate model using active chronic conditions as the label, while leaving the rest of the pipeline intact. This simple change doubled the fraction of Black patients in the high-priority group: from 14% to 27%.” They note that this approach represents just one option, and that the best approach will depend on your specific circumstances.

Models will also need to be routinely retrained to avoid becoming biased over time. This occurs due to feature drift, in which the distribution of AI/ML variables in the target population begins to substantially differ from the distribution in the training population. Feature drift occurs naturally, and can be addressed by regularly monitoring deployed model performance and retraining when appropriate. Additionally, major updates to your population (e.g., a merger with another organization), changes to codesets, or shifts in chronic disease prevalence (e.g., respiratory disease rates during the COVID-19 pandemic) will all require models to be retrained to avoid bias and significant accuracy degradation.‍

Sometimes retraining a biased model to resolve label choice bias or subgroup invalidity may not be feasible. This is often the case if your organization is evaluating unexplainable “black-box” algorithms from third-party providers that are proprietary. In this case, your organization may be unable to retrain the model at all or evaluate its design in enough depth to realistically assess whether changes impactfully reduce bias.

If you cannot address and correct the causes of bias in a model, then you should not use it. Instead, you should consider using the first two steps above to begin the process anew and either produce or purchase a new model that avoids label choice bias, is well-calibrated across subgroups, and can be routinely evaluated and modified.

HCOs using AI/ML should establish structural processes for bias management that are directly overseen by an algorithmic bias steward and ideally a dedicated, diverse team formed to address bias. According to The Chicago Booth playbook, organizations should take the following steps:

  • Create a system to report bias concerns. Enable everyone in your organization to formally report algorithmic bias concerns without worrying about any repercussions.
  • Create standardized requirements for documenting algorithms. Detail all relevant information related to your models and expect clear, comprehensive documentation of any models you purchase from a third party.
  • Establish a routine schedule for auditing AI/ML. Consistently monitor, manage, and govern any deployed AI/ML.
  • Consider partnering with external oversight. Involving a third-party to audit your AI/ML can help to ensure accountability and provide guidance if bias is detected.
  • Stay on the pulse of AI/ML developments. AI/ML is a rapidly growing field and suggested guidelines are constantly evolving.

ClosedLoop’s Approach to Bias

At ClosedLoop, we are committed to addressing algorithmic bias in healthcare and minimizing its impact on obstructing health equity. Our AI/ML positively impacts tens of millions of lives daily, and our data science platform is specifically designed to help HCOs identify and address bias.

In April of 2021, we won the Centers for Medicare and Medicaid Services (CMS) AI Health Outcomes Challenge — the largest healthcare-focused AI challenge in history. Demonstrating a comprehensive, effective approach to mitigating algorithmic bias was a key part of this $1.6 million challenge, which centered on creating explainable AI solutions to predict adverse health events. Our work during the challenge included feedback from Ziad Obermeyer and David Kent, two authors of seminal papers on bias in healthcare AI. To learn more about how we addressed bias and won the challenge, please watch this webinar led by our CTO, Dave DeCaprio.

Joseph Gartner, our director of data science and professional services, has written a series of posts to help HCOs identify and address algorithmic bias concerns that may widen equity gaps:

Each of his posts explain key concepts, provide best-practice suggestions for healthcare data practitioners, and break down why bias in healthcare must be approached differently compared to other industries. In particular, “A New Metric…” explains why disparate impact, the most common metric traditionally used to evaluate algorithmic fairness, is entirely unsuited to healthcare, and provides an alternative metric created by ClosedLoop.

Other members of ClosedLoop leadership have also written at length on addressing algorithmic bias. Carol McCall, our chief health analytics officer, recently published an article on the subject in STAT, and was featured on a MedCity News panel, How to incorporate Robust Bioethics in AI algorithms. In both the article and the panel, she emphasized the importance of explainability and argued that “black-box” models aren’t adequate for healthcare. She asserts that they must insist on “AI algorithms that are fully transparent, deeply explainable, completely traceable, and able to be audited. Anything less is unacceptable.”

Ultimately, AI/ML is not directly at fault for reinforcing systemic biases. Algorithms are designed by humans and told what to predict, what data to use, how to calculate predictions, and what population to make them on. As a result, they can inadvertently reflect and perpetuate the inequities and biases infused in the healthcare system. However, AI/ML is also an incredibly powerful tool for enacting positive change. Not only can we take steps to avoid and prevent  algorithmic bias, with AI/ML we can unearth inequities across healthcare and leverage the mountains of available data to produce better outcomes and advance health equity.

This post is the final part of our health equity series. If you’re interested in learning more about health equity and what can be done to achieve it, please check out our comprehensive overview post, Why Health Equity Matters in 2022, and our previous posts:


How You Can Develop and Launch a Strategy to Prioritize Health Equity

Implementing a comprehensive strategy to advance health equity is a moral and financial imperative for healthcare organizations (HCOs). ‍Persistent...

12 min read

How COVID-19 Exacerbated Health Disparities

COVID-19 simultaneously exacerbated existing health disparities and introduced entirely new ones. The pandemic disproportionately impacted people of c...

11 min read

3 Ways Healthcare Organizations Can Advance Health Equity by Addressing Social Determinants of Health

Social determinants of health (SDoH) profoundly affect a person’s overall health, and according to the Centers for Disease Control and Prevention (C...

12 min read

Make AI/ML a core element of your care strategy.

Get in touch today to see the ClosedLoop platform in action.