The Importance of Interpretability in Healthcare AI
We need to empathize with the people who use our tools. All too often, these people are overlooked, and AI projects fail to reach a deployed state, stumbling into a common pitfall despite boasting impressive performance metrics.
Written by Joseph Gartner, Director of Data Science, ClosedLoop
Originally published December 2, 2020. Last updated June 19, 2023. • 8 min read
It’s a Thursday morning. You were up late last night iterating on your customer churn model, vetting new features, tweaking hyperparameters for your algorithm, and since no one else was in the office, singing along to Taylor Swift at full blast. It was a perfect evening. Several of the new features added lift, and you struck a great balance between performance and avoiding overtraining. You’ve just produced the best ROC curve that anyone on the team has ever made for this problem. You roll into the office, excited to share the good news with the folks that make the calls based on your models—the customer success team. To your surprise, they’re not happy with your model and want you to get back to the drawing board.
When situations like this occur, it’s easy to think, “The model is great; they just don’t get it.” Most of us have had this thought before, but it has no place in the mind of a data scientist who wants to succeed. As data scientists, it’s our job to solve problems logically with evidence. Too often, we inadvertently make the most complicated model we can imagine. This approach generally leads to Thursday mornings that feel a lot like the one above. So how can we combat this tendency? I find it’s best to take the opposite approach and start by asking stakeholders simple questions. Here are a few questions I tend to ask:
If you had to make this prediction, how well do you think you’d do?
If you feel you’d do well, what factors would you base your decisions on?
Are we capturing all of these factors in the data?
Stakeholders tend to favor models that reflect their preexisting insights and are therefore easily interpretable. The best initial model will typically account for a small number of features that reflect the intuitions of the folks on the ground. Once you get buy-in, iterating on the model to add new information should be a journey you take together. Aligning with stakeholder intuition is the sugar that helps the medicine of complexity go down.
Empathy Topples Digital Dictators
As data scientists, the output of our work has serious ramifications for the work of other human beings. The tools we create often dictate workflows of end users, so it’s essential we have a detailed understanding of how these people are interacting with them. Stakeholders generally lack training in advanced statistics but often rely on the models we create. Because a small design decision on our end may radically impair their ability to leverage predictions, clearly understand intended workflows, connect with the right person, or potentially even perform the core functions of their jobs effectively, we must anticipate both their needs and perspectives to iterate accordingly.
As models are increasingly used to direct human work, we should aspire to engineer in aspects of an effective, empathetic leader. Your tool must clearly communicate the reasoning behind its predictions and support stakeholder goals. If the people working with it are unable to interpret its decisions or feel constrained by its suggestions, they will inevitably come to resent it as one would a terrible boss. This comparison may seem extreme, but increasingly, human work is being directed by robot supervisors that can set unrealistic goals and workflow patterns in pursuit of optimization. If you’re working on systems like these, aim to make your robot a John Keating, not a Charles Kingsfield. There’s nothing worse than end users regarding the fruit of your labor as an uncaring dictator.
Boots on the Ground
We need to empathize with the people who use our tools. All too often, these people are overlooked, and AI projects fail to reach a deployed state, stumbling into a common pitfall despite boasting impressive performance metrics. Developing ML models in hermetically sealed environments is simply an anti-pattern that directly contributes to failure and frustration.
In pursuit of successful deployments predicated on empathy, consider the possibilities of the opposite extreme: spending every waking moment observing firsthand how ML models are impacting decisions. Dr. Chris White put this approach to the test when he dove headfirst into a warzone to improve data-mining tools. In 2010, He deployed to Afghanistan to understand how military personnel in the field were collecting, analyzing, and using data to influence decision making. You might be able to push unclear algorithms onto folks in the white-collared world, but that won’t fly with a Marine unit at a forward outpost.
Understanding the perspective of the boots on the ground made all the difference. Dr. White’s data-mining tools were purpose-built for warfare and incredibly successful. This experience stuck with Dr. White and informed his approach while he oversaw DARPA’s Memex program. The broad purpose of the Memex program was to produce a search engine that included the dark web, organized search results into logical clusters of information, and stored knowledge across user sessions. In practice, the program was building one of the best tools in the world for combatting human trafficking. Dr. White’s experience overseas drove home the importance of usability. Memex’s tools underwent continual user testing, with a binary evaluation standard: does this tool help—yes or no? The program was ultimately deemed a success, and directly resulted in tools that are still widely used today to bolster human trafficking investigations.
Healthcare data scientists have it comparatively easy; instead of dodging bullets in a warzone, we merely have to embed with the folks using our tools in an office or hospital setting. Given the COVID-19 Pandemic, this might not be a viable option, but compromises are certainly possible. Before you show stakeholders a set of results, you should be in the habit of interacting with predictions. If a predictive model suggests the opportunity for an intervention with a patient, is it crystal-clear why a caregiver should follow through? If you can’t surface decision logic, it’s highly unlikely that caregivers will be receptive to your model’s predictions.
You should know how stakeholders interact with your predictions, and if you or your team make the UI for the predictions, you should participate in user testing sessions. It can be exceedingly challenging to swallow your pride in the face of disappointing feedback, but it’s critically important to see this as an opportunity to improve the tool. In the case that your predictions are embedded in a system that you didn’t create, it’s still important to work with users to understand their interactions with scores. While you might not have the same ability to effect change, you should still be on the hunt for opportunities to make your predictions more actionable.
More Than “Nice to Have”
In a lot of work environments, clear decision logic is a “nice to have” for auditing purposes, but at the end of the day performance reigns supreme. While this paradigm can cause serious headaches for the people using AI and ML tools, it often plays to the strengths of data scientists. Many data scientists get into the field because they are mathematically inclined. We like puzzles, we like challenging optimization problems, and we love arguing about the arcane details of algorithms. It can certainly feel like Interpretability is just a concession made for interacting with the plebeians, but that’s ultimately a naive perspective. When you’re having thoughts like these, it’s a good idea to keep this Einstein quote in mind, “If you can’t explain it simply, you don’t understand it well enough.”
For a great example of interpretability’s importance, look no further than this study of image classifiers differentiating between dogs and wolves. When the decision logic of these seemingly accurate dog v wolf classifiers was examined, a simple fact was revealed: the neural network had learned to determine the difference between snow and grass. It just so happens that most pictures of wolves are taken in snowy climates. As soon as the decision logic of these algorithms surfaced, it became apparent that the background—not the object itself—was the driving factor. This finding improved image classification on the whole, and it probably wouldn’t have happened without an intimate understanding of how the algorithms worked and the need to explain it.
As data scientists, we should continue to be the nerds we are. Burn the midnight oil, take risks on tricky math, and keep singing “Shake it Off” like no one is listening. But we should also endeavor to be humanists who cherish the opportunity to get new perspectives on our models. In short, learn to love your Thursday mornings, your models will be better for them.
Interested in learning more about healthcare data science, explainability and interpretability, and working in concert with stakeholders? Check out these posts: