What’s the best way to communicate the accuracy of a predictive model?
This is a question I have often returned to while spending the last 15 years building predictive models for health care. Measuring the accuracy of a predictive model is straightforward, since most modeling frameworks make it easy to compute a host of accuracy statistics. The problem is statistics by themselves don’t communicate the accuracy of a model in a way that helps stakeholders understand how to use it effectively.
The problem shows up in questions like “Is a 0.82 good?” and “Is a 0.85 way better than a 0.82, or are those pretty close?“ When a doctor, manager or business user asks these questions, they are trying to understand the model’s accuracy in terms that are relevant to the decisions they need to make. It’s not that they don’t understand what the statistics mean, it’s that accuracy stats aren’t sufficient to answer questions like “How much money will this model save us versus our current approach?” and “Should we invest more in improving accuracy, or is this model good enough?”
This blog post is the first of a two part series, where I’ll show two simple graphs that I have found to be very effective at communicating model accuracy in healthcare settings. I’ll cover what the graphs are, how they differ in small but important ways from the traditional Receiver Operating Characteristic (ROC) curve, and how we utilize them to effectively communicate model accuracy to stakeholders. I’ll cover the first graph, the Outcome Capture curve, in this post, and the second, the Return On Investment (ROI) graph, in part 2.
The first graph we use is the Outcome Capture curve. It’s very useful in any situation where you are using predictions to choose a group of people to act upon. For example, healthcare providers use predictions of readmission risk to target patients for home health visits after surgery, and insurers use predictions of member churn to target members for incentive programs for re-enrollment. In both cases, a predictive model identifies people who are highly likely to have some negative outcome, and the highest risk people are targeted for some intervention.
For these situations, the most relevant accuracy metric is how good the model is at targeting the interventions. In the readmission example, if you have the ability to intervene on 5% of the patients, then the best way to compare the accuracy of two predictors would be to look at which identifies more readmissions in the top 5% of the population. In most cases, this 5% threshold isn’t fixed. You could potentially intervene on only 3%, or maybe up to 10% of the population. The Outcome Capture curve shows the accuracy for all different levels of the threshold. The graph for a readmission model is shown below:
We draw this graph using a historical test set (not the one used to build the model) where we can compare the model’s predictions versus what actually happened. The horizontal axis represents different percentages of the population that are selected, and the vertical axis indicates the percentage of the total readmissions that occurred in that group. The yellow line on the graph passes through the point at 10 on the horizontal axis and 22 on the vertical axis. This means that the 10% of patients with the highest predicted risk were responsible for 22% of the total readmissions.
The most important tradeoff to make when using a predictive model to target an intervention is how many interventions to do. Doing more interventions takes more resources, but allows more negative outcomes to be avoided. The Outcome Capture curve directly demonstrates this tradeoff. Instead of presenting the model accuracy as an abstract statistic, it is presented in the context of a key decision that needs to be made regarding the model’s implementation.
Drawing multiple lines on the graph allows us to compare the accuracy of different models. More accurate models will capture a higher percentage of the outcomes (be higher on the graph) from the same proportion of the population. In this graph the yellow line shows predictions from a machine learning model while the black line is from the LACE Index, a rule based approach to readmission risk. The machine learning model outperforms the LACE Index for any level of intervention. At 10% of the population, the LACE Index captures 16% of readmissions, while the machine learning model captures 22% – a 30% improvement.
ROC curves are traditionally used to show model accuracy. The ROC curve below compares the same two readmission models as the Outcome Capture curve above.
You can see that the graphs share a lot of similarities. In both cases the yellow line is above the black one and the overall shapes are similar. In fact, the vertical axes in both graphs are the same. The definition of True Positive Rate (TPR) used in the ROC curve is identical to Outcome Capture for binary predictions. We prefer the term Outcome Capture because it’s more easily interpretable to non-statisticians and extends easily to continuous predictions, which we will cover below. In both graphs, random predictions are represented as a diagonal line from the bottom left to the top right of the graph.
The critical difference between the graphs is the horizontal axis. In the ROC curve, the horizontal axis is the False Positive Rate (FPR), while in the Outcome Capture curve it is a percentage of the population. This is significant because the Outcome Capture curve shows what level of accuracy can be achieved by intervening on different proportions of the population. Determining the number of interventions to do, and consequently the resources required to perform the intervention, is the key tradeoff that needs to be made in implementing the model.
The change in the horizontal axis changes the meaning of the Area Under the Curve (AUC) metric. For an ROC curve, a perfect predictor will have an AUC of 1.0. It will have a point where it has a TPR of 1.0 and an FPR 0.0. The Outcome Capture curve doesn’t have this property, since some non-zero percentage of the population will always have to be included to identify all of the positive outcomes. It is possible to compute an AUC for the Outcome Capture curve, but we prefer to still use ROC AUC when a single accuracy statistic is needed because many people are familiar with it and it has the nice property of 1.0 being a perfect predictor.
Unlike an ROC curve, the Outcome Capture curve can be drawn for models predicting a continuous outcome. Again, this graph is most relevant if the intended use of the model is to target those with the highest predictions for some intervention. In healthcare, a common use case is to identify patients who are likely to be high cost to target them for additional care coordination. Below we show a graph for predictions of medical cost.
This graph compares a machine learning prediction for medical cost compared with an existing approach which simply ranks patients based on their historical costs. If we look at the 5% of the population predicted to have the highest cost by the machine learning model, those patients represent 38% of the total overall cost. If we use prior cost as our only guide, the top 5% only represent 33% of the total cost. The machine learning model identifies 15% more cost in the same size population.
The ROC curve has been around for more than 70 years and has proven very useful for a wide range of applications, but there is no one perfect accuracy graph for all situations. Depending on how a model is being used, different decisions will be important, and so different graphs will be most relevant.
We’ve found the Outcome Capture curve to be very powerful in situations where predictions are going to be used to select a high risk portion of the population for interventions. It is similar to an ROC curve, but presents the information in a format that helps to directly answer the question of how many interventions should be done. It also has the advantage that it can be used with both binary and continuous predictions.
At ClosedLoop.ai, our healthcare predictive analytics platform includes the Outcome Capture curve as a part of our standard reporting of model accuracy, both in our user interface and our Python analysis package. For binary classification models, we also display the ROC curve, but we have repeatedly found that both data scientists and business owners find the Outcome Capture curve to be more useful in answering questions about accuracy.
In the next post I’ll cover how we use the information in this graph, along with some assumptions about intervention cost and effectiveness, to generate graphs of return on investment (ROI) that provide even more relevant information for stakeholders.
Interested in reading more about building models for Healthcare and providing explainability for clinical end users? Check out these related posts:
We add new resources regularly. Enter your email address to get them directly in your inbox.