In the deep ocean of big data, it’s hard for companies to know what’s true or even relevant to their operations. The latest research from Hamsa Bastani, Wharton professor of operations, information and decisions, can help companies navigate the waters by offering a better way to use predictive analytics. Bastani spoke with Knowledge at Wharton about her paper, “Predicting with Proxies.”
An edited transcript of the conversation follows.
Knowledge at Wharton: This paper focuses on predictive analytics. How do companies use predictive analytics today?
Hamsa Bastani: A lot of companies across a variety of applications are starting to use predictive analytics to guide their decision-making. For example, in e-commerce, companies like Amazon or Expedia use customer-specific data to try to predict what sorts of products a customer might be interested in and then use that to make personalized product recommendations.
Knowledge at Wharton: This process often uses something called a proxy outcome. What’s the difference between a proxy outcome and an actual outcome? And why do firms settle for proxies?
Bastani: It’s often the case that the data that we want are available in a very limited quantity, and this is what I would call the true outcomes. What we have instead is a large amount of data from a closely related outcome, which is what we call proxy outcomes.
In the e-commerce example again, a company like Amazon typically will have very little data on customer purchases for a particular item, but they’ll have lots and lots of click data. If you think about it, clicks are a pretty good proxy for purchases because a customer will typically not click on a product unless they have some intent of purchasing it. Of course, these two outcomes are not exactly the same. What I’ve found in some of my research looking at, for instance, personalized hotel recommendation data for Expedia is that these outcomes can be different along a few dimensions.
The big one that I saw was price. It turns out that customers are more than happy to click on expensive hotel recommendations, but they tend to shy away from booking these. So, if companies just use clicks, which is what they typically do to make good recommendations for our customers, they’ll end up recommending overly expensive hotels and miss out on an opportunity to get customers to purchase one.
The problem with this is that, on one hand, because there’s so much data on proxies, it’s often more effective to train your predictive models on this data. There’s just so much more of it, and you get more accurate models that way. But you’re basically using the wrong big data, instead of using the right small data. Because of these sources of biases, you might end up making suboptimal decisions.
Knowledge at Wharton: What’s a firm to do when a proxy is not necessarily getting the outcome they want, but neither would the actual data?
“Clicks are a pretty good proxy for purchases because a customer will typically not click on a product unless they have some intent of purchasing it.”
Bastani: What we’re doing in this research is proposing a novel estimator that combines a large number of proxy outcomes, which is what we typically have, with a small number of true outcomes, which we also usually have. It combines them efficiently using high-dimensional statistics to build something that gets you the best of both worlds. It simultaneously de-biases the proxy estimator by identifying the key sources of differences between the proxy and the true outcomes. But it still preserves the large sample properties that you have with your proxy outcomes. We’re able to prove that this approach gets you much more predictive accuracy. But at the same time, we’ve also tested it on several data sets, from e-commerce and health care, and we find that it does seem to be pretty effective.
Knowledge at Wharton: You have an interesting example in the paper about how this applied in the health care setting. Could you talk about that?
Bastani: Many hospitals use patient-specific data to try to understand which patients are at high risk for some particular adverse event, and they use this to target interventions. The example I looked at was diabetes. If you target interventions to patients who are at high risk for diabetes, then you can hope to stop the progression. When you build these kinds of estimators, a hospital has a conundrum, which is whether to use their own patient population to build a new estimator from scratch, or to use an existing predictive model that’s been trained in a larger conglomerate of hospitals. The advantage of the latter is that you get a much larger sample, which is kind of a proxy cohort, or you can use your own patient population, which is a much smaller sample and is a true cohort.
What I found when using real electronic medical record data is that you often are better off using the larger sample in terms of pure accuracy, but there are important biases to account for. For example, one issue is there is a particular diagnosis called impaired fasting glucose, which is very, very predictive of diabetes within the small cohort I was looking at. But in the larger cohort, the physicians tended not to measure this diagnosis because it does involve the patient fasting, so they didn’t think that burden was worth it. So, this feature is very predictive for the small cohort, but is not predictive for the large cohort.
Differences in the physician behavior, the patient behavior, the way the data are recorded in the electronic medical record can affect the predictive model that you end up building, and it’s important to account for these biases when you’re transferring knowledge from a proxy setting to your setting of interest. Again, what our algorithm in this setting does is identify these sources of bias, even using the very little patient population data you have from the small cohort, and then it enables you to transfer most of the knowledge that you got from the other hospitals. So, we’re able to build something that’s much more accurate.
“We’re able to prove that this approach gets you much more predictive accuracy.”
Knowledge at Wharton: How easy would it be for a company to apply this model in a real-life setting?
Bastani: I hope it’s very easy. One of the things that we’re planning to do with this research is to open-source this estimator, so companies that have access to both their proxy and true outcomes in a useful format would be able to read in their data and output better predictions. We also are planning to include several baselines, so they can compare and pick whatever is best for their setting.
Knowledge at Wharton: Is it harder in certain situations for a company to tease out their proxy and actual outcomes?
Bastani: Yes. In some firms I have talked to, it gets pretty complicated because what we have essentially assumed is that the features used in both of these outcomes are the same. But in some settings, you may find that in the proxy cohort, they record a different set of features compared to the true cohort. In this case, there might need to be some engineering work to put these features into the same space. This is typically right now done by data scientists in most companies. Or there’s a modification of our algorithm that uses our estimator on the overlapping features, then leaves the other features in the same predictive modeling framework that you would normally use.
Knowledge at Wharton: Firms have more access to data today than they ever did. It seems like it’s harder than ever to figure out which data are relevant — what is going to tell you what you want to know about your customers. How much does this research speak to this idea of firms needing to get their hands around exactly which data are important and which data are not?
Bastani: That’s a great point. Another issue that’s come up is that often there are multiple kinds of proxies that managers think might be relevant, but they’re not really sure. One of the extensions we consider is how to incorporate all of them into the estimator, and this will use out-of-sample accuracy on a kind of held-out test set to determine which proxy is most appropriate, and then combines that with the true outcome data. But in general, when you have multiple sources of data and you’re not really sure which ones are the best to use, you typically end up looking at out-of-sample accuracy to make a judgment call.
Knowledge at Wharton: It also seems like companies need to have this discussion about which data are important before they go looking at all of it. I could see a situation where what they think is important is not what the data are showing should be important.
Bastani: That is definitely true. We’ve seen that in several instances, where a hospital will think that a particular patient metric is important to focus on in terms of care quality, and then it ends up not being relevant to patient mortality because of various biases that they maybe didn’t know about before. That’s a constant conversation that they should be having in the background, and hopefully they update their knowledge of what this might be based on the trials that they’re running over time.