Report cards are the norm for assessing the performance of schoolchildren, but they’ve also become a popular way to rate organizations in different industries. In the health care sector, a common complaint about such report cards is that they do not take into account the unique characteristics of the patients at a particular hospital. 

In a new paper, a group of researchers from Wharton, the University of Pennsylvania’s Perelman School of Medicine and the Center for Outcomes Research at the Children’s Hospital of Philadelphia detail a way to “match” patients at a group of hospitals to a template that included a number of different characteristics, including age, gender and medical history. Using this sampling of highly similar patients, they then compared the outcomes of care at 217 hospitals — and found substantial differences in how the patients fared. The research was posted by the journal Health Services Research in March.

The paper, “Template Matching for Auditing Hospital Cost and Quality,” was co-authored by Wharton health care management professor Jeffrey H. Silber; Wharton statistics professor Paul R. Rosenbaum; Richard N. Ross, Justin M. Ludwig, Wei Wang, Bijan A. Niknam, Nabanita Mukherjee, Philip A. Saynisch, and Orit Even-Shoshan — all with the Center for Outcomes Research at the Children’s Hospital of Philadelphia — and Rachel R. Kelz and Lee A. Fleisher, both at Penn’s Perelman School of Medicine. Silber, Rosenbaum, Even-Shoshan and Fleisher are also affiliated with Penn’s Leonard Davis Institute of Health Economics.

In this interview with Knowledge at Wharton, co-author Silber, who also directs the Center for Outcomes Research and is a professor of pediatrics and anesthesiology and critical care at the Perelman School of Medicine, describes the technique and discusses why it represents a more accurate way of detecting poor performance by health care providers — and how the research could apply to other areas as well, such as rating school performance. 

An edited version of the transcript appears below.

Knowledge at Wharton: What was the key issue you were trying to address through this research?

Jeffrey Silber: Our research really revolves around a problem that chief medical officers [of health care organizations] have. They open up the morning paper and find that their hospitals have been ranked very low in some quality of care report. Their initial reaction is always, “Our patients are really sicker [than those at the other hospitals studied], and that’s why we look so bad in the reports.” 

This research uses template matching to improve the reports so that there is a fair comparison across all hospitals. We really tried to look at the problem in a different way using multivariate matching, which has never been done before in looking at quality, to be able to fairly compare hospitals. And then the chief medical officer wouldn’t be as upset because he would probably [accept] that there was a problem or he would look good in the reports.

Knowledge at Wharton: Can you elaborate more on the process you used? 

Silber: Template matching is a new way of looking at the way hospitals treat patients. The usual problem when we compare hospitals is that we say, “Here are all the patients that you saw.” And then, “If your patients were going to be going to another hospital or the typical hospital, how would they do?” That’s a strange way to compare hospitals because each hospital has a different set of patients. And so, it’s really not a fair way to compare hospitals because one hospital might have an easier set of patients [and] another hospital might have a much different, more difficult set of patients. 

What we decided to do was create a template of patients, meaning a set of patients. In our case, it was 300 patients being treated by general surgery and orthopedic surgery. We said, “Let’s take a relevant template and then match, at each hospital, patients who would fit the template. We ended up with 300 patients being matched at 217 hospitals. Now, we have [a set of ] very similar patients at each of the hospitals because they’ve all been matched to this cookie cutter template, which is producing very similar patients across hospitals. So, now we can make a very fair comparison because we’re looking at the same, in a sense, 300 patients at each and every hospital, which is completely different than the way almost all report cards are made today. [Report cards usually] look at the patients at an individual hospital, then they try to estimate how they would have done at other hospitals….

We’re completely changing things around. The technique is [a form of] “direct standardization.” We’re saying you have to have a fair exam [to accurately grade the quality of care.] A fair exam is [asking,] “How did the hospitals do at treating these 300 patients?” We find the hospital’s 300 patients that look like the template, and that’s what we compare across hospitals. So, now it’s very hard for a chief medical officer to say, “Our patients were really sicker than another hospital.” It’s very hard to do that because the patients examined at their hospital are [very similar to the] 300 that are at each and every other hospital…. We have multivariate matched them, which means that we’ve matched them on literally hundreds of characteristics so that they’re very, very similar. Their risk of doing poorly or their chance of doing well is very, very similar across all the hospitals.

What we show in the paper is that these 300 patients at each and every hospital are incredibly similar. Their age is the same. Their rate of diabetes and heart failure and all the characteristics we’d be interested in are incredibly similar, statistically undifferentiable to the other hospitals. Yet what we find is that the outcomes are very, very different.

“These 300 patients at each and every hospital are incredibly similar…. Yet what we find is that the outcomes are very, very different.”

Knowledge at Wharton: What are the key takeaways of this research?

Silber: I think that there are two key takeaways. The first is that we … showed that you can get very close matches across hospitals. Each hospital stamped to this template of patients. The second is that there is great variability in the way the hospitals handled the patients and the outcomes of the patients. I think seeing that the matches were so close and, at the same time, seeing that the outcomes were so different was something that I think people should realize — that there are differences in quality across hospitals and that there are, I think, better ways to measure them than what we’ve been doing up to now.

Knowledge at Wharton: Were there any findings that surprised you? 

Silber: Probably the biggest surprise was that after we did this very close matching, such that all the hospitals had an incredibly similar sample of the 300 patients, there was great variation in outcomes. Some hospitals had high complication rates. Some had low complication rates. Some had high mortality rates. Some had low mortality rates — things that when I would see this in a standard report, I might not believe because I knew the hospitals were seeing different patients. But here, after having matched so closely, to see this great variation in outcomes was an eye opener.

Knowledge at Wharton: How could this research be applied beyond the grading of hospitals or beyond the health care sector? 

Silber: We’ve given that some thought. We’ve usually been applying this technique to the hospital sector. But I think you could easily apply this to nursing home quality, schools — we could imagine creating a template of students and stamping the students out across different schools, seeing how the student outcomes were. So, yes, you could use this technique in other areas.

I think the key here is that we’re taking from one field — which is from statistics and, in particular, multivariate matching, for which [Wharton statistics professor] Paul Rosenbaum is the world’s expert — and we’re applying it to quality assessment. I just don’t think that’s been done. And it certainly hasn’t been done with a template, which kind of levels the playing field and lets you really see if your quality is different than others.

Knowledge at Wharton: What misconceptions does this research help to dispel? 

Silber: People tend to believe report cards based on what’s called “indirect standardization,” where we’re looking at different patients at different hospitals — different types of patients, even non-overlapping populations — and saying that we can extrapolate from the models to say that one hospital’s quality is better than another. I think that’s a misconception. I don’t think that the models currently allow us to do that. 

If we had better techniques, and I think this is one of them, we would be able to at least provide confidence to the reader — both the user in terms of the patient and also the people at hospitals who are trying to improve safety and quality — that there really are differences between the care at different hospitals… I think that’s probably the main point.

Knowledge at Wharton: Are there any policy or procedural changes you think should be implemented as a result of this research? 

Silber: There are many groups that are grading hospitals. The federal government has a service — it’s called “Hospital Compare” — which is a website that lets you click on a hospital and compare it to another hospital…. There are many different organizations and web-based companies that grade hospitals. They often look to Hospital Compare, which is the main model. For example, Consumer Reports looks to Hospital Compare. 

Hospital Compare suffers from these same problems. It suffers from using indirect standardization, in the sense that it’s taking the patients each hospital saw and making an extrapolation about how they would have done at another hospital. I think that if that could be changed, that would be a great benefit to the application of looking at quality.

Knowledge at Wharton: Can hospitals use this framework themselves to perform a self-assessment? 

Silber: Absolutely. Often hospitals have access to national data sets or comparative data sets. So, hospitals could, in a sense, form their own templates and then see how their patients would do at other hospitals using the same technique. But they would have to understand the details of multivariate matching. We bring to the table a way to do this in a very sophisticated way through the work of statisticians at Wharton, in particular Paul Rosenbaum. But if they could do that matching, then they could better compare their results to other hospitals and know whether they’re really doing a good job or not.

“After we did this very close matching [of patients] … there was great variation in outcomes. Some hospitals had high complication rates. Some had low complication rates. Some had high mortality rates. Some had low mortality rates….”

Knowledge at Wharton: What are some other avenues for furthering this research? 

Silber: I think a second area that we actually just talked about was this idea of making your own boutique template. So, a chief medical officer could, for example, say, “Let’s construct a template on these difficult patients that we see. And let’s see how other hospitals would have handled these same difficult patients.” I think that a boutique template is the next step and that’s where we’re moving. Then there are other applications of multivariate matching to look at quality. 

Suppose you have a very small hospital, and you don’t match the template very often because you’re so small that you don’t have the patients to be able to match to this overall national template. Well, we can turn things around and we can say, “OK, let’s look at the small hospital’s patients and then match them to the whole country.” So, use the whole country as a kind of frame of reference. And then we have a very good idea looking at the outcomes of the small hospital’s patients and their matched sample how the small hospital did relative to the typical patient in the country. That’s another application with multivariate matching to help improve quality.