Statistical Software

profileroelofsl
SoftwareArticle22.docx

Health Services and Outcomes Research

Comparison of “Risk-Adjusted” Hospital Outcomes

David M. Shahian, MD; Sharon-Lise T. Normand, PhD

Background —A frequent challenge in outcomes research is the comparison of rates from different populations. One common example with substantial health policy implications involves the determination and comparison of hospital outcomes. The concept of “risk-adjusted” outcomes is frequently misunderstood, particularly when it is used to justify the direct comparison of performance at 2 specific institutions.

Methods and Results —Data from 14 Massachusetts hospitals were analyzed for 4393 adults undergoing isolated coronary artery bypass graft surgery in 2003. Mortality estimates were adjusted using clinical data prospectively collected by hospital personnel and submitted to a data coordinating center designated by the state. The primary outcome was hospital-specific, risk-standardized, 30-day all-cause mortality after surgery. Propensity scores were used to assess the comparability of case mix (covariate balance) for each Massachusetts hospital relative to the pool of patients undergoing coronary artery bypass grafting surgery at the remaining hospitals and for selected pairwise comparisons. Using hierarchical logistic regression, we indirectly standardized the mortality rate of each hospital using its expected rate. Predictive cross-validation was used to avoid underidentification of true outlying hospitals. Overall, there was sufficient overlap between the case mix of each hospital and that of all other Massachusetts hospitals to justify comparison of individual hospital performance with that of the remaining hospitals. As expected, some pairwise hospital comparisons indicated lack of comparability. This finding illustrates the fallacy of assuming that risk adjustment per se is sufficient to permit direct side-by-side comparison of healthcare providers. In some instances, such analyses may be facilitated by the use of propensity scores to improve covariate balance between institutions and to justify such comparisons.

Conclusions —Risk-adjusted outcomes, commonly the focus of public report cards, have a specific interpretation. Using indirect standardization, these outcomes reflect a provider’s performance for its specific case mix relative to the expected performance of an average provider for that same case mix. Unless study design or post hoc adjustments have resulted in reasonable overlap of case-mix distributions, such risk-adjusted outcomes should not be used to directly compare one institution with another. (Circulation. 2008;117:1955-1963.)

Key Words: health care quality assessment ■ outcomes research ■ risk ■ statistics

( O )utcomes research “seeks to understand the end results of particular health care practices and interventions.”1 This

may involve investigation of a new drug or procedure compared with standard therapy through the use of either a randomized trial or an observational study. Because of the current health policy emphasis on measuring and improving provider performance,2,3 interest has also been increasing in another type of outcomes research referred to as provider profiling.4,5 This research focuses on the collection and analysis of outcomes data to evaluate the performance of a physician or a hospital.

Clinical Perspective p 1963

Provider profiling has a number of features that distinguish it from other types of outcomes research. First, unlike trials of new medications or treatment regimens, randomization of

patients to hospitals or physicians would often be both impractical and unethical. Thus, profiling studies are almost always observational in nature, relying on data from usual practice settings. In further contrast to drug trials that involve direct comparisons of outcomes for only a few treatments, profiling studies typically assess outcomes for many provid- ers, usually with regard to some population reference stan- dard. Finally, when profiling is based on outcomes measures such as mortality or morbidity, risk adjustment is necessary to account for preexisting conditions that may confound their assessment.

Despite their increasingly widespread use, considerable confusion exists among consumers, the media, payers, and providers as to the correct meaning and interpretation of risk-adjusted outcomes. For example, many incorrectly inter- pret such outcomes as having “leveled the playing field” to

( Downloaded from http://circ.ahajournals.org/ by guest on February 20, 2017 )

Received November 9, 2007; accepted February 13, 2008.

From the Center for Quality and Safety, Department of Surgery, and Institute for Health Policy, Massachusetts General Hospital, and Harvard Medical School (D.M.S.), and Department of Health Care Policy, Harvard Medical School, and the Department of Biostatistics, Harvard School of Public Health (S.T.N.), Boston, Mass.

Guest Editor for this article was Harlan M. Krumholz, MD, SM.

Correspondence to Sharon-Lise T. Normand, Department of Health Care Policy, Harvard Medical School, 180 Longwood Ave, Boston, MA 02115. E-mail [email protected]

© 2008 American Heart Association, Inc.

Circulation is available at http://circ.ahajournals.org DOI: 10.1161/CIRCULATIONAHA.107.747873

1955

( 1956 ) ( Circulation ) ( April 15, 2008 )

( Shahian and Normand ) ( Risk-Adjusted Hospital Outcomes ) ( 1957 )

( Downloaded from http://circ.ahajournals.org/ by guest on February 20, 2017 )permit direct comparison of one provider with another. Direct comparability may sometimes be justified in an observational study, but this would be fortuitous and is not an inherent characteristic of the study design.

Correct interpretation of the concept of risk-adjusted out- comes is neither a trivial nor a strictly academic concern. Such outcomes are used to designate centers of excellence, to determine reimbursement levels in pay for performance programs, to rank institutions, and to classify providers as “outliers.” These determinations may have profound effects on patient access, hospital reputation, referrals, and financial survival.

The goal of this article is to systematically review the fundamental concepts from which the deceptively simple term “risk-adjusted outcome” is derived. We develop the concept of risk-adjusted outcomes in the context of causal inference theory and illustrate the derivation of indirectly standardized mortality ratios, often referred to as O/E (observed/expected) ratios. Key methodological concepts (eg, outlier determination and direct comparison of hospitals) are illustrated through the example of coronary artery bypass grafting surgery (CABG) mortality profiling, in which the difference in outcomes of a hospital compared with the reference standard is generally regarded as a reflection of quality of care.5

Methods

Background It is useful to consider risk adjustment and standardization as specific applications of causal inference theory, a broad discipline with historical roots in philosophy, mathematical logic, and statistics.6 –19 This is the foundation for understanding causal effects in health care,16,20 –24 which can be thought of as the difference between the outcome for a patient when exposed to one treatment (or provider) and the outcome when exposed to another.

A fundamental precept of causality is that only one of a series of potential outcomes can be experienced at any one time.7,17,20,23,24 In CABG hospital profiling, a patient can undergo CABG at only one hospital on a given day. Therefore, some method must be used to estimate what would hypothetically have occurred to that patient had he or she undergone surgery at a different hospital. The observed result is referred to as the actual outcome, and the unobservable estimated outcome is the counterfactual.7,17,20,23,24 Estimation of this counterfactual outcome, the hypothetical result if treated under a different set of circumstances, is the primary motivator for risk model development. Several approaches have been developed to estimate these potential outcomes for individual patients and subse- quently to assess the overall performance of a hospital.

Estimation of Counterfactuals for Risk Adjustment and Standardization

The simplest estimator of a counterfactual would be the average result of treating a similar condition (eg, a CABG procedure) in the overall population or at another specific institution. However, this estimator is likely to be both inaccurate and misleading. Patients are nonrandomly allocated among institutions, and use of crude mortal- ity rates from other hospitals as the counterfactual outcomes would ignore systematic differences among patients such as acuity status. At the other end of the spectrum, the counterfactual outcomes could be determined through randomization,15,18,19,25 the most internally valid design. Both measured and unmeasured confounders would be balanced, so the mortality experience of patients undergoing CABG at one hospital could serve as the counterfactual outcome for patients treated at another hospital. However, it is implausible to think that

most patients would consent to randomization for anything but truly experimental care; for this reason, almost all profiling studies are conducted with observational data. Matching and stratification are other methods sometimes used to derive counterfactuals, but they quickly become impractical when more than a few predictor vari- ables are considered, the typical case in mortality profiling.

Most profiling studies have relied on regression modeling to derive counterfactual outcomes, and it is the method used here. Risk adjustment, the term commonly used for this approach, refers to the results of statistical regression models that relate the outcome for a specific patient to his or her observed characteristics.4,26 –29 Then, because the main focus of profiling is to determine how the overall experience of a particular hospital compares to what would be “expected,” the next step is to standardize the results of an institution to the reference population.

Indirect standardization is used for almost all profiling and public report cards. With this method, the expected rate represents what the mortality rate would have been at a hospital given its actual distribution of patients but replacing its observed mortality rates with rates estimated from the entire group of providers. The indirectly standardized mortality ratio, often referred to as the ratio of observed to expected outcomes (O/E ratio), compares the outcomes for the specific distribution of patients at a hospital with their expected results had they been treated by an average provider in the reference population.

Indirect standardization is accomplished by first summing the individual risk probabilities for each patient within a given hospital using the coefficients estimated from the regression model and the patient’s specific distribution of confounders. This yields the ex- pected total number of deaths for that hospital. This counterfactual hospital mortality often is used as the denominator of the ratio of observed to expected mortality (O/E ratio), a form of causal estimand. This O/E ratio is favorable if 1 and unfavorable if 1. As a final step, the O/E ratio may be multiplied by the unadjusted population mortality rate for the procedure to obtain what is often called the risk-adjusted mortality rate but which is more correctly designated the risk-standardized mortality rate (RSMR) or standard- ized mortality incidence rate (SMIR).30 –34

Outlier Determination and the Direct Comparison of Hospitals
Outliers

The main goal of outcomes profiling is to identify differences in hospital quality. Because the risk-standardized rates for each hospital are derived from the reference population, it is most appropriate to determine whether these rates are statistically different from the population average. If so, the hospital is regarded as a statistical outlier. Most commonly, this is achieved by determining whether the 95% interval for a hospital’s risk-standardized mortality estimate includes the overall state average mortality (or alternatively, if the intervals around their O/E ratio intersect 1). If no overlap exists, they typically are classified as an outlier. An important but overlooked aspect of outlier determination is the effect on expected outcomes when true outlying programs are included in the development of the statistical model. This problem and a potential solution (cross- validated P values) are described further in the Illustration.

Risk Factor Distribution and Direct Comparability

In addition to comparing individual hospitals with the reference population to determine outlier status, some consumers also seek to directly compare individual hospitals with one another. A problem with direct comparisons that has been widely recognized by statis- ticians, and that was the motivation for the development of balancing methods such as propensity scores,14 –16,18,19,35-41 is that of covariate imbalance. Absent randomization, the patient cohorts from 2 hospi- tals may be unbalanced with regard to the frequency of confounders. The implications of such imbalance have received little attention in the context of risk-adjusted outcomes profiling, which in turn has led to both misunderstanding and misuse.

( Downloaded from http://circ.ahajournals.org/ by guest on February 20, 2017 )In general, only the results for those patients with comparable risk profiles (eg, that overlap the risk distributions of the 2 providers) should be directly compared. Consider the extreme but not uncom- mon example of a state or region with many small community hospitals and 1 or 2 tertiary/quaternary hospitals. As a general principle, direct comparison of a community to a tertiary hospital would be appropriate only for the relatively small proportion of patients who overlap between the 2 hospitals. Although the results for the overlap group can be used to estimate expected outcomes for patients not in common between the 2 institutions, this form of extrapolation depends heavily on assumptions that are typically unverifiable. For example, the indirectly risk-standardized results at a community hospital apply to its specific type of patients, who might be relatively low risk compared with a tertiary center. It cannot be assumed that a favorable risk-standardized mortality at the community hospital, based on its lower risk case mix, could necessarily be achieved if it were confronted with the higher-risk case mix of the tertiary center, including some types of patients that it rarely, if ever, encounters.

Propensity scores are a useful method to construct treatment and control groups that may differ in number of subjects but are similar to randomized studies in their balanced distribution of all measured confounders.14 –16,18,19,35– 41 The propensity score is the likelihood of receiving treatment of one type compared with another (or in the case of profiling, exposure to one or another specific provider) on the basis of a patient’s set of observed characteristics. It provides a convenient scalar (1-number) summary of the information contained in all the patient’s measured covariates. The propensity score may then be used for matching, stratification, blocking, or weighting in regression modeling.

The problem of covariate imbalance has received little attention in provider profiling studies.42– 45 If the propensity score provides a convenient summary estimate of individual patient risk, then each provider will have a specific distribution of propensity scores that characterizes its “case mix.” For 2 providers to be comparable, the area of overlap in their respective propensity score distributions should be identified. As shown in Figure 1A, 2 hypothetical hospitals (hospitals 1 and 2) might by chance (or as a result of randomization) have substantial overlap in their propensity score distributions. The area of shaded overlap in Figure 1A indicates that a majority of patients treated at hospital 2 have a similar propensity to have been treated at hospital 1. For almost every patient who underwent CABG at hospital 1, we can find a “similar” patient from among those having CABG at hospital 2.

Figure 1B depicts a different set of 2 hospitals with significant imbalance in their average patient risk as measured by their propen- sity score distributions. Only a small percentage of patients at the 2 institutions have comparable risk profiles. It is only the group of patients who overlap from which relative performance inferences should be drawn.

Illustration
Study Population

We examined data from all adults (>18 years of age) undergoing isolated CABG at all acute-care, nonfederal hospitals in Massachu- setts between January 1, 2003, and December 31, 2003. Data collection is mandated by the Massachusetts Department of Public Health.

Data Sources

We used clinical data submitted to a data coordinating center (Mass-DAC) located in the Harvard Medical School Department of Health Care Policy. Data are collected by trained hospital personnel using the Society of Thoracic Surgeons National Adult Cardiac Database instrument.46 Supplemental patient and surgeon identifying information also is collected using additional data forms developed by Mass-DAC. The data are sent electronically to Mass-DAC, where they are cleaned, audited, and verified using internal and external procedures.

End Points

The primary end point is hospital-specific, risk-standardized, all- cause, 30-day mortality rate. Mortality data are obtained 2 ways. First, hospital personnel are responsible for collecting 30-day mor- tality for all patients undergoing cardiac surgery. Second, patient identifying information is linked to this registry from the Massachu- setts Registry of Vital Records and Statistics to verify date of death. The registry includes mortality information for Massachusetts resi- dents and all records of deaths that occur within the Commonwealth regardless of the state of residence. Because Mass-DAC has access to Social Security numbers, the Social Security Index Web site47 also is searched to identify deaths, including those reported to the Social Security Administration by funeral homes or by relatives.

Statistical Analyses

Distributions of clinical and demographic variables are computed and stratified by hospital to identify unusual or extreme values. Because of data collection protocols and auditing procedures, no data are missing in the clinical variables or outcomes for the mortality models.

Risk Adjustment

We first estimated a propensity score model in which the dependent variable was multinomial, assuming 13 distinct values corresponding to the 13 hospitals (1 hospital is the reference group). The specific clinical variables included in the model were selected from a literature review of existing models and expert opinion from a panel of senior cardiac surgeons. A multinomial logistic regression model was estimated, and predictions for each patient in the sample were subsequently obtained. Thus, each patient had 14 estimated proba- bilities, each reflecting the likelihood that the patient would undergo CABG at 1 specific hospital rather than 1 of the remaining 13 hospitals. For this reason, the sum of the 14 estimated probabilities for each patient was 1.

To compare the performance of each hospital with that of its peers, it is necessary to assess whether the population of patients undergo- ing surgery at a particular hospital is comparable to that of all other Massachusetts hospitals on the basis of their observed characteris- tics. To accomplish this, we examined the overlap between the distribution of the propensity scores for patients undergoing surgery at each hospital and the distribution of the propensity scores for patients not undergoing surgery at that hospital. Ideally, the esti- mated propensity scores of the latter group would cover the entire range of estimated propensity scores at the particular hospital being studied. This finding would provide support for the assumption that the 2 groups of patients (those treated at a particular hospital versus all others) were similar in terms of observable demographic charac- teristics and other comorbidities.

We next estimated a regression model for the mortality outcomes. The dependent variable was binary, assuming a value of 1 if the patient died of any cause within 30 days of surgery and 0 otherwise. We included the same set of confounders used in the propensity score model. We included a random hospital-specific intercept that represented the underlying quality of the hospital and accounted for within-hospital correlation of patients. We calculated odds ratios (ORs) conditional on the hospital random effects that apply to comparisons of patients belonging to the same hospital (see Larsen and Merlo48 for a discussion of differences between conditional and unconditional ORs).

The size of between-hospital variation was summarized by the median OR (MOR).49 The MOR considers 2 CABG patients with the same set of observed risk factors but selected randomly from 2 different hospitals. The MOR is the OR between the patient with a higher probability of dying and the patient with a lower probability of dying. A MOR value 1 supports the hypothesis that between- hospital variation in mortality exists after adjustment for patient characteristics. If the between-hospital variation were 0, this would imply that differences in hospital outcomes, after adjustment for patient characteristics, are due only to random sampling variability. Although between-hospital variation will always be 0 in practice, some have suggested that small values can be effectively ignored by

A Hospital 1

( 0.28 ) ( 0.32 ) ( 0.36 )Hospital 2

B Hospital 1

( 0.28 ) ( 0.32 ) ( 0.36 )Hospital 2

( 0.0 ) ( 0.04 ) ( 0.08 ) ( 0.0 ) ( 0.04 ) ( 0.08 )Logit(P(Surgery at Hospital 1)) Logit(P(Surgery at Hospital 1)

Figure 1. Covariate balance (shaded area) between patients treated at 2 fictitious hospitals. The x axis represents the log-odds of the probability that a patient has surgery at hospital 1 vs hospital 2; the y axis represents the density of patients. Substantial overlap is present in log-odds in A, and less overlap is present in B.

( Downloaded from http://circ.ahajournals.org/ by guest on February 20, 2017 ) ( Density ) ( 0.12 ) ( 0.16 ) ( 0.20 ) ( 0.24 ) ( Density ) ( 0.12 ) ( 0.16 ) ( 0.20 ) ( 0.24 )essentially setting the between-hospital variation component to 0. We see no reason to assume that between-hospital variation is 0 given that this value can be estimated.

We calculated the mortality risk for each patient using the observed values of his or her confounding variables. The individual risk factors were multiplied by the estimated coefficients from the regression model, transformed onto the probability scale, and summed to obtain the number of expected number of deaths at each hospital.

Hospital RSMRs

We next estimated a risk-standardized mortality ratio for each hospital by computing the ratio of the “observed” number of deaths to the expected number of deaths (RSMR). However, rather than use the actual numbers of deaths at a hospital, we used an adjusted number (called a shrinkage estimate) that avoids several statistical problems associated with the observed number, including small sample sizes and clustering.28,34,50,51 We then multiplied the stan- dardized mortality ratio by the crude state mortality rate to obtain hospital-specific RSMRs. Ninety-five percent posterior intervals for each RSMR were computed.

Cross-Validation

Because all hospitals contribute to the model used to estimate the expected number of deaths, each hospital helps to define its own expected behavior.50,51 If one hospital is truly “outlying,” with an

unusually high or low mortality rate, it may “inflate” the estimated between-hospital variance component because the regression model adapts to incorporate the results of the unusual hospital. Conse- quently, this hospital will be less likely to be identified as an outlier. With a very large number of hospitals, the results of one institution are unlikely to distort the model substantially. However, with a smaller number of cardiac surgery hospitals, as in Massachusetts or other individual states, one aberrant hospital could substantially influence the counterfactual outcome and make the performance of that hospital less likely to be identified as an outlier.

We addressed this problem through cross-validation. In a second set of analyses, the data from each hospital were sequentially deleted from the determination of the counterfactual distribution for its particular patients. With this approach, the expected number of deaths for a hospital represents how well the rest of the hospitals in the state would fare with the patients from that specific hospital. We computed the difference between the observed numbers of deaths in each hospital and the number of deaths predicted using its case mix and the regression coefficients from a model based on all other hospitals. Posterior predictive probability values, which reflect the similarity of the mortality experience of a particular hospital to that of its peers, also were computed.50 Extreme predictive P values (P<0.01 or P>0.99) indicate a discrepancy between the observed data and what is predicted by the model developed from the remaining hospitals.

Table 1. Selected Patient Characteristics Stratified by Hospital: Massachusetts Adults Undergoing Isolated CABG Surgery During 2003

Hospital, %

Variable

A

B

C

D

E

F

G

H

I

J

K

L

M

N

All, %

Female

31.0

27.0

24.0

29.0

29.0

21.0

25.0

25.0

26.0

23.0

30.0

20.0

30.0

18.0

26.5

Renal failure

3.8

13.0

6.6

6.2

9.7

1.8

7.2

3.4

3.1

8.3

8.4

11.0

6.9

4.5

6.9

Hx of PVD

14.0

20.0

19.0

23.0

16.0

20.0

17.0

13.0

17.0

19.0

14.0

26.0

12.0

27.0

17.4

Prior CABG

3.0

2.1

4.7

4.9

1.8

0

4.2

1.4

3.1

4.5

2.2

1.8

1.1

0

3.1

EF 30%

12.0

15.0

8.7

13.0

9.1

1.8

2.9

9.3

7.3

8.1

11.0

11.0

13.0

6.8

12.6

MI 6 h

0.9

0.9

0.3

0.5

0.5

0.9

0.9

2.0

0.5

0.5

1.1

0

4.0

0

0.96

Emergent or salvage

3.8

7.2

3.1

4.0

2.7

0.9

3.4

2.5

2.1

1.1

1.8

0.9

4.0

0

3.0

Cardiogenic shock

3.0

2.1

1.6

0.8

1.5

0.9

1.9

1.7

2.1

0.6

1.1

1.8

2.3

0

1.6

Preop IABP

10.0

13.0

17.0

6.7

8.6

29.0

12.0

8.8

15.0

14.0

9.9

14.0

5.2

2.3

11.7

Hx of PVD indicates history of peripheral vascular disease; EF, ejection fraction; MI, myocardial infarction; and preop IABP, preoperative intraaortic balloon pump.

The authors had full access to and take full responsibility for the integrity of the data. All authors have read and agree to the manuscript as written.

Results

The crude 30-day mortality rate is 2.25%, corresponding to

99 deaths out of 4393 isolated CABG admissions. The

A Hospital B vs All Others Multinomial Logistic

number of isolated CABG admissions ranged from a low of 44 to a high of 650. Not surprisingly, substantial differences were found in patient risk factors among hospitals (Table 1). For example, the percentage of admissions in which ejection fraction was 30% ranged from 1.8% to 15.0%, renal failure ranged from 1.8% to 13.0%, preoperative intraaortic balloon

B Hospital B vs Hospital F Binary Logistic

( Surgery at B Surgery at F ) ( 0.9 ) ( 0.32 )Surgery at B Surgery

( 0.5 ) ( 0.6 ) ( 0.7 ) ( 0.8 ) ( 0.20 ) ( 0.24 ) ( 0.28 )not at B

( 0.0 ) ( 0.1 ) ( 0.2 ) ( 0.0 ) ( 0.04 ) ( 0.08 )-6.0 -5.0 -4.0 -3.0 -2.0 -1.0 0.0

Logit(P(Surgery at Hospital B))

-6 -2 0 2 4 6 8 10 14 18 22 26

Logit(P(Surgery at Hospital B))

( Downloaded from http://circ.ahajournals.org/ by guest on February 20, 2017 ) ( Density ) ( 0.3 ) ( 0.4 ) ( Density 0.16 ) ( 0.12 )Figure 2. Covariate balance for 2 comparisons using Massachusetts cardiac surgery programs. A, Substantial overlap is present in the log-odds of the probability of surgery at hospital B vs the remaining 13 cardiac surgery programs. B, The covariate balance for the direct comparison of hospital B to hospital F is much less.

Table 2. Comparison of Prevalence of Risk Factors Between Hospitals B and F

Table 3. Prevalence of Risk Factors and Conditional and Unconditional (Population-Averaged) Odds Ratios of 30-Day Mortality After Isolated CABG Surgery in Massachusetts (2003)

Area of Area of

Nonoverlap

Overlap

Prevalence,

Odds Ratios

Risk Factor Hospital B Hospital F Hospital B

Risk Factor

%

(95% Posterior Limits)

Years 65 y, mean

1.7

1.06, (1.04, 1.09)

Male

73.5

0.43 (0.28, 0.68)

Renal failure

6.9

3.35 (1.81, 5.57)

Diabetes mellitus

38.1

1.21 (0.76, 1.84)

Hypertension

79.5

1.05 (0.58, 1.84)

Peripheral vascular disease

17.4

1.34 (0.78, 2.08)

Prior CABG surgery

3.1

2.67 (0.96, 5.49)

Prior PTCA surgery

17.8

1.32 (0.75, 2.15)

Cardiogenic shock

1.6

6.58 (2.70, 13.87)

Ejection fraction (reference,

>40%)

75.0

1.00

30% or missing 12.6 1.13 (0.55, 1.99)

30% to 39% 12.4 2.05 (1.16, 3.29)

( Years 65 y of age, mean 0.33 2.64 0.90 Male, % 62 79 75 Renal failure, % 24 2 11 Diabetes mellitus, % 55 24 37 Hypertension, % 71 77 77 Peripheral vascular disease, % 14 20 21 Prior CABG surgery, % 17 0 0 Prior PTCA surgery, % 19 28 13 Cardiogenic shock, % 10 1 1 Ejection fraction, % (reference, > 40%) 41 88 76 30% or missing 52 2 10 30% to 39% 7 10 14 Myocardial infarction, % (reference, none) 36 44 46 )Myocardial infarction (reference, none)

49.7 1.00

6 h

0.96

0.72 (0.09, 2.40)

7–24 h

2.2

1.77 (0.47, 4.51)

1–7 d

23.0

1.16 (0.63, 1.98)

8–21 d

5.2

1.26 (0.47, 2.61)

21 d

18.9

1.01 (0.50, 1.78)

( 6 h 2 1 1 7–24 h 26 3 2 1–7 d 24 17 22 8–21 d 0 5 1 21 d 12 30 28 Status of CABG, % (reference, elective) 10 35 25 ) ( Urgent 38 64 74 Emergent/salvage 52 1 1 Preoperative intraaortic balloon pump, % 38 29 9 )Status of CABG (reference, elective)

31.2 1.00

Urgent

65.8

0.74 (0.43, 1.20)

Emergent/salvage

3.0

2.16 (0.70, 4.96)

Preoperative intraaortic balloon pump

11.7

1.71 (0.89, 2.91)

PTCA indicates percutaneous coronary angioplasty. Area of overlap was defined by estimated log-odds of propensity score 5.

pump use varied from 2.3% to 29.0%, and emergent or salvage procedures ranged from 0% to 7.2%. Visual inspec- tion of the covariate frequencies for hospitals B and F

Between-hospital parameters*

Mean, between-hospital (logits)

Variance, between-hospital variance (logits)

... -5.05 (-5.76, -4.35)

. . . 0.0939 (0.00111, 0.483)

( Downloaded from http://circ.ahajournals.org/ by guest on February 20, 2017 )suggests that they represent, on average, quite different populations. For example, 7.2% of the patients at hospital B were emergent or salvage, the highest-acuity group, whereas only 0.9% of patients at hospital F were in that category. This imbalance is illustrated more formally in Figure 2B, a graphic depiction of the density of estimated propensity scores from hospital B compared with those of hospital F. This analysis is restricted to those patients who underwent surgery in those 2 hospitals. The propensity scores in Figure 2B were obtained by estimating a (binary) logistic regression model in which the response was an indicator assuming a value of 1 if the patient underwent CABG at hospital B and 0 if the patient underwent surgery at hospital F. The density estimates indicate that for 13% of the patients who underwent CABG at hospital B (solid line), no “similar” patient underwent the procedure in hospital F (dashed line). This percentage was calculated by identifying the fraction of hospital B patients with estimated log-odds of propensity scores 5 because this

MOR . . . 1.34

PTCA indicates percutaneous coronary angioplasty. Based on 4393 surgeries with 99 deaths (2.25%).

*Between-hospital parameter estimates are reported as means and 95% limits.

defined the area of nonoverlap (eg, no hospital F patient had an estimated log-odds of propensity score 5). The lack of overlap implies that a direct comparison of all patients treated at hospital B with those at hospital F may not be statistically valid.

Table 2 illustrates the prevalence of the individual covari- ates from which these propensity score density distributions were derived. Column 1 shows the characteristics of the subset of patients at hospital B who do not overlap with hospital F (ie, for whom the log-odds of their propensity scores are 5). The prevalence of individual high-risk char- acteristics is quite elevated in this patient subset (eg, 24% renal failure, 17% reoperation, 10% cardiogenic shock, 52%

Table 4. Cross-Validation Results

Counterfactual=Hospital Peer Experience (Entire State Excluding Hospital)

Counterfactual=Entire State Experience (2.25%)

Hospital

Between-Hospital MOR

Observed-“Predicted” Mortality, %

Predictive P

RSMR, %

(95% Posterior Limits)

A

1.47

0.35

0.29

2.49 (1.83, 3.83)

B

1.28

-1.35

0.12

2.03 (1.16, 2.74)

C

1.39

0.10

0.36

2.39 (1.58, 3.69)

D

1.23

2.26

0.01

3.06 (2.07, 5.68)

E

1.33

1.01

0.12

2.64 (1.85, 4.49)

F

1.30

-1.90

0.13

2.05 (0.89, 3.09)

G

1.37

-0.62

0.31

2.12 (1.33, 2.88)

H

1.32

-0.79

0.24

2.12 (1.18, 3.03)

I

1.39

0.07

0.34

2.33 (1.45, 3.68)

J

1.40

-0.31

0.43

2.20 (1.47, 3.12)

K

1.33

0.98

0.14

2.58 (1.78, 4.25)

L

1.31

-1.81

0.20

2.10 (1.06, 3.07)

M

1.32

-1.11

0.29

2.11 (1.11, 3.09)

N

1.32

0.72

0.15

2.39 (1.29, 4.17)

MOR for entire state is 1.34. Positive values of observed-predicted indicated higher-than-predicted mortality rates, whereas negative values indicate lower-than-predicted mortality.

( Downloaded from http://circ.ahajournals.org/ by guest on February 20, 2017 )emergent or salvage), and hospital F has no experience with patients having this overall level of acuity. The last 2 columns demonstrate the balancing properties of propensity scores in the area of overlap, in which patients are found from both hospitals with comparable log-odds of propensity score. For many of the most important covariates (eg, prior CABG, cardiogenic shock, recent myocardial infarction, urgent or emergent/salvage status), the prevalence was comparable for hospital B and F patients in the overlap region.

Although direct hospital-to-hospital covariate balance was poor, the overlap of estimated propensity score distributions for each hospital compared with the propensity score distri- bution for patients at most of the remaining hospitals was excellent. For example, Figure 2A displays the overlap for hospital B and all remaining hospitals based on the predic- tions obtained from the multinomial logistic regression model. This suggests that a comparison of the performance of hospital B relative to the overall group of other Massachusetts CABG providers is statistically valid.

The prevalence of the confounders and their relationship to 30-day mortality are presented in Table 3. Between-hospital variation measured by the MOR, after accounting for patient risk factors, is 1.34. This implies that for 2 patients with the same observed risk factors, the patient treated in the hospital with higher mortality risk is 1.34 times as likely to die within 30 days of isolated CABG as the patient treated in the hospital with lower mortality risk.

The last column of Table 4 depicts the typical profiling results that would be obtained with the entire state experience (all 14 hospitals) as the counterfactual. The 95% posterior interval of each hospital for its RSMR includes the state crude rate of 2.25%. This would imply that no hospital had higher- or lower-than-expected mortality rate given its case mix. In most public report cards, this finding would be regarded as

sufficient evidence for the absence of statistical outliers, but as noted previously, this conclusion may be misleading. The 3 columns on the left demonstrate the results of analyses performed with cross-validation, sequentially deleting the results of each hospital from the determination of its own counterfactual. The result of this cross-validation predictive P value analysis was highly significant (P=0.01) for hospital D on the left side of Table 4. Supporting this concern is the fact that the between-hospital variation in risk-adjusted mortality is reduced by 50% when hospital D is excluded from the model (from 0.0939 to 0.048; data not shown), and the MOR decreases from 1.34 to 1.23. Finally, a 2.26% excess mortal- ity rate results when hospital D is compared with its peers. These findings all suggest that hospital D is in fact a statistical outlier.

Discussion

The study of variations in the provision of healthcare services has been a central activity of outcomes research for more than 2 decades. This variability has included both utilization of services and outcomes. Initial publication of hospital mortal- ity rates in 1986 by the Health Care Financing Administration (now known as the Centers for Medicare and Medicaid Services, or CMS) was widely criticized for failing to adjust for patient risk.52 This motivated the development of numer- ous statistical risk models, particularly in cardiac surgery, to account for preoperative patient characteristics. It also stim- ulated CMS to look more closely at its risk models. It has now released new mortality models for acute myocardial infarc- tion and heart failure that address many risk-adjustment issues and statistical deficiencies identified in their earlier releases.32,33 Nevertheless, although risk adjustment corrects for the case severity at a given institution using risk estimates derived from the entire population, it does not guarantee

( Downloaded from http://circ.ahajournals.org/ by guest on February 20, 2017 )statistically valid direct hospital-to-hospital comparisons. When analyzing outcomes data, interested stakeholders should always consider these additional questions: To what type of patients can inferences about risk-standardized hos- pital outcomes be applied? What reference population was used to determine the counterfactual? If direct hospital-to- hospital comparison is the goal, is there sufficient covariate balance (overlap) to justify such comparison? A widely held view is that risk adjustment levels the playing field so that hospitals can be compared directly with one another over the broad spectrum of patient risk. We argue that this assumption often is invalid and that this common misinterpretation has profound health policy implications in today’s performance- centric environment.

Are current report cards useful? Yes, they are useful when

interpreted in the correct context. Most outcomes report cards use indirect standardization. In this context, the RSMR of a hospital may be interpreted as a measure of quality for the type of patient it treats. Properly constructed and interpreted, report cards facilitate comparisons of hospitals with the entire experience of a larger population of providers (eg, a state or region). Such a comparison group for each hospital typically will be rich enough to support a valid assessment of their quality of care, and it provides meaningful information to payers, regulators, and healthcare consumers.

Conclusions

Outcomes research typically involves nonrandomized studies to assess the results of patient experience with the healthcare system. Virtually always, some form of adjustment is re- quired. Although risk-standardized outcomes have been an important advance in adjusting provider results for differ- ences in case mix, such results often have been misapplied. Assessing the performance of a hospital for its case mix compared with the expected performance of a reference group of providers for a similar case mix usually is justified. However, because of substantial differences in the distribu- tion of risk factors, it may often be inappropriate to directly compare 2 hospitals using the results available in most public report cards.

Sources of Funding

Dr Normand is contracted by the Massachusetts Department of Public Health to monitor hospital cardiac quality and also receives funding from Yale University to develop risk models for CMS.

Disclosures

None.

References

1. Agency for Healthcare Research and Quality. Outcomes Research: Fact Sheet. Available at: http://www.ahrq.gov/clinic/outfact.htm. Accessed September 5, 2007.

2. Institute of Medicine. Crossing the Quality Chasm: A New Health System for the 21st Century. Washington, DC: National Academies Press; 2001.

3. Institute of Medicine. Performance Measurement: Accelerating Improvement. Washington, DC: National Academies Press; 2006.

4. Gatsonis CA. Profiling providers of medical care. In: Armitage P, Colton T, ed. Encyclopedia of Biostatistics, Volume 6. 2nd ed. Chichester, UK: John Wiley & Sons Ltd; 2005:4252– 4254.

5. Normand S-LT. Quality of care. In: Armitage P, Colton T, ed. Ency- clopedia of Biostatistics, Volume 6. 2nd ed. Chichester, UK: John Wiley & Sons Ltd; 2005:4348 – 4352.

6. Rubin DB. Comment: Neyman (1923) and causal inference in exper- iments and observational studies. Stat Sci. 1990;5:472– 480.

7. Holland PW. Statistics and causal inference. J Am Stat Assoc. 1986;81: 945–960.

8. Holland PW, Rubin DB. Causal inference in retrospective studies. Eval Rev. 1988;12:203–231.

9. Rothman KJ, Greenland S. Causation and causal inference in epidemiol- ogy. Am J Public Health. 2005;95:S144 –S150.

10. Rothman KJ, Greenland S. Modern Epidemiology. Philadelphia, Pa: Lippincott-Raven; 1998.

11. Pearl J. Causality: Models, Reasoning, and Inference. Cambridge, UK: Cambridge University Press; 2000.

12. Robins JM, Greenland S. The role of model selection in causal inference from nonexperimental data. Am J Epidemiol. 1986;123:392– 402.

13. Rosenbaum PR, Rubin DB. Estimating the effects caused by treatments: comment. J Am Stat Assoc. 1984;79:26 –28.

14. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55.

15. Rosenbaum PR. Observational Studies. New York, NY: Springer; 2002.

16. Little RJ, Rubin DB. Causal effects in clinical and epidemiological studies via potential outcomes: concepts and analytical approaches. Annu Rev Public Health. 2000;21:121–145.

17. Rubin DB. Causal inference using potential outcomes: design, modeling, decisions. J Am Stat Assoc. 2005;100:322–331.

18. Gelman A. Applied Bayesian Modeling and Causal Inference From Incomplete Perspectives. Chichester, UK: Wiley; 2004.

19. Gelman A, Hill J. Data Analysis Using Regression and Multilevel/ Hierarchical Models. Cambridge, UK: Cambridge University Press; 2007.

20. Maldonado G, Greenland S. Estimating causal effects. Int J Epidemiol. 2002;31:422– 429.

21. Rubin DB. The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials. Stat Med. 2007;26:20 –36.

22. Rubin DB. Direct and indirect causal effects via potential outcomes.

Scand J Stat. 2004;31:161–170.

23. Rubin DB. Bayesian-inference for causal effects: role of randomization.

Ann Stat. 1978;6:34 –58.

24. Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66:688 –701.

25. Fleiss JL, Levin BA, Paik MC. Statistical Methods for Rates and Pro- portions. Hoboken, NJ: J. Wiley; 2003.

26. Shahian DM, Blackstone EH, Edwards FH, Grover FL, Grunkemeier GL, Naftel DC, Nashef SA. Nugent WC, Peterson ED. Cardiac surgery risk models: a position article. Ann Thorac Surg. 2004;78:1868 –1877.

27. Shahian DM, Normand SL, Torchiana DF, Lewis SM, Pastore JO, Kuntz RE, Dreyer PI. Cardiac surgery report cards: comprehensive review and statistical critique. Ann Thorac Surg. 2001;72:2155–2168.

28. Normand S-LT, Glickman ME, Gatsonis CA. Statistical methods for profiling providers of medical care: issues and applications. J Am Stat Assoc. 1997;92:803– 814.

29. McNeil BJ, Pedersen SH, Gatsonis C. Current issues in profiling quality of care. Inquiry. 1992;29:298 –307.

30. Hannan EL, Wu C, Ryan TJ, Bennett E, Culliford AT, Gold JP, Hartman A, Isom OW, Jones RH, McNeil B, Rose EA, Subramanian VA. Do hospitals and surgeons with higher coronary artery bypass graft surgery volumes still have lower risk-adjusted mortality rates? Circulation. 2003; 108:795– 801.

31. Hannan EL, Kumar D, Racz M, Siu AL, Chassin MR. New York State’s Cardiac Surgery Reporting System: four years later. Ann Thorac Surg. 1994;58:1852–1857.

32. Krumholz HM, Wang Y, Mattera JA, Wang Y, Han LF, Ingber MJ, Roman S, Normand SL. An administrative claims model suitable for profiling hospital performance based on 30-day mortality rates among patients with an acute myocardial infarction. Circulation. 2006;113: 1683–1692.

33. Krumholz HM, Wang Y, Mattera JA, Wang Y, Han LF, Ingber MJ, Roman S, Normand SL. An administrative claims model suitable for profiling hospital performance based on 30-day mortality rates among patients with heart failure. Circulation. 2006;113:1693–1701.

34. Shahian DM, Torchiana DF, Shemin RJ, Rawn JD, Normand SL. Mas- sachusetts cardiac surgery report card: implications of statistical meth- odology. Ann Thorac Surg. 2005;80:2106 –2113.

35. Rosenbaum PR, Rubin DB. Constructing a control-group using multivar- iate matched sampling methods that incorporate the propensity score. Am Stat. 1985;39:33–38.

36. Rosenbaum PR, Rubin DB. Reducing bias in observational studies using subclassification on the propensity score. J Am Stat Assoc. 1984;79: 516 –524.

37. Rubin DB. The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials. Stat Med. 2007;26:20 –36.

38. D’Agostino RB Jr. Propensity scores in cardiovascular research.

Circulation. 2007;115:2340 –2343.

39. Braitman LE, Rosenbaum PR. Rare outcomes, common treatments: analytic strategies using propensity scores. Ann Intern Med. 2002;137: 693– 695.

40. D’Agostino RB Jr. Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Stat Med. 1998;17:2265–2281.

41. Joffe MM, Rosenbaum PR. Invited commentary: propensity scores. Am J Epidemiol. 1999;150:327–333.

42. Glance LG, Osler TM, Mukamel DB, Dick AW. Use of a matching algorithm to evaluate hospital coronary artery bypass grafting per- formance as an alternative to conventional risk adjustment. Med Care. 2007;45:292–299.

43. Huang IC, Frangakis C, Dominici F, Diette GB, Wu AW. Application of a propensity score approach for risk adjustment in profiling multiple physician groups on asthma care. Health Serv Res. 2005;40:253–278.

44. Dehejia RH, Wahba S. Causal effects in nonexperimental studies: reeval- uating the evaluation of training programs. J Am Stat Assoc. 1999;94: 1053–1062.

45. Tchernis R, Horvitz-Lennon M, Normand SL. On the use of discrete choice models for causal inference. Stat Med. 2005;24:2197–2212.

46. Society of Thoracic Surgeons. STS National Database. Available at: http://www.sts.org/sections/stsnationaldatabase/. Accessed September 5, 2007.

47. Social Security Death Index interactive search. Available at: http:// ssdi.rootsweb.com/cgi-bin/ssdi.cgi. Accessed September 5, 2007.

48. Larsen K, Merlo J. Appropriate assessment of neighborhood effects on individual health: integrating random and fixed effects in multilevel logistic regression. Am J Epidemiol. 2005;161:81– 88.

49. Larsen K, Petersen JH, Budtz J, Endahl L. Interpreting parameters in the logistic regression model with random effects. Biometrics. 2000;56: 909 –914.

50. Normand ST, Shahian DM. Statistical and clinical aspects of hospital outcomes profiling. Stat Sci. 2007;22:206 –226.

51. Draper D, Gittoes M. Statistical analysis of performance indicators in UK higher education. J Royal Stat Soc Ser A (Stat Soc). 2004;167:449 – 474.

52. Iezzoni LI. Risk Adjustment for Measuring Health Care Outcomes. 3rd ed. Chicago, Ill: Health Administration Press; 2003.

( Downloaded from http://circ.ahajournals.org/ by guest on February 20, 2017 )

( CLINICAL PERSPECTIVE Risk-standardized outcomes are increasingly being used by various stakeholders to assess the quality of care delivered by healthcare providers. Although adjusted outcomes represent a substantial improvement over unadjusted results, they are commonly misinterpreted and misused, which can have important consequences for the provider and the healthcare system. Risk-standardized outcomes, as most commonly constructed, characterize a provider’s performance for a specific group of patients compared with what would have been expected had that care been delivered by an average provider in the reference population (typically a state or a country). These indirectly standardized outcomes, based on providers’ actual case mix, cannot necessarily be extrapolated to predict what their performance might be with a different (eg, more complex) group of patients. Moreover, if the number of providers in the reference population is small, the inclusion of a true outlying program in the development of the risk model may decrease the sensitivity of the resulting algorithm to detect true outliers. In Massachusetts, this problem is mitigated through the use of cross-validation, obtained by sequentially removing each hospital from risk model development and then assessing its performance with a model derived from the remaining hospitals. Finally, although risk-standardized outcomes are useful for comparing individual provider performance with that of the overall reference population, this does not imply that the outcomes of 2 providers can be directly compared with one another. This would only be justified for the group of patients whose risk profiles overlap the 2 providers because these are the only patients that they have in common. )

( Downloaded from http://circ.ahajournals.org/ by guest on February 20, 2017 )

Comparison of ''Risk-Adjusted'' Hospital Outcomes

David M. Shahian and Sharon-Lise T. Normand

Circulation. 2008;117:1955-1963; originally published online April 7, 2008; doi: 10.1161/CIRCULATIONAHA.107.747873

Circulation is published by the American Heart Association, 7272 Greenville Avenue, Dallas, TX 75231 Copyright © 2008 American Heart Association, Inc. All rights reserved.

Print ISSN: 0009-7322. Online ISSN: 1524-4539

The online version of this article, along with updated information and services, is located on the World Wide Web at:

http://circ.ahajournals.org/content/117/15/1955

( Permissions: Requests for permissions to reproduce figures, tables, or portions of articles originally published in Circulation can be obtained via RightsLink, a service of the Copyright Clearance Center, not the Editorial Office. Once the online version of the published article for which permission is being requested is located, click Request Permissions in the middle column of the Web page under Services. Further information about this process is available in the Permissions and Rights Question and Answer document. Reprints: Information about reprints can be found online at: http://www.lww.com/reprints Subscriptions: Information about subscribing to Circulation is online at: http://circ.ahajournals.org//subscriptions/ )