Week one discussion qr

profilebmarlurer8
Thepreventionandhandlingofthemissingdata_ADA.pdf

Review Article

Even in a well-designed and controlled study, missing data occurs in almost all research. Missing data can reduce the

statistical power of a study and can produce biased estimates, leading to invalid conclusions. This manuscript reviews

the problems and types of missing data, along with the techniques for handling missing data. The mechanisms by

which missing data occurs are illustrated, and the methods for handling the missing data are discussed. The paper

concludes with recommendations for the handling of missing data. (Korean J Anesthesiol 2013; 64: 402-406)

Key Words: Expectation-Maximization, Imputation, Missing data, Sensitivity analysis.

The prevention and handling of the missing data

Hyun Kang

Department of Anesthesiology and Pain Medicine, Chung-Ang Universtiy College of Medicine, Seoul, Korea

Received: February 13, 2013. Accepted: February 20, 2013.

Corresponding author: Hyun Kang, M.D., Ph.D., Department of Anesthesiology and Pain Medicine, Chung-Ang Universtiy College of Medicine,

224-1, Heuksuk-dong, Dongjak-gu, Seoul 156-756, Korea. Tel: 82-2-6299-2571, Fax: 82-2-6299-2585, E-mail: [email protected]

This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://

creativecommons.org/licenses/by-nc/3.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any medium,

provided the original work is properly cited.

CC

Copyright ⓒ the Korean Society of Anesthesiologists, 2013 www.ekja.org

Korean J Anesthesiol 2013 May 64(5): 402-406 http://dx.doi.org/10.4097/kjae.2013.64.5.402

Missing data (or missing values) is defined as the data value

that is not stored for a variable in the observation of interest.

The problem of missing data is relatively common in almost all

research and can have a significant effect on the conclusions

that can be drawn from the data [1]. Accordingly, some studies

have focused on handling the missing data, problems caused

by missing data, and the methods to avoid or minimize such in

medical research [2,3].

However, until recently, most researchers have drawn con-

clusions based on the assumption of a complete data set. The

general topic of missing data has attracted little attention in the

field of anesthesiology.

Missing data present various problems. First, the absence of

data reduces statistical power, which refers to the probability that

the test will reject the null hypothesis when it is false. Second, the

lost data can cause bias in the estimation of para meters. Third, it

can reduce the representativeness of the samples. Fourth, it may

complicate the analysis of the study. Each of these distortions

may threaten the validity of the trials and can lead to invalid

conclusions.

Types of Missing Data

Rubin first described and divided the types of missing data

according to the assumptions based on the reasons for the

missing data [4]. In general, there are three types of missing data

according to the mechanisms of missingness.

Missing completely at random

Missing completely at random (MCAR) is defined as when

the probability that the data are missing is not related to either

the specific value which is supposed to be obtained or the set

of observed responses. MCAR is an ideal but unreasonable

assumption for many studies performed in the field of anesthe-

siology. However, if data are missing by design, because of an

403www.ekja.org

Korean J Anesthesiol Hyun Kang

equipment failure or because the samples are lost in transit

or technically unsatisfactory, such data are regarded as being

MCAR.

The statistical advantage of data that are MCAR is that the

analysis remains unbiased. Power may be lost in the design, but

the estimated parameters are not biased by the absence of the

data.

Missing at random

Missing at random (MAR) is a more realistic assumption for

the studies performed in the anesthetic field. Data are regarded

to be MAR when the probability that the responses are missing

depends on the set of observed responses, but is not related to

the specific missing values which is expected to be obtained.

As we tend to consider randomness as not producing bias,

we may think that MAR does not present a problem. However,

MAR does not mean that the missing data can be ignored. If a

dropout variable is MAR, we may expect that the probability

of a dropout of the variable in each case is conditionally

independent of the variable, which is obtained currently and

expected to be obtained in the future, given the history of the

obtained variable prior to that case.

Missing not at random

If the characters of the data do not meet those of MCAR or

MAR, then they fall into the category of missing not at random

(MNAR).

The cases of MNAR data are problematic. The only way to

obtain an unbiased estimate of the parameters in such a case is

to model the missing data. The model may then be incorporated

into a more complex one for estimating the missing values.

Techniques for Handling the Missing Data

The best possible method of handling the missing data is to

prevent the problem by well-planning the study and collecting

the data carefully [5,6]. The following are suggested to minimize

the amount of missing data in the clinical research [7].

First, the study design should limit the collection of data to

those who are participating in the study. This can be achieved

by minimizing the number of follow-up visits, collecting only

the essential information at each visit, and developing the user-

friendly case-report forms.

Second, before the beginning of the clinical research, a

detailed documentation of the study should be developed in the

form of the manual of operations, which includes the methods

to screen the participants, protocol to train the investigators and

participants, methods to communicate between the inves ti-

gators or between the investigators and participants, implemen-

tation of the treatment, and procedure to collect, enter, and edit

data.

Third, before the start of the participant enrollment, a training

should be conducted to instruct all personnel related to the study

on all aspects of the study, such as the participant enrollment,

collection and entry of data, and implementation of the treatment

or intervention [8].

Fourth, if a small pilot study is performed before the start of

the main trial, it may help to identify the unexpected problems

which are likely to occur during the study, thus reducing the

amount of missing data.

Fifth, the study management team should set a priori targets

for the unacceptable level of missing data. With these targets in

mind, the data collection at each site should be monitored and

reported in as close to real-time as possible during the course of

the study.

Sixth, study investigators should identify and aggressively,

though not coercively, engage the participants who are at the

greatest risk of being lost during follow-up.

Finally, if a patient decides to withdraw from the follow-

up, the reasons for the withdrawal should be recorded for the

subsequent analysis in the interpretation of the results.

It is not uncommon to have a considerable amount of missing

data in a study. One technique of handling the missing data is to

use the data analysis methods which are robust to the problems

caused by the missing data. An analysis method is considered

robust to the missing data when there is confidence that mild to

moderate violations of the assumptions will produce little to no

bias or distortion in the conclusions drawn on the population.

However, it is not always possible to use such techniques.

Therefore, a number of alternative ways of handling the missing

data has been developed.

Listwise or case deletion

By far the most common approach to the missing data is to

simply omit those cases with the missing data and analyze the

remaining data. This approach is known as the complete case (or

available case) analysis or listwise deletion.

Listwise deletion is the most frequently used method in han-

d ling missing data, and thus has become the default option for

analysis in most statistical software packages. Some researchers

insist that it may introduce bias in the estimation of the para-

meters. However, if the assumption of MCAR is satisfied, a

listwise deletion is known to produce unbiased estimates and

conservative results. When the data do not fulfill the assumption

of MCAR, listwise deletion may cause bias in the estimates of the

parameters [9].

If there is a large enough sample, where power is not an

404 www.ekja.org

Vol. 64, No. 5, May 2013Missing data

issue, and the assumption of MCAR is satisfied, the listwise

deletion may be a reasonable strategy. However, when there is

not a large sample, or the assumption of MCAR is not satisfied,

the listwise deletion is not the optimal strategy.

Pairwise deletion

Pairwise deletion eliminates information only when the

particular data-point needed to test a particular assumption

is missing. If there is missing data elsewhere in the data set,

the existing values are used in the statistical testing. Since a

pairwise deletion uses all information observed, it preserves

more information than the listwise deletion, which may delete

the case with any missing data. This approach presents the

following problems: 1) the parameters of the model will stand

on different sets of data with different statistics, such as the

sample size and standard errors; and 2) it can produce an

intercorrelation matrix that is not positive definite, which is

likely to prevent further analysis [10].

Pairwise deletion is known to be less biased for the MCAR

or MAR data, and the appropriate mechanisms are included as

covariates. However, if there are many missing observations, the

analysis will be deficient.

Mean substitution

In a mean substitution, the mean value of a variable is used in

place of the missing data value for that same variable. This allows

the researchers to utilize the collected data in an incom plete

dataset. The theoretical background of the mean sub stitution is

that the mean is a reasonable estimate for a randomly selected

observation from a normal distribution. However, with missing

values that are not strictly random, especially in the presence

of a great inequality in the number of missing values for the

different variables, the mean substitution method may lead

to inconsistent bias. Furthermore, this approach adds no new

information but only increases the sample size and leads to an

underestimate of the errors [11]. Thus, mean substitution is not

generally accepted.

Regression imputation

Imputation is the process of replacing the missing data with

estimated values. Instead of deleting any case that has any

missing value, this approach preserves all cases by replacing the

missing data with a probable value estimated by other available

information. After all missing values have been replaced by

this approach, the data set is analyzed using the standard

techniques for a complete data.

In regression imputation, the existing variables are used to

make a prediction, and then the predicted value is substituted

as if an actual obtained value. This approach has a number of

advantages, because the imputation retains a great deal of data

over the listwise or pairwise deletion and avoids significantly

altering the standard deviation or the shape of the distribution.

However, as in a mean substitution, while a regression impu-

tation substitutes a value that is predicted from other variables,

no novel information is added, while the sample size has been

increased and the standard error is reduced.

Last observation carried forward

In the field of anesthesiology research, many studies are

performed with the longitudinal or time-series approach, in

which the subjects are repeatedly measured over a series of

time-points. One of the most widely used imputation methods

in such a case is the last observation carried forward (LOCF).

This method replaces every missing value with the last observed

value from the same subject. Whenever a value is missing, it is

replaced with the last observed value [12].

This method is advantageous as it is easy to understand

and communicate between the statisticians and clinicians or

between a sponsor and the researcher.

Although simple, this method strongly assumes that the

value of the outcome remains unchanged by the missing

data, which seems unlikely in many settings (especially in the

anesthetic trials). It produces a biased estimate of the treatment

effect and underestimates the variability of the estimated

result. Accordingly, the National Academy of Sciences has re-

commended against the uncritical use of the simple imputation,

including LOCF and the baseline observation carried forward,

stating that:

Single imputation methods like last observation carried

forward and baseline observation carried forward should not be

used as the primary approach to the treatment of missing data

unless the assumptions that underlie them are scientifically

justified [13].

Maximum likelihood

There are a number of strategies using the maximum like li-

hood method to handle the missing data. In these, the assump-

tion that the observed data are a sample drawn from a multi-

variate normal distribution is relatively easy to understand.

After the parameters are estimated using the available data, the

missing data are estimated based on the parameters which have

just been estimated.

When there are missing but relatively complete data, the

statistics explaining the relationships among the variables may

be computed using the maximum likelihood method. That is,

405www.ekja.org

Korean J Anesthesiol Hyun Kang

the missing data may be estimated by using the conditional

distribution of the other variables.

Expectation-Maximization

Expectation-Maximization (EM) is a type of the maximum

likelihood method that can be used to create a new data set, in

which all missing values are imputed with values estimated by

the maximum likelihood methods [14]. This approach begins

with the expectation step, during which the parameters (e.g.,

variances, covariances, and means) are estimated, perhaps

using the listwise deletion. Those estimates are then used to

create a regression equation to predict the missing data. The

maximization step uses those equations to fill in the missing

data. The expectation step is then repeated with the new para-

meters, where the new regression equations are determined

to "fill in" the missing data. The expectation and maximization

steps are repeated until the system stabilizes, when the cova-

riance matrix for the subsequent iteration is virtually the same

as that for the preceding iteration.

An important characteristic of the expectation-maximization

imputation is that when the new data set with no missing values

is generated, a random disturbance term for each imputed value

is incorporated in order to reflect the uncertainty associated

with the imputation. However, the expectation-maximization

imputation has some disadvantages. This approach can take a

long time to converge, especially when there is a large fraction

of missing data, and it is too complex to be acceptable by some

exceptional statisticians. This approach can lead to the biased

parameter estimates and can underestimate the standard error.

For the expectation-maximization imputation method, a

predicted value based on the variables that are available for

each case is substituted for the missing data. Because a single

imputation omits the possible differences among the multiple

imputations, a single imputation will tend to underestimate the

standard errors and thus overestimate the level of precision.

Thus, a single imputation gives the researcher more apparent

power than the data in reality.

Multiple imputation

Multiple imputation is another useful strategy for handling

the missing data. In a multiple imputation, instead of sub stitu-

ting a single value for each missing data, the missing values are

replaced with a set of plausible values which contain the natural

variability and uncertainty of the right values.

This approach begin with a prediction of the missing data

using the existing data from other variables [15]. The missing

values are then replaced with the predicted values, and a full

data set called the imputed data set is created. This process

iterates the repeatability and makes multiple imputed data sets

(hence the term “multiple imputation”). Each multiple imputed

data set produced is then analyzed using the standard statistical

analysis procedures for complete data, and gives multiple

analysis results. Subsequently, by combining these analysis

results, a single overall analysis result is produced.

The benefit of the multiple imputation is that in addition to

restoring the natural variability of the missing values, it incorpo-

rates the uncertainty due to the missing data, which results in

a valid statistical inference. Restoring the natural variability of

the missing data can be achieved by replacing the missing data

with the imputed values which are predicted using the variables

correlated with the missing data. Incorporating uncertainty is

made by producing different versions of the missing data and

observing the variability between the imputed data sets.

Multiple imputation has been shown to produce valid stati-

stical inference that reflects the uncertainty associated with the

estimation of the missing data. Furthermore, multiple impu-

tation turns out to be robust to the violation of the normality

assumptions and produces appropriate results even in the pre-

sence of a small sample size or a high number of missing data.

With the development of novel statistical software, although

the statistical principles of multiple imputation may be difficult

to understand, the approach may be utilized easily.

Sensitivity analysis

Sensitivity analysis is defined as the study which defines how

the uncertainty in the output of a model can be allocated to the

different sources of uncertainty in its inputs.

When analyzing the missing data, additional assumptions

on the reasons for the missing data are made, and these assum-

ptions are often applicable to the primary analysis. However,

the assumptions cannot be definitively validated for the

correctness. Therefore, the National Research Council has pro-

posed that the sensitivity analysis be conducted to evaluate

the robustness of the results to the deviations from the MAR

assumption [13].

Recommendations

Missing data reduces the power of a trial. Some amount of

missing data is expected, and the target sample size is increased

to allow for it. However, such cannot eliminate the potential

bias. More attention should be paid to the missing data in the

design and performance of the studies and in the analysis of the

resulting data.

The best solution to the missing data is to maximize the data

collection when the study protocol is designed and the data

collected. Application of the sophisticated statistical analysis

406 www.ekja.org

Vol. 64, No. 5, May 2013Missing data

techniques should only be performed after the maximal efforts

have been employed to reduce missing data in the design and

prevention techniques.

A statistically valid analysis which has appropriate mechanisms

and assumptions for the missing data should be conducted. Single

imputation and LOCF are not optimal approaches for the final

analysis, as they can cause bias and lead to invalid conclusions.

All variables which present the potential mechanisms to explain

the missing data must be included, even when these variables

are not included in the analysis [16]. Researchers should seek

to understand the reasons for the missing data. Distinguishing

what should and should not be imputed is usually not possible

using a single code for every type of the missing value [17]. It

is difficult to know whether the multiple imputation or full

maximum likelihood estimation is best, but both are superior to

the traditional approaches. Both techniques are best used with

large samples. In general, multiple imputation is a good approach

when analyzing data sets with missing data.

References

1. Graham JW. Missing data analysis: making it work in the real world.

Annu Rev Psychol 2009; 60: 549-76.

2. Little RJ, D'Agostino R, Cohen ML, Dickersin K, Emerson SS, Farrar

JT, et al. The prevention and treatment of missing data in clinical

trials. N Engl J Med 2012; 367: 1355-60.

3. O'Neill RT, Temple R. The prevention and treatment of missing data

in clinical trials: an FDA perspective on the importance of dealing

with it. Clin Pharmacol Ther 2012; 91: 550-4.

4. Rubin DB. Inference and missind data. Biometrika 1976; 63: 581-92.

5. DeSarbo S, Green PE, Carroll JD. An alternating least-squares

procedure for estimating missing preference data in product-

concept testing. Decision Sciences 1986; 17 : 163-85.

6. Wisniewski SR, Leon AC, Otto MW, Trivedi MH. Prevention of

missing data in clinical research studies. Biol Psychiatry 2006; 59:

997-1000.

7. Scharfstein DO, Hogan J, Herman A. On the prevention and analysis

of missing data in randomized clinical trials: the state of the art. J

Bone Joint Surg Am 2012; 94 Suppl 1: 80-4.

8. Wilcox S, Shumaker SA, Bowen DJ, Naughton MJ, Rosal MC, Ludlam

SE, et al. Promoting adherence and retention to clinical trials in

special populations: a women's health initiative workshop. Control

Clin Trials 2001; 22: 279-89.

9. Donner A. The relative effectiveness of procedures commonly used

in multiple regression analysis for dealing with missing values. Am

Stat 1982; 36: 378-81.

10. Kim JO, Curry J. The treatment of missing data in multivariate

analysis. Sociol Methods Res 1977; 6: 215-41.

11. Malhotra N. Analyzing marketing research data with incomplete

information on the dependent variable. J Mark Res 1987; 24: 74-84.

12. Hamer RM, Simpson PM. Last observation carried forward versus

mixed models in the analysis of psychiatric clinical trials. Am J

Psychiatry 2009; 166: 639-41.

13. Panel on Missing Data in Clinical Trials. The prevention and treat-

ment of missing data in clinical trials. 2nd ed. Washington DC,

National Academies Press. 2010, pp 107-14.

14. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from

incomplete data via the EM algorithm. JRSSB 1997; 39: 1-38.

15. Sinharay S, Stern HS, Russell D. The use of multiple imputation for

the analysis of missing data. Psychol Methods 2001; 6: 317-29.

16. Rubin DB. Multiple imputation after 18+ years (with discussion). J

Am Stat Assoc 1996; 91: 473-89.

17. Acock AC. Working with missing values. J Marriage Fam 2005; 67:

1012-28.