Data Screening Basics Stats
Warner, R. M. (2021). Applied sta s cs II: Mul variable and Mul variate Techniques. Los Angeles, CA: Sage Publica ons. ISBN: 978-1-5443-9872-3
CHAPTER 2 ADVANCED DATA SCREENING: OUTLIERS AND MISSING VALUES 2.1 INTRODUCTION Extensive data screening should be conducted prior to all analyses. Univariate and bivariate data screening are s ll necessary (as described in Volume I [Warner, 2020]). This chapter provides further discussion of outliers and procedures for handling missing data. It is important to formulate decision rules for data screening and handling prior to data collec on and to document the process thoroughly. During data screening, a researcher does several things: Correct errors. “Get to know” the data (for example, iden fy distribu on shapes). Assess whether assump ons required for intended analyses are sa sfied. Correct viola ons of assump ons, if possible. Iden fy and remedy problems such as outliers, skewness, and missing values. The following sec on suggests ways to keep track of the data-screening process for large numbers of variables. 2.2 VARIABLE NAMES AND FILE MANAGEMENT 2.2.1 Case Iden fica on Numbers If there are no case iden fica on numbers, create them. O en the original case numbers used to iden fy individuals during data collec on in social sciences are removed to ensure confiden ality. Case numbers that correspond to row numbers in the SPSS file can be created using this command: COMPUTE idnumber = $casenum (where $casenum denotes row number in the SPSS file). The variable idnumber can be used to label individual cases in graphs and iden fy which cases have outliers or missing values. 2.2.2 Codes for Missing Values Missing values are usually iden fied by leaving cells in the SPSS data worksheet blank. In recode or compute statements, a blank cell corresponds to the value $sysmis. In some kinds of research, it is useful to document different reasons for missing values (Acock, 2005). For example, a survey response can be missing because a par cipant refuses to answer or cannot remember the informa on; a physiological measure may be missed because of equipment malfunc on. Different numerical codes can be used for each type of missing value. Be sure to use number codes for missing that cannot occur as valid score values. For example, number of
ckets for traffic viola ons could be coded 888 for “refused to answer” and 999 for “could not remember.” Archival data files some mes use mul ple codes for missing. Missing values are specified and labeled in the SPSS Data Editor Variable View tab.
2.2.3 Keeping Track of Files It is common for data analysts to go through a mul ple-step process in data screening; this is par cularly likely when longitudinal data are collected. A flowchart may be needed to keep track of scores that are modified and cases that are lost due to a ri on. The CONSORT (Consolidated Standards of Repor ng Trials) protocol describes a way to do this (Boutron, 2017). Figure 2.1 shows a template for a CONSORT flow diagram. Descrip on Figure 2.1 Flowchart: CONSORT Protocol to Track Par cipant A ri on and Data Handling Source: h p://www.consort-statement.org/consort-statement/flow-diagram. It is important to retain the original data file and save modified data files at every step during this process. If you change your mind about some decisions, or discover errors, you may need to go back to earlier versions of files. Keep a detailed log that documents what was done to data at each step. Use of file names that include the date and me of file crea on and/or words that remind you what was done at each step can be helpful when you need to locate the most recent version or backtrack to earlier versions. Naming a file “final” is almost never a good idea. (Files are me-stamped by computers, but these me stamps are not always adequate informa on.) 2.2.4 Use Different Variable Names to Keep Track of Modifica ons If a variable will be transformed or recoded before use in later analysis, it is helpful to use an ini al variable name indica ng that this change has not yet been done. For example, an ini al score for reac on me could be named raw_rt. The log-transformed version of the variable could be called log_rt. As another example, some self-report measures include reverse-worded ques ons. For example, most items in a depression scale might be worded such that higher degree of agreement indicates more depression (e.g., “I feel sad most of the me,” rated on a scale from 1 = strongly disagree to 5 = strongly agree). Some items might be reverse worded (such that a high score indicates less depression; e.g., “Most days I am happy”). Before scores can be summed to create a total depression score, scoring for reverse-worded ques ons must be changed to make scores consistent (e.g., a score of 5 always indicates higher depression). The ini al name for a reverse-worded ques on could be rev_depression1. The “rev” prefix would indicate that this item was worded in a direc on opposite from other items. A er recoding to change the direc on of scoring, the new variable name could be depression1 (without the “rev” prefix). Then the total scale score could be computed by summing depression1, depression2, and so on. Avoid using the same names for variables before and a er transforma ons or recodes, because this can lead to confusion. 2.2.5 Save SPSS Syntax The Paste bu on in SPSS dialog boxes can be used to save SPSS commands generated by your menu selec ons into a syntax file. Save all SPSS syntax used to recode, transform, compute new variables, or make other modifica ons during the data prepara on process. The syntax file
documents what was done, and if errors are discovered, syntax can be edited to make correc ons and all analyses can be done again. This can save considerable me. Data screening is needed so that when the analyses of primary interest are conducted, the best possible informa on is available. If problems such as outliers and missing values are not corrected during data screening, the results of final analyses are likely to be biased. 2.3 SOURCES OF BIAS Bias can be defined as over- or underes ma on of sta s cs such as values of M, t ra os, and p values. Bias means that the sample sta s c over- or underes mates the corresponding popula on parameters (e.g., M is systema cally larger or smaller than μ). Bias can occur when assump ons for analyses are violated and when outliers or missing data are present. Most of the sta s cs in this book (except for logis c regression) are special cases of the general linear model (GLM). Most GLM analyses were developed on the basis of the following assump ons. Some assump ons are explicit (i.e., assumed in deriva ons of sta s cs). There are also implicit assump ons and rules for the use of significance tests in prac ce (e.g., don’t run hundreds of tests and report only those with p < .05; selected p values will greatly underes mate the risk for Type I decision error). Problems such as outliers and missing data o en arise in real-world data. The actual prac ce of sta s cs is much messier than the ideal world imagined by mathema cal sta s cians. Here is a list of concerns that should be addressed in data screening. Some of these things are rela vely easy to iden fy and correct, while others are more difficult.
Scores within samples must be independent of one another. Whether this assump on is sa sfied depends primarily on how data were collected (Volume I [Warner, 2020], Chapter 2). Scores in samples are not independent if par cipants can influence one another’s behavior through processes such as persuasion, coopera on, imita on, or compe on. See Kenny and Judd (1986, 1996) for discussion. When this assump on is violated, es mates of SD or SSwithin are o en too small; that makes es mates of t or F too large and results in inflated risk for Type I error. Viola ons of this assump on are a serious problem.
All rela onships among variables are linear. This is an extremely important assump on that we can check in samples by visual examina on of sca erplots and by tests of nonlinearity. Nonlinear terms (such as X2, in addi on to X as a predictor of Y) can be added to linear regression models, but some mes nonlinearity points to the need for other courses of ac on, such as nonlinear data transforma on or analyses outside the GLM family.
Missing values can lead to bias in the composi on of samples and corresponding bias in es ma on of sta s cs. O en, cases with missing values differ in some way from the cases with complete data. Suppose men are more likely not to answer a ques on about depression than women, or that students with low grades are more likely to skip ques ons about academic performance. If these cases are dropped, the sample becomes biased (the sample will underrepresent men and/or low-performing students).
Later sec ons in this chapter discuss methods for evalua on of amount and pa ern of missing values and replacement of missing values with es mated or imputed scores. Whether problems with missing values can be remedied depends on the reasons for missingness, as discussed in that sec on.
Residuals or predic on errors are independent of one another, are normally distributed, and have mean of 0 and equal variance for all values of predictor variables. For regression analysis and related techniques such as me-series analysis, these assump ons can be evaluated using plots and descrip ve sta s cs for residuals. Data analysts should beware the tempta on to drop cases just because they have large residuals (Tabachnick & Fidell, 2018). This can amount to trimming the data to fit the model. Users of regression are more likely to focus on residuals than users of analysis of variance (ANOVA).
Some sample distribu on shapes make M a poor descrip on of central tendency. For example, a bimodal distribu on of ra ngs on a 1-to-7 scale, with a mode at the lowest and highest scores (as we would see for highly polarized ra ngs), is not well described by a sample mean (see Volume I [Warner, 2020], Chapter 5). We need to do something else with these data. If sample size is large enough, we may be able to treat each X score (e.g., X = 1, X = 2, …, X = 7) as a separate group. With large samples (on the order of thousands) it may be be er to treat some quan ta ve variables as categorical.
Some distribu on types require different kinds of analysis. For example, when a Y dependent variable is a count of behaviors such as occasions of drug use, the histogram for the distribu on of Y may have a mode at 0. Analyses outside the GLM family, such as zero-inflated nega ve binomial regression, may be needed for this kind of dependent variable (see Appendix 2A). The remedy for this kind of problem is to choose an appropriate analysis.
Skewness of sample distribu on shape. Skewness can be evaluated by visual examina on of histograms. SPSS provides a skewness index and its standard error; sta s cal significance of skewness can be assessed by examining z = skewness/SEskewness, using the standard normal distribu on to evaluate z. However, visual examina on is o en adequate and may provide insight into reasons for skewness that the skewness index by itself cannot provide. Skewness is not always a major problem. Some mes sample skewness can be eliminated or reduced by removal or modifica on of outliers. If skewness is severe and not due to just a few outliers, transforma ons such as log may be useful ways to reduce skewness (discussed in a later sec on).
Deriva ons of many sta s cal significance tests assume that scores in samples are randomly selected from normally distributed popula ons. This raises two issues. On one hand, some data analysts worry about the normality of their popula on distribu ons. I worry more about the use of convenience samples that were not selected from any well- defined popula on. The use of convenience samples can limit generalizability of results. On the other hand, Monte Carlo simula ons that evaluate viola ons of this normally distributed popula on assump on for ar ficially generated popula ons of data and simple analyses such as the independent-samples t test o en find that viola ons of this assump on do not seriously bias p values, provided that samples are not too small
(Sawilowsky & Blair, 1992). There are significance tests, such as Levene F, to test differences between sample variances for t and F tests. However, tests that are adjusted to correct for viola ons of this assump on, such as the “equal variances not assumed” or Welch’s t, are generally thought to be overly conserva ve. The issue here is that we o en don’t know anything about popula on distribu on shape. For some simple analyses, such as independent-samples t and between-S ANOVA, viola ons of assump on of normal distribu on in the popula on may not cause serious problems.
For more advanced sta s cal methods, viola ons of normality assump ons may be much more serious. These problems can be avoided through the use of robust es ma on methods that do not require normality assump ons (Field & Wilcox, 2017; Maronna, Mar n, Yohai, & Salibián-Barrera, 2019).
Viola on of assump on that all variables are measured without error (that all measures are perfectly reliable and perfectly valid). This is almost never true in real data. Advanced techniques such as structural equa on modeling include measurement models that take measurement error into account (to some extent).
Model must be properly specified. A properly specified model in regression includes all the predictors of Y that should be included, includes terms such as interac ons if these are needed, and does not include “garbage” variables that should not be sta s cally controlled. We can never be sure that we have a correctly specified model. Kenny (1979) noted that when we add or drop variables from a regression, we can have “bouncing betas” (regression slope es mates can change drama cally). The value of each beta coefficient depends on context (i.e., which other variables are included in the model). Significance tests for b coefficients vary depending on the set of variables that are controlled when assessing each predictor. Another way to say this is that we cannot obtain unbiased es mates of effects unless we control for the “right” set of variables. Decisions about which variables to control are limited by the variables that are available in the data set. Unfortunately, it is common prac ce for data analysts to add and/or drop control variables un l they find that the predictor variable of interest becomes sta s cally significant.
Some of these assump ons (such as normally distributed scores in the popula on from which the sample was selected) cannot be checked. Some poten al problems can be evaluated through screening of sample data. 2.4 SCREENING SAMPLE DATA From a prac cal and applied perspec ve, what are the most important things to check for during preliminary data screening? First, remember that rules for iden fica on and handling of problems with data, such as outliers, skewness, and missing values, should be established before you collect data. If you experiment with different rules for outlier detec on and handling, run numerous analyses, and report selected results, the risk for commi ng Type I decision error increases, o en substan ally. Doing whatever it takes to obtain sta s cally significant values of p is called p-hacking (Wicherts et al., 2016), and this can lead to misleading results (Simmons, Nelson, & Simonsohn, 2011). Commi ng to decisions about data handling prior to data collec on can reduce the tempta on to engage in p-hacking.
2.4.1 Data Screening Needed in All Situa ons
Individual scores should always be evaluated to make sure that all score values are plausible and accurate and that the ranges of scores in the sample (for important variables) corresponds at least approximately to the ranges of scores in the hypothe cal popula on of interest. If a study includes persons with depression scores that range only from 0 to 10, we cannot generalize or extrapolate findings to persons with depression scores above 10. A frequency table provides informa on about range of scores.
Missing values. Begin by evalua ng how many missing data there are; frequency tables tell us how many missing values there are for each variable. If there are very few missing values (e.g., less than 5% of observa ons), missing data may not be a great concern, and it may be acceptable to let SPSS use default methods such as listwise dele on for handling missing data. If there are larger amounts of missing data, this raises concerns whether data are missing systema cally. A later sec on in this chapter discusses the missing data problem further.
Evaluate distribu on skewness. When a distribu on is asymmetrical, it o en has a longer and thinner tail at one end than the other. It is possible that an appearance of skewness arises because of a few outliers. If this is the case, I recommend that you handle this as an outlier problem. A later sec on discusses possible ways to handle skewness if it is not due to just a few outliers. Some variables (such as income) predictably have very strong posi ve skewness. Some mes nonlinear data transforma ons are used to reduce skewness.
2.4.2 Data Screening for Comparison of Group Means Make sure all groups have adequate n’s. If we have at least n = 30 cases per group, and
use two-tailed tests, viola ons of the popula on normality assump on and of assump ons about equal popula on variances do not seriously bias p value es mates (Sawilowsky & Blair, 1992). Some authori es suggest that even smaller values of n may be adequate. I believe that below some point (perhaps n of 20 per group), there is just not sufficient informa on to describe groups or to evaluate whether the group is similar to popula ons of interest. However, this is not an ironclad rule. In some kinds of research (such as neuroscience), it is reasonable for researchers to assume li le varia on among cases with respect to important characteris cs such as brain structure and func on, and recrui ng and paying for cases can be expensive because of me- consuming procedures. For example, in behavioral neuroscience animal research, each case may require extensive training, then surgery, then extensive tes ng or evalua on or costly laboratory analysis of specimen materials. Procedures such as magne c resonance imaging are very costly. Some mes smaller n’s are all we can get.
Check for outliers within groups. Outliers within groups affect es mates of both M and SD, and these in turn will affect es mates of t and p. The effect of outliers may be either to inflate or deflate the t ra o. Boxplots are a common way to iden fy outliers within groups.
Examine distribu on shapes in groups to evaluate whether M is a reasonable descrip on of central tendency. Some distribu on shapes (such as a bimodal distribu on with
modes at the extreme high and low ends of the distribu on and distribu ons with large modes at 0) can make M a poor way to describe central tendency (Volume I [Warner, 2020], Chapter 5). If these distribu on shapes are seen in sample data, the data analyst should consider whether comparison of means is a good way to evaluate outcomes.
2.4.3 Data Screening for Correla on and Regression Check that rela ons between all variables are linear if you plan to use linear correla on
and linear regression methods. In addi on, predictor variables should not be too highly correlated with one another. Visual examina on of a sca erplot may be sufficient; regression can also be used to evaluate nonlinearity (discussed in Sec on 2.8).
Check for outliers. Bivariate and mul variate outliers can inflate or deflate correla ons among variables. Bivariate outliers can be detected by visual examina on of an X, Y sca erplot. For more than two variables, you need to look for mul variate outliers (described later in this chapter).
Evaluate whether X and Y have similar distribu on shapes. It may be more important that X and Y have similar distribu on shapes than that their sample distribu on shapes are normal. When distribu on shapes differ, the maximum obtainable value for r will have a limited range (not the full range from –1 to +1). This in turn will influence es mates for analyses that use r as a building block. Visual examina on of histograms may be sufficient. See Appendix 10D in Volume I (Warner, 2020).
Evaluate plots of residuals from regression to verify that they are (a) normally distributed and (b) not related to values of Y or Y´ (these are assump ons for regression). If you have only one predictor, screening raw scores on variables may lead to the same conclusions as screening residuals. Tabachnick and Fidell (2018) pointed out that when a researcher runs the final analysis of primary interest and then examines residuals, it can be temp ng to remove or modify cases specifically because they cause poor fit in the final analysis. In other words, the data analyst may be tempted to trim the data (post hoc) to fit the model.
Sample distribu ons that differ dras cally from normal may alert you to the need for different kinds of analyses outside the GLM family (an example is provided in Appendix 2A).
2.5 POSSIBLE REMEDY FOR SKEWNESS: NONLINEAR DATA TRANSFORMATIONS Nonlinear transforma ons of X (such as 1/X, X2, Xc for any value of c, base 10 or natural log of X, arcsine of X, and others) can change the shape of distribu ons (Tabachnick & Fidell, 2018). Although log transforma ons can poten ally reduce posi ve or nega ve skewness in an otherwise normal distribu on, they are not always appropriate or effec ve. In many situa ons, if distribu on shape can be made reasonably normal by modifying or removing outliers, it may be preferable to do that. Log transforma ons make sense when at least one of the following condi ons are met:
The underlying distribu on is exponen al. It is conven onal to use log transforma ons with this variable; readers and reviewers are
familiar with it.
Scores on the variable differ across orders of magnitude. Scores differ across orders of magnitude when the highest value is vastly larger than the smallest value. Consider the following example: The weight of an elephant can be tens of thousands of mes greater than the weight of a mouse. Typical values for body weight for different species, given in kilograms, appear on the X axis in Figure 2.2.
Because of outliers (weight and metabolic rates for elephants), scores for body weight (and metabolism) of smaller species are crowded together in the lower le -hand corner of the graph, making it difficult to dis nguish differences among most species. When the base 10 log is taken for both variables, as shown in Figure 2.3, scores for species are spread out more evenly on the X and Y axes. Differences among them are now represented in log units (orders of magnitude). In addi on, the rela on between log of weight and log of metabolic rate becomes linear (of course, this will not happen for all log-transformed variables). Descrip on Figure 2.2 Sca erplot of Metabolic Rate by Body Weight (Raw Scores) Source: Reprinted with permission from Dr. Tatsuo Motokawa. Descrip on Figure 2.3 Sca erplot of Log Metabolic Rate by Log Body Weight Source: Reprinted with permission from Dr. Tatsuo Motokawa. Other transforma ons commonly used in some areas of psychology involve power func ons, that is, replacing X with X2, or Xc (where c is some power of X; the exponent c is not necessarily an integer value). Power transforma ons are used in psychophysical studies (e.g., to examine how perceived heaviness of objects is related to physical mass). When individual scores for cases are propor ons, percentages, or correla ons, other nonlinear transforma ons may be needed. Data transforma ons such as arcsine (for propor ons) or Fisher r to Z are used to correct problems with the shapes of sampling distribu ons that arise when the range of possible score values has fixed end points (–1 to +1 for correla on, 0 to 1.00 for propor on). If you use nonlinear transforma ons to reduce skewness, examine a histogram for the transformed scores to see whether the transforma on had the desired effect. In my experience, distribu ons of log-transformed scores o en do not look any be er than the raw scores. When X does not have a very wide range, the correla on of X with X2, or X with log X, is o en very close to 1. In these situa ons, the transforma on does not have much effect on distribu on shape. 2.6 IDENTIFICATION OF OUTLIERS 2.6.1 Univariate Outliers Outliers can be a problem because many widely used sta s cs, such as the sample mean M, are not robust against the effect of outliers. In turn, other sta s cs that use M in computa ons
(such as SD and r and t) can also be influenced by outliers. Outliers can bias es mates of parameters, effect sizes, standard errors, confidence intervals, and test sta s cs such as t and F ra os and their corresponding p values (Field, 2018). It is o en possible to an cipate which variables are likely to have outliers. If scores are ra ngs on 1-to-5 or 1-to-7 scales, extreme outliers cannot occur. However, many variables (such as income) have no fixed upper limit; in these situa ons, outliers are common. When you know ahead of me that some of your variables are likely to generate outliers, it’s important to make decisions ahead of me. What rules will you use to iden fy scores as outliers, and what methods will you use to handle outliers? Outliers are some mes obtained because of equipment malfunc on or other forms of measurement error. If groups will be compared, outlier evalua on should be done separately within each group (e.g., a separate boxplot within each group). To review briefly: In boxplots, scores that lie outside the “whiskers” can be considered poten al outliers (an open circle represents an outlier; an asterisk represents an extreme outlier). Boxplots are par cularly appropriate for non-normally distributed data. Scores can be iden fied as outliers if they have z values greater than 3.29 in absolute value for the distribu on within each group (Tabachnick & Fidell, 2018). These are arbitrary rules; they are suggested here because they make sense in a wide range of situa ons. Aguinas, Go redson, and Joo (2013) provided numerous other possible sugges ons for outlier iden fica on. 2.6.2 Bivariate and Mul variate Outliers Bivariate outliers affect es mates of correla ons and regression slopes. In bivariate sca erplots it is easy to see whether an individual data point is far away from the cloud that contains most other data points. This distance can be quan fied by compu ng a Mahalanobis distance. Mahalanobis distance can be generalized to situa ons with larger numbers of variables. A score with a large Mahalanobis distance corresponds to a point that is outside the cloud that contains most of the other data points, as shown in the three-dimensional plot for three variables in Figure 2.4. The most extreme mul variate outlier is shown as a filled circle near the top. Descrip on Figure 2.4 Mul variate Outlier for Combina on of Three Variables Source: Data selected and extensively modified from Warner, Frye, Morrell, and Carey (2017). Note: Fat is the number of fat servings per day, and sugar is the number of sugar calories per day. Descrip on
Figure 2.5 Linear Regression Dialog Boxes Mahalanobis distance can be obtained as a diagnos c when running analyses such as mul ple regression and discriminant analysis. Tabachnick and Fidell (2018) suggested a method to obtain Mahalanobis distance for a set of variables without “previewing” the final regression analysis of interest. Their suggested method avoids the tempta on to remove outliers that reduce goodness of fit for the final model. They suggested using the case iden fica on number as the dependent variable in a linear regression and using the en re set of variables to be examined for mul variate outliers as predictors. (This works because mul variate outliers among predictors are unaffected by subject iden fica on number; Tabachnick & Fidell, 2018). Data in the file outlierfvi.sav are used to demonstrate how to obtain and interpret Mahalanobis distance for a set of hypothe cal data. The ini al menu selec ons are <Analyze> → <Regression> → <Linear>. This opens the Linear regression dialog box on the le -hand side of Figure 2.5. Idnumber is entered as the dependent variable. To examine whether there are mul variate outliers in a set of three variables, all three variables are entered as predictor variables in the Linear Regression dialog box. Click the Save bu on. This opens the Linear Regression: Save dialog box that appears on the right-hand side of Figure 2.5. Check the box for “Mahalanobis.” Click Con nue, then OK. A er the regression has been run, SPSS Data View (Figure 2.6) has a new variable named MAH_1. (The tag “_1” at the end of the variable name indicates that this is from the first regression analysis that was run.) This is the Mahalanobis distance score for each individual par cipant; it tells you the degree to which that person’s combina on of scores on fat, sugar, and body mass index (BMI) was a mul variate outlier, rela ve to the cloud that these scores occupied in three-dimensional space (shown previously in Figure 2.4). The file was sorted in descending order by values of MAH_1; the part of the file that appears in Figure 2.6 shows a subset of persons whose scores could be iden fied as mul variate outliers, because they had large values of Mahalanobis distance (many other cases not shown in Figure 2.6 had smaller values of Mahalanobis distance). Mahalanobis distance has a χ2 distribu on with df equal to the number of predictor variables (Tabachnick & Fidell, 2018). The largest value was MAH_1 = 77.76 (for idnumber = 421). The cri cal value of chi squared with 3 df, using α = .001, is 16.27. Using that value of χ2 as a criterion, MAH_1 would be judged sta s cally significant for all cases listed in Figure 2.6. If the decision to use Mahalanobis distance as a criterion for the iden fica on of outliers was made prior to data screening, scores for all three variables for the cases with significant values of Mahalanobis distance could be converted to missing values. If this results in fewer than 5% missing values, this small amount of missing data may not bias results. If more than 5% of cases have missing values, some form of imputa on (described elsewhere in the chapter) could be used to replace the missing values with reasonable es mates. Figure 2.6 SPSS Data View With Saved Mahalanobis Distance
Examina on of scores for sugar and fat consump on and BMI for the case on the first row in Figure 2.6 indicates that this person had a BMI within normal range (the normal range for BMI is generally defined as 18.5 to 24.9 kg/m2), even though this person reported consuming 16 servings of fat per day. (The value of 16 servings of fat per day was a univariate outlier.) Although this might be physically possible, this seems unlikely. In actual data screening, 16 servings of fat would have been tagged as a univariate outlier and modified at an earlier stage in data screening. Several addi onal cases had sta s cally significant values for Mahalanobis distance. When there are numerous mul variate outliers, Tabachnick and Fidell (2018) suggested addi onal examina on of this group of cases to see what might dis nguish them from nonoutlier cases. 2.7 HANDLING OUTLIERS 2.7.1 Use Different Analyses: Nonparametric or Robust Methods The most widely used parametric sta s cs (those covered in Volume I [Warner, 2020], and the present volume) that are part of the GLM are generally not robust against the effect of outliers. One way to handle outliers is to use different analyses. Many nonparametric sta s cs convert scores to ranks as part of computa on; this gets rid of outliers. However, it would be incorrect to assume that use of nonparametric sta s cs makes everything simple. Sta s cs such as the Wilcoxon rank sum test do not require scores to be normally distributed, but they assume that the distribu on shape is the same across groups, and in prac ce, data o en violate that assump on. Robust sta s cal techniques, o en implemented using R (Field & Wilcox, 2017; Maronna et al., 2019) do not require the assump ons made for GLM. Robust methods are beyond the scope of this volume. They will likely become more widely used in the future. 2.7.2 Handling Univariate Outliers Suppose that you have iden fied scores in your data file as univariate outliers (because they were tagged in a boxplot, because they had z > 3.29 in absolute value, or on the basis of other rules). Rules for iden fica on and handling of outliers should be decided before data collec on, if possible. Here are the four most obvious choices for outlier handling; there are many other ways (Aguinas et al., 2013).
Do nothing. Run the analysis with the outliers included. Discard all outliers. Removal of extreme values is o en called trunca on or trimming. Replace all outliers with the next largest score value that is not considered an outlier.
The informa on in boxplots can be used to iden fy outliers and find the next largest score value that is not an outlier. This is called Winsorizing.
Run the analysis with the outliers included, and also with the outliers excluded, and report both analyses. (Do not just report the version of the analysis that you liked be er.)
No ma er which of these guidelines you choose, you must document how many outliers were iden fied, using what rule, and what was done with these outliers. Try to avoid using different rules for different variables or cases. If you have a different story about each data point you remove, it will sound like p-hacking, and in fact, it will probably be p-hacking. (That said, there may be precedent or specific reasons for outlier handling that apply to some variables and not others.) Do not experiment with different choices for outlier elimina on and modifica on and then report the version of analysis you like best. That is p-hacking; the reported p value will greatly underes mate the true risk for Type I decision error. 2.7.3 Handling Bivariate and Mul variate Outliers Consider bivariate outliers first. If you have scores for height in inches (X) and body weight in pounds (Y), and one case has X = 73 and Y = 110, the univariate scores are not extreme. The combina on, however, would be very unusual. Winsorizing might not get rid of the problem, but you could do other things (exclude the case, or run the analysis both with and without this case), as long as you can jus fy your choice on the basis of plans you made prior to data collec on. For mul variate outliers, it may be possible to iden fy which one or two variables make the case an outlier. In the previous example of a mul variate outlier, the extremely high value of fat (in row 1 of the data file that appears in Figure 2.6) seemed inconsistent with the normal BMI score. A decision might be made to replace the high fat score with a lower valid score value that is not an outlier. However, detailed evalua on of mul variate outliers to assess whether one or two variables are responsible may be too me consuming to be prac cal. Some mul variate outliers may disappear when univariate outliers have been modified. However, mul variate outliers can arise even when none of the individual variables is a univariate outlier. 2.8 TESTING LINEARITY ASSUMPTION If an associa on between variables is not linear, it can be described as nonlinear, curvilinear, or perhaps a polynomial trend. Visual examina on of bivariate sca erplots may be sufficient to evaluate possible nonlinearity. It is possible to test whether departure from linearity is sta s cally significant using regression analysis to predict Y from X2 and perhaps even X3 (in addi on to X). If adding X2 to a regression equa on that includes only X as a predictor leads to a significant increase in R2, then the associa on can be called significantly nonlinear. The actual increase in R2 would tell you whether nonlinearity predicts a trivial or large part of the variance in Y. For discussion of regression with two predictors, see Chapter 4. If Y is a func on of:
Only X, then the X, Y func on is a straight line; this represents a linear trend. X and X2, then the X, Y func on has one curve; this is a quadra c trend. It may resemble
a U or inverted U shape. X, X2, and X3, then the func on has two curves; this is a cubic trend.
Note that the number of curves in the X, Y func on equals the highest power of X minus 1. The bivariate regression model for a simple linear rela onship is Y′ = b0 + b × X. This can be expanded to include a quadra c term: Y′ = b0 + b1X + b2X2. If the b2 coefficient associated with the X2 predictor variable is sta s cally significant, this indicates a significant departure from linearity. SPSS transform and compute commands are used to compute a new variable, Xsquared, that corresponds to X2. (Similarly, we could compute X3 = X × X × X; however, trends that are higher order than quadra c are not common in psychological data.) The hypothe cal data that correspond to the graphs in Figure 2.7 are in a file named linearitytest.sav (with N = 13 cases). Visual examina on of the sca erplots in Figure 2.7 suggests a linear associa on of Y with X (le ) and a quadra c associa on of Y with Q (right). Let’s first ask whether Y has a significantly nonlinear associa on with X for the sca erplot on the le of Figure 2.7. To do this, first compute the squared version of X. This can be done as follows using an SPSS compute statement. (If you are not familiar with compute statements, see Volume I [Warner, 2020], or an SPSS guide, or perform a Google search for this topic.) COMPUTE Xsquared = X * X (A be er way to compute X2 is (X – MX) × (X – MX), where MX is the mean of X.1) Then run SPSS linear regression using X and the new variable that corresponds to the squared value of X (named Xsquared) as predictors. Descrip on Figure 2.7 Linear Versus Quadra c Trend Data REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT Y
/METHOD=ENTER X Xsquared Par al results for this regression appear in Figure 2.8. The X2 term represents quadra c trend. If the b coefficient for this variable is sta s cally significant, the assump on of linearity is violated. In this situa on, for X2, b = –.001, β = –.025, t(10) = –.048, p = .963, two tailed (the df error term for the t test appears in a part of regression output that is not included here). The assump on of linearity is not significantly violated for X as a predictor of Y. The same procedure can be used to ask whether there is a viola on of the linearity assump on when Q is used to predict Y. First, compute a new variable named Qsquared; then run the regression using Q and Qsquared as predictors. For Q2, b = –.173, β = –3.234, t(10) = –4.20, and p = .002, two tailed (values from Figure 2.9). (While β coefficients usually range between –1 and +1, they can be far outside that range when squared terms or products between variables are used as predictors.) The linearity assump on was significantly violated for Q as a predictor of Y. What can be done when the linearity assump on is violated? Some mes a data transforma on such as log will make the rela on between a pair of variables more nearly linear; possibly the log of Q would have a linear associa on with the log of Y. However, this works in only a few situa ons. Another op on is to incorporate the iden fied nonlinearity into later analyses, for example, include X2 as a predictor in later regression analyses, so that the nonlinearity detected during data screening is taken into account. Descrip on Figure 2.8 Regression Coefficients for Quadra c Regression (Predic on of Y Scores From Scores on X and X2) Descrip on Figure 2.9 Regression Coefficients for Predic on of Y From Q and Q2 2.9 EVALUATION OF OTHER ASSUMPTIONS SPECIFIC TO ANALYSES Many analyses require addi onal evalua on of assump ons in addi on to these preliminary assessments. In the past, you have seen that tests of homogeneity of variance were applied for independent-samples t tests and between-S ANOVA. O en, as pointed out by Field (2018), assump ons for different analyses are quite similar. For example, homogeneity of variance assump ons can be evaluated for the independent-samples t test, ANOVA, and regression. For advanced analyses such as mul variate analysis of variance, addi onal assump ons need to be evaluated. Addi onal screening requirements for new analyses are discussed when these analyses are introduced. 2.10 DESCRIBING AMOUNT OF MISSING DATA 2.10.1 Why Missing Values Create Problems
There are several reasons why missing data are problema c. Obviously, if your sample is small, missing values make the amount of informa on even smaller. There is a more subtle problem. O en, missing responses don’t occur randomly. For example, people who are overweight may be more likely to skip ques ons about body weight. SPSS listwise dele on, the default method of handling missing data, just throws out the persons who did not answer this ques on. If you focus just on the subset of people who did answer a ques on, you may be looking at a different kind of sample (probably biased) than the original set of people recruited for the study. Blank cells are o en used to represent missing responses in SPSS data files. (Some archival data files use specific numerical values such as 99 or 77 to represent missing responses.) SPSS does not treat these blanks as 0 when compu ng sta s cs such as means; it omits the cases with missing scores from computa on. For many procedures there are two SPSS methods for handling missing values. Consider this situa on: A researcher asks for correla ons among all variables in this list: X1, X2, …, Xk. If listwise dele on is chosen, then only the cases with valid scores for all of the X variables on the list are used when these correla ons are calculated. If pairwise dele on is chosen, then each correla on (e.g., r12, r13, r23) is computed using all the persons who have valid scores for that pair of variables. When listwise dele on is used, all correla ons are based on the same N of cases. When pairwise dele on is used, if there are missing values, the N’s for different correla ons will vary, and some of the N’s may be larger than the N reported using listwise dele on. If the amount of missing data is less than 5%, use of listwise dele on may not cause serious problems (Graham, 2009). When the amount of missing data is larger, listwise dele on can yield a biased sample. For example, if students with low grades are dropped from a sample used in the analysis because they refused to answer some ques ons about grades, the remaining sample will mostly include students with higher grades. The sample will be biased and will not represent responses from students with lower grades. I’ll add another cau on here. If you pay no a en on to missing values, and you do a series of analyses with different variables, the total N will vary. For example, in your table of descrip ve sta s cs, you may have 100 cases when you report M and SD for many variables. In a subsequent regression analysis, you may have only 85 cases. In an ANOVA, you might have only 50 cases. Readers are likely to wonder why N keeps changing. In addi on, results can’t be compared across these analyses because they are not based on data for the same set of cases. It is be er to deal with the problem of missing values at the beginning and then work with the same set of cases in all subsequent analyses. Missing value analysis involves two steps. First, we need to evaluate the amount and pa ern of missing data. Then, missing values may be replaced with plausible scores prior to other analyses. To illustrate procedures used with missing data, I used a subset of data obtained in a study by Warner and Vroman (2011). A subset of 240 cases and six variables with complete data was
selected and saved in a file named nonmissingwb.sav. To create a corresponding file with specific pa erns of missingness, I changed selected scores in this file to system missing and saved these data in the file named missingwb.sav. Descrip on Figure 2.10 Number of Missing Values for Each of Six Variables (in Data File missingwb.sav) 2.10.2 Assessing Amount of Missingness Using SPSS Base Ini al assessments of amount of missing data do not require the SPSS Missing Values add-on module. Amount of missing data can be summarized three ways: for each variable, for the en re data set, and for each case or par cipant. To make an ini al assessment, the SPSS frequencies was used; results appear in Figure 2.10. For each variable: Four variables had some missing values; two variables did not have missing data (in other words, 4/6 = 66.7% of variables had at least one missing value). What number of cases (or percentage of values) were missing on each variable? This is also obtained from the frequencies procedure output in Figure 2.10. For example, out of 240 cases, depression had missing values on 22 cases (22/240 = 9.2%). For the en re data set: Out of all possible values in the data set, what percentage were missing? The number of possible scores = number of variables × number of cases = 6 × 240 = 1,440. The number of missing values is obtained by summing the values in the “Missing” row in Figure 2.10: 22 + 90 + 14 + 20 + 0 + 0 = 146. Thus 146 of 1,440 scores are missing, for an overall missing data percentage of approximately 10%. For each par cipant or case: Addi onal informa on is needed to evaluate the number of missing values for each case. To obtain this, create a dummy variable to represent missingness of scores on each variable (as suggested by Tabachnick & Fidell, 2018). The variable missingdepression corresponds to this yes/no ques on: Does the par cipant have a missing score on depression? Responses are coded 0 = no, 1 = yes. Dummy variables for missingness were created using the <Transform> → <Recode into Different Variables> procedure, as shown in Figure 2.11. In the dialog box on top in Figure 2.11, specify the name of the exis ng (numerical) variable, in this example, depression. Create a name for the output variable in the right-hand side box (in this example, the output variable is named missingdepression). Click Change to move this new output variable name into the window under “Numeric Variable -> Output Variable.” Then click Old and New Values. This leads to the second dialog box in Figure 2.11. To define the first value of the dummy variable (a code of 1 if there is a missing value for depression), click the radio bu on to select the system missing value for depression as the old value; then enter the code for the new or output variable (1) into the “New Value” box on the right. Each par cipant who has a system missing value for depression is assigned a score of 1 on the new variable, missingdepression. Click Add to move this specifica on into the “Old --> New”
box. To define the second value, select the radio bu on on the le for “All other values,” and input 0 for “New Value” on the right; click Add. A par cipant with any other value, other than system missing, on depression is given a score of 0 on the new variable named missingdepression. Click Con nue to return to the main dialog box, then click OK. The SPSS syntax that corresponds to these menu selec ons is: Descrip on Figure 2.11 Recode into Different Variables Dialog Box RECODE Depression (SYSMIS=1) (ELSE=0) INTO missingdepression EXECUTE The same opera ons can be used to create missingness variables for other variables (Nega veAffect, Sa sfac onwLife, and Neuro cism). To find out how many variables had missing values for each par cipant, sum these new variables: COMPUTE Totalmissing = missingdepression + missingsa sfac on + missingnegaffect + missingneuro cism Then obtain a frequencies table for the new variable Totalmissing (see Figure 2.12). Only one person was missing values on all three variables. Most cases or par cipants were missing values on no variables (n = 116) or only one variable (n = 103). 2.10.3 Decisions Based on Amount of Missing Data Amount of Missing Data in En re Data Set Graham (2009) stated that it may be reasonable to ignore the problem of missing values if the overall amount of missing data is below 5%. When there are very few missing data, the use of listwise dele on may be acceptable. In listwise dele on, cases that are missing values for any of the variables in the analysis are completely excluded. For example, if you run correla ons among X1, X2, X3, and X4 using listwise dele on, a case is excluded if it is missing a value on any one of these variables. Pairwise dele on means that a case is omi ed only for correla ons that require a score that the case is missing; for example, if a case is missing a score on X1, then that case is excluded for computa on of r12, r13, and r14, but retained for r23, r34, and r24. Listwise and pairwise dele on are regarded as unacceptable for large amounts of missing data. Even with less than 5% missing, Graham s ll recommended using missing values imputa on (discussed in upcoming sec ons) instead of listwise dele on. Descrip on Figure 2.12 Numbers of Par cipants or Cases Missing 0, 1, 2, and 3 Scores Across All Variables Amount of Missing Data for Each Variable Tabachnick and Fidell (2018) suggested that if a variable is not crucial to the analysis, that variable might be en rely dropped if it has a high propor on of missing values. Suppose that prior to data analysis, the analyst decided to discard variables with more than 33% missing
values. Sa sfac on with life was missing 38% of its values; it might be dropped using this preestablished rule. If a variable has numerous missing values, this may have been informa on that was not obtainable for many cases. (If the missing value were planned missing, the variable would not be dropped. For example, if only smokers are asked addi onal ques ons about amount of smoking, these variables would not be dropped simply because nonsmokers did not answer the ques ons.) It is not acceptable to drop variables a er final analyses; dropping variables that influence outcomes such as p values at a late stage in the analysis can be a form of p-hacking. Any decision to drop a variable must be well jus fied. Amount of Missing Data for Each Case Analysts might also consider dropping cases with high percentages of missing values (as suggested by Tabachnick & Fidell, 2018). Completely dropping cases is equivalent to listwise dele on, and experts on missing values agree that listwise dele on is generally poor prac ce. However, it’s worth considering the possibility that some par cipants may have provided really poor data. Some possible examples of extremely low quality survey data include the following: no answers for many ques ons, ridiculous or impossible responses (height 10 or 3 m), a series of iden cal ra ngs given for a long list of ques ons that assess different things (e.g., a string of scores such as 5, 5, 5, 5, 5, 5, 5 …), and inconsistent responses across ques ons (e.g., person responds “I have never smoked” to one ques on and then responds “I smoke an average of 10 cigare es per day” to another ques on). These problems can arise because of poorly worded ques ons, or they may be due to lack of par cipant a en on and effort or deliberate refusal to cooperate. If a decision is made to omit en re cases on the basis of data quality, be careful how this decision is presented, and make it clear that case dele ons were though ul decisions, not (mindless, automa c) listwise dele on. Ideally, specific criteria for case dele on would be specified prior to data collec on. However, par cipants can come up with types of poor data that are difficult to an cipate. In research other than surveys, analogous problems may arise. The dummy variables used here to evaluate par cipant- or case-level missing data can also be used to evaluate pa erns in missingness, as discussed in Sec on 2.13. 2.10.4 Assessment of Amount of Missingness Using SPSS Missing Values Add-On The SPSS Missing Values add-on module can be used to obtain similar informa on about amount of missing data in a different format (without the requirement to set up dummy variables for missingness.) The SPSS Missing Values add-on module provides two different procedures for analysis and imputa on of missing values. Unfortunately, the menu op ons for these (at least up un l SPSS Version 26) are confusing. (You can locate SPSS manuals by searching for “SPSS Missing Values manual” and loca ng the manual for the version number you are using.) When you purchase a license for the Missing Values add-on, two new choices appear in the pull-down menu under <Analyze>. The first choice can be obtained by selec ng these menu op ons <Analyze> → <Missing Value Analysis>. I have not used this procedure in this chapter, and I do not recommend it. The procedure that corresponds to these menu selec ons has an
important limita on; it does not provide mul ple imputa on (only single imputa on). Mul ple imputa on is strongly preferred by experts. For all subsequent missing value analysis, I used these menu selec ons: <Analyze> → <Mul ple Imputa on>, as shown in Figure 2.13. The pull-down menu that appears when you click <Mul ple Imputa on> offers two choices: <Analyze Pa erns> and <Impute Missing Data Values>. The procedures demonstrated in this chapter are run using these two procedures. First, descrip ve informa on about the amount and pa ern of missing data is obtained using the menu selec ons <Analyze> → <Mul ple Imputa on> → <Analyze Pa erns>. Then the menu selec ons <Analyze> → <Mul ple Imputa on> → <Impute Missing Data Values> are used to generate mul ple imputa on of missing score values. Figure 2.13 Drop-Down Menu Selec ons to Open SPSS Missing Values Add-On Module To obtain informa on about the amount of missing data, make these menu selec ons: <Analyze> → <Mul ple Imputa on> → <Analyze Pa erns>, as shown in Figure 2.13. (The <Mul ple Imputa on> command appears in the <Analyze> menu only if you or your organiza on has purchased a separate license; it is not available in SPSS Base.) In the Analyze Pa erns dialog box (Figure 2.14), checkboxes can be used to select the kinds of informa on requested. I suggest that you include all variables in the “Analyze Across Variables” pane, not only the ones that you know have missing values. (“Analyze pa erns” is a bit of a misnomer here; the informa on provided by this procedure is mainly for the amount of missing values rather than pa erns of missingness.) Only one part of the output is shown here (Figure 2.15). Figure 2.15 tells us that four of six of the variables (66.67%) had at least one missing value. One hundred sixteen of 240 of cases or par cipants (48.33%) had at least one missing value. Of the 2,400 values in the en re data set, 146 or 10.14% were missing. These graphics present informa on already obtained from SPSS Base. The Missing Values add-on module also generates graphics to show the co-occurrence of pairs or sets of missing variables (e.g., how many cases were missing scores on both depression and sex?). However, more useful ways to assess pa erns of missingness are discussed in Sec ons 2.12 and 2.13. Descrip on Figure 2.14 Dialog Box for Analyze Pa erns Procedure Descrip on Figure 2.15 Selected Output From Missing Values Analyze Pa erns Procedure 2.11 HOW MISSING DATA ARISE Data can be missing for many reasons. Four common reasons are described; however, this does not exhaust the possibili es.
Refusal to par cipate: A researcher may ini ally contact 1,000 people to ask for survey par cipa on. If only 333 agree to par cipate, no data are available for two thirds of the intended sample. Refusal to par cipate is unlikely to be random and can introduce substan al bias. There is nothing that can be done to replace this kind of missing data (the researcher could ask another 2,000 people to par cipate and obtain 666 more people). People who volunteer, or consent, to par cipate in research differ systema cally from those who refuse (Rosenthal & Rosnow, 1975). It is essen al to report numbers of person who refused to par cipate. It would also be useful to know why they refused. Refusal to par cipate leads to bias that cannot be corrected through later procedures such as imputa on of missing values; imputa on cannot replace this kind of lost data. The likelihood that the sample is not representa ve of the en re popula on that was contacted should be addressed in the discussion sec on when considering poten al limita ons of generalizability of results. A ri on in longitudinal studies creates another kind of missingness. Imagine a longitudinal study in which par cipants are assigned (perhaps randomly) to different treatment condi ons. Assessments may be made before treatment and at one or more mes a er the treatment or interven on. There is usually a ri on. Par cipants may drop out of the treatment program, move and leave no contact informa on, die, or become unwilling or unable to con nue. Some par cipants may miss one follow-up assessment and return for a later assessment. Samples a er treatment or interven on can be smaller than the pretreatment sample, and they may also differ from the pretreatment sample in systema c ways. Missing data may be planned: A survey might contain a funnel ques on, such as “Have you ever smoked?” People who say “yes” are directed to addi onal ques ons about smoking. People who say “no” would skip the addi onal smoking ques ons. Missing values would almost certainly not be imputed for these skipped ques ons. To shorten the me demands of a long survey, par cipants may be given only random subsets of the ques ons (and thus not have data for other ques ons, but in a planned and random manner). Development of be er methods for handling planned missing data has encouraged the development of planned missing studies (Graham, 2009). Missing values may have been used to replace outliers in previous data screening: One possible way to handle outliers (par cularly when they are unbelievable or implausible) is to convert them to system missing values. In an ideal situa on, missing values would occur randomly, in ways that would not introduce bias in later data analysis. In actual data, missing values o en occur in nonrandom pa erns. 2.12 PATTERNS IN MISSING DATA 2.12.1 Type A and Type B Missingness Pa erns of missingness are usually described as one of these three types: missing completely at random, missing at random, and missing not at random (Rubin, 1976). To explain how these
kinds of missingness differ, here is a dis nc on not found elsewhere in the missing values literature: I will refer to Type A and Type B missingness. Consider Type A missingness. Suppose we have a Y variable (such as depression) that has missing values, and we also have data for other variables X1, X2, X3, and so on (such as sex, neuro cism, and social desirability response bias). It is possible that missingness on Y is related to scores on one or more of the X variables; for example, men and people high in social desirability may be more likely to refuse to answer the depression ques ons than women and persons with low social desirability response bias. I will call this Type A missingness. The next few sec ons show that this kind of missingness can easily be detected and that state-of-the-art methods of replacement for missing values, such as mul ple imputa on (MI), can correct for bias due to this type of missingness. Now consider Type B missingness. It is conceivable that the likelihood of missing scores on Y (depression) depends on people’s levels of depression. That is, people who would have had high scores on depression may be likely not to answer ques ons about depression. I will call this Type B missingness. Type B missingness is more difficult to iden fy than Type A missingness. (Some mes it is impossible to iden fy Type B missingness.) Also, poten al bias due to Type B missingness is more problema c and may not be correctable. 2.12.2 MCAR, MAR, and MNAR Missingness The three pa erns of missingness that appear widely in research on missing values were described by Rubin (1976). These are missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Each of these pa erns can be defined by the presence or absence of Type A and Type B missingness. First consider MCAR missingness, as described by Schafer and Graham (2002): Assume that “variables X (X1, … Xp) are known for all par cipants but Y is missing for some. If par cipants are independently sampled from the popula on … MCAR means that the probability that Y is missing for a par cipant does not depend on his or her own values of X or Y.” Using the terms I suggest, MCAR does not have either Type A or Type B missingness. The name MAR (missing at random) is somewhat confusing, because this pa ern is not completely random. Schafer and Graham (2002) stated, “MAR means that the probability that Y is missing may depend on X but not Y … under MAR, there could be a rela onship between missingness and Y induced by their mutual rela onships to X, but there must be no residual rela onship between them once X is taken into account.” Using my terms, MAR may show Type A missingness (however, MAR must not show Type B missingness a er correc ons have been made for any Type A missingness). The third and most troubling possible pa ern is MNAR. Schafer and Graham (2002) stated that “MNAR means that the probability of missingness depends on Y…. Under MNAR, some residual dependence between missingness and Y remains a er accoun ng for X.” Using terms I suggest, MNAR has Type B missingness (and it may or may not also have Type A missingness).
MAR and MCAR pa erns of missingness are called ignorable. This does not mean that we don’t have to do anything about missing data if the pa ern of missingness is judged to be MAR or MCAR. “Ignorable” means that, a er state-of-the-art methods for replacement of missing values are used, results of analyses (such as p values) should not be biased. MNAR (and Type B missingness, its dis nguishing feature) are nonignorable forms of missingness. Even when state-of-the-art methods are used to impute scores for missing values in MNAR missing data, poten al bias remains a problem that cannot be ignored. Discussion in a journal ar cle must acknowledge the limita ons imposed by this bias. For example, if we know that persons who are very depressed are likely to have missing data on the depression ques on, it follows that the people for whom we do have data represent a sample that is biased toward lower depression. Schlomer, Bauman, and Card (2010) urged researchers to consider the possible existence of MNAR and reasons why this might occur. The degree to which missing values are problema c depends more on the pa ern of missingness than the amount of missingness (Tabachnick & Fidell, 2018). MNAR is most problema c. Researchers should report informa on about pa ern, as well as amount, of missing data. It is possible to find pa erns in data that indicate problems with Type A missingness. However, it is impossible to prove that Type A and/or Type B missingness is absent. 2.12.3 Detec on of Type A Missingness Methods for detec on of Type A missingness are discussed in the context of an empirical example in upcoming Sec on 2.13, including pairwise examina on of variables and Li le’s test of MCAR. In this empirical example, Type A missingness occurs because missingness of depression scores is related to sex, neuro cism, socially desirable response bias, and other variables. The SPSS Missing Values add-on module provides all the necessary tests for Type A missingness. I will demonstrate that many of these tests can also be obtained using SPSS Base (the output from analysis using SPSS Base may be easier to understand). State-of-the-art methods for replacement of missing values are thought to correct most of the bias due to this type of missingness (Graham, 2009). 2.12.4 Detec on of Type B Missingness Unfortunately, evalua on of Type B missingness is difficult. It usually requires informa on that researchers don’t have. Consider this example. If a ques on about school grade point average (GPA) is included in a survey, it is possible that students are more likely not to answer this if they have low GPAs. To evaluate whether Type B missingness is occurring, we need to know what the GPA scores would have been for the people who did not answer the ques on. O en there is no way to obtain this kind of informa on. In some situa ons, outside informa on can be helpful. Here are three examples of addi onal informa on that would help evaluate whether Type B missingness is occurring.
1. The researcher could follow up with the students who did not answer the GPA ques on and try again to obtain their answers. (Of course, if that informa on is obtained, it can be used to replace the missing value.)
2. The researcher could look for an independent source of data to find out what GPA answers would have been for people who did not answer the ques on. For example, universi es have archival computer records of GPA data for all students. (Usually researchers cannot access this informa on.) If the researcher could obtain GPAs for all students, he or she could evaluate whether students who did not answer the ques on about GPA had lower GPA values than people who did answer the ques on. In this situa on also, the values from archival data could be used to replace missing values in the self-report data.
3. An indirect way to assess Type B missingness would be to look at the distribu on and range of GPA values in the sample of students and compare that with the distribu on and range of GPA values for the en re university. Assume that the sample was drawn randomly from all students at the university. If the sample distribu on for GPA contains a much lower propor on of GPAs below 2.0 than the university distribu on, this would suggest that low-GPA students may have been less likely to report their GPAs than high- GPA students. This would indicate the presence of Type B missingness but would not provide a solu on for it.
In the data set used as an empirical example, I know that neuro cism had Type B missingness (because, when I created my missing data file, I systema cally turned higher scores on neuro cism into missing values). When I created Type B missingness for neuro cism, my new missing data file underrepresented people high in neuro cism, compared with the complete data set. Even a er replacement of values using methods such as MI, generaliza on of findings to persons high on neuro cism would be problema c in this example. Researchers o en cannot iden fy, or correct for, Type B missingness. When Type B missingness is present (and probably it o en is), researchers need to understand the bias this creates. Two types of bias may occur: Parameters may be over- or underes mated, and the sample may not be representa ve of, or similar to, the original popula on of interest. (For example, the sample may underrepresent certain types of persons, such as those highest on depression.) A researcher should address these problems and limita ons in discussion of the study. 2.13 EMPIRICAL EXAMPLE: DETECTING TYPE A MISSINGNESS To assess Type A missingness, we need to know whether missing versus nonmissing status for each variable is related to scores on other variables. This informa on can be obtained using the SPSS Missing Values add-on module. However, when first learning about missing values, doing a similar analysis in SPSS Base may make the underlying ideas clearer. Descrip on Figure 2.16 One-Way ANOVA Dialog Box: Assess Associa ons of Other Variables With Missingness on Nega ve Affect
Earlier, in Sec on 2.10, a dummy “missingness” variable was created for each variable in the data set that had one or more missing values. These dummy variables can now be used to test Type A missingness. To see whether missingness on one variable (such as nega ve affect) is related to scores on other quan ta ve variables (such as response bias, nega ve affect, or neuro cism), means for those other quan ta ve variables are tested to see if they differ across the missing and nonmissing groups. It is convenient to use the SPSS one-way ANOVA procedure for comparison of means. To open the one-way ANOVA procedure, make the following menu selec ons: <Analyze> → <Compare Means> → <One-Way ANOVA>. The One-Way ANOVA dialog box in Figure 2.16 shows which variables were included. The Op ons bu on was used to select descrip ve sta s cs (recall that means and other descrip ve sta s cs are not provided unless requested explicitly). Selected results appear in Figure 2.17. The groups (groups of persons missing or not missing nega ve affect scores) did not differ in mean sa sfac on with life, F(1, 148) = .106, p = .745. Missingness on nega ve affect was related to scores on the other three variables; in other words, there is evidence of Type A missingness. The table of group means (not shown here) indicated that people in the missing nega ve affect group scored lower on neuro cism, higher in social desirability response bias, and lower on depression. Similar comparisons of means are needed for each of the other missingness dummy variables (e.g., ANOVAs to compare groups of missing vs. not missing status for Depression, Sa sfac onwLife, etc.). To evaluate whether missingness is related to a categorical variable such as sex, or to missingness on other variables, set up a con ngency table using the SPSS crosstabs procedure. The crosstabs results in Figure 2.18 indicate that sex was associated with missingness on depression; 22 of 112 men (almost 20%) of men were missing scores on depression; none of the women were missing depression scores. This was sta s cally significant, χ2(1) = 27.68, p < .001 (output not shown). The SPSS Missing Values add-on module provides similar comparisons of group means and crosstabs (not shown here). An addi onal test available from the Missing Values add-on module is Li le’s test of MCAR (Li le, 1988). Li le’s test essen ally summarizes informa on from the individual tests for Type A missingness just described. To obtain Li le’s test, open the Missing Values add-on module by selec ng <Analyze> → <Missing Value Analysis> (not either of the two addi onal menu choices that appear to the right a er selec ng <Mul ple Imputa on>; refer back to Figure 2.13). The Missing Value Analysis dialog box appears as shown in Figure 2.19. Descrip on Figure 2.17 ANOVA Source Table: Comparison of Groups Missing Versus Not Missing Nega ve Affect Scores Descrip on Figure 2.18 Con ngency Table for Missingness on Depression by Sex
Descrip on Figure 2.19 Missing Value Analysis Dialog Box to Request Li le’s MCAR Test In the Missing Value Analysis dialog box, move all quan ta ve variables to the “Quan ta ve Variables” pane, and move any categorical variables into the separate “Categorical Variables” pane. Check the box for “EM” in the “Es ma on” list, then click OK. (If you also want the t tests and crosstabs that were discussed earlier in SPSS Base, click the Descrip ves bu on and use checkboxes in the Descrip ves dialog box to request these; they are not included here.) Li le’s MCAR test appears as a footnote to the “EM Means” table in Figure 2.20. This was sta s cally significant, χ2(33) = 136.081, p < .001. The null hypothesis is essen ally that there is no Type A missingness for the en re set of variables. This null hypothesis is rejected (consistent with earlier ANOVA and crosstabs results showing that missingness was related to scores on other variables). This is addi onal evidence that Type A missingness is present. There is no similar empirical test for Type B missingness. Descrip on Figure 2.20 “EM Means” Table From SPSS Missing Values Analysis With Li le’s MCAR Test 2.14 POSSIBLE REMEDIES FOR MISSING DATA There are essen ally three ways to handle missing values. The first is to ignore them, that is, throw out cases with missing data using default methods such as SPSS listwise or pairwise dele on. (Somewhat different terms are used elsewhere; “complete case analysis” is synonymous with listwise dele on; “available data analysis” is equivalent to pairwise dele on; Pigo , 2001.) Listwise dele on is almost universally regarded as bad prac ce. However, Graham (2009) said that listwise dele on may yield acceptable results if the overall amount of missing data is less than 5%; he stated that “it would be unreasonable for a cri c to argue that it was a bad idea” if an analyst chose to use listwise dele on in this situa on. However, he recommended the use of missing data replacement methods such as MI even when there is less than 5% missing data. One obvious problem with listwise dele on is reduc on of sta s cal power because of a smaller sample size. A less obvious but more serious problem with listwise dele on is that discarding cases with missing scores can systema cally change the composi on of the sample. Recall that when I created a missing values pa ern in the data set used as an example, I systema cally deleted the cases with the highest scores on neuro cism (this created Type B missingness for neuro cism). If listwise dele on were used, subsequent analyses would not include any informa on about people who had the highest scores for neuro cism. That creates bias in two senses. First, if we want to generalize results from a sample to some larger hypothe cal popula on, the sample now underrepresents some kinds of people in the popula on, Second, there is bias in es ma on of sta s cs such as regression slopes, effect sizes, and p values (this is known from Monte Carlo studies that compared different methods for handling missing values in the presence of different types of pa ern for missingness).
A second way to handle missing values is to replace them with simple es mates based on informa on in the data set. Missing scores on a variable could be replaced with the mean of that variable (for the en re data set or separately for each group). Missing values could be replaced with predicted scores from a regression analysis that uses other variables in the data set as predictors. These methods are not recommended (Acock, 2005), because they do not effec vely reduce bias. There are several state-of-the-art methods for replacement of missing values that involve more complex methods. Graham (2009) “fully endorses” mul ple imputa on. Monte Carlo work shows that MI is effec ve in reducing bias in many missing-values situa ons (but note that it cannot correct for bias due to Type B missingness). Graham and Schlomer et al. (2010) described other state-of-the-art procedures and the capabili es of several programs, including SAS, SPSS, Mplus, and others. They also described freely downloadable so ware for missing values. The empirical example presented in the following sec on uses MI. Graham (2009) stated that MI performs well in samples as small as 50 (even with up to 18 predictors) and with as much as 50% missing data in the dependent variable. He explained that, contrary to some beliefs, it is acceptable to impute replacements for missing values on dependent variables. He suggested that a larger number of imputa ons than the SPSS default of 5 may be needed with larger amounts of missing data, possibly as many as 40 imputa ons. 2.15 EMPIRICAL EXAMPLE: MULTIPLE IMPUTATION TO REPLACE MISSING VALUES To run MI using the SPSS Missing Values add-on module, start from the top-level menu. Choose <Analyze> → <Mul ple Imputa on>, then from the pop-up menu on the right, select <Impute Missing Data Values>. The resul ng dialog box appears in Figure 2.21. All the variables of interest (both the variables with missing values and all other variables that will be used in later analyses) are included. Note that you can access a list of procedures that can be applied to imputed data in SPSS Help, as noted in this dialog box. The number of imputa ons is set to 5 by default (note that a larger number of imputa ons, on the order of 40, is preferable for data sets with large percentages of missing data; Graham, 2009). A name for the newly created data set must be provided (in this example, Imputed Data). MI does something comparable with replacement by regression. Each imputa on es mates a different set of plausible values to replace each missing value for a variable such as depression; these plausible values are based on predic ons from all other variables. The resul ng data file (a subset appears in Figure 2.22) now contains six versions of the data: the original data and the five imputed versions. The first column indicates imputa on number (0 for the original data). Descrip on Figure 2.21 Dialog Box for Impute Missing Data Values
Figure 2.22 Selected Rows From Imputed Data Set Descrip on Figure 2.23 Split File Command Used to Pool Results for Imputed Data File The final analysis of interest (for example, predic on of depression from the other five variables) is now run on all versions of the data (Imputa ons 0 through 5), and results are pooled (averaged) across data sets. Prior to the regression, select <Data> and <Split File> (not <Split into Files>). In the Split File dialog box, move the variable Imputa on Number into the pane under “Groups Based on” and select the radio bu on for “Compare groups.” You should see a line that says Current Status: Compare:Imputa on_ in the lower le corner. The SPSS syntax is: SPLIT FILE LAYERED BY Imputa on_. Now run the analysis of interest. In this example, it was a mul ple regression to predict scores on Depression from Sa sfac onwLife, Nega veAffect, Neuro cism, Sex, and Socialdesirability. Selected results for this regression analysis appear in Figure 2.24. Figure 2.24 shows the regression coefficients (to predict Depression from Sa sfac onwLife, Nega veAffect, Neuro cism, Sex, and Socialdesirability), separately for the original data, for each of the imputed data sets (1 through 5), and for the pooled results. We hope to see consistent results across all solu ons, and that is usually what is obtained. For these data, results varied li le across the five imputa ons. Repor ng would focus on pooled coefficient es mates and the overall sta s cal significance of the regressions (in the ANOVA tables, not provided here). Descrip on Figure 2.24 Predic on of Depression From Sa sfac onwLife, Nega veAffect, Neuro cism, Sex, and Socialdesirability Using Linear Regression: Original and Imputed Missing Values 2.16 DATA SCREENING CHECKLIST Decisions about eligibility criteria, minimum group size, methods to handle outliers, plans for handling missing data, and so forth should be made prior to data collec on. For longitudinal studies that compare treatment groups, Consolidated Standards of Repor ng Trials (CONSORT) guidelines may be helpful (Boutron, 2017). Document what was done (with jus fica on) at every step of the data-screening process. The following checklist for data screening and handling covers many research situa ons. Some varia on in the order of steps is possible. However, I believe that it makes sense to consider distribu on shape prior to making decisions about handling outliers and to deal with
outliers before impu ng missing values. These sugges ons are not engraved in stone. There are reasonable alterna ves for most of the choices I have recommended.
1. Proofread the data set against original sources of data (if available). Replace incorrect scores with accurate data. Replace impossible score values with system missing.
2. Remove cases that do not meet eligibility criteria. 3. If group means will be compared, each group should have a minimum of 25 to 30 cases
(Boneau, 1960). If some groups have smaller n’s, addi onal members for these groups might be obtained prior to other data analyses. Alterna vely, groups with small n’s can be dropped, or combined with other groups (if that makes sense).
4. Assess distribu on shapes by examining histograms. If groups will be compared, distribu on shape should be assessed separately within each group. Some distribu on shapes, such as Poisson, require different analyses than those covered in this book (see Appendix 2A).
5. Possibly apply data transforma ons (such as log or arcsine), but only if this makes sense. If distribu on skewness is due to a few outliers, it may be preferable to deal with those outliers individually instead of transforming the en re set of scores.
6. Screen for univariate, bivariate, and mul variate outliers. Decide how to handle these (for example, convert extreme scores to less extreme values, or replace them with missing values).
7. Test linearity assump ons for associa ons between quan ta ve variables. If nonlinearity is detected, revisit the possibility of data transforma ons, or include terms such as X2 in later analyses.
8. Assess amount and pa ern of missing values. If there is greater than 50% missingness on a case or a variable, consider the possibility that these cases or variables provide such poor-quality data that they cannot be used. If cases or variables are dropped, this should be documented and explained.
9. Use mul ple imputa on to replace missing values (or use another state-of-the-art missing value replacement method, as discussed in Graham, 2009).
2.17 REPORTING GUIDELINES At a minimum, the following ques ons should be answered. Some may require only a sentence or two; others may require more informa on. For addi onal sugges ons about repor ng, see Johnson and Young (2011), Recommenda ons 9 and 10, and Manly and Wells (2015). In the “Introduc on”: What types of analyses were done and why were these chosen? In the “Methods” sec on: Details about ini al sample selec on, measurements, group comparisons (if any), and other aspects of procedure. In the “Results” sec on: Data screening and handling procedures should be described at the beginning of the “Results” sec on. This should address each of the following ques ons:
1. What were the final numbers of cases for final analysis, a er any respondents were dropped because they declined to par cipate or did not meet eligibility criteria (or
presented other problems)? For longitudinal studies, a CONSORT flowchart may be helpful (see Sec on 2.1).
2. If any variables were dropped from planned analyses because of poor measurement quality or if groups were omi ed or combined because of small n’s, explain.
3. Explain rule(s) for outlier detec on and the way outliers were handled, and note how many changes were made during outlier evalua on. Explain any data transforma ons.
4. Report the amount of missing values, such as the percentage of scores missing in the en re data set, the percentage missing for each variable, or the percentage of par cipants missing one or more scores.
5. Describe possible reasons for missing values. 6. Explain pa ern in missing values. Type A missingness is present if Li le’s MCAR test is
significant; details about the nature of missingness are found in the t tests and crosstabs that show how missingness dummy variables are related to other variables. It may not be possible to detect Type B missingness unless addi onal informa on is available beyond the data set; this possibility should be discussed. (Type A missingness is ignorable; Type B is problema c.)
7. Provide specific informa on about the imputa on method used to replace missing values, including so ware, version, and commands; number of imputa ons; and any notable differences among results for different imputa ons and original data.
In the “Discussion” sec on: Be sure to explain the ways in which data problems, such as sample selec on and missing values, may have (a) created bias in parameter es mates and (b) limited the generalizability of results. 2.18 SUMMARY Before collec ng data, researchers should decide on rules and procedures for data screening, outliers, and missing values, and then adhere to those rules. This informa on is required for preregistra on of study plans. Open Science advocates preregistra on as a way to improve completeness and transparency of repor ng and calls for making data available for examina on by other researches through publicly available data archives. Some journals offer special badges for papers that report preregistered studies. For further discussion, see Asendorpf et al. (2013) and Cumming and Calin-Jageman (2016). In addi on, professional researchers o en seek research funding from federal grant agencies (e.g., the Na onal Science Founda on, the Na onal Ins tutes of Health). These agencies now require detailed plans for data handling in the proposals, for example, decisions about sample size on the basis of sta s cal power analysis, plans for iden fica on and handling of outliers, and plans for management of missing data. A few professional journals (for example, Psychological Science and some medical journals) provide the opportunity to preregister detailed plans for studies including this informa on. Journal editors are beginning to require greater detail and transparency in repor ng data screening than in the past. The requirement for detailed repor ng of data handling is likely to increase. For many decisions about outliers and missing value replacement, there is no one best op on. This chapter suggests several op ons for handling outliers, but there are many others (Aguinas
et al., 2013). This chapter describes the use of MI for replacement of missing values, but addi onal methods are available or may become available in the future. The growing literature about missing values includes strong arguments for the use of MI and other state-of-the-art methods as ways to reduce bias. However, even state-of-the-art methods for replacement of missing values does not get rid of problems due to Type B missingness. It is important to remember that many other common research prac ces may be even greater sources of bias. Use of convenience samples rather than random or representa ve samples limits the generalizability of findings. Prac ces such as p-hacking and hypothesizing a er results are known to greatly inflate the risk for Type I error. Quality control during data collec on is essen al. Nothing that is done during data screening can make up for problems due to poor- quality data. Numerous missing values situa ons are beyond the scope of this chapter, for example, imputa on of missing values for categorical variables (Allison, 2002), a ri on in longitudinal studies (Kristman, Manno, & Côté, 2005; Muthén, Asparouhov, Hunter, & Leuchter, 2011; Twisk & de Vente, 2002), missing data in mul level or structural equa on models, and missing values at the item level in research that uses mul ple-item ques onnaires to assess constructs such as depression (Parent, 2013). Subsequent chapters assume that all appropriate data screening for generally required assump ons has been carried out. Addi onal data-screening procedures required for specific analyses will be introduced as needed. APPENDIX 2A: BRIEF NOTE ABOUT ZERO-INFLATED BINOMIAL OR POISSON REGRESSION The following empirical example provides an illustra on. Figure 2.25 is adapted from Atkins and Gallop (2007). The count variable in their study (on the X axis) is the number of steps each person has taken toward divorce, ranging from 0 to 10. The distribu on is clearly non-normal; it has a mode at zero and posi ve skewness (a very small propor on of persons in the sample had taken 8 or more steps). Descrip on Figure 2.25 Four Models for Distribu on Shape of Frequency Count Variable Source: Adapted from Atkins and Gallop (2007). Note: Variable on the X axis is the number of steps or ac ons taken toward separa on or divorce, ranging from 0 to 10. Atkins and Gallop (2007) evaluated the fit of four mathema cal distribu on models to the empirical frequency distribu on in Figure 2.25: Poisson, zero-inflated Poisson (ZIP), nega ve binomial, and zero-inflated nega ve binomial (ZINB). Quan ta ve criteria were used to evaluate model fit. They concluded that the ZIP model was the best fit for their data (results were very
similar for the ZINB model). The regression analysis to predict number of steps toward divorce from other variables would be called zero-inflated Poisson regression; this is very different from linear regression. It is possible to ask two ques ons about analyses in these models applied to behavior count variables. Consider illegal drug use as an example (e.g., Wagner, Riggs, & Mikulich-Gilbertson, 2015). First, we want to predict whether individuals use drugs or not. For those who do use drugs, a zero frequency of drug use in the past month is possible, but higher frequencies of use behaviors can occur. The set of variables that predicts frequency of drug use in this group may differ from the variables that predict use versus nonuse of drugs. This informa on would be missed if a data analyst applied ordinary linear regression. The SPSS generalized linear models procedure can handle behavior count dependent variables. (Note that this is different from the GLM procedure used in Volume I [Warner, 2020].) For an online SPSS tutorial, see UCLA Ins tute for Digital Research & Educa on (2019). Atkins and Gallop (2007) provided extensive online supplemental material for their study. Note that count data should not be log transformed in an a empt to make them more nearly normally distributed (O’Hara & Kotze, 2010). COMPREHENSION QUESTIONS
1. What can you look for in a histogram for scores on a quan ta ve variable? 2. What can you look for in a three-dimensional sca erplot? 3. What quan ta ve rule can be used to decide whether a univariate score is an outlier? 4. Are there situa ons in which can you jus fy dele ng a case or par cipant completely? If
so, what are they? 5. Under what condi ons might you convert a score to system missing? 6. What is the point of running an analysis once with outliers included and once with
outliers deleted? 7. What is a way to iden fy mul variate outliers using Mahalanobis distance? 8. Describe two distribu on shapes (other than normal) that you might see in actual data
(hint: any other distribu on graph you have seen in this chapter, along with any strange things you might have seen in other data).
9. When can log transforma ons be used, and what poten al benefits do these have? When should log transforma ons not be used?
10. If you have a dependent variable that represents a count of some behavior, would you expect data to be normally distributed? Why or why not? What types of distribu on be er describe this type of data? Can you use linear regression? What type of analysis would be preferable?
11. Which do authori es believe generally pose more serious problems in analysis: outliers or non-normal distribu on shapes?
12. What problems arise when listwise dele on is used to handle missing values? NOTE 1Chapter 7, on modera on, explains that when forming products between predictor variables, the correla on between X2 and X can be reduced by using centered scores on X to compute the
squared term. A variable is centered by subtrac ng out its mean. In other words, we can calculate X2 = (X – MX) × (X – MX) where MX is the mean of X. The significance of the quadra c trend is the same whether X is centered or not; however, judgments about whether there could also be a significant linear trend can change depending on whether X was centered before compu ng X2. DIGITAL RESOURCES Find free study tools to support your learning, including eFlashcards, data sets, and web resources, on the accompanying website at edge.sagepub.com/warner3e. Descrip ons of Images and Figures Back to Figure A flow diagram of the progress through the phases of enrolment, interven on alloca on, follow-up, and data analysis is displayed. The details of the phases are as follows: Flow during enrollment phase is as follows: Assessed for eligibility (n= ) Excluded (n= )
Not mee ng inclusion criteria (n= ) Declined to par cipate (n= ) Other reasons (n= )
Randomized (n= ) During the alloca on phase, there are two steps as men oned below: Flow during enrollment phase is as follows: Assessed for eligibility (n= ) Excluded (n= )
Not mee ng inclusion criteria (n= ) Declined to par cipate (n= ) Other reasons (n= )
Randomized (n= ) During the alloca on phase, there are two steps as men oned below:
1. Allocated to interven on (n= ) Received allocated interven on (n= )
Did not receive allocated interven on(give reasons) (n= ) 2. Allocated to interven on (n= )
Received allocated interven on (n= ) Did not receive allocated interven on (give reasons) (n= )
During the Follow-up phase, there are two steps as men oned below:
1. Lost to follow-up (give reasons) (n= ) Discon nued interven on (givereasons) (n= )
2. Lost to follow-up (give reasons) (n= ) Discon nued interven on (givereasons) (n= )
During the Analysis phase, there are two steps as men oned below: 1. Analysed (n= )
Excluded from analysis (givereasons) (n= ) 2. Analysed (n= )
Excluded from analysis (givereasons) (n= ) Back to Figure The horizontal axis represents weight in kilograms and ranges from nega ve 2,000 to 10,000 in increments of 2,000. The ver cal axis represents the metabolic rate and rages from nega ve 200 to 1,200 in intervals of 200. The data corresponding to the metabolic rate and weight is summarized below:
Mouse: Nega ve 1137.77; Nega ve 79.5. Rat: Nega ve 41.01; Nega ve 68.1. Dog: 657.10; Nega ve 70.5. Cat: Nega ve 407.39; Nega ve 24.6. Monkey: 40.81; 8.3. Goat: 90.02; 46.6. Sheep: 90.02; 46.6. Man: 89.61; 71.1. Chimp guinea pig: 6.28; 84.7. Pig: 205.68; 87.6. Cow: 238; 142.1. Boar: 487.16; 150.4. Horse: 647.38; 505.3. Bull: 976.57; 696.5. Elephant: 8049.85; 1114.5.
Back to Figure The horizontal axis represents weight in kilograms and ranges from 0.01 to 10,000, simultaneously a log scale is given. The log scale ranges from nega ve 2 to 4, in increments of 1. The ver cal axis represents the standard metabolic rate in wa s and rages from 0.1 to 1,000. The data corresponding to the metabolic rate and weight is summarized below:
Mouse: Nega ve 1.72; 107.19.
Rat: Nega ve 0.39; 312.82. Guinea Pig: Nega ve 0.33; 337.40. Cat: Nega ve 0.47; 444.64. Monkey: 0.73; 506.11. Dog: 1.18; 571.15. Goat: 1.62; 615.14. Chimpanzee: 1.56; 632.66. Sheep: 1.69; 657.26. Man: 1.84; 687.12. Pig: 2.16; 717.04. Boar: 2.42; 799.56. Cow: 2.68; 797.88. Horse: 2.70; 836.48. Bull: 2.83; 875.11. Elephant: 3.56; 989.35.
Back to Figure The x- axis represents sugar and ranges from 0 to 800, in increments of 400. The y-axis represents body mass index and ranges from 10 to 40, in increments of 10. The z-axis represents fat and ranges from 0 to 20, in increments of 5.
The “ZX” plane shows a correla on between fat and sugar intake. These two variables are directly propor onal.
The “YZ” plane shows a correla on between fat and body mass index. These two variables are directly propor onal.
Most of the data points are seen clu ered around the intersec on points of these three planes. The outlier is shown near the top around the “YZ” plane. Back to Figure The first dialog box is tled “Linear Regression” and is divided into two panes. In the first pane variables are listed in alphabe cal order and with a nominal and measurement icons. The list reads as follows “Weight”, “SWLS”, “BMI” and so on. The second pane has a data field labelled, “Dependent” the text reads “Id number”. A horizontal pane below these data fields is labelled, “Block 1 of 1” and is empty with two tabs “Previous” and “Next”. It has a data field and is labelled “Independent(s)” and the text reads, “Sugar”, “fat”, and “BMI”. The op on of “BMI” is selected. Below this is a data field labelled Method with a “Enter” wri en in a dropdown cell. There are three data fields with empty cells below this pane and are listed below.
Selec on Variable Case Labels WLS Weight
There are tabs to the right side of this pane that read as follows: Sta s cs; Plots; Save; Op ons; Style; and Bootstrap. At the bo om in the dialog box are the following tabs, “Reset”, “Paste”, “Cancel”, and “OK”. The tabs of “OK” is selected. The second dialog box is tled “Linear Regression: Save” and has details of the predicted values, distances, predic on intervals, residuals, influence sta s cs, and coefficient sta s cs. The pane that is labeled distances has the following checkboxes: Mahalanobis, cook’s, and leverage values. The op on of Mahalanobis is selected. Back to Figure The linear sca er plot on the le has the horizontal axis that represents the variable X ranges from 0 to 20, in increments of 5. The ver cal axis represents the variable Y and ranges from 0 to 20, in increments of 5. Data points are plo ed in the graph show a linear correla on and the best fit line is seen going from bo om right to top le of the graph. The equa on Y equals nega ve 2.11 plus1.13 mul plied by x is wri en on top of the best fit line. The quadra c sca er plot on the right has the horizontal axis that represents the variable Q ranges from 0 to 20, in increments of 5. The ver cal axis represents the variable Y and ranges from 0 to 20, in increments of 5. Data points are plo ed in the graph and do not show a linear correla on, but seem sca ered throughout the graph. The curve in the form of a “n” shows the trend in this graph. The equa on Y equals nega ve 4.35 minus 3.35.0.17 mul plied by X2. Back to Figure The table has six columns that detail unstandardized B, coefficients standard error, standard coefficients beta, t, and sig for various models listed below. The dependent variable is Y in this scenario. The details of the table are as below: When the model is constant
Unstandardized B: Nega ve 2.200; coefficients standard error: 2.276; standard coefficients beta: NA; t: Nega ve 9.67; Sig: .357.
When the model is X
Unstandardized B: 1.214; coefficients standard error: .522; standard coefficients beta: .980; t: 2.325; Sig: 0.42.
When the model is X squared
Unstandardized B: Nega ve .001; coefficients standard error: 0.25; standard coefficients beta: Nega ve.020; t: Nega ve .048; Sig: .963.
Back to Figure The table has six columns that detail unstandardized B, coefficients standard error, standard coefficients beta, t, and sig for various models listed below. The dependent variable is Y in this scenario. The details of the table are as below: When the model is constant
Unstandardized B: Nega ve 4.350; coefficients standard error: 3.970; standard coefficients beta: NA; t: Nega ve 1.096; Sig: .299.
When the model is Q
Unstandardized B: 3.552; coefficients standard error: .873; standard coefficients beta: 3.134; t: 4.070; Sig: .002.
When the model is Q squared
Unstandardized B: Nega ve .173; coefficients standard error: 0.41; standard coefficients beta: Nega ve 3.234; t: Nega ve 4. 199; Sig: .002.
Back to Figure The details of the tables are as follows: For Valid
Depression: 218; Sa sfac on on w life: 150; Nega ve affect: 226; Neuro cism: 220; Sex: 240; Social desirability: 240.
For Missing Depression: 22; Sa sfac on on w life: 90; Nega ve affect: 14; Neuro cism: 20; Sex: 0;
Social desirability: 0. Back to Figure In the first screenshot, the dialog box is divided into three ver cal panes. The first pane on the le side displays a list of variables with the nominal icon in the alphabe cal order. The list reads as follows, “Id Number”, “Nega ve Effect”, “Sex”, and so on. The pane in the center is tled “Numeric variable gives (here depicted with a forward poin ng arrow) Output Variable”
Below it is a box with the following text, “Depression gives (here depicted with a forward poin ng arrow) Missing depression”. Between the pane in the le and the center a bu on of a “return key” can be seen. A tab for “old and New values…” can be seen in the bo om of the central pane. The last pane on the right side is tled “Output Variable” with two fields for Name and Label. The field for “Name” reads, AGE 2 and the field for “Label” reads “AGE 50 plus (or not).” A tab for “Change” can be seen at the bo om of this “Output Variable” pane. Another field with a “If” tab can be seen at the bo om of this dialog box, with the following text wri en next to it in parenthesis “op onal case selec on condi on”. At the bo om in the dialog box are the following tabs, “Reset”, “Paste”, “Cancel”, and “OK”. The tabs of “Paste” and “OK” are disabled. The third pane is labeled output variable and has two fields name and label. The field of label is empty and the field of name has the text missing depression. It also has a change bu on at the bo om. The second screenshot has a dialog box that is divided into two panes. The first pane tled “Old Value” has a list of seven fields in the following order.
Value System - missing System – or – user- missing Range Range, Lowest through value Range, Value through Highest All other values
The field of “All other values” is selected. Under the field of “Value” there is an empty cell. Similarly, for the field of “Range” two empty cells to men on from and to range can be seen. The second pane tled “New Value” is a horizontal pane on the right side in the dialog box. The fields in the pane are as follows.
Value System – missing Copy old Value(s)
The field of “Value” is seen selected and has an empty cell next to it.
The pane below this shows how “Old gives New (here depicted with a forward poin ng arrow)” with the following text in the cell below. SYSMIS gives (here depicted with a forward poin ng arrow) 1. ELSE (here depicted with a forward poin ng arrow) 0. There are two checkboxes under this pane in the following order.
Output variables are strings. Convert numeric strings to numbers. This check box is disabled.
Back to Figure The table is labeled total missing and gives details the scores for four variables. The details are as below: For missing cases 0
Frequency: 116; Percent: 48.3; Valid percent: 48.3; Cumula ve percent: 48.3. For missing cases 1
Frequency: 103; Percent: 42.9; Valid percent: 42.9; Cumula ve percent: 91.3. For missing cases 2
Frequency: 20; Percent: 8.3; Valid percent: 8.3; Cumula ve percent: 99.6. For missing cases 3
Frequency: 1; Percent: .4; Valid percent: .4; Cumula ve percent: 100.0. For Total
Frequency: 240; Percent: 100.0; Valid percent: 100.0; Cumula ve percent: NA. Back to Figure The dialog box is labelled “Analyze pa erns” and is divided into two ver cal panes. In the first pane variables are listed with nominal and measurement icons. The list reads as follows “Id number”, “Missing sa sfac on”, “Total missing” and so on. The second pane is labeled “Analyze Across Variables”. It lists the variables like: depression, sa sfac on, sex and so on. The variable “Sex” is selected. An empty data field labeled “Analysis Weight” is also given in the pane. The third pane is labeled “Output” and has the following check boxes:
Summary of missing values. Pa erns of missing values. Variables with highest frequency of missing values.
All the three checkboxes are selected. Two fields for maximum and minimum number of variables displayed is also given in the pane.
Maximum number of variables displayed: 25. Minimum percentage missing for variable to be displayed: 10.
There are tabs for “Reset”, “Paste”, “Cancel”, and “OK”. The tab of “OK” is selected in this screenshot. Back to Figure The pie chart labeled variables details the following:
For complete data: Variables: 2; 33.33%. For incomplete data: Variables: 4; 66.67%.
The pie chart labeled cases details the following:
For complete data: Cases: 116; 48.33%. For incomplete data: Cases: 124; 51.67%.
The pie chart labeled values details the following:
For complete data: Variables: 146; 10.14%. For incomplete data: Variables: 1.294; 89.86%.
Back to Figure The dialog box is divided into two ver cal panes. The first pane lists the variables, “Id number”, “Nega ve effect”, “Sex” and so on. The second pane is labeled, “Dependent Variables” and lists the variables: “Depression”, “Social desirability”, and so on. A data field for factor: missingnegeffect is given at the bo om of this pane. The first row has the following values against each of them in the following order:1, MAEDUC, and PAEDUC. There are three tabs on the right side of the second pane: Contrasts, Post Hoc, and Op ons. At the bo om in the dialog box are the following tabs, “Reset”, “Paste”, “Cancel”, and “OK” and in enabled state. Back to Figure The output table displays how missingness on one variable is related to scores on other quan ta ve variables.
The details of the table are as follows:
The table gives the cross tabula on results for the variable sex and missing depression. The details of the image are as follows:
The dialog box is divided into two ver cal panes. The first pane lists the variables, “Id number”, “Nega ve effect”, “Sex” and so on. It has a bu on named use all variables at the bo om. The second pane has two windows labeled, “Quan ta ve Variables” and “Categorical Variables”. The quan ta ve variables window lists the variables: “Depression”, “Social desirability”, and so on. And the categorical variables window lists the variable: Sex. A data field below defines the missing value as 25. Another empty data field labeled case labels is also given at the bo om.
There are three tabs on the right side of the second pane: Pa erns and descrip veness. Below these tabs is a window labeled es ma on and has checkboxes for: listwise, pairwise, EM, Regression. The checkbox for EM is selected. Below this window are more tabs for variables, EM, and Regression. The tab for regression is selected in the dialog box. At the bo om in the dialog box are the following tabs, “Reset”, “Paste”, “Cancel”, and “OK” and in enabled state. Back to Figure The details of the table are as below: The table is tled, EM Means superscript a. It lists the missing values for the following variables.
Depression: 22.35 Sa sfac on w life: 17.40 Nega ve affect: 24.39 Neuro cism: 30.50 Social desirability: 9.85
The following text is wri en at the bo om of the table: Li le MCAR test: Chi square equals 136.081; DF equals 33; Sig equals .000. Back to Figure The dialog box is labeled “Impute missing data values” and has tabs for variables, method, constraints, and output. The tab for variables is selected and is divided into two ver cal panes. In the first pane variables are listed with a nominal and measurement icons. The list reads as follows “Id number”, “Missing sa sfac on”, “Total missing” and so on. The second pane is labeled “Variables in model”. It lists the variables like: depression, sa sfac on, social desirability and so on. The variable “Social desirability” is selected. An empty data field labeled “Analysis Weight” is also given in the pane. A data field below defines the missing value as 25. A horizontal pane at the bo om is labeled “Loca on of imputed data” and has the following radio bu ons:
Create a new data set. And a data field for Data set name: Imputed data. Write to a new data file. A disabled tab for browse is given at the bo om.
Text below the dialog box reads: A er genera ng a data set containing the imputed values, you can use ordinary SPSS sta s cs analysis procedures marked by the icon to analyze your data. See Help for a complete list of supported analysis procedures. Back to Figure
The dialog box is labeled “split file” and is divided into two panes. The pane on the le has variables with a nominal and measurement icons. The list reads as follows “Id number”, “Missing sa sfac on”, “Total missing” and so on. The variable “ID number” is selected. Text below reads, current status: compare imputa on. The second pane has radio bu ons for the following:
Analyze all cases, do not create group. Compare groups. Organize output by groups.
The radio bu on of “Compare groups” is selected. A data field labeled groups based on has a variable that reads: imputa on number (imputa on). Two more data radio bu ons a er the data field read:
Sort the file by grouping variables. File is already sorted.
The radio bu on for sort the file by grouping variables is selected. At the bo om in the dialog box are the following tabs, “Reset”, “Paste”, “Cancel”, and “OK”. The tabs of “OK” is selected. Back to Figure The table predicts depression from the variables Sa sfac on wLife, Nega ve Affect, Neuro cism, Sex, and Socialdesirability. The details of the table are as below:
The horizontal axis represents the number of steps or ac ons taken toward separa on or divorce. It ranges from 0 to 10, in increments of 2. The ver cal axis represents the propor on if sample and ranges from .00 to .20, in increments of .05. There are 12 bins in the histogram and frequency distribu on for four mathema cal distribu on models: Poisson, zero-inflated Poisson, nega ve binomial, and zero-inflated nega ve binomial is shown in the graph.
The details of the frequency distribu on are as follows: Poisson model:
Number of steps: 1; Propor on of sample: 0.17 Number of steps: 2; Propor on of sample: 0.20 Number of steps: 3; Propor on of sample: 0.18 Number of steps: 4; Propor on of sample: 0.14 Number of steps: 5; Propor on of sample: 0.10 Number of steps: 6; Propor on of sample: 0.06 Number of steps: 7; Propor on of sample: 0.04 Number of steps: 8; Propor on of sample: 0.03 Number of steps: 9; Propor on of sample: 0.02 Number of steps: 10; Propor on of sample: 0.01
Zero-inflated Poisson model:
Number of steps: 1; Propor on of sample: 0.24 Number of steps: 2; Propor on of sample: 0.07 Number of steps: 3; Propor on of sample: 0.12 Number of steps: 4; Propor on of sample: 0.15 Number of steps: 5; Propor on of sample: 0.14 Number of steps: 6; Propor on of sample: 0.08 Number of steps: 7; Propor on of sample: 0.05 Number of steps: 8; Propor on of sample: 0.04 Number of steps: 9; Propor on of sample: 0.03 Number of steps: 10; Propor on of sample: 0.02
Nega ve binomial model:
Number of steps: 1; Propor on of sample: 0.24 Number of steps: 2; Propor on of sample: 0.08 Number of steps: 3; Propor on of sample: 0.12 Number of steps: 4; Propor on of sample: 0.14 Number of steps: 5; Propor on of sample: 0.13 Number of steps: 6; Propor on of sample: 0.08 Number of steps: 7; Propor on of sample: 0.05 Number of steps: 8; Propor on of sample: 0.02 Number of steps: 10; Propor on of sample: 0.02
Zero-inflated nega ve binomial model:
Number of steps:1; Propor on of sample: 0.16 Number of steps:2; Propor on of sample: 0.19
Number of steps:3; Propor on of sample: 0.17 Number of steps:4; Propor on of sample: 0.14 Number of steps:5; Propor on of sample: 0.10 Number of steps:6; Propor on of sample: 0.06 Number of steps:7; Propor on of sample: 0.04 Number of steps:8; Propor on of sample: 0.03 Number of steps:9; Propor on of sample: 0.02 Number of steps:10; Propor on of sample: 0.02