data+
30
REGRESSION Regression is a statistical tool that allows you to predict the value of one continuous variable from one or more other variables. When you perform a regression analysis, you create a regression equation that predicts the values of your DV using the values of your IVs. Each IV is associated with specific coefficients in the equation that summarizes the relationship between that IV and the DV. Once we estimate a set of coefficients in a regression equation, we can use hypothesis tests and confidence intervals to make inferences about the corresponding parameters in the population. You can also use the regression equation to predict the value of the DV given a specified set of values for your IVs. Simple Linear Regression Simple linear regression is used to predict the value of a single continuous DV (which we will call Y) from a single continuous IV (which we will call X). Regression assumes that the relationship between IV and the DV can be represented by the equation Yi = β0 + β 1Xi + εi, where Yi is the value of the DV for case i, Xi is the value of the IV for case i, β0 and β1 are constants, and εi is the error in prediction for case i. When you perform a regression, what you are basically doing is determining estimates of β0 and β1 that let you best predict values of Y from values of X. You may remember from geometry that the above equation is equivalent to a straight line. This is no accident, since the purpose of simple linear regression is to define the line that represents the relationship between our two variables. β0 is the intercept of the line, indicating the expected value of Y when X = 0. β1 is the slope of the line, indicating how much we expect Y will change when we increase X by a single unit. The regression equation above is written in terms of population parameters. That indicates that our goal is to determine the relationship between the two variables in the population as a whole. We typically do this by taking a sample and then performing calculations to obtain the estimated regression equation Yi = b0 + b1Xi . Once you estimate the values of b0 and b1, you can substitute in those values and use the regression equation to predict the expected values of the DV for specific values of the IV. Predicting the values of Y from the values of X is referred to as regressing Y on X. When analyzing data from a study you will typically want to regress the values of the DV on the values of the IV. This makes sense since you want to use the IV to explain variability in the DV. We typically calculate b0 and b1 using least squares estimation. This chooses estimates that minimize the sum of squared errors between the values of the estimated regression line and the actual observed values. In addition to using the estimated regression equation for prediction, you can also perform hypothesis tests regarding the individual regression parameters. The slope of the regression equation (β1) represents the change in Y with a one-unit change in X. If X predicts Y, then as X
31
increases, Y should change in some systematic way. You can therefore test for a linear relationship between X and Y by determining whether the slope parameter is significantly different from zero. When using performing linear regression, we typically make the following assumptions about the error terms εi.
1. The errors have a normal distribution. 2. The same amount of error in the model is found at each level of X. 3. The errors in the model are all independent.
To perform a simple linear regression in SPSS
• Choose Analyze !!!! Regression!!!! Linear. • Move the DV to the Dependent box. • Move the IV to the Independent(s) box. • Click the Continue button. • Click the OK button.
The output from this analysis will contain the following sections.
• Variables Entered/Removed. This section is only used in model building and contains no useful information in simple linear regression.
• Model Summary. The value listed below R is the correlation between your variables. The value listed below R Square is the proportion of variance in your DV that can be accounted for by your IV. The value in the Adjusted R Square column is a measure of model fit, adjusting for the number of IVs in the model. The value listed below Std. Error of the Estimate is the standard deviation of the residuals.
• ANOVA. Here you will see an ANOVA table, which provides an F test of the relationship between your IV and your DV. If the F test is significant, it indicates that there is a relationship.
• Coefficients. This section contains a table where each row corresponds to a single coefficient in your model. The row labeled Constant refers to the intercept, while the row containing the name of your IV refers to the slope. Inside the table, the column labeled B contains the estimates of the parameters and the column labeled Std. Error contains the standard error of those parameters. The column labeled Beta contains the standardized regression coefficient, which is the parameter estimate that you would get if you standardized both the IV and the DV by subtracting off their mean and dividing by their standard deviations. Standardized regression coefficients are sometimes used in multiple regression (discussed below) to compare the relative importance of different IVs when predicting the DV. In simple linear regression, the standardized regression coefficient will always be equal to the correlation between the IV and the DV. The column labeled t contains the value of the t-statistic testing whether the value of each parameter is equal to zero. The p-value of this test is found in the column labeled Sig. If the value for the IV is significant, then there is a relationship between the IV and the DV. Note that the square of the t statistic is equal to the F statistic in the ANOVA table and that the p-values of the two tests are equal. This is because both of these are testing whether there is a significant linear relationship between your variables.
32
Multiple Regression Sometimes you may want to explain variability in a continuous DV using several different continuous IVs. Multiple regression allows us to build an equation predicting the value of the DV from the values of two or more IVs. The parameters of this equation can be used to relate the variability in our DV to the variability in specific IVs. Sometimes people use the term multivariate regression to refer to multiple regression, but most statisticians do not use ìmultiple" and ìmultivariate" as synonyms. Instead, they use the term ìmultiple" to describe analyses that examine the effect of two or more IVs on a single DV, while they reserve the term ìmultivariate" to describe analyses that examine the effect of any number of IVs on two or more DVs. The general form of the multiple regression model is Yi = β0 + β 1Xi1 + β 2Xi2 + Ö + βkXik + εi,. The elements in this equation are the same as those found in simple linear regression, except that we now have k different parameters which are multipled by the values of the k IVs to get our predicted value. We can again use least squares estimation to determine the estimates of these parameters that best our observed data. Once we obtain these estimates we can either use our equation for prediction, or we can test whether our parameters are significantly different from zero to determine whether each of our IVs makes a significant contribution to our model. Care must be taken when making inferences based on the coefficients obtained in multiple regression. The way that you interpret a multiple regression coefficient is somewhat different from the way that you interpret coefficients obtained using simple linear regression. Specifically, the value of a multiple regression coefficient represents the ability of part of the corresponding IV that is unrelated to the other IVs to predict the part of the DV that is unrelated to the other IVs. It therefore represents the unique ability of the IV to account for variability in the DV. One implication of the way coefficients are determined is that your parameter estimates become very difficult to interpret if there are large correlations among your IVs. The effect of these relationships on multiple regression coefficients is called multicollinearity. This changes the values of your coefficients and greatly increases their variance. It can cause you to find that none of your coefficients are significantly different from zero, even when the overall model does a good job predicting the value of the DV. One implication of the way coefficients are determined is that your parameter estimates become very difficult to interpret if there are large correlations among your IVs. The typical effect of multicollinearity is to reduce the size of your parameter estimates. Since the value of the coefficient is based on the unique ability for an IV to account for variability in a DV, if there is a portion of variability that is accounted for by multiple IVs, all of their coefficients will be reduced. Under certain circumstances multicollinearity can also create a suppression effect. If you have one IV that has a high correlation with another IV but a low correlation with the DV, you can find that the multiple regression coefficient for the second IV from a model including both variables can be larger (or even opposite in direction!) compared to the coefficient from a model that doesn't include the first IV. This happens when the part of the second IV that is independent of the first IV has a different relationship with the DV than does the part that is
33
related to the first IV. It is called a suppression effect because the relationship that appears in multiple regression is suppressed when you just look at the variable by itself. To perform a multiple regression in SPSS
• Choose Analyze !!!! Regression !!!! Linear. • Move the DV to the Dependent box. • Move all of the IVs to the Independent(s) box. • Click the Continue button. • Click the OK button.
The SPSS output from a multiple regression analysis contains the following sections.
• Variables Entered/Removed. This section is only used in model building and contains no useful information in standard multiple regression.
• Model Summary. The value listed below R is the multiple correlation between your IVs and your DV. The value listed below R square is the proportion of variance in your DV that can be accounted for by your IV. The value in the Adjusted R Square column is a measure of model fit, adjusting for the number of IVs in the model. The value listed below Std. Error of the Estimate is the standard deviation of the residuals.
• ANOVA. This section provides an F test for your statistical model. If this F is significant, it indicates that the model as a whole (that is, all IVs combined) predicts significantly more variability in the DV compared to a null model that only has an intercept parameter. Notice that this test is affected by the number of IVs in the model being tested.
• Coefficients. This section contains a table where each row corresponds to a single coefficient in your model. The row labeled Constant refers to the intercept, while the coefficients for each of your IVs appear in the row beginning with the name of the IV. Inside the table, the column labeled B contains the estimates of the parameters and the column labeled Std. Error contains the standard error of those estimates. The column labeled Beta contains the standardized regression coefficient. The column labeled t contains the value of the t-statistic testing whether the value of each parameter is equal to zero. The p-value of this test is found in the column labeled Sig. A significant t-test indicates that the IV is able to account for a significant amount of variability in the DV, independent of the other IVs in your regression model.
Multiple regression with interactions In addition to determining the independent effect of each IV on the DV, multiple regression can also be used to detect interactions between your IVs. An interaction measures the extent to which the relationship between an IV and a DV depends on the level of other IVs in the model. For example, if you have an interaction between two IVs (called a two-way interaction) then you expect that the relationship between the first IV and the DV will be different across different levels of the second IV. Interactions are symmetric, so if you have an interaction such that the effect of IV1 on the DV depends on the level of IV2, then it is also true that the effect of IV2 on the DV depends on the level of IV1. It therefore does not matter whether you say that you have an interaction between IV1 and IV2 or an interaction between IV2 and IV1. You can also have interactions between more than two IVs. For example, you can have a three-way interaction between IV1, IV2, and IV3. This would mean that the two-way interaction between IV1 and IV2 depends on the level of IV3. Just like two-way interactions, three-way interactions are also
34
independent of the order of the variables. So the above three-way interaction would also mean that the two-way interaction between IV1 and IV3 is dependent on the level of IV2, and that the two-way interaction between IV2 and IV3 depends on the level of IV1. It is possible to have both main effects and interactions at the same time. For example, you can have a general trend that the value of the DV increases when the value of a particular IV increases along with an interaction such that the relationship is stronger when the value of a second IV is high than when the value of that second IV is low. You can also have lower order interactions in the presence of a higher order interaction. Again, the lower-order interaction would represent a general trend that is modified by the higher-order interaction. You can use linear regression to determine if there is an interaction between a pair of IVs by adding an interaction term to your statistical model. To detect the interaction effect of two IVs (X1 and X2) on a DV (Y) you would use linear regression to estimate the equation Yi = b0 + b 1Xi1 + b 2Xi2 + b 3Xi1Xi2. You construct the variable for the interaction term Xi1Xi2 by literally multiplying the value of X1 by the value of X2 for each case in your data set. If the test of b3 is significant, then the two predictors have an interactive effect on the outcome variable. In addition to the interaction term itself, your model must contain all of the main effects of the variables involved in the interaction as well as all of the lower-order interaction terms that can be created using those main effects. For example, if you want to test for a three-way interaction you must include the three main effects as well as all of the possible two-way interactions that can be made from those three variables. If you do not include the lower-order terms then the test on the highest order interaction will produce incorrect results. It is important to center the variables that are involved in an interaction before including them in your model. That is, for each independent variable, the analyst should subtract the mean of the independent variable from each participantís score on that variable. The interaction term should then be constructed from the centered variables by multiplying them together. The model itself should then be tested using the centered main effects and the constructed interaction term. Centering your independent variables will not change their relationship to the dependent variable, but it will reduce the collinearity between the main effects and the interaction term. If the variables are not centered then none of the coefficients on terms involving IVs involved in the interaction will be interpretable except for the highest-order interaction. When the variables are centered, however, then the coefficients on the IVs can be interpreted as representing the main effect of the IV on the DV, averaging over the other variables in the interaction. The coefficients on lower-order interaction terms can similarly be interpreted as the testing the average strength of that lower-order interaction, averaging over the variables that are excluded from the lower-order interaction but included in the highest-order interaction term. Centering has the added benefit of reducing the collinearity between the main effect and interaction terms. You can perform a multiple regression including interaction terms in SPSS just like you would a standard multiple regression if you create your interaction terms ahead of time. However,
35
creating these variables can be tedious when analyzing models that contain a large number of interaction terms. Luckily, if you choose to analyze your data using the General Linear Model procedure, SPSS will create these interaction terms for you (although you still need to center all of your original IVs beforehand). To analyze a regression model this way in SPSS
• Center the IVs involved in the interaction. • Choose Analyze !!!! General Linear Model !!!! Univariate. • Move your DV to the box labeled Dependent Variable. • Move all of the main effect terms for your IVs to the box labeled Covariate(s). • Click the Options button. • Check the box next to Parameter estimates. By default this procedure will only provide
you with tests of your IVs and not the actual parameter estimates. • Click the Continue button. • By default SPSS will not include interactions between continuous variables in its
statistical models. However, if you build a custom model you can include whatever terms you like. You should therefore next build a model that includes all of the main effects of your IVs as well as any desired interactions. To do this
o Click the Model button. o Click the radio button next to Custom. o Select all of your IVs, set the drop-down menu to Main effects, and click the
arrow button. o For each interaction term, select the variables involved in the interaction, set the
drop-down menu to Interaction, and click the arrow button. o If you want all of the possible two-way interactions between a collection of IVs
you can just select the IVs, set the drop-down menu to All 2-way, and click the arrow button. This procedure can also be used to get all possible three-way, four- way, or five-way interactions between a collection of IVs by setting the drop- down menu to the appropriate interaction type.
• Click the Continue button. • Click the OK button.
The output from this analysis will contain the same sections found in standard multiple regression. When referring to an interaction, SPSS will display the names of the variables involved in the interaction separated by asterisks (*). So the interaction between the variables RACE and GENDER would be displayed as RACE * GENDER. So what does it mean if you obtain a significant interaction in regression? Remember that in simple linear regression, the slope coefficient (b1) indicates the expected change in Y with a one- unit change in X. In multiple regression, the slope coefficient for X1 indicates the expected change in Y with a one-unit change in X1, holding all other X values constant. Importantly, this change in Y with a one-unit change in X1 is the same no matter what value the other X variables in the model take on. However, if there is a significant interaction, the interpretation of coefficients is slightly different. In this case, the slope coefficient for X1 depends on the level of the other predictor variables in the model.
36
Polynomial regression Polynomial regression models are used when the true relationship between a continuous predictor variable and a continuous dependent variable is a polynomial function, or when the curvilinear relationship is complex or unknown but can be approximated by a polynomial function. A polynomial regression model with one predictor variable is expressed in the following way: Yi = β0 + β1Xi + β11X2i + εi The predictor variable (X) should be centered (discussed in the section Multiple regression with interactions), or else the X and X2 terms will be highly correlated and lead to severe multicollinearity. Additionally, you lose the ability to interpret the lower-order coefficients in a straightforward manner. In the above model, the coefficient β1 is typically called the ìlinear effectî coefficient and β11 is called the ìquadratic effectî coefficient. If the estimate of the coefficient β11 is significantly different from zero then you have a significant quadratic effect in your data. If the highest-order term in a polynomial model is not significant, conventionally statisticians will remove that term from the model and rerun the regression. The best way to choose the highest order polynomial is through a historical or theoretical analysis. There are certain types of relationships that are well known to be fitted by quadratic or cubic models. You might also determine that a specific type of relationship should exist because of the mechanisms responsible for the relationship between the IV and the DV. If you are building your model in an exploratory fashion, however, you can estimate how high of an order function you should use by the shape of the relationship between the DV and that IV. If your data appears to reverse p times (has p curves in the graph), you should use a function whose highest order parameter is raised to the power of p +1. In multiple regression you can see whether you should add an additional term for an IV by examining a graph of the residuals against the IV. Again, if the relationship between the residuals and the IV appears to reverse p times, you should add terms whose highest order parameter is raised to the power of p + 1. It is quite permissible to have more than one predictor variable represented in quadratic form in the same model. For instance: Yi = β0 + β 1Xi1 + β 2Xi2 + β 11X2i1 + β 22X2i2 + εi is a model with two predictor variables, both with quadratic terms. To perform a polynomial regression in SPSS
• Determine the highest order term that you will use for each IV. • Center any IVs for which you will examine higher-order terms. • For each IV, create new variables that are equal to your IV raised to the powers of 2
through the power of your highest order term. Be sure to use the centered version of your IV.
37
• Conduct a standard multiple regression including all of the terms for each IV. Simultaneously testing categorical and continuous IVs Both ANOVA and regression are actually based on the same set of statistical ideas, the general linear model. SPSS implements these functions in different menu selections, but the basic way that the independent variables are tested is fundamentally the same. It is therefore perfectly reasonable to combine both continuous and categorical predictor variables in the same model, even though people are usually taught to think of ANOVA and regression as separate types of analyses. To perform an analysis in SPSS using the General Linear Model
• Choose Analyze !!!! General Linear Model !!!! Univariate. • Move your DV to the box labeled Dependent Variable. • Move any categorical IVs to the box labeled Fixed Factor(s). • Move any continuous IVs to the box labeled Covariate(s). • By default SPSS will include all possible interactions between your categorical IVs, but
will only include the main effects of your continuous IVs. If this is not the model you want then you will need to define it by hand by taking the following steps.
o Click the Model button. o Click the radio button next to Custom. o Add all of your main effects to the model by clicking all of the IVs in the box
labeled Factors and covariates, setting the pull-down menu to Main effects, and clicking the arrow button.
o Add each of the interaction terms to your model. You can do this one at a time by selecting the variables included in the interaction in the box labeled Factors and covariates, setting the pull-down menu to Interaction, and clicking the arrow button for each of your interactions.
o You can also use the setting on the pull-down menu to tell SPSS to add all possible 2-way, 3-way, 4-way, or 5-way interactions that can be made between the selected variables to your model.
o Click the Continue button. • Click the OK button.
The SPSS output from running an analysis using the General Linear Model contains the following sections.
• Between-Subjects Factors. This table just lists out the different levels of any categorical variables included in your model.
• Tests of Between-Subjects Effects. This table provides an F test of each main effect or interaction that you included in your model. It indicates whether or not the effect can independently account for a significant amount of variability in your DV. This provides the same results as testing the change in model R2 that you get from the test of the set of terms representing the effect.
Post-hoc comparisons in mixed models. You can ask SPSS to provide post-hoc contrasts comparing the different levels within any of your categorical predictor variables by clicking the Contrasts button in the variable selection window. If you want to compare the means of cells
38
resulting from combinations of your categorical predictors, you will need to recode them all into a single variable as described in the section Post-hoc comparisons for when you have two or more factors. The easiest way to examine the main effect of a continuous independent variable is to graph its relationship to the dependent variable using simple linear regression.. You can obtain this using the following procedure:
• Choose Analyze !!!! Regression !!!! Curve Estimation. • Move your dependent variable into the Dependent(s) box • Move your independent variable into the Independent box • Make sure that Plot Models is checked • Under the heading Models, make sure that only Linear is checked
This will produce a graph of your data along with the least-squares regression line. If you want to look at the interaction between a categorical and a continuous independent variable, you can use the Select Cases function (described above) to limit this graph to cases that have a particular value on the categorical variable. Using this method several times, you can obtain graphs of the relationship between the continuous variable and the dependent variable separately for each level of the categorical independent variable. Another option you might consider would be to recode the continuous variables as categorical, separating them into groups based on their value on the continuous variables. You can then run a standard ANOVA and compare the means of the dependent variable for those high or low on the continuous variable. Even if you decide to do this, you should still base all of your conclusions on the analysis that actually treated the variable as continuous. Numerous simulations have shown that there is greater power and less error in analysis that treat truly continuous variables as continuous compared to those that analyze them in a categorical fashion.
39
MEDIATION When researchers find a relationship between an independent variable (A) and a dependent variable (C), they may seek to uncover variables that mediate this relationship. That is, they may believe that the effect of variable A on variable C exits because variable A leads to a change in a mediating variable (M), which in turn effects the dependent variable (C). When a variable fully mediates a relationship, the effect of variable A on variable C disappears when controlling for the mediating variable. A variable partially mediates a relationship when the effect of variable A on variable C is significantly reduced when controlling for the mediator. A common way of expressing these patterns is the following: Mediating Variable (M) Independent Variable (A) Dependent Variable (C) You need to conduct three different regression analyses to determine if you have a mediated relationship using the traditional method Regression 1. Predict the dependent variable (C) from the independent variable (A). The effect of the independent variable in this model must be significant. If there is no direct effect of A on C, then there is no relationship to mediate. Regression 2. Predict the mediating variable (M) from the independent variable (A). The effect of the independent variable in this model must be significant. If the independent variable does not reliably affect the mediator, the mediator cannot be responsible for the relationship observed between A and C. Regression 3. Simultaneously predict the value of the dependent variable (C) from both the independent variable (A) and the mediating variable (M) using multiple regression. The effect of the independent variable should be non significant (or at least significantly reduced, compared to Regression 1), whereas the effect of the mediating variable must be significant. The reduction in the relationship between A and C indicates that the mediator is accounting for a significant portion of this relationship. However, if the relationship between M and C is not significant, then you cannot clearly determine whether M mediates the relationship between A and C, or if A mediates the relationship between M and C. One can directly test for a reduction in the effect of A ! C when controlling for the mediator by performing a Sobel Test. This involves testing the significance of the path between A and C through M in Regression 3. While you cannot do a Sobel Test in SPSS, the website http://www.unc.edu/~preacher/sobel/sobel.htm will perform this for you online. If you wish to show mediation in a journal article, you will almost always be required to show the results of the Sobel Test.
40
CHI-SQUARE TEST OF INDEPENDENCE A chi-square is a nonparametric test used to determine if there is a relationship between two categorical variables. Letís take a simple example. Suppose a researcher brought male and female participants into the lab and asked them which color they preferóblue or green. The researcher believes that color preference may be related to gender. Notice that both gender (male, female) and color preference (blue, green) are categorical variables. If there is a relationship between gender and color preference, we would expect that the proportion of men who prefer blue would be different than the proportion of women who prefer blue. In general, you have a relationship between two categorical variables when the distribution of people across the categories of the first variable changes across the different categories of the second variable. To determine if a relationship exists between gender and color preference, the chi-square test computes the distributions across the combination of your two factors that you would expect if there were no relationship between them. In then compares this to the actual distribution found in your data. In the example above, we have a 2 (gender: male, female) X 2 (color preference: green, blue) design. For each cell in the combination of the two factors, we would compute "observed" and "expected" counts. The observed counts are simply the actual number of observations found in each of the cells. The expected proportion in each cell can be determined by multiplying the marginal proportions found in a table. For example, let us say that 52% of all the participants preferred blue and 48% preferred green, whereas 40% of the all of the participants were men and 60% were women. The expected proportions are presented in the table below. Expected proportion table Males Females Marginal proportion Blue 20.8% 31.2% 52% Green 19.2% 28.8% 48% Marginal proportion 40% 60% As you can see, you get the expected proportion for a particular cell by multiplying the two marginal proportions together. You would then determine the expected count for each cell by multiplying the expected proportion by the total number of participants in your study. The chi- square statistic is a function of the difference between the expected and observed counts across all your cells. Luckily you do not actually need to calculate any of this by hand, since SPSS will compute the expected counts for each cell and perform the chi-square test. To perform a chi-square test of independence in SPSS
• Choose Analyze !!!! Descriptive Statistics !!!! Crosstabs. • Put one of the variables in the Row(s) box • Put the other variable in the Column(s) box • Click the Statistics button. • Check the box next to Chi-square. • Click the Continue button.
41
• Click the OK button.
The output of this analysis will contain the following sections. • Case Processing Summary. Provides information about missing values in your two
variables. • Crosstabulation. Provides you with the observed counts within each combination of
your two variables. • Chi-Square Tests. The first row of this table will give you the chi-square value, its
degrees of freedom and the p-value associated with the test. Note that the p-values produced by a chi-square test are inappropriate if the expected count is less than 5 in 20% of the cells or more. If you are in this situation, you should either redefine your coding scheme (combining the categories with low cell counts with other categories) or exclude categories with low cell counts from your analysis.
42
LOGISTIC REGRESSION The chi-square test allows us to determine if a pair of categorical variables are related. But what if you want to test a model using two or more independent variables? Most of the inferential procedures we have discussed so far require that the dependent variable be a continuous variable. The most common inferential statistics such as t-tests, regression, and ANOVA, require that the residuals have a normal distribution, and that the variance is equal across conditions. Both of these assumptions are likely to be seriously violated if the dependent variable is categorical. The answer is to use logistic regression, which does not make these assumptions and so can be used to determine the ability of a set of continuous or categorical independent variables to predict the value of a categorical dependent variable. However, standard logistic regression assumes that all of your observations are independent, so it cannot be directly used to test within-subject factors. Logistic regression generates equations that tell you exactly how changes in your independent variables affect the probability that the observation is in a level of your dependent variable. These equations are based on predicting the odds that a particular observation is in one of two groups. Let us say that you have two groups: a reference group and a comparison group. The odds that an observation is in the reference group is equal to the probability that the observation is in the reference group divided by the probability that it is in the comparison group. So, if there is a 75% chance that the observation is in the reference group, the odds of it being in the reference group would be .75/.25 = 3. We therefore talk about odds in the same way that people do when betting at a racetrack. In logistic regression, we build an equation that predicts the logarithm of the odds from the values of the independent variables (which is why itís called log-istic regression). For each independent variable in our model, we want to calculate a coefficient B that tells us what the change in the log odds would be if we would increase the value of the variable by 1. These coefficients therefore parallel those found in a standard regression model. However, they are somewhat difficult to interpret because they relate the independent variables to the log odds. To make interpretation easier, people often transform the coefficients into odds ratios by raising the mathematical constant e to the power of the coefficient (eB). The odds ratio directly tells you how the odds increase when you change the value of the independent variable. Specifically, the odds of being in the reference group are multiplied by the odds ratio when the independent variable increases by 1. One obvious limitation of this procedure is that we can only compare two groups at a time. If we want to examine a dependent variable with three or more levels, we must actually create several different logistic regression equations. If your dependent variable has k levels, you will need a total of k-1 logistic regression equations. What people typically do is designate a specific level of your dependent variable as the reference group, and then generate a set of equations that each compares one other level of the dependent variable to that group. You must then examine the behavior of your independent variables in each of your equations to determine what their influence is on your dependent variable. To test the overall success of your model, you can determine the probability that you can predict the category of the dependent variable from the values of your independent variables. The
43
higher this probability is, the stronger the relationship is between the independent variables and your dependent variable. You can determine this probability iteratively using maximum likelihood estimation. If you multiply the logarithm of this probability by ñ2, you will obtain a statistic that has an approximate chi-square distribution, with degrees of freedom equal to the number of parameters in your model. This is referred to as ñ2LL (minus 2 log likelihood) and is commonly used to assess the fit of the model. Large values of ñ2LL indicate that the observed model has poor fit. This statistic can also be used to provide a statistical test of the relationship between each independent variable and your dependent variable. The importance of each term in the model can be assessed by examining the increase in ñ2LL when the term is dropped. This difference also has a chi-square distribution, and can be used as a statistical test of whether there is an independent relationship between each term and the dependent variable. To performing a logistic regression in SPSS
• Choose Analyze !!!! Regression !!!! Multinomial Logistic. • Move the categorical DV to the Dependent box. • Move your categorical IVs to the Factor(s) box. • Move your continuous independent variables to the Covariate(s) box. • By default, SPSS does not include any interaction terms in your model. You will need to
click the Model button and manually build your model if you want to include any interactions.
• When you are finished, you click the Ok button to tell SPSS to perform the analysis. If your dependent variable only has two groups, you have the option of selecting Analyze !!!! Regression !!!! Binary Logistic. Though this performs the same basic analysis, this procedure is primarily designed to perform model building. It organizes the output in a less straightforward way and does not provide you with the likelihood ratio test for each of your predictors. You are therefore better off if you only use this selection if you are specifically interested in using the model-building procedures that it offers. NOTE: The results from a binary logistic analysis in SPSS will actually produce coefficients that are opposite in sign when compared to the results of a multinomial logistic regression performed on exactly the same data. This is because the binary procedure chooses to predict the probability of choosing the category with the largest indicator variable, while the multinomial procedure chooses to predict the probability of choosing the category with the smallest indicator variable. The Multinomial Logistic procedure will produce output with the following sections.
• Case Processing Summary. Describes the levels of the dependent variable and any categorical independent variables.
• Model Fitting Information. Tells you the ñ2LL of both a null model containing only the intercept and the full model being tested. Recall that this statistic follows a chi-square distribution and that significant values indicate that there is a significant amount of variability in your DV that is not accounted for by your model.
• Pseudo R-Square. Provides a number of statistics that researchers have developed to represent the ability of a logistic regression model to account for variability in the dependent variable. Logistic regression does not have a true R-square statistic because
44
the amount of variance is partly determined by the distribution of the dependent variable. The more even the observations are distributed among the levels of the dependent variable, the greater the variance in the observations. This means that the R-square values for models that have different distributions are not directly comparable. However, these statistics can be useful for comparing the fit of different models predicting the same response variable. The most commonly reported pseudo R-square estimate is Nagelkerkeís R-square, which is provided by SPSS in this section.
• Likelihood Ratio Tests. Provides the likelihood ratio tests for the IVs. The first column of the table contains the ñ2LL (a measurement of model error having a chi-square distribution) of a model that does not include the factor listed in the row. The value in the first row (labeled Intercept) is actually the ñ2LL for the full model. The second column is the difference between the ñ2LL for the full model and the ñ2LL for the model that excludes the factor listed in the row. This is a measure of the amount of variability that is accounted for by the factor. This difference parallels the Type III SS in a regression model, and follows a chi-square distribution with degrees of freedom equal to the number of parameters it takes to code the factor. The final column provides the p- value for the test of the null hypothesis that the amount of error in the model that excludes the factor is the same as the amount of error in the full model. A significant statistic indicates that the factor does account for a significant amount of the variability in the dependent variable that is not captured by other variables in the model.
• Parameter Estimates. Provides the specific coefficients of the logistic regression equations. You will have a number of equations equal to the number of levels in your dependent variable ñ 1. Each equation predicts the log odds of your observations being in the highest numbered level of your dependent variable compared to another level (which is listed in the leftmost column of the chart). Within each equation, you will see estimates of the standardized logistic regression coefficient for each variable in the model. These coefficients tell you the increase in the log odds when the variable increases by 1 (assuming everything else is held constant). The next column contains the standard errors of those coefficients. The Wald Statistic provides another statistic testing the significance of the individual coefficients, and is based on the relationship between the coefficient and its standard error. However, there is a flaw in this statistic such that large coefficients may have inappropriately large standard errors, so researchers typically prefer to use the likelihood ratio test to determine the importance of individual factors in the model. SPSS provides the odds ratio for the parameter under the column Exp(B). The last two columns in the table provide the upper and lower bounds for a 95% confidence interval around the odds ratio.
45
RELIABILITY Ideally, the measurements that we take with a scale would always replicate perfectly. However, in the real world there are a number of external random factors that can affect the way that respondents provide answers to a scale. A particular measurement taken with the scale is therefore composed of two factors: the theoretical "true score" of the scale and the variation caused by random factors. Reliability is a measure of how much of the variability in the observed scores actually represents variability in the underlying true score. Reliability ranges from 0 to 1. In psychology it is preferred to have scales with reliability greater than .7. The reliability of a scale is heavily dependent on the number of items composing the scale. Even using items with poor internal consistency, you can get a reliable scale if your scale is long enough. For example, 10 items that have an average inter-item correlation of only .2 will produce a scale with a reliability of .714. However, the benefit of adding additional items decreases as the scale grows larger, and mostly disappears after 20 items. One consequence of this is that adding extra items to a scale will generally increase the scale's reliability, even if the new items are not particularly good. An item will have to significantly lower the average inter-item correlation for it to have a negative impact on reliability. Reliability has specific implications for the utility of your scale. The most that responses to your scale can correlate with any other variable is equal to the square root of the scaleís reliability. The variability in your measure will prevent anything higher. Therefore, the higher the reliability of your scale, the easier it is to obtain significant findings. This is probably what you should think about when you want to determine if your scale has a high enough reliability. It should also be noted that low reliability does not call into question results obtained using a scale. Low reliability only hurts your chances of finding significant results. It cannot cause you to obtain false significance. If anything, finding significant results with an unreliable scale indicates that you have discovered a particularly strong effect, since it was able to overcome the hindrances of your unreliable scale. In this way, using a scale with low reliability is analogous to conducting an experiment with a small number of participants. Calculating reliability from parallel measurements One way to calculate reliability is to correlate the scores on parallel measurements of the scale. Two measurements are defined as parallel if they are distinct (are based on different data) but equivalent (such that you expect responses to the two measurements to have the same true score). The two measurements must be performed on the same (or matched) respondents so that the correlation can be performed. There are a number of different ways to measure reliability using parallel measurements. Below are several examples. Test-Retest method. In this method, you have respondents complete the scale at two different points in time. The reliability of the scale can then be estimated by the correlation between the two scores. The accuracy of this method rests on the assumption that the participants are fundamentally the same (i.e., possess the same true score on your scale) during your two test periods. One common problem is that completing the scale the first time can change the way that respondents complete the scale the second time. If they remember any of their specific responses
46
from the first period, for example, it could artificially inflate the reliability estimate. When using this method, you should present evidence that this is not an issue. Alternate Forms method. This method, also referred to as parallel forms, is basically the same as the Test-Retest method, but with the use of different versions of the scale during each session. The use of different versions reduces the likelihood that the first administration of the scale influences responses to the second. The reliability of the scale can then be estimated by the correlation between the two scores. When using alternate forms, you should show that the administration of the first scale did not affect responses to the second and that the two versions of your scale are essentially the same. The use of this method is generally preferred to the Test- Retest method. Split-Halves method. One difficulty with both the Test-Retest and the Alternate Forms methods is that the scale responses must be collected at two different points in time. This requires more work and introduces the possibility that some natural event might change the actual true score between the two administrations of the scale. In the Split-Halves method you only have respondents fill out your scale one time. You then divide your scale items into two sections (such as the even-numbered items and the odd-numbered items) and calculate a score for each half. You then determine the correlation between these two scores. Unlike the other methods, this correlation does not estimate your scaleís reliability. Instead, you get your estimate using the formula:
r r
+ =
1 2àρ
where ρà is the reliability estimate and r is the correlation that you obtain. Note that if you split your scale in different ways, you will obtain different reliability estimates. Assuming that there are no confounding variables, all split-halves should be centered on the true reliability. In general it is best not to use a first half/second half split of the questionnaire since respondents may become tired as they work through the scale. This would mean that you would expect greater variability in the score from the second half than in the score from the first half. In this case, your two measurements are not actually parallel, making your reliability estimate invalid. A more acceptable method would be to divide your scale into sections of odd-numbered and even-numbered items. Calculating reliability from internal consistency The other way to calculate reliability is to use a measure of internal consistency. The most popular of these reliability estimates is Cronbach's alpha. Cronbach's alpha can be obtained using the equation:
)1(1 −+ =
Nr rN
α ,
47
where α is Cronbach's alpha, N is the number of items in the scale, and r is the mean inter-item correlation. From the equation we can see that α increases both with increasing r as well as with increasing N. Calculating Cronbach's alpha is the most commonly used procedure to estimate reliability. It is highly accurate and has the advantage of only requiring a single administration of the scale. The only real disadvantage is that it is difficult to calculate by hand, as it requires you to calculate the correlation between every single pair of items in your scale. This is rarely an issue, however, since SPSS will calculate it for you automatically. To obtain the α of a set of items in SPSS:
• Choose Analyze !!!! Scale !!!! Reliability analysis. • Move all of the items in the scale to the Items box. • Click the Statistics button. • Check the box next to Scale if item deleted. • Click the Continue button. • Click the OK button.
Note: Before performing this analysis, make sure all items are coded in the same direction. That is, for every item, larger values should consistently indicate either more of the construct or less of the construct. The output from this analysis will include a single section titled Reliability. The reliability of your scale will actually appear at the bottom of the output next to the word Alpha. The top of this section contains information about the consistency of each item with the scale as a whole. You use this to determine whether there are any ìbad itemsî in your scale (i.e., ones that are not representing the construct you are trying to measure). The column labeled Corrected Item- Total Correlation tells you the correlation between each item and the average of the other items in your scale. The column labeled Alpha if Item Deleted tells you what the reliability of your scale would be if you would delete the given item. You will generally want to remove any items where the reliability of the scale would increase if it were deleted, and you want to keep any items where the reliability of the scale would drop if it were deleted. If any of your items have a negative item-total score correlation it may mean that you forgot to reverse code the item. Inter-rater reliability A final type of reliability that is commonly assessed in psychological research is called ìinter- rater reliability.î Inter-rater reliability is used when judges are asked to code some stimuli, and the analyst wants to know how much those judges agree. If the judges are making continuous ratings, the analyst can simply calculate a correlation between the judgesí responses. More commonly, judges are asked to make categorical decisions about stimuli. In this case, reliability is assessed via Cohenís kappa. To obtain Cohen's kappa in SPSS, you first must set up your data file in the appropriate manner. The codes from each judge should be represented as separate variables in the data set. For example, suppose a researcher asked participants to list their thoughts about a persuasive message. Each judge was given a spreadsheet with one thought per row. The two judges were then asked to code each thought as: 1 = neutral response to the message, 2 = positive response to the message, 3 = negative response to the message, or 4 = irrelevant thought. Once both judges
48
have rendered their codes, the analyst should create an SPSS data file with two columns, one for each judgeís codes. To obtain Cohen's kappa in SPSS
• Choose Analyze !!!! Descriptives !!!! Crosstabs. • Place Judge Aís responses in the Row(s) box. • Place Judge Bís responses in the Column(s) box. • Click the Statistics button. • Check the box next to Kappa. • Click the Continue button. • Click the OK button.
The output from this analysis will contain the following sections.
• Case Processing Summary. Reports the number observations on which you have ratings from both of your judges.
• Crosstabulation. This table lists all the reported values from each judge and the number of times each combination of codes was rendered. For example, assuming that each judge used all the codes in the thought-listing example (e.g., code values 1 ñ 4), the output would contain a cross-tabulation table like this:
Judge A * Judge B Crosstabulation
Count Judge B Total 1.00 2.00 3.00 4.00
1.00 5 1 6 2.00 5 1 6 3.00 1 7 8
Judge A 4.00 7 7 Total 5 7 8 7 27
The counts on the diagonal represent agreements. That is, these counts represent the number of times both Judges A and B coded a thought with a 1, 2, 3, or 4. The more agreements, the better the inter-rater reliability. Values not on the diagonal represent disagreements. In this example, we can see that there was one occasion when Judge A coded a thought in category 1 but Judge B coded that same thought in category 2.
• Symmetric Measures. The value of kappa can be found in this section at the intersection of the Kappa row and the Value column. This section also reports a p-value for the Kappa, but this is not typically used in reliability analysis.
Note that a kappa cannot be computed on a non-symmetric table. For instance, if Judge A had used codes 1 ñ 4, but Judge B never used code 1 at all, the table would not be symmetric. This is because there would be 4 rows for Judge A but only 3 columns for Judge B. Should you have this situation, you should first determine which values are not used by both judges. You then change each instance of these codes to some other value that is not the value chosen by the opposite judge. Since the original code was a mismatch, you can preserve the original amount of agreement by simply changing the value to a different mismatch. This way you can remove the
49
unbalanced code from your scheme while retaining the information from every observation. You can then use the kappa obtained from this revised data set as an accurate measure of the reliability of the original codes.
50
FACTOR ANALYSIS Factor analysis is a collection of methods used to examine how underlying constructs influence the responses on a number of measured variables. There are basically two types of factor analysis: exploratory and confirmatory. Exploratory factor analysis (EFA) attempts to discover the nature of the constructs influencing a set of responses. Confirmatory factor analysis (CFA) tests whether a specified set of constructs is influencing responses in a predicted way. SPSS only has the capability to perform EFA. CFAs require a program with the ability to perform structural equation modeling, such as LISREL or AMOS. The primary objectives of an EFA are to determine the number of factors influencing a set of measures and the strength of the relationship between each factor and each observed measure. To perform an EFA, you first identify a set of variables that you want to analyze. SPSS will then examine the correlation matrix between those variables to identify those that tend to vary together. Each of these groups will be associated with a factor (although it is possible that a single variable could be part of several groups and several factors). You will also receive a set of factor loadings, which tells you how strongly each variable is related to each factor. They also allow you to calculate factor scores for each participant by multiplying the response on each variable by the corresponding factor loading. Once you identify the construct underlying a factor, you can use the factor scores to tell you how much of that construct is possessed by each participant. Some common uses of EFA are to:
• Identify the nature of the constructs underlying responses in a specific content area. • Determine what sets of items ``hang together'' in a questionnaire. • Demonstrate the dimensionality of a measurement scale. Researchers often wish to
develop scales that respond to a single characteristic. • Determine what features are most important when classifying a group of items. • Generate ``factor scores'' representing values of the underlying constructs for use in other
analyses. • Create a set of uncorrelated factor scores from a set of highly collinear predictor
variables. • Use a small set of factor scores to represent the variable contained in a larger set of
variables. This is often referred to as data reduction. It is important to note that EFA does not produce any statistical tests. It therefore cannot ever provide concrete evidence that a particular structure exists in your data ñ it can only direct you to what patterns there may be. If you want to actually test whether a particular structure exists in your data you should use CFA, which does allow you to test whether your proposed structure is able to account for a significant amount of variability in your items. EFA is strongly related to another procedure called principle components analysis (PCA). The two have basically the same purpose: to identify a set of underlying constructs that can account for the variability in a set of variables. However, PCA is based on a different statistical model, and produces slightly different results when compared to EFA. EFA tends to produce better results when you want to identify a set of latent factors that underlie the responses on a set of
51
measures, whereas PCA works better when you want to perform data reduction. Although SPSS says that it performs ìfactor analysis,î statistically it actually performs PCA. The differences are slight enough that you will generally not need to be concerned about them ñ you can use the results from a PCA for all of the same things that you would the results of an EFA. However, if you want to identify latent constructs, you should be aware that you might be able to get slightly better results if you used a statistical package that can actually perform EFA, such as SAS, AMOS, or LISREL. Factor analyses require a substantial number of subjects to generate reliable results. As a general rule, the minimum sample size should be the larger of 100 or 5 times the number of items in your factor analysis. Though you can still conduct a factor analysis with fewer subjects, the results will not be very stable. To perform an EFA in SPSS
• Choose Analyze !!!! Data Reduction !!!! Factor. • Move the variables you want to include in your factor analysis to the Variables box. • If you want to restrict the factor analysis to those cases that have a particular value on a
variable, you can put that variable in the Selection Variable box and then click Value to tell SPSS which value you want the included cases to have.
• Click the Extraction button to indicate how many factors you want to extract from your items. The maximum number of factors you can extract is equal to the number of items in your analysis, although you will typically want to examine a much smaller number. There are several different ways to choose how many factors to examine. First, you may want to look for a specific number of factors for theoretical reasons. Second, you can choose to keep factors that have eigenvalues over 1. A factor with an eigenvalue of 1 is able to account for the amount of variability present in a single item, so factors that account for less variability than this will likely not be very meaningful. A final method is to create a Scree Plot, where you graph the amount of variability that each of the factors is able to account for in descending order. You then use all the factors that occur prior to the last major drop in the amount of variance accounted for. If you wish to use this method, you should run the factor analysis twice - once to generate the Scree plot, and a second time where you specify exactly how many factors you want to examine.
• Click the Rotation button to select a rotation method. Though you do not need to rotate your solution, using a rotation typically provides you with more interpretable factors by locating solutions with more extreme factor loadings. There are two broad classes of rotations: orthogonal and oblique. If you choose an orthogonal rotation, then your resulting factors will all be uncorrelated with each other. If you choose an oblique rotation, you allow your factors to be correlated. Which you should choose depends on your purpose for performing the factor analysis, as well as your beliefs about the constructs that underlie responses to your items. If you think that the underlying constructs are independent, or if you are specifically trying to get a set of uncorrelated factor scores, then you should clearly choose an orthogonal rotation. If you think that the underlying constructs may be correlated, then you should choose an oblique rotation. Varimax is the most popular orthogonal rotation, whereas Direct Oblimin is the most popular oblique rotation. If you decide to perform a rotation on your solution, you usually ignore the parts of the output that deal with the initial (unrotated) solution since
52
the rotated solution will generally provide more interpretable results. If you want to use direct oblimin rotation, you will also need to specify the parameter delta. This parameter influences the extent that your final factors will be correlated. Negative values lead to lower correlations whereas positive values lead to higher correlations. You should not choose a value over .8 or else the high correlations will make it very difficult to differentiate the factors.
• If you want SPSS to save the factor scores as variables in your data set, then you can click the Scores button and check the box next to Save as variables.
• Click the Ok button when you are ready for SPSS to perform the analysis. The output from a factor analysis will vary depending on the type of rotation you chose. Both orthogonal and oblique rotations will contain the following sections.
• Communalities. The communality of a given item is the proportion of its variance that can be accounted for by your factors. In the first column youíll see that the communality for the initial extraction is always 1. This is because the full set of factors is specifically designed to account for the variability in the full set of items. The second column provides the communalities of the final set of factors that you decided to extract.
• Total Variance Explained. Provides you with the eigenvalues and the amount of variance explained by each factor in both the initial and the rotated solutions. If you requested a Scree plot, this information will be presented in a graph following the table.
• Component Matrix. Presents the factor loadings for the initial solution. Factor loadings can be interpreted as standardized regression coefficients, regressing the factor on the measures. Factor loadings less than .3 are considered weak, loadings between .3 and .6 are considered moderate, and loadings greater than .6 are considered to be large.
Factor analyses using an orthogonal rotation will include the following section.
• Rotated Component Matrix. Provides the factor loadings for the orthogonal rotation. The rotated factor loadings can be interpreted in the same way as the unrotated factor loadings.
• Component Transformation Matrix. Provides the correlations between the factors in the original and in the rotated solutions.
Factor analyses using an oblique rotation will include the following sections.
• Pattern Matrix. Provides the factor loadings for the oblique rotation. The rotated factor loadings can be interpreted in the same way as the unrotated factor loadings.
• Structure Matrix. Holds the correlations between the factions and each of the items. This is not going to look the same as the pattern matrix because the factors themselves can be correlated. This means that an item can have a factor loading of zero for one factor but still be correlated with the factor, simply because it loads on other factors that are correlated with the first factor.
• Component Correlation Matrix. Provides you with the correlations among your rotated factors.
After you obtain the factor loadings, you will want to come up with a theoretical interpretation of each of your factors. You define a factor by considering the possible constructs that could be responsible for the observed pattern of positive and negative loadings. You should examine the
53
items that have the largest loadings and consider what they have in common. To ease interpretation, you have the option of multiplying all of the loadings for a given factor by -1. This essentially reverses the scale of the factor, allowing you, for example, to turn an ``unfriendliness'' factor into a ``friendliness'' factor.
54
VECTORS AND LOOPS Vectors and loops are two tools drawn from computer programming that can be very useful when manipulating data. Their primary use is to perform a large number of similar computations using a relatively small program. Some of the more complicated types of data manipulation can only reasonably be done using vectors and loops. A vector is a set of variables that are linked together because they represent similar things. The purpose of the vector is to provide a single name that can be used to access any of the entire set of variables. A loop is used to tell the computer to perform a set of procedures a specified number of times. Often times we need to perform the same transformation on a large number of variables. By using a loop, we only need to define the transformation once, and can then tell the computer to do the same thing to all the variables using a loop. If you have computer-programming experience then you have likely come across these ideas before. However, what SPSS calls a ìvectorî is typically referred to as an ìarrayî in most programming languages. If you are familiar with arrays and loops from a computer- programming course, you are a step ahead. Vectors and loops are used in data manipulation in more or less the same way that arrays and loops are used in standard computer programming. Vectors Vectors can only be defined and used in syntax. Before you can use a vector you first need to define it. You must specify the name of the vector and list what variables are associated with it. Variables referenced by a vector are called ìelementsî of that vector. You declare a vector using the following syntax. vector Vname = varX1 to varX2. If the variables in the vector have not already been declared, you can do so as part of the vector statement. For more information on this, see page 904 of the SPSS Base Syntax Reference Guide. The following are all acceptable vector declarations. vector V = v1 to v8. vector Myvector = entry01 to entry64. vector Grade = grade1 to grade12. vector Income = in1992 to in2000. The vector is given the name Vname and is used to reference a set of variables defined by the variable list. The elements in the vector must be declared using the syntax first variable to last variable. You cannot list them out individually. This means that the variables to be included in a vector must all be grouped together in your data set. Vectors can be used in transformation statements just like variables. However, the vector itself isn't able to hold values. Instead, the vector acts as a mediator between your statement and the variables it references. The variables included in a vector are placed in a specific order, determined by the declaration statement. So if you give SPSS a vector and an order number (referred to as the index), it knows what specific element you want to access. You do not need to
55
know what the exact name of the variable is - you just need to know its location in the vector. References to items within a vector are typically made using the format vname (index) where vname is the name of the vector, and index is the numerical position of the desired element. Using this format, you can use a vector to reference a variable in any place that you would normally insert a variable name. For example, all of the following would be valid SPSS statements, assuming that we had defined the four variables above. compute V(4) = 6. if (Myvector(30)='house') correct = correct + 1. compute sum1 = Grade(1) + Grade(2) + Grade(3). compute change = Income(9) - Income(1). Note that the index used by a vector only takes into account the position of elements in the vector - not the names of the variables. To reference the variable in1993 from in the Income vector above, you would use the phrase income(2), not income(1993). Using vectors this way doesn't provide us with much of an advantage - we are not really saving ourselves any effort by referring to a particular variable as Myvector(1) instead of entry01. The advantage comes in with the fact that the index of the vector itself can be a variable. In this case, the element that the vector will reference will depend on the value of the index variable. So the exact variable that is changed by the statement compute Grade(t) = Grade(t) + 1. depends on the value of t when this statement is executed. If t has the value of 1, then the variable grade1 will be incremented by 1. If t has a value of 8, then the variable grade8 will be incremented by 1. This means that the same statement can be used to perform many different things, simply depending what value you assign to t. This allows you to use vectors to write ìgenericî sections of code, where you control exactly what the code does by assigning different values to the index variables. Loops Vectors are most useful when they are combined with loops. A loop is a statement that lets you tell the computer to perform a set of commands a specified number of times. In SPSS you can tell the computer to perform a loop by using the following code: loop loop_variable = lower_limit to upper_limit. --commands to be repeated appear here-- end loop. When SPSS encounters a loop statement, what it does first is set the value of the loop variable to be equal to the lower limit. It then performs all of the commands inside the loop until it reaches the end loop statement. At that point the computer adds 1 to the loop variable, and then compares it to the upper limit. If the new value of the loop variable is less than or equal to the upper limit, it goes back to the beginning of the loop and goes through all of the commands
56
again. If the new value is greater than the upper limit, the computer then moves to the statement after the end loop statement. Basically, this means that the computer performs the statements inside the loop a total number of times equal to (upper limit ñ lower limit + 1). The following is an example of an SPSS program that uses a loop to calculate a sum: compute x = 0. loop #t = 4 to 8. + compute x = x + #t. end loop. The first line simply initializes the variable count to the value of zero. The second line defines the conditions of the loop. The loop variable is named t, and starts with a value of 4. The loop cycles until the value of t is greater than 8. This causes the program to perform a total of 5 cycles. During each cycle the current value of t is added to x. At the end of this set of statements, the variable x would have the value of 4 + 5 + 6 + 7 + 8 = 30. In this example, the loop variable is denoted as a ìscratch variableî because its first letter is a number sign (#). When something is denoted as a scratch variable in SPSS it is not saved in the final data set. Typically we are not interested in storing the values of our loop variables, so it is common practice to denote them as scratch variables. For more information on scratch variables see page 32 of the SPSS Base Syntax Reference Guide. You will also notice the plus sign (+) placed before the compute statement in line 3. SPSS needs you to start all new commands in the first column of each line. Here we wish to indent the command to indicate that it is part of the loop. We therefore put the plus symbol in the first column which tells SPSS that the actual command starts later on the line. Just in case you were wondering, the first statement setting x = 0 is actually necessary for the sum to be calculated. Most programming languages, including SPSS syntax, start variables with missing values. Adding anything to a missing value produces a missing value, so we must explicitly start the variable count at zero to be able to obtain the sum. The Power of Combining Vectors and Loops Though you can work with vectors and loops alone, they were truly designed to be used together. A combination of vectors and loops can save you incredible amounts of time when performing certain types of repetitive transformations. Consider the characteristics of vectors and loops. A vector lets you reference a set of related variables using a single name and an index. The index can be a variable or a mathematical expression involving one or more variables. A loop repeatedly performs a set of commands, incrementing a loop variable after each cycle. What would happen if a statement inside of a loop referenced a vector using the loop variable as the index? During each cycle, the loop variable increases by 1. So during each cycle, the vector would refer to a different variable. If you correctly design the upper and lower limits of your loop, you could use a loop to perform a transformation on every element of a vector. For an example, let's say that you conducted a reaction-time study where research participants observed strings of letters on the screen and judged whether they composed a real word or not. In your study, you had a total of 200 trials in several experimental conditions. You want to analyze
57
your data with an ANOVA to see if the reaction time varies by condition, but you find that the data has a right skew (which is common). To use ANOVA, you will need to transform the data so that it has a normal distribution, which involves taking the logarithm of the response time on each trial. In terms of your data set, what you need is a set of 200 new variables whose values are equal to the logarithms of the 200 response time variables. Without using vectors or loops, you would need to write 200 individual transformation statements to create each log variable from the corresponding response time variable. Using vectors and loops, however, we can do the same work with the following simple program. The program assumes that the original response time variables are rt001 to rt200, and the desired log variables will be lrt001 to lrt200. vector Rtvector = rt001 to rt200. vector Lvector = lrt001 to lrt200. loop #item = 1 to 200. + compute Lvector(#item) = log(Rtvector(#item)). end loop. The first two statements set up a pair of vectors, one to represent the original response time variables and one to represent the transformed variables. The third statement creates a loop with 200 cycles. Each cycle of the loop corresponds to a trial in the experiment. The fourth line actually performs the desired transformation. During each cycle it takes one variable from Lvector and sets it equal to the log of the corresponding variable in Rtvector. The fifth line simply ends the loop. By the time this program completes, it will have created 200 new variables holding the log values that you desire. In addition to greatly reducing the number of programming lines, there are other advantages to performing transformations using vectors and loops. If you need to make a change to the transformation you only need to change a single statement. If you write separate transformations for each variable, you must change every single statement anytime you want to change the specifics of the transformation. It is also much easier to read programs that use loops than programs with large numbers of transformation statements. The loops naturally group together transformations that are all of the same type, whereas with a list you must examine each individual transformation to find out what it does.