8210 WK 8 DISCUSSION
Discussion: Correlation and Bivariate Regression
Whether in a scholarly or practitioner setting, good research and data analysis should have the benefit of peer feedback. For this Discussion, you will perform an article critique on correlation and bivariate regression. Be sure and remember that the goal is to obtain constructive feedback to improve the research and its interpretation, so please view this as an opportunity to learn from one another.
To prepare for this Discussion:
· Review the Learning Resources and the media programs related to correlation and regression.
· Search for and select a quantitative article specific to your discipline and related to correlation or regression. Help with this task may be found in the Course guide and assignment help linked in this week’s Learning Resources. Also, you can use as guide the Research Design Alignment Table located in this week’s Learning Resources.
By Day 3
Write a 3- to 5-paragraph critique of the article. In your critique, include responses to the following:
1. What is the research design used by the authors?
2. Why did the authors use correlation or bivariate regression?
3. Do you think it’s the most appropriate choice? Why or why not?
4. Did the authors display the data?
5. Do the results stand alone? Why or why not?
6. Did the authors report effect size? If yes, is this meaningful?
Be sure to support your Main Post and Response Post with reference to the week’s Learning Resources and other scholarly evidence in APA Style.
·
Week Eight: Regression
Posted on: Friday, July 15, 2022 7:38:09 AM EDT
Good morning:
For those challenged by statistics and mathematical calculations, the mere title of the week's focus can be challenging. Before attempting Pearson r [Bivariate regression], as with all mathematical activities, it might assist if you understood the concept. What is a bivariate regression and why might it be important to understand as your doctoral studies unfold? Here, I have two resources:
Regression vs Correlation In statistics, determining a relation between two random variables cab be important or required to guide decision making. Rather than inferring or guessing, regression analyses can scientifically inform. Proper analyses of data informs predictions making proposed outcomes more probable. Regression analyses and correlations are used to form weather forecasts [to build the 'models' we often hear about when a hurricane appears on the horizon. Regression studies are used within financial services to 'speculate.' In other words, while challenging, the exercises you will undertake will assist you in understanding how regressions influence and inform real world scenarios.
What is Regression? Regressions illustrate potential relationships between two variables. Some data collected may focus upon dependent variables. The exact relationship, sometimes, can only be made evident using regression methods. Regressions predicts behaviors or responses of when variable to another. In an example, a company wishes to price a new product. The use of a regression can predict what consumer responses will be to various prices by engaging a random sample of would be shoppers.
This can easily be represented by a scatter plot [see link below]. Graphically, regression is equivalent to finding the best fitting curve for the give data set. The function of the curve is the regression function. Using the mathematical model, the demand of a commodity can be predicted for a given price. You will see 'scatter plots', without even finalizing a regression product can reveal patterns. Within social sciences, 94.9% of human behavior is predictable. In regressions, with proper variables selected, a 'scatter plot' will reveal/predict the most common response to variables.
If you still struggle to visualize a regression, then take a look at this: https://www.statisticssolutions.com/bivariate-correlation/
NOTE: remember you are (1) making your understanding of demanding concepts clear and (2) then conclude with what would a naive, passive audience understand from looking at your analysis?
By Day 5 USE FOR RESPONSES
Respond to one of your colleagues’ posts and:
1. Make recommendations for the design choice.
2. Explain whether you think that this is the appropriate test to use for the research question. Why or why not?
3. As a lay reader, were you able to understand the results and their implications? Why or why not?
Frankfort-Nachmias, C., Leon-Guerrero, A., & Davis, G. (2020). Social statistics for a diverse society (9th ed.). Thousand Oaks, CA: Sage Publications.
· Chapter 12, “Regression and Correlation” (pp. 401-457)
Wagner, III, W. E. (2020). Using IBM® SPSS® statistics for research methods and social science statistics (7th ed.). Thousand Oaks, CA: Sage Publications.
· Chapter 8, “Correlation and Regression Analysis”
Walden University, LLC. (Producer). (2016b). Correlation and bivariate regression [Video file]. Baltimore, MD: Author.
Note: The approximate length of this media piece is 9 minutes.
EXSAMPLE
Why linear mixed-effects models are probably not the solution to your missing data problems
July 09, 2020
Linear mixed-effects models are often used for their ability to handle missing data using maximum likelihood estimation. In this post I will present a simple example of when the LMM fails, and illustrate two MNAR sensitivity analyses: the pattern-mixture method and the joint model (shared parameter model). This post is based on a small example from my PhD thesis .
MCAR, MAR, and MNAR missing data
D. B. Rubin (1976) presented three types of missing data mechanisms: missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR). LMMs provide unbiased estimates under MAR missingness. If we have the complete outcome variable Y (which is made up of the observed data Yobs and the missing values Ymiss) and a missing data indicator R (D. B. Rubin 1976; R. J. Little and Rubin 2014; Schafer and Graham 2002), then we can write the MCAR and MAR mechanisms as,
MCAR:P(R∣Y)MAR:P(R∣Y)=P(R)=P(R∣Yobs).
If the missingness depends on Ymiss, the missing values in Y, then the mechanism is MNAR. MCAR and MAR are called ignorable because the precise model describing the missing data process is not needed. In theory, valid inference under MNAR missingness requires specifying a joint distribution for both the data and the missingness mechanisms (R. J. A. Little 1995). There are no ways to test if the missing data are MAR or MNAR (Molenberghs et al. 2008; Rhoads 2012), and it is therefore recommended to perform sensitivity analyses using different MNAR mechanisms (Schafer and Graham 2002; R. J. A. Little 1995; Hedeker and Gibbons 1997).
LMMs and missing data
LMMs are frequently used by researchers to try to deal with missing data problems. However, researchers frequently misunderstand the MAR assumption and often fail to build a model that would make the assumption more plausible. Sometimes you even see researchers using tests, e.g., Little’s MCAR test, to prove that the missing data mechanisms is either MCAR or MAR and hence ignorable—which is clearly a misunderstanding and builds on faulty logic.
A common problem is that researchers do not include covariates that potentially predict dropout. Thus, it is assumed that missingness only depend on the previously observed values of the outcome. This is quite a strong assumption. A related misunderstanding, is that the LMM’s missing data assumption is more liberal as it allows for participants’ slopes to vary. It is sometimes assumed tat if a random slope is included in the model it can also be used to satisfy the MAR assumption. Clearly, it would be very practical if the inclusion of random slopes would allow missingness to depend on patients’ latent change over time. Because it is probably true that some participants’ dropout is related to their symptom’s rate of change over time. Unfortunately, the random effects are latent variables and not observed variables—hence, such a missingness mechanism would also be MNAR (R. J. A. Little 1995). The figure below illustrates the MAR, outcome-based MNAR, and random coefficient-based MNAR mechanisms.
Figure 1. Three different drop out mechanisms in longitudinal data from one patient. a) Illustrates a MAR mechanism where the patient's likelihood of dropping out is related to an observed large value. b) Shows an outcome-related MNAR mechanism, where dropout is related to a large unobserved value. c) Shows a random-slope MNAR mechanism where the likelihood of dropping out is related to the patient's unobserved slope.
Let’s generate some data
To illustrate these concepts let’s generate data from a two-level LMM with random intercept and slopes, and included a MNAR missing data mechanism where the likelihood of dropping out depended on the patient-specific random slopes. Moreover, let’s assume that the missingness differs between the treatment and control group. This isn’t that unlikely in unblinded studies (e.g., wait-list controls).
Figure 2. A differential MNAR dropout process where the probability of dropping out from a trial depends on the patient-specific slopes which interact with the treatment allocation. The probability of dropout is assumed to be constant over time.
Figure 3. A sample of patients drawn from the MNAR (random slope) data-generating process. Circles represent complete observations; the bold line represents the slope before dropping out. P(dropout) gives the probability of dropout, which is assumed to be constant at all time points.
The equations for the dropout can be written as,
logit(Pr(Rij=1∣TXij=1))logit(Pr(Rij=1∣TXij=0))=−σu1+logit(0.15)+U1j=−σu1+logit(0.15)−U1j.
The R code is quite simple,
COPYSHOW
Now let’s draw a large sample from this model (1000 participants per group), and fit a typical longitudinal LMM using both the complete outcome variable and the incomplete (MNAR) outcome variable.
COPYHIDE
library(lme4)
library(powerlmm)
p <- study_parameters(n1 = 11,
n2 = 1000,
icc_pre_subject = 0.6,
fixed_slope = -0.48,
var_ratio = 0.02,
cor_subject = -0.5,
effect_size = cohend(-0.2))
set.seed(1111)
d <- simulate_data(p)
d <- add_MNAR_missing(d)
# MNAR
fit <- lmer(y ~ time * treatment + (time | subject), data = d)
# Complete Y
fit_c <- lmer(y_c ~ time * treatment + (time | subject), data = d)
Here are the results (click on “SHOW” to see the output).
COPYHIDE
# complete
summary(fit_c)
COPYSHOW
COPYHIDE
# MNAR
summary(fit)
COPYSHOW
We can see that the slope difference is -0.25 for the complete data and much larger for the LMM with missing data (-1.14).
A Pattern-mixture model
A simple extension of the classical LMM is a pattern-mixture model. This is a simple model where we allow the slope to differ within subgroups of different dropout patterns. The simplest pattern is to group the participants into two subgroups dropouts (1) or completers (0), and include this dummy variable in the model.
COPYHIDE
fit_PM <- lmer(
y ~ time * treatment * dropout + (time | subject),
data = d)
summary(fit_PM)
COPYSHOW
As you can see in the output, we now have a bunch of new coefficients. In order to get the marginal treatment effect we need to average over the dropout patterns. There are several ways to do this, we could just calculate a weighted average manually. For example, the outcome at posttest in the control group is
COPYHIDE
# weight by the overall proportion of dropouts
p <- mean(d$dropout == 0)
b <- fixef(fit_PM)
# Outcome in control group at posttest
(b[1] + b[2]*10) * p[1] +
(b[1] + b[4] + (b[2] + b[6]) * 10) * (1 - p[1])
COPYHIDE
## (Intercept)
## -3.70131
To estimate the treatment effect we’d need to repeat this for the treatment group and take the difference. However, we’d also need to calculate the standard errors (e.g., using the delta method). An easier option is to just specify the linear contrast we are interest in.
COPYHIDE
L <- c(0, 0, 1, 0, 10, 0, (1 - p), (1 - p) * 10)
lmerTest::contest1D(fit_PM, L = L)
COPYHIDE
## Estimate Std. Error df t value Pr(>|t|)
## 1 -4.646391 0.8937291 2687.174 -5.198881 2.155559e-07
This tells us that the difference between the groups at posttest is estimated to be -4.65. This is considerably smaller than the estimate from the classical LMM, but still larger then for the complete data. We could accomplish to same thing using emmeans package.
COPYHIDE
emmeans::emmeans(fit_PM,
pairwise ~ treatment | time,
at = list(time = 10),
CIs = FALSE,
lmer.df = "asymptotic", # wald
weights = "proportional",
data = d)
COPYHIDE
## $emmeans
## time = 10:
## treatment emmean SE df asymp.LCL asymp.UCL
## 0 -3.70 0.635 Inf -4.95 -2.46
## 1 -8.35 0.629 Inf -9.58 -7.11
##
## Results are averaged over the levels of: dropout
## Degrees-of-freedom method: asymptotic
## Confidence level used: 0.95
##
## $contrasts
## time = 10:
## contrast estimate SE df z.ratio p.value
## 0 - 1 4.65 0.894 Inf 5.199 <.0001
##
## Results are averaged over the levels of: dropout
## Degrees-of-freedom method: asymptotic
Fitting a joint model
The pattern-mixture model was an improvement, but it didn’t completely recover the treatment effect under the random slope MNAR model. We can actually fit a model that allows dropout to be related to the participants’ random slopes. To accomplish this we combine a survival model for the dropout process and an LMM for the longitudinal outcome.
COPYHIDE
library(JM)
# JM
d_c <- d
d_m <- d %>%
filter(!is.na(y)) %>%
arrange(subject)
# LMM
fit_lme <- lme(
y ~ treatment * time, data = d_m,
random = ~ time | subject
)
# dropouts
d_miss <- d_m %>%
group_by(subject, treatment) %>%
summarise(time = max(time),
time = ifelse(time < 10, time + 1, time),
dropout = ifelse(time < 10, 1, 0)) %>%
arrange(subject)
# the Cox model
fit_surv <- coxph(
Surv(time, dropout) ~ 1 + treatment,
data = d_miss,
x = TRUE
)
# slope derivatives
dForm <- list(
fixed = ~treatment,
random = ~1,
indFixed = c(3, 4),
indRandom = c(2)
)
# Fit the joint model
fit_JM <- jointModel(
fit_lme,
fit_surv,
timeVar = "time",
parameterization = "slope",
derivForm = dForm,
interFact = list(slope = ~treatment,
data = d_miss))
summary(fit_JM)
COPYSHOW
We can see from the output that the estimate of the treatment effect is really close to the estimate from the complete data (-0.23 vs -0.25). There’s only one small problem with the joint model and that is that we almost never know what the correct model is…
A small simulation
Now let’s run a small simulation to show the consequences of this random-slope dependent MNAR scenario. We’ll do a study with 11 time points, 150 participants per group, a variance ratio of 0.02, and pretest ICC = 0.6, with a correlation between intercept and slopes of -0.5. There will be a “small” effect in favor of the treatment of d=−0.2. The following models will be compared:
· LMM (MAR): a classical LMM assuming that the dropout was MAR.
· GEE: a generalized estimating equation model.
· LMM (PM): an LMM using a pattern-mixture approach. Two patterns were used; either “dropout” or “completer”, and the results were averaged over the two patterns.
· JM: A joint model that correctly allowed the dropout to be related to the random slopes.
· LMM with complete data: an LMM fit to the complete data without any missingness.
I will not post all code here; the complete code for this post can be found on GitHub . Here’s a snippet showing the code that was used to fit the models.
COPYSHOW
Results
The table and figure below shows how much the treatment effects differ. We can see that LMMs are badly biased under this missing data scenario; the treatment effect is much larger than it should be (Cohen’s d: -0.7 vs. -0.2). The pattern-mixture approach improves the situation, and the joint model recovers the true effect. Since the sample size is large, the bias under the MAR assumption leads to the LMM’s CIs having extremely bad coverage. Moreover, under the assumption of no treatment effect the MAR LMM’s type I errors are very high (83%), whereas the pattern-mixture and joint model are closer to the nominal levels.
|
Model |
M(Est.) |
Rel. bias |
d |
Power |
CI coverage |
Type I error |
|
MAR |
-11.84 |
274.38 |
-0.74 |
1.00 |
0.02 |
0.83 |
|
PM |
-5.39 |
70.47 |
-0.34 |
0.64 |
0.84 |
0.10 |
|
GEE |
-11.19 |
253.98 |
-0.70 |
1.00 |
0.06 |
0.71 |
|
JM |
-3.18 |
0.59 |
-0.20 |
0.28 |
0.93 |
0.07 |
|
Complete |
-3.21 |
1.44 |
-0.20 |
0.38 |
0.95 |
0.05 |
Note: MAR = missing at random; LMM = linear mixed-effects model; GEE = generalized estimating equation; JM = joint model; PM = pattern mixture; Est. = mean of the estimated effects; Rel. bias = relative bias of Est.; d = mean of the Cohen’s d estimates.
Figure 3. Mean of the estimated treatment effect from the MNAR missing data simulations for the different models. The dashed lines represents the control group's estimated average slope and the solid lines the treatment group's average slope.
Summary
This example is purposely quite extreme. However, even if the MNAR mechanism would be weaker, the LMM will yield biased estimates of the treatment effect. The assumption that dropout might be related to patients’ unobserved slopes is not unreasonable. However, fitting a joint model is often not feasible as we do not know the true missingness mechanism. I included it just to illustrate what is required to avoid bias under a plausible MNAR mechanism. In reality, the patients’ likelihood of dropping out is likely an inseparable mix of various degrees of MCAR, MAR, and MNAR mechanisms. The only sure way of avoiding bias would be to try to acquire data from all participants—and when that fails, perform sensitivity analyses using reasonable assumptions of the missingness mechanisms.
References
Hedeker, Donald, and Robert D Gibbons. 1997. “Application of Random-Effects Pattern-Mixture Models for Missing Data in Longitudinal Studies.” Psychological Methods 2 (1): 64–78. doi: 10.1037/1082-989X.2.1.64 .
Little, Roderick J. A. 1995. “Modeling the Drop-Out Mechanism in Repeated-Measures Studies.” Journal of the American Statistical Association 90 (431): 1112–21. doi: 10.1080/01621459.1995.10476615 .
Little, Roderick JA, and Donald B Rubin. 2014. Statistical Analysis with Missing Data. Vol. 333. John Wiley & Sons.
Molenberghs, Geert, Caroline Beunckens, Cristina Sotto, and Michael G. Kenward. 2008. “Every Missingness Not at Random Model Has a Missingness at Random Counterpart with Equal Fit.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70 (2): 371–88. doi: 10.1111/j.1467-9868.2007.00640.x .
Rhoads, Christopher H. 2012. “Problems with Tests of the Missingness Mechanism in Quantitative Policy Studies.” Statistics, Politics, and Policy 3 (1). doi: 10.1515/2151-7509.1012 .
Rubin, Donald B. 1976. “Inference and Missing Data.” Biometrika 63 (3): 581–92. doi: 10.1093/biomet/63.3.581 .
Schafer, Joseph L., and John W. Graham. 2002. “Missing Data: Our View of the State of the Art.” Psychological Methods 7 (2): 147–77. doi: 10.1037//1082-989X.7.2.147 .