help 4 pages
PARTIAL EM PROCEDURE FOR BIG-DATA LINEAR MIXED
EFFECTS MODEL, AND GENERALIZED PPE FOR
HIGH-DIMENSIONAL DATA IN JULIA
by
JANG-IK CHO
Submitted in partial fulfillment of the requirements
For the degree of Doctor of Philosophy
Dissertation Advisor: Dr. Jiayang Sun
Department of Population and Quantitative Health Sciences
CASE WESTERN RESERVE UNIVERSITY
March 2018
ProQuest Number:
All rights reserved
INFORMATION TO ALL USERS The quality of this reproduction is dependent on the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
Published by ProQuest LLC (
ProQuest
). Copyright of the Dissertation is held by the Author.
All Rights Reserved. This work is protected against unauthorized copying under Title 17, United States Code
Microform Edition © ProQuest LLC.
ProQuest LLC 789 East Eisenhower Parkway
P.O. Box 1346 Ann Arbor, MI 48106 - 1346
28099746
28099746
2020
CASE WESTERN RESERVE UNIVERSITY
SCHOOL OF GRADUATE STUDIES
We hereby approve the dissertation of
JANG-IK CHO
candidate for the Doctor of Philosophy degree *
Committee Chair: Dr. Jeffrey Albert Professor Department of Population and Quantitative Health Sciences
Committee: Dr. Jiayang Sun Dissertation Advisor Professor Department of Population and Quantitative Health Sciences
Committee: Dr. Mark Schluchter Professor Department of Population and Quantitative Health Sciences
Committee: Dr. David Aron Professor Department of Clinical and Molecular Endocrinology
Committee (optional member): Dr. Yifan Xu Applied Scientist II Amazon
March 2018
*We also certify that written approval has been obtained for any proprietary material contained therein.
Table of Contents
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . vii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
I Big-data Linear Mixed Effects Model (bLMM) 1
1 Introduction and Challenges 2
2 Data Curation 10
3 Measurement Variability and Heterogeneity of the Popula- tion 21 3.1 Measurement Variability . . . . . . . . . . . . . . . . . . . . . 21 3.2 Heterogeneity of the Population . . . . . . . . . . . . . . . . . 23
4 Big-data Linear Mixed Effects Model (bLMM) 40 4.1 Introduction to bLMM . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Linear Mixed Effects Model (LMM) . . . . . . . . . . . . . . . 43 4.3 Expectation-Maximization (EM) Algorithm . . . . . . . . . . 48 4.4 Partial EM (PEM) Procedure for bLMM . . . . . . . . . . . . 59
5 Performance of bLMM PEM 66
6 Evaluation of SCAN-ECHO training on diabetes treatment using bLMM PEM 72
7 Discussion and Conclusion 75
iii
II Projection Pursuit Ellipse (PPE) 78
8 Original PPE 79 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 8.2 PPE Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.2.1 PPE Index 1 . . . . . . . . . . . . . . . . . . . . . . . 82 8.2.2 PPE Index 2 . . . . . . . . . . . . . . . . . . . . . . . 82 8.2.3 PPE Index 3 . . . . . . . . . . . . . . . . . . . . . . . 83
8.3 PPE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9 High-dimensional PPE (hPPE) 86 9.1 hPPE algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 86 9.2 hPPE power analysis . . . . . . . . . . . . . . . . . . . . . . . 87
10 Discussion and Conclusion 89
A Appendix 91 A.1 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . 91
A.1.1 MLE for β . . . . . . . . . . . . . . . . . . . . . . . . . 92 A.1.2 MLE for (σ2k)k=1,...,q and (ρkl)k 6=l=1,...q . . . . . . . . . . 92 A.1.3 MLE for σ20 . . . . . . . . . . . . . . . . . . . . . . . . 97 A.1.4 MLE for φ . . . . . . . . . . . . . . . . . . . . . . . . . 98
A.2 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . 100 A.3 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . 100
Bibliography 101 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
iv
List of Tables
1.1 Variable definition and demographics of the study population . 4
2.1 Arguments for R-function fusionr . . . . . . . . . . . . . . . . 13
3.1 Number of years for number of A1c counts per year . . . . . . 22 3.2 Descriptive statistics of the summary measures of A1c . . . . . 23 3.3 Estimated coefficients of the fixed effects in the LMM for mean
of A1c by clusters . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Estimated coefficients of the fixed effects in the LMM for mean
of A1c from all patients . . . . . . . . . . . . . . . . . . . . . 28 3.5 Estimated coefficients of the fixed effects in the LMM for SD
of A1c by clusters . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.6 Estimated coefficients of the fixed effects in the LMM for SD
of A1c from all patients . . . . . . . . . . . . . . . . . . . . . 33 3.7 Estimated coefficients of the fixed effects in the LMM for CV
of A1c by clusters . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.8 Estimated coefficients of the fixed effects in the LMM for CV
of A1c from all patients . . . . . . . . . . . . . . . . . . . . . 38
5.1 Mean-square error of the estimates . . . . . . . . . . . . . . . 70 5.2 Bias of the estimates . . . . . . . . . . . . . . . . . . . . . . . 70 5.3 % Bias of the estimates . . . . . . . . . . . . . . . . . . . . . . 71 5.4 Variance of the estimates . . . . . . . . . . . . . . . . . . . . . 71 5.5 Efficiency (sec) . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1 Estimated coefficients of the fixed effects in the LMM for orig- inal A1c by clusters . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Estimated coefficients of the fixed effects in the LMM for orig- inal A1c from all patients . . . . . . . . . . . . . . . . . . . . 74
9.1 Empirical power in percents comparison for PPE . . . . . . . 88 9.2 hPPE empirical power in percents comparison for hPPE . . . 88
v
List of Figures
1.1 Profiles of A1c from all patients . . . . . . . . . . . . . . . . . 5 1.2 Profiles of A1c from 80 random patients . . . . . . . . . . . . 6 1.3 Trends of SD of A1c from patients in subgroups . . . . . . . . 8
2.1 Data curation scheme . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 Clustering decision criterions for the mean of A1c . . . . . . . 24 3.2 Cluster analysis result for the mean of A1c . . . . . . . . . . . 25 3.3 Clustering decision criterions for the SD of A1c . . . . . . . . 30 3.4 Cluster analysis result for the SD of A1c . . . . . . . . . . . . 31 3.5 Clustering decision criterions for the CV of A1c . . . . . . . . 34 3.6 Cluster analysis result for the CV of A1c . . . . . . . . . . . . 35
4.1 Overall flow of bLMM PEM algorithm . . . . . . . . . . . . . 42 4.2 Overall flow of PEM Procedure . . . . . . . . . . . . . . . . . 60
vi
ACKNOWLEDGEMENTS
I would like to express my profound gratitude to the people who made this
thesis possible.
First and foremost, I wish to thank my advisor, Professor Jiayang Sun, the
director of the Center for Statistical Research, Computing and Collaborations
(SR2c) at Case Western Reserve University. This work would not have been
possible without her academic and financial support. For the last 4.5 years
since I started my Ph.D. program, she has guided me through the entire
journey, having her door always opened whenever I needed her advice. Her
insights in statistics have been the driving force to overcome any challenges
I encountered. During the most challenging times when writing this thesis,
she gave me the moral support, encouragement, and freedom I needed to
move on, in addition to scientific guidance.
I also would like to sincerely thank my committee members, Professor Jef-
frey Albert, Professor David Aron, Professor Mark Schluchter, and Dr. Yifan
Xu. Professor Albert, my former advisor, and thesis committee chair, pro-
vided me with academic and financial support, which has been a considerable
contribution to completing this thesis. He has always been supportive and
understanding of any issues I had, academically or personally. He leads in-
sightful discussions and offered me valuable advice in improving this thesis.
Professor Aron, the director of the Interprofessional Improvement Research,
Education, and Clinical Center (IIRECC) at Louis Stokes Cleveland VA Med-
ical Center, provided me with financial support and clinical insight, which
made this thesis to gain significantly higher scientific value. He always had
an answer to any clinical question I had and gave me the freedom to focus
on my thesis during the critical times. Professor Schluchter provided excel-
lent statistical advice as I was developing this thesis. His valuable statistical
insights brought this thesis into the next level. Dr. Xu, an applied scien-
vii
tist at Amazon, exposed me to the most modern statistical approaches in
making this thesis applicable to the big-data infrastructures. His mental and
emotional support as a friend also has been a great help.
Staff from the Department of Population and Quantitative Health Sciences,
Nick Koziura for all his administrative support and Alberto Santana for his
technical support deserve a big thank you. I also would like to thank my
classmates and friends in the Department of Population and Quantitative
Health Sciences. Special thanks to Arielle Bloostein for her support in im-
proving the linguistics of this thesis.
Last, but not the least, I would like to thank my parents for their uncondi-
tional love and support. Sa-rang-hae-yo!
viii
Partial EM Procedure for Big-data Linear Mixed Effects Model, and Generalized PPE for High-dimensional Data in Julia
Abstract
by
JANG-IK CHO
Methodologically, this dissertation contributes to two areas in Statistics:
Linear mixed effects models for big data and Test of equal covariance for
high-dimensional data. Scientifically, this dissertation helps to comprehen-
sively evaluate the effect of the Specialty Care Access Network-Extension
for Community Healthcare Outcomes (SCAN-ECHO) training on primary
care providers at outpatient clinics in treating diabetes for the VA patient
population.
In the first part of this dissertation, we introduce three challenges and of-
fer solutions to each, in examining the effect of SCAN-ECHO training on
VA diabetic patients. The first challenge was data curation for longitudi-
nal variables. As a solution, we developed an R-function called “fusionr”
customized to our data structure for effective data curation. The second
challenge was measurement variability and heterogeneity of the population.
Different types of summary measures were used to reduce the variability of
the outcome. Longitudinal cluster analysis was conducted to identify similar
subgroups among the heterogeneous population. The third challenge was
fitting linear mixed effects model for big data that could not be imported
to R because the data exceeded the memory capacity. As a solution, we
ix
proposed a new modern approach to Big-data Linear Mixed Effects Model
(bLMM) using a Partial EM (PEM) algorithm and data partitioning. Our
PEM procedure was developed to analyze the effect of SCAN-ECHO train-
ing on diabetes treatment but this analytic approach is of interest by itself
(statistical contribution 1) because the PEM is a general procedure for
fitting LMM for big data. We evaluated the performance of bLMM PEM by
comparing PEM to the following three methods for fitting LMM: Broyden-
Fletcher-Goldfarb-Shanno (BFGS) algorithm using the entire data, full EM
using the entire data, and meta analysis using data partitions. Finally, for
implementation, we applied our PEM procedure to evaluate the effect of
SCAN-ECHO training for diabetes treatment.
In the second part of this dissertation, improvement in the optimization
algorithm for Projection Pursuit Ellipse (PPE) to test for equal variance in
high-dimensional data is introduced (statistical contribution 2). Many
standard multivariate techniques were developed based on the assumption
that the covariance matrices from different groups are equal. A well-known
test for the equality of covariance is the Bartletts test. However, the Bartletts
test is only a function of the volumes of covariance matrices, which does not
account for the shapes and orientations of the matrices. In this work we
developed a Projection Pursuit Ellipses procedure for high-dimensional data
(hPPE) and compared its performance to the Bartletts test and a modern
benchmark for high dimensional p data.
x
Part I
Big-data Linear Mixed Effects Model (bLMM)
1
Chapter 1
Introduction and Challenges
Electronic health records (EHR) and electronic medical records (EMR) have
grown massively in the last few decades. These big data have provided un-
precedented opportunities for medical research, while challenging statistics
beyond the capabilities of traditional statistical methods due to the large size
of data. For example, the need for developing our Big-data Linear Mixed Ef-
fects Model (bLMM) came from a project at the Veteran Affairs (VA) medical
center. The Cleveland VA hospital was funded with a multi-million dollar
project to carry out a Specialty Care Access Network-Extension for Commu-
nity Healthcare Outcomes (SCAN-ECHO) training to improve the quality of
treatment for Veterans since 2011 [8]. SCAN-ECHO training mimics Project
ECHO (Extension for Community Healthcare Outcomes), which started at
the University of New Mexico Health Sciences Center. SCAN-ECHO train-
ing is designed to train primary care providers (PCP) in outpatient clinics to
improve the quality of treatment for chronic disease or complex health con-
ditions such as diabetes, heart failure, chronic pain and vascular conditions
so that the quality of care at outpatient clinics is as good as that of the care
provided by specialists at the main hospital. We were involved in evaluat-
ing the effect of the SCAN-ECHO training on PCPs for diabetes treatment
during the period of 2011-2016. The study population consisted of patients
2
3
who were already diagnosed with (pre)diabetes in the greater Cleveland area
and the outcome of interest was the hemoglobin A1c level (i.e., A1c) of the
patients. Our goal was to evaluate whether SCAN-ECHO training helped
the PCPs to provide enhanced care to the diabetic patients so that the A1c
measures of patients were well controlled over time relative to the patients
treated by the PCPs who did not participate in the training.
According to the National Institute of Diabetes and Digestive and Kidney
Diseases, a division under NIH, A1c can be measured by the A1c test, which
provides information on the average blood glucose control for the past 3
months and the test provides evidence on how well the diabetes treatment
plan is working. The scale of A1c is reported in percent and an A1c level
below 5.7 percent is normal, between 5.7 and 6.4 is prediabetic and above
6.4 is diabetic [33]. Each patient had multiple A1c measurements over time,
but not necessarily with the same frequency. The covariates considered were
a patient’s age at baseline, gender, race, body mass index (BMI), systolic
blood pressure (SBP), diastolic blood pressure (DBP), medication, consults,
and diagnosis of cardiovascular disease (CVD) recorded as (yes or no). For
each patient, BMI, SBP, and DBP could be measured repeatedly over time,
different types of medications could be prescribed over time and consult ses-
sions could be held multiple times also over time. However, the repeatedly
measured variables are not necessarily measured on the same dates. The
original dataset had 38,173 patients and each of the variables measured re-
peatedly had 1 to 52 repetitions. The definitions and types of the variables
and the demographics of the study population are summarized in Table 1.1.
4
Continuous Variable Definition n Mean Std A1c A1c level 268,466 7.2 1.5 baseage Age at first A1c measure 38,173 64.2 12.3 BMI Body Mass Index 533,965 32.3 6.4 SBP Systolic Blood Pressure 533,965 130.2 15.8 DBP Diastolic Blood Pressure 533,965 74.98 9.1
Categorical Variable Definition Category Count % gender Gender Male 18,450 48.3
Female 614 1.6 (NA) 19,109 50.1
race Race Black 1,856 4.9 White 15,651 41.0 Other 1,447 3.8 (NA) 19,219 50.3
med On medication 1 (yes) 1,867 4.9 0 (no) 36,306 95.1
cons On consult 1 (yes) 1,002 2.6 0 (no) 37,171 97.4
cvdiag Diagnosed for CVD 1 (yes) 13,132 34.4 0 (no) 25,041 65.6
Table 1.1: Variable definition and demographics of the study population
Here ‘A1c’ is a continuous variable ranging from 3.1 to 19.4 with the mean of
7.2 and the standard deviation of 1.5. ‘med’ is a binary variable indicating
that a patient was prescribed with medication since a certain date and ‘cons’
is also a binary variable indicating that a patient had a consult session related
to his/her diabetes treatment since a certain date. Therefore, the ‘med’
and ‘cons’ variables will have a value of 1 from the date a patient was first
prescribed medication and the date they had a consult session. ‘cvdiag’ is
a binary variable and all of the dates since the patient was diagnosed with
CVD will have a value of 1.
Each variable is provided in a separate data table. Thus data management
including data reshaping, cleaning, integrating and organizing was essential
to facilitate any data analysis of this reach collection of data from the VA
5
system. With 38,173 patients and some longitudinal variables measured at
different dates, simple merging the data tables exceeded the memory capacity.
Therefore, finding a data curation strategy was our first challenge to continue
with the data analysis.
Figure 1.1: Profiles of A1c from all patients
Next, to understand the data and get an idea of which statistical approach
to use, we conducted exploratory data analysis (EDA) using a plot with the
trends of A1c from all of the patients as in Figure 1.1. For all patients, their
first date of A1c measurement was defined as time 0 and the number of days
to the next measures was calculated and used as the time variable. In Figure
1.1 the number of days is in the x-axis, the A1c level is in the y-axis, and
6
each line represents a trend of A1c for a patient.
Figure 1.2: Profiles of A1c from 80 random patients
Because the lines are overlapping, it was not easy to visually distinguish
the exact trend of each patient. We then provided the profile plot of 80
randomly selected patients from the population as given in Figure 1.2. We
were able to see that patients had a wide range of starting A1c values and
the trends of A1c level over time varied by patients. Some are increasing,
some decreasing and some are flat. This suggested that the population is
heterogeneous. Therefore, including all of the patients into one model could
obscure the trend because an increasing trend can compensate for a decreas-
ing trend when taken averages of the profiles. Also, measurement variability
was another concern. Variability in repeated measures within a patient could
be due to different clinicians, different testing methods, an input error, or
the patient’s other health conditions. It was not possible to trace back and
confirm the reason for the measurement variability for every single patient.
Because we did not have the information to determine whether the variability
is indeed real or an error, finding an approach not only to minimize but incor-
7
porate the measurement variability in fitting a model was important. Here
we face our second challenge measurement variability and heterogeneity of
the population.
An approach we took to reduce the measurement variability of A1c was to
use summary statistics such as mean, standard deviation, and coefficient of
variation of the A1c levels. Usually, it is advised to use a 3-months interval
when taking a summary measure of A1c, but because 95.7% of patients
had fewer than 2 measurements per year for some years in the duration
of the study, we’ve decided to use a one-year interval in our analysis. Past
contributions in the studies to evaluate diabetes using A1c focused mainly on
the mean of A1c [50] [49]. However, a mean can not measure the variation in
data. Therefore, to capture the variation, we also derived standard deviation
(SD) and coefficient of variation (CV) of the A1c level by years and used
them as outcomes. Based on each of these measures, cluster analysis was
conducted to identify the groups of patients with similar A1c trends over
time from the heterogeneous population. The results of the cluster analyses
suggested that the patients can be divided into 5, 3, and 3 clusters, for the
mean of A1c, the SD of A1c, and the CV of A1c, respectively. Details about
the results from the measures of the outcome and cluster analysis can be
found in Chapter 3.
Using the results from the cluster analysis, we continued to evaluate the effect
of SCAN-ECHO training by running EDA on the trends of A1c levels over
time for the patients in each of the subgroups. As an example, we show the
plots of trends of the SD of A1c over time by subgroups in Figures 1.3a, 1.3b,
and 1.3c.
Figures 1.3a, 1.3b, and 1.3c visually suggested that even within homogeneous
subgroups, the starting SD of A1c values and the trend of SD of A1c over
time across the patients vary. Therefore, based on this visual EDA, we fitted
two different LMM, one by subgroups and one with all patients using the
8
(a) Cluster A (b) Cluster B (c) Cluster C
Figure 1.3: Trends of SD of A1c from patients in subgroups
cluster as a covariate. Figures 1.3a, 1.3b, and 1.3c, also showed possible
linear, quadratic or even cubic time effects. Thus, we considered linear,
quadratic and cubic trends of the time along with the interactions between
linear/quadratic/cubic time effect and the intervention. The final model
described in Chapter 3 Section 3.2 is the best model selected based on the
model fit and the significance of the linear/quadratic/cubic time trends. The
identical strategy was used for fitting LMM on the mean of A1c and the CV
of the A1c.
As an alternative approach in examining the data, we also tried to fit LMM
on the raw A1c levels. This was to cross check with the results derived
from the summary measures to the ones from an LMM using the raw A1c
levels assuming that the measurement variability can be explained by random
effects. We attempted to conduct a longitudinal cluster analysis using the
raw A1c and time in days. However, because the time to A1c measurement
varied so much across the patients when the data were transformed to a wide
format for longitudinal cluster analysis, there were a lot of missing values.
The two main solutions to missing values when conducting cluster analysis
are to impute the missing values or to take summary measures over a range of
time. Since we already conducted the cluster analysis using the mean of A1c
per year, we decided to use the same subgroups suggested from the cluster
9
analysis result using the mean of A1c per year.
Based on the homogeneous subgroups, we tried to fit an LMM on the raw
A1c levels by the subgroups and also for all of the patients with the clusters
as a covariate. In order to fit a standard LMM, a single dataset with all
of the variables was required. However, even within a subgroup, deriving a
single dataset with all of the variables was not possible because it exceeded
the memory capacity. To continue with the analysis, finding an approach
for fitting an LMM for big data was our third challenge. As a solution, we
developed a new modern approach for Big-data Linear Mixed Effects Model
(bLMM) using a Partial Expectation and Maximization (PEM) procedure.
Our PEM procedure is of its interest itself as a statistical analytic procedure
which can be applied not only to our SCAN-ECHO data but also to other
data to fit bLMM.
In summary, this thesis provides our solutions to the following three main
challenges we faced when analyzing the data to find the effect of SCAN-
ECHO training on diabetes treatment in VA community:
1. Data curation for repeated measures
2. Measurement variability and heterogeneity in a study population
3. Fitting a ‘big-data’ linear mixed effects model
In Chapter 2, we will describe our data curation algorithm; in Chapter 3,
we will summarize the results from our cluster analysis and LMM analysis of
sub and whole groups; in Chapter 4, we will introduce our PEM procedure
for fitting bLMM; in Chapter 5, we will show the performance of our PEM
procedure for bLMM; and in Chapter 6, we will summarize the analysis result
using PEM procedure on the original A1c measurements.
Chapter 2
Data Curation
For big data, new ways of data curation are needed to add value to the
overflowing messy data [31]. Our study also required an algorithm of big
data curation. In our study, all of the variables were provided in separate
data tables with the repeated measures in a wide format table. Repeated
measures were not necessarily recorded on same dates, so data cleaning and
curation was an important task. Patient name was used as a key variable in
merging the data tables.
Using the least amount of memory was one of our biggest challenges for
data curation because R stores the imported data into the memory. So to
minimize the use of memory, our strategy was to merge and clean data tables
with repeated measures one at a time and then merge data tables with non-
repeated measures. When merging a data table with a repeated measure, the
date of the measurement is important for data cleaning. So, when merging
two data tables with repeated measures when dates do not match, the merged
table before data cleaning can be very large. Therefore, it is important to use
the least amount of data when merging data tables with repeated measures.
However, when merging non-repeated measures, the non repeated measure
will only add one additional column to the data matching to the key variable
(which was patient in our data). So merging non-repeated variable does not
10
11
require as much memory as to merging variables with repeated measures.
After two data tables were merged by patient, different strategies for repeated
measures and non-repeated measures were applied. Data curation was done
using R version 3.4.3 and our data curation was written as an R function
fusionr and followed the process depicted in the flowchart in Figure 2.1. In
this thesis, we will be using “variable” and “covariate” interchangeably.
Figure 2.1: Data curation scheme
The key variable used for data curation is A1c because it is the outcome
12
of interest. Therefore throughout the data curation process, all of the data
comparisons were made based on the dates A1c was measured. In Step 1,
the data curation function selected the data table with A1c and another data
table with one of the variables which were measured repeatedly. Because data
tables with repeated measures were provided in a wide format, in Step 2.1 we
first converted each data table from wide to long format and merged them
by patient name in Step 2.2. The merged dataset had two dates, one from
A1c and one from the other variable. In Steps 2.3 - 2.5, three steps of data
cleaning were conducted. In Step 2.3, the dates between A1c and the other
variable were compared. For a continuous covariate, if its date was within
3 months of the date A1c was measured, the value was kept otherwise it
was discarded. For a categorical variable, if the dates of the A1c measure
took place after the date of this categorical variable, the value of covariate
was copied for the later dates of A1c measures. In Step 2.4, if a continuous
variable was measured within 3 months of one another, we replace the values
and the dates within 3 months by their mean. This averaging step reduced
the measurement error and the data size without loosing much of the clinical
information. In Step 2.5, in order to keep only one date variable, the function
only kept the date from the A1c measure and deleted the dates from other
variables. Then the function selected another table with another variable
with repeated measures and iterated through Steps 2.1 - 2.5 until all of the
data tables with repeated covariates were included. In Step 3, data tables
with non-repeated measures were merged by patient name. After all of the
data tables were included, the function exported a single curated dataset for
statistical modeling and analysis. The overall scheme of data curation was
designed to minimize the memory usage needed for data storage. Our data
curation R function fusionr builds on R packages dplyr and tidyr and
other relevant functions in R default packages.
The basic syntax of fusionr function is
13
fusionr(outcome=list(y_table,y_name,y_date,y_repcol_start,y_repcol_end),
conticov=list(conticov1=list(conticov1_table,conticov1_name,
conticov1_date,conticov1_repcol_start,conticov1_repcol_end),
conticov2,...),
catcov = list(catcov1=list(catcov1_table,catcov1_name,
catcov1_date,catcov1_repcol_start,catcov1_repcol_end),
catcov2,...),key, length)
Argument Description Type
outcome y table: name of the data table that contains the list outcome variable, y name: name of the outcome variable, y date: dates of the outcome measurements, y repcol start: column number repeated measure starts, y repcol end: column number repeated measure ends
conticov conticov1 table: name of the data table that contains the list a continuous variable, conticov1 name: name of the continuous variable, conticov1 date: dates of the continuous measurements, conticov1 repcol start: column number repeated measure starts, conticov1 repcol end: column number repeated measure ends
catcov catcov1 table: name of the data table that contains the list a categorical variable, catcov1 name: name of the categorical variable, catcov1 date: dates of the categorical measurements, catcov1 repcol start: column number repeated measure starts, catcov1 repcol end: column number repeated measure ends
key name of the variable to merge the data tables character length time difference used for cleaning continuous variable in numeric
days
Table 2.1: Arguments for R-function fusionr
14
The outcome’s input is a list object. Its first list, y table, is the name of
the data frame that contains the outcome variable, the second list, y name,
is the name of the outcome variable, the third list, y date, is the date
the outcome is measured, the fourth list, y repcol start, is the column
number when the repeated outcome measure (or date) starts and the fifth
list, y repcol end, is the column number which the repeated outcome mea-
sure (or date) ends. conticov and catcov arguments have the same struc-
ture. conticov is a list object with its primary element being another list
which contains the information of the data frames for each of the continu-
ous variables. A continuous variable’s information is included in a list ob-
ject, conticov1. Inside the conticov1, the first list, conticov1 table, is
the name of the data frame that contains the continuous variable, the sec-
ond list, conticov1 name, is the name of the continuous variable, the third
list, conticov1 date, is the date the continuous variable is measured, the
fourth list, conticov1 repcol start, is the column number which the re-
peated measure or date starts and the fifth list, conticov1 repcol end, is
the column number which the repeated measure or date ends. For a second
continuous variable, the same argument is applied to conticov2. When all
of the list objects are defined for all of the continuous variable, conticov
will have conticov1,conticov2,... as its elements in each list. The identi-
cal argument definition is applied for catcov. key is a character which is
the name of the variable that will be used to merge all of the data tables.
length is numeric which is the number of days that will be used to clean
the continuous variables. If the date difference between a continuous variable
and the outcome variable is larger than length, the value of the continuous
variable will be defined as a missing value. Also, for each subject, if its con-
secutive repeated measures of a continuous covariate are within the length,
the function fusionr will calculate it’s mean value and replace the original
value with the mean value. Finally, since the data curation is based on the
15
outcome variable, the dates from the outcome variable will be kept and the
dates from the covariates will be deleted after the data cleaning is complete.
An example of fusionr is provided below. Because of the HIPPA privacy
issue, the example is presented using a simulated data which mimics the orig-
inal data structure. First, we will look at the data structure of an outcome
variable, a multivariate continuous covariate with repeated measurements
and a categorical covariate before their curation.
1. Data structure of the outcome variable: A1c
Patient.Name Last.4.SSN Date.1 Result.1 Date.2 Result.2
1 aaa 1357 3/9/15 6.9 7/9/15 7.3
2 bbb 2579 <NA> NA 7/27/15 11.2
3 ccc 9024 1/9/15 8.8 7/13/15 9.3
4 ddd 8501 <NA> NA <NA> NA
5 eee 4848 <NA> NA <NA> NA
6 rrr 2594 2/23/15 7.1 <NA> NA
7 ggg 9092 <NA> NA <NA> NA
Date.3 Result.3 Date.4 Result.4 Date.5 Result.5
1 <NA> NA <NA> NA 1/3/16 9.2
2 <NA> NA 11/4/15 10.8 <NA> NA
3 <NA> NA <NA> NA <NA> NA
4 9/23/15 12.2 <NA> NA 1/13/16 10.8
5 <NA> NA 12/18/15 9.3 <NA> NA
6 <NA> NA 12/12/15 7.7 <NA> NA
7 9/14/15 8.1 <NA> NA <NA> NA
Date.6 Result.6 Date.7 Result.7 Date.8 Result.8
1 <NA> NA 9/5/16 10.2 <NA> NA
2 <NA> NA <NA> NA <NA> NA
16
3 4/14/16 7.3 9/2/16 8.1 <NA> NA
4 <NA> NA <NA> NA <NA> NA
5 2/15/16 8.8 <NA> NA 12/23/16 9.1
6 <NA> NA 9/14/16 8.1 <NA> NA
7 <NA> NA <NA> NA 11/29/16 7.8
There are 7 unique patients and the data table is in the wide format.
2. Data structure of the continuous covariate: BMI, SBP and DBP
Patient.Name Last.4.SSN BMI.1 SBP.1 DBP.1 Date.1 BMI.2
1 bbb 2579 28.2 128 92 7/27/15 NA
2 ccc 9024 NA NA NA <NA> 32.3
3 ddd 8501 24.3 118 84 9/23/15 NA
4 eee 4848 NA NA NA <NA> NA
5 ggg 9092 NA NA NA <NA> 29.8
SBP.2 DBP.2 Date.2 BMI.3 SBP.3 DBP.3 Date.3 BMI.4
1 NA NA <NA> 27.3 132 93 11/4/15 NA
2 126 92 1/9/15 34.8 129 96 7/13/15 NA
3 NA NA <NA> NA NA NA <NA> NA
4 NA NA <NA> NA NA NA <NA> 25.6
5 128 82 9/14/15 NA NA NA <NA> 28.8
SBP.4 DBP.4 Date.4 BMI.5 SBP.5 DBP.5 Date.5 BMI.6
1 NA NA <NA> NA NA NA <NA> 29.7
2 NA NA <NA> 32.6 133 92 1/13/16 NA
3 NA NA <NA> NA NA NA <NA> NA
4 124 79 12/18/15 NA NA NA <NA> NA
5 126 79 12/13/15 NA NA NA <NA> NA
SBP.6 DBP.6 Date.6 BMI.7 SBP.7 DBP.7 Date.7
1 127 78 3/4/16 NA NA NA <NA>
17
2 NA NA <NA> NA NA NA <NA>
3 NA NA <NA> NA NA NA <NA>
4 NA NA <NA> 26.4 128 82 12/23/16
5 NA NA <NA> NA NA NA <NA>
There are 5 unique patients and the data table is in the wide format.
3. Data structure of the categorical covariate: medication
Patient.Name Last.4.SSN medication.1 Date.1
1 aaa 1357 Bromocriptine 5MG 3/13/12
2 aaa 1357 Canagliflozin 5MG 5/20/10
3 bbb 2579 Metformin 5MG 10/21/11
4 ccc 9024 Bromocriptine 5MG 3/13/12
5 ddd 8501 Metformin 5MG 12/16/13
6 eee 4848 Metformin 5MG 3/12/12
7 rrr 2594 Metformin 5MG 7/20/12
8 ggg 9092 Bromocriptine 5MG 11/3/09
medication.2 Date.2 medication.3 Date.3
1 Canagliflozin 5MG 3/13/12 <NA> <NA>
2 <NA> <NA> <NA> <NA>
3 Canagliflozin 5MG 1/16/13 Metformin 10MG 3/20/13
4 <NA> <NA> Metformin 5MG 7/7/14
5 <NA> <NA> <NA> <NA>
6 Metformin 5MG 5/1/12 <NA> <NA>
7 <NA> <NA> <NA> <NA>
8 Metformin 5MG 2/5/10 Metformin 10MG 2/7/13
The categorical variable medication was recorded multiple times at various
time points.
18
The fusionr function merges and cleans data tables of the outcome vari-
able with the continuous covariates first, and then merges into categorical
variables one by one as given in Figure 2.1.The current version of fusionr
dichotomizes categorical variables. If there is a measure before a date of the
outcome, the categorical variable is defined as 1 otherwise 0. Here is an R
example script.
a1cdata = read.csv("outcome.csv")
healthdata = read.csv("health.csv")
meddata = read.csv("medication.csv")
outcome = list(a1cdata,"Result","Date",3,18)
conticov1 = list(healthdata,"BMI","SBP","DBP","Date",3,30)
conticov = list(conticov1)
catcov1 = list(meddata,"Medication","Date",3,8)
catcov = list(catcov1)
key = "Patient.Name"
length = 90
cleandata = fusionr(outcome, conticov, catcov, key, length)
> cleandata
Patient.Name Last.4.SSN Date Result BMI DBP SBP catcov1
ccc 9092 2015-09-14 8.1 NA NA NA 1
ccc 9092 2016-11-29 7.8 29.8 82 128 1
ccc 9092 2016-11-29 7.8 28.8 79 126 1
aaa 1357 2015-07-09 7.3 NA NA NA 1
aaa 1357 2016-01-03 9.2 NA NA NA 1
aaa 1357 2016-09-05 10.2 NA NA NA 1
aaa 1357 2015-03-09 6.9 NA NA NA 1
19
eee 4848 2015-12-18 9.3 NA NA NA 1
eee 4848 2015-12-18 9.3 26.4 82 128 1
eee 4848 2016-02-15 8.8 NA NA NA 1
eee 4848 2016-02-15 8.8 26.4 82 128 1
eee 4848 2016-12-23 9.1 25.6 79 124 1
eee 4848 2016-12-23 9.1 NA NA NA 1
ccc 9024 2015-07-13 9.3 32.3 92 126 1
ccc 9024 2015-07-13 9.3 NA NA NA 1
ccc 9024 2015-07-13 9.3 32.6 92 133 1
ccc 9024 2016-04-14 7.3 32.3 92 126 1
ccc 9024 2016-04-14 7.3 34.8 96 129 1
ccc 9024 2016-04-14 7.3 32.6 92 133 1
ccc 9024 2016-09-02 8.1 32.3 92 126 1
ccc 9024 2016-09-02 8.1 34.8 96 129 1
ccc 9024 2016-09-02 8.1 32.6 92 133 1
ccc 9024 2015-01-09 8.8 NA NA NA 1
ccc 9024 2015-01-09 8.8 34.8 96 129 1
ccc 9024 2015-01-09 8.8 32.6 92 133 1
ddd 8501 2015-09-23 12.2 NA NA NA 1
ddd 8501 2016-01-13 10.8 24.3 84 118 1
bbb 2579 2015-07-27 11.2 27.3 93 132 1
bbb 2579 2015-07-27 11.2 29.7 78 127 1
bbb 2579 2015-07-27 11.2 NA NA NA 1
bbb 2579 2015-11-04 10.8 NA NA NA 1
bbb 2579 2015-11-04 10.8 29.7 78 127 1
bbb 2579 2015-11-04 10.8 28.2 92 128 1
rrr 2594 2015-12-12 7.7 NA NA NA 1
rrr 2594 2016-09-14 8.1 NA NA NA 1
rrr 2594 2015-02-23 7.1 NA NA NA 1
20
fusionr is applicable to conducting data curation for data with an identical
structure to ours. A user only needs to specify the path where data tables are
stored, the names of the data tables, response variable, continuous variables,
categorical variables and covariates with repeated measures to get a single
dataset ready for statistical analysis. This minimal specification requirement
is custom friendly to researchers who may be unfamiliar with R. However,
the current version of fusionr is limited to a specific data structure identical
to that of our case. Generalization of our data curation method for repeated-
measure big data can be one of our potential future development.
Chapter 3
Measurement Variability and Heterogeneity of the Population
3.1 Measurement Variability
Measurement variability is another challenge to our longitudinal data anal-
ysis. Variation among repeated measures in a short-time interval can result
from a random error or is important information to be explained and consid-
ered in the model. However, finding the nature and cause of variability can
be difficult in reality especially for data recorded by many different people in
the past such as in EHR and EMR. Therefore, when the variability is diffi-
cult to understand or reflects the natural variation of a measurement system,
incorporating or reducing it is essential to improve the quality of the data
for statistical analysis. Quantile regression [10, 52] is a recent method to
incorporate some type of measurement variability within a statistical model.
Another recent approach use the variable itself [30] to reduce the measure-
ment variability. For our date, reducing the measurement variability directly
using the variable was more natural.
The repeated measures in our data are A1c, BMI, SBP, DBP, medication,
21
22
and consults. Medication and consults are categorical and have minimal mea-
surement variability because it is recorded whenever there was a medication
prescribed and a consult session held. However, for the continuous variables,
A1c, BMI, SBP, DBP, their measurement variability can be an issue. For
BMI, SBP, and DBP, we calculated the mean of each variable measured
within the 3 months to reduce their measurement variability. Based on the
guideline provided by NIH [33], it is advised that A1c measures within 3
months are expected to be similar. So to reduce the measurement variability
of A1c, we calculated the summary measures of A1c by a quarter of a year.
However, since many patients had less than 3 observations per year, if the
summary measure of A1c was taken every 3 months, we would observe many
missing values of the summary measures. Table 3.1 shows the frequencies of
number of A1c measures per year for a total of m = 145, 200 ≤ n× 6 years,
where n = 38, 173 is the number of subjects. m is less than n × 6 because
some subjects had been observed for less than 6 years. The first column indi-
cates that there were 69,461 years with only 1 A1c value, the second column
indicates that there were 54,168 years with only 2 A1c values and so on until
15 which was the maximum number of A1c count per year. In summary,
out of 145,200 years in the study, 123,629 (which is 85.1%) years had 1 or 2
counts.
# of A1c count / yr 1 2 3 4 5 6 7 # of years 69,461 54,168 8,488 9,027 334 1,679 52
# of A1c count / yr 8 9 10 12 14 15 # of years 1,627 132 2 228 1 1
Table 3.1: Number of years for number of A1c counts per year
Thus, for A1c, we decided to derive summary measures by year for each
patient. The measures we used were mean, standard deviation (SD) and
coefficient of variation (CV), where CV = SDmean. The summary statistics of
23
the mean, SD and CV of the A1c are provided in Table 3.2.
n mean sd Mean of A1c/yr 6 7.09 1.43
SD of A1c/yr 6 0.3 0.5 CV of A1/yrc 6 0.04 0.06
Table 3.2: Descriptive statistics of the summary measures of A1c
3.2 Heterogeneity of the Population
In addition to the measurement variability problem, heterogeneity of the
population was another issue. Identifying homogeneous subgroups of the
patients in terms of on their trends of A1c over time is important in finding
the pure effect of SCAN-ECHO training on A1c (see Figure 1.1). Because
we did not have any information on the distribution or on the trends of
the A1c, we decided to use longitudinal cluster analysis (LCA) to our data.
LCA is a data-driven unsupervised learning method. Our LCA was done
on the three measures: mean, SD, and CV of the A1c per year by patient
using the R-package kml [15, 16]. kml is K-means clustering generalized
for longitudinal data. We used serial correlation as the distance function in
our cluster analysis and chose the number of clusters to be 2 to 6. The best
number of clusters was chosen based on the five decision criteria provided by
the kml package. The most widely used criteria among the five criteria are
three different variations of Calinski & Harabatz criteria C(k) [5]. However,
C(k) does not always provide the correct number of clusters [43] and the
three C(k) variations may not provide the same number of optimal clusters.
Therefore, Ray & Turi [40] and Davies & Bouldin [7] criteria are used to
cross-check the candidate optimal numbers of clusters. The outcome values
of the 5 decision criteria are shown in Figure 3.1 where the X-axis represents
the number of clusters and the Y-axis represents the decision criteria values.
24
For all 5 decision criteria, a high value is favorable.
Figure 3.1: Clustering decision criterions for the mean of A1c
Thus, for the means of A1c, the optimal number of clusters is 3 by Calinski
& Harabatz criterion, Ray & Turi criterion, and Davies & Bouldin criterion
and is 5 by two other variations of Calinski & Harabatz criteria. When using
the majority rule, the optimal number of clusters to be used is 3. However,
when subgrouping the data into 3 clusters, the profile of each cluster was
parallel over time with low, medium and high mean of A1c whereas, with 5
clusters, we were able to see not only the different magnitude of the mean
of A1c by clusters but also different profiles which were parallel, increasing
and decreasing by clusters. Therefore, considering the statistical criteria and
the profiles of the clusters, we decided that 5 clusters are the most sensible
number of clusters for our data. Figure 3.2 shows the profiles of the 5 clusters
where the X-axis represents the time in years and the Y-axis represents the
mean of A1c values. Further, Cluster A consists of patients with a flat mean
25
Figure 3.2: Cluster analysis result for the mean of A1c
26
A1c around 5.5 over time, Cluster B consists of patients with a flat mean A1c
around 7 over time, Cluster C consists of patients with an increasing A1c over
time, starting around 7.5, Cluster D consists of patients with a decreasing
A1c over time, starting around 9, and Cluster E consists of patients with a
flat mean of A1c around 10 over time. Thus, Cluster A has patients with
the healthiest A1c levels in the range of normal health status, Cluster B has
patients with stable A1c over time but in the rage of diabetes, Cluster C
has patients with diabetic condition getting worse over time, Cluster D has
patients with diabetic conditions getting better over time, and Cluster E has
patients with stable A1c but in worst diabetic condition. For each of the 5
subgroups, we fitted an LMM with a random intercept α1i and random slope
of time α2i, as in (3.1) at time = 1,2,3,4,5,6 and a stepwise model selection
was conducted using the AIC values.
mean(A1c)i = scan + time + baseage + gender + race + BMI + SBP
+ DBP + med + cons + cvdiag + α1i + α2itimei (3.1)
where i indicates the ith patient.
The estimated coefficients of the fixed effects in the LMM by subgroups after
the model selection are given in Table 3.3.
Cluster A Cluster B Cluster C Estimate p-value Estimate p-value Estimate p-value
(Intercept) 4.75300 < 2e-16 5.62500 < 2e-16 6.91000 < 2e-16 scan1 -0.02438 0.00428 -0.00376 0.61718 0.00626 0.53800 time -0.00094 0.70591 0.04711 < 2e-16 0.09047 < 2e-16
baseage 0.00424 1.31E-11 0.00374 1.06E-15 -0.00002 0.97600 GenderM 0.13050 4.23E-05 0.09358 0.00094 0.00598 0.86200
raceOTHER -0.01131 0.59304 raceWHITE -0.04238 0.00559
SBP DBP -0.00005 0.86150 0.00027 0.32593 0.00188 2.90E-06
27
BMI 0.02140 < 2e-16 0.01571 < 2e-16 0.00760 < 2e-16 med1 0.07683 4.03E-16 0.07517 2.12E-13 cons1 0.05294 0.00543 0.09538 9.54E-09 0.18250 3.76E-15
cvdiag1 -0.03412 0.02235
Cluster D Cluster E Estimate p-value Estimate p-value
(Intercept) 8.40700 < 2e-16 11.99000 < 2e-16 scan1 -0.00060 0.97304 0.16330 0.00904 time 0.08571 < 2e-16 0.05646 0.01355
baseage -0.00302 0.00258 -0.00818 0.02475 GenderM 0.04579 0.39854 -0.47720 0.00790
raceOTHER -0.17340 0.27856 raceWHITE -0.20480 0.02127
SBP -0.00153 0.00070 -0.00444 0.00293 DBP 0.00487 5.03E-09 0.00805 0.00368 BMI -0.02289 9.54E-06
med1 0.05442 0.00192 cons1 0.34780 <2e-16 0.50130 4.64E-07
cvdiag1
Table 3.3: Estimated coefficients of the fixed effects in the LMM for mean of A1c by clusters
The estimates for variance of random intercept, variance of random slope,
correlation between random intercept and random slope for Clusters A, B, C,
D, and E were (0.13708,0.01086,-0.61), (0.21999,0.02906,-0.82), (0.40391,0.05631,-
0.90), (0.73480,0.11150,-0.90), and (1.52790,0.21560,-0.83) respectively. For
all five clusters, the random intercept and random slope had a negative cor-
relation. SCAN-ECHO training had a significant non-zero effect in Clusters
A (p-value = 0.00428) and in Cluster E (p-value = 0.00904). In Cluster A,
SCAN-ECHO training decreased the mean of A1c level by 0.02438 and in
Cluster E, SCAN-ECHO training increased the mean of A1c level by 0.1633.
The statistical significance of these two subgroups explains that for diabetic
patients who are in reasonable good health, the SCAN-ECHO training helped
them to slightly decrease their A1c level to maintain their A1c levels at nor-
28
mal. However, for patients with abnormally high A1c level, SCAN-ECHO
training does not help to control the A1c level of a patient compared to the
patients seeing non-trained PCPs. For these patients, we might recommend
visiting the specialists in the main hospital. For Clusters B,C, and D, SCAN-
ECHO training did not show a statistically significant effect on decreasing
the A1c level. As an alternative modeling approach, we fitted an LMM (3.2)
for all patients with clusters as a covariate. In this model, we also considered
the interaction between SCAN-ECHO training and the clusters.
mean(A1c)i = scan + time + cluster + baseage + gender + race + BMI
+ SBP + DBP + med + cons + cvdiag + scan : cluster
+ α1i + α2itimei (3.2)
where i indicates the ith patient.
The estimated coefficients of the fixed effects in the LMM for all of the
patients after the model selection are given in Table 3.4.
Covariate Estimate p-value Covariate Estimate p-value (Intercept) 5.29800 < 2e-16 GenderM 0.03287 0.08016
scan1 -0.02491 0.10696 SBP -0.00055 0.00016 time 0.05892 < 2e-16 DBP 0.00195 2.10E-13
clusterB 0.76210 < 2e-16 BMI 0.00751 < 2e-16 clusterC 1.70500 < 2e-16 med1 0.08041 < 2e-16 clusterD 2.81200 < 2e-16 cons1 0.20800 < 2e-16 clusterE 4.48600 < 2e-16 scan1:clusterB 0.01367 0.45944 baseage 0.00073 0.03208 scan1:clusterC 0.03694 0.04788
raceOTHER -0.01634 0.27063 scan1:clusterD 0.02691 0.18873 raceWHITE -0.03269 0.00183 scan1:clusterE 0.18950 2.24E-08
Table 3.4: Estimated coefficients of the fixed effects in the LMM for mean of A1c from all patients
The variance of random intercept, the variance of random slope, the corre-
29
lation between random intercept and the random slope estimated from the
LMM 0.4025, 0.0573, -0.87. The results show that on average, SCAN-ECHO
training does not have a statistically significant effect on the mean of A1c
over time, while it is a helpful as a maintenance for relatively healthy group of
patients. In addition, significant variables in the model were clusters, time,
race, SBP, DBP, BMI, medication and consults meaning that the clusters
were very well identified, the mean of A1c gradually increases by 0.05892
every year, and white patients have a -0.03269 smaller mean of A1c relative
to the black patients. Medication and consults increased the mean of A1c
which is a remaining work to be done, to find out whether the cause of the
increase is directly from medication or consults or whether the increases is
coming from the characteristics of the patient population who were on med-
ication or had consults.
The decision criteria are shown in Figure 3.3 and the result of cluster analysis
on the SD of A1c is shown Figure 3.4. In Figure 3.3, the X-axis represents the
number of clusters and the Y-axis represents the decision criteria values. In
Figure 3.4, the X-axis represents the time in years and the Y-axis represents
the SD of A1c values.
30
Figure 3.3: Clustering decision criterions for the SD of A1c
For the SDs of A1c, the optimal number of clusters is 3 based on 4 out of 5
criteria as in Figure 3.3. In this analysis, both the majority rule among the
decision criteria and the information provided from the profiles of the clusters
both agreed to use 3 clusters to subgroup the patient population. Figure 3.4
shows the profiles of the 3 clusters where Cluster A consists of patients with
a flat SD of A1c around 0.2 over time, Cluster B consists of patients with
increasing SD of A1c at the beginning and gradually decreases over time and
Cluster C consists of patients with high SD of A1c at the beginning with
a steep decrease in the first year but stabilizes over time. Thus, Cluster A
has patients with almost no variance over time, Cluster B has patients with
the highest variance of A1c over time and Cluster C has patients who had
a stabilized variance of A1c over time. For each of the 3 subgroups, as in
(3.3), we fitted an LMM with a random intercept α1i and random slope of
time α2i. In Figure 3.4, we were able to identify quadratic profiles of SD of
A1c over time by the patient. Therefore, in our LMM, we included a fixed
31
Figure 3.4: Cluster analysis result for the SD of A1c
32
quadratic time effect at time = 1,2,3,4,5,6 and a stepwise model selection
was conducted using the AIC values. For these models, measures with SD =
0 were excluded because they were treated as missing values.
SD(A1c)i = scan + time + time 2 + baseage + gender + race + SBP + DBP
+ BMI + med + cons + cvdiag + α1i + α2itimei (3.3)
where i indicates the ith patient.
The estimated coefficients of the fixed effects in the LMM by subgroups after
the model selection are given in Table 3.5.
Cluster A Cluster B Cluster C Estimate p-value Estimate p-value Estimate p-value
(Intercept) 0.34370 < 2e-16 1.51800 < 2e-16 2.40400 < 2e-16 scan1 -0.01312 < 1e-05 -0.02514 0.11216 -0.02581 0.35909 time 0.01050 < 1e-15 0.23710 < 2e-16 -0.66680 < 2e-16
time2 -0.03798 < 2e-16 0.10510 < 2e-16 baseage -0.00202 < 2e-16 -0.00755 < 1e-11 -0.00865 < 1e-05
GenderM 0.03838 0.00413 0.15220 0.01743 0.01252 0.90251 raceOTHER -0.00693 0.50031 -0.09996 0.03352 -0.19770 0.01017 raceWHITE -0.01402 0.06422 -0.11380 0.00008 -0.16470 0.00055
SBP -0.00099 0.00239 DBP 0.00021 0.06153 0.00082 0.43693 BMI 0.00082 0.00999 -0.00864 < 1e-08 -0.00742 0.00139
med1 0.04744 < 2e-16 0.10550 < 1e-09 0.03414 0.23413 cons1 0.03583 < 1e-07 0.12290 0.00022 0.16510 0.00563
cvdiag1 0.01097 0.02367
Table 3.5: Estimated coefficients of the fixed effects in the LMM for SD of A1c by clusters
The estimates for variance of random intercept, variance of random slope,
correlation between random intercept and random slope for Clusters A, B,
and C were (0.07791,0.01002,-0.78), (0.74262,0.09083,-0.85), and (0.43121,0.04818,-
33
0.82) respectively. For all three clusters, the random intercept and random
slope had negative correlation. SCAN-ECHO training had significantly de-
creased SD of A1c in Cluster A (p-value = < 1e-05) by 0.01312 whereas
for Clusters B and C, SCAN-ECHO training was not statistically significant.
Cluster A consists of patients with low SD of A1c value throughout the study
period. So, given our data, SCAN-ECHO training helps to reduce the SD of
A1c for patients who have relatively stable A1c values over time compared to
the patients who have more variability in their A1c measurements over time.
As an alternative modeling approach, we fitted an LMM for all patients with
clusters as a covariate as in (3.4) and conducted stepwise model selection.
SD(A1c)i = scan + time + time 2 + cluster + baseage + gender + race
+ SBP + DBP + BMI + med + cons + cvdiag + scan : cluster
+ α1it + α2ittimeit (3.4)
where i indicates the ith patient.
The estimated coefficients of the fixed effects in the LMM for all of the
patients after the model selection are given in Table 3.6.
Covariate Estimate t-value Covariate Estimate t-value (Intercept) 0.62310 < 2e-16 raceOTHER -0.04972 0.00029
scan1 -0.01309 0.02483 raceWHITE -0.05762 < 1e-08 time 0.00133 0.72839 BMI -0.00235 < 1e-06
time2 -0.00041 0.53513 med1 0.06468 < 2e-16 clusterB 0.64340 < 2e-16 cons1 0.08558 < 1e-13 clusterC 0.54700 < 2e-16 scan1:clusterB 0.00823 0.46704 baseage -0.00389 < 2e-16 scan1:clusterC -0.11040 < 1e-07
GenderM 0.06893 0.00016
Table 3.6: Estimated coefficients of the fixed effects in the LMM for SD of A1c from all patients
34
The variance of random intercept, variance of random slope, correlation be-
tween random intercept and random slope estimated from the LMM 0.24572,
0.03162, -0.85. The result showed that overall, SCAN-ECHO training has
a statistically significant effect on the SD of A1c over time reducing the
variation by 0.01309 (p-value = 0.02483). In addition, in Cluster C, SCAN-
ECHO training reduced the variability significantly more by 0.1104 (p-value
< 1e-07) which was our new finding in fitting LMM with the entire patients
compared to the result from our LMMs by clusters.
The decision criteria are shown in Figure 3.5 and the result of cluster analysis
on the CV of A1c is shown Figure 3.6. In Figure 3.5, the X-axis represents the
number of clusters and the Y-axis represents the decision criteria values. In
Figure 3.6, the X-axis represents the time in years and the Y-axis represents
the SD of A1c values.
Figure 3.5: Clustering decision criterions for the CV of A1c
For the CVs of A1c, the optimal number of clusters is 3 based on 4 out of 5
35
Figure 3.6: Cluster analysis result for the CV of A1c
36
criteria as in Figure 3.5. In this analysis, both the majority rule among the
decision criteria and information provided from the profiles of the clusters
both agreed to use 3 clusters to subgroup the patient population. Figure 3.6
shows the profiles of the 3 clusters where Cluster A consists of patients with
a flat CV of A1c around 0.01 over time, Cluster B consists of patients with
increasing CV of A1c at the beginning and gradually decreases over time
and Cluster C consists of patients with high CV of A1c at the beginning
with a steep decrease in the first year but stabilizes over time. Thus, Cluster
A has patients with almost no variance over time, Cluster B has patients
with the highest variance of A1c over time and Cluster C has patients who
stabilized the variance of A1c over time. For each of the 3 subgroups, as in
(3.5), we fitted a LMM with a random intercept α1i and random slope of
time α2i. In Figure 3.6, we were able to identify quadratic profiles of CV of
A1c over time by the patient. Therefore, in our LMM, we included a fixed
quadratic time effect at time = 1,2,3,4,5,6 and a stepwise model selection
was conducted using the AIC values. For these models, measures with CV
= 0 were excluded because they were treated as missing values.
CV (A1c)i = scan + time + time 2 + baseage + gender + race + SBP + DBP
+ BMI + med + cons + cvdiag + α1i + α2itimei (3.5)
where i indicates the ith patient.
The estimated coefficients of the fixed effects in the LMM by subgroups after
the model selection are given in Table 3.7.
Cluster A Cluster B Cluster C Estimate p-value Estimate p-value Estimate p-value
(Intercept) 0.04438 < 2e-16 0.15770 < 2e-16 0.24720 < 2e-16 scan1 -0.00175 < 1e-05 -0.00293 0.05770 -0.00249 0.29068 time 0.00209 < 1e-11 0.02703 < 2e-16 -0.06196 < 2e-16
37
time2 -0.00009 0.05775 -0.00447 < 2e-16 0.00986 < 2e-16 baseage -0.00019 < 1e-12 -0.00072 < 1e-11 -0.00101 < 1e-09
GenderM 0.00361 0.02372 0.01701 0.00519 -0.01468 0.02645 raceOTHER -0.00236 0.05662 -0.01067 0.01751 -0.01265 0.00414 raceWHITE -0.00284 0.00173 -0.00971 0.00059
SBP 0.00002 0.00490 -0.00011 0.00106 DBP BMI -0.00082 < 1e-08 -0.00055 0.00921
med1 0.00448 < 2e-16 0.01031 < 1e-09 0.00542 0.02760 cons1 0.00342 0.00006 0.01261 0.00014 0.01660 0.00057
cvdiag1 0.00176 0.00253 -0.00486 0.04581
Table 3.7: Estimated coefficients of the fixed effects in the LMM for CV of A1c by clusters
The estimates for variance of random intercept, variance of random slope,
correlation between random intercept and random slope for Clusters A, B,
and C were (0.0011,0.0002,-0.80), (0.0072,0.0008,-0.84), and (0.0041,0.0005,-
0.78) respectively. For all three clusters, the random intercept and random
slope had a negative correlation. SCAN-ECHO training had significantly
decreased CV of A1c in Cluster A (p-value = < 1e-05) by 0.00175 whereas
for Clusters B and C, SCAN-ECHO training was not statistically significant.
Cluster A consists of patients with low SD of A1c value throughout the study
period. So, given our data, SCAN-ECHO training helps to reduce the SD of
A1c for patients who have relatively stable A1c values over time compared
to the patients who have more variability in their A1c measurements over
time. Also, DBP was not a significant factor in all three clusters in the LMM
for the CV of A1c whereas, the fixed quadratic time trend was statistically
significant in all three clusters. As an alternative modeling approach, we
fitted na LMM for all patients with clusters as a covariate as in (3.6) and
conducted stepwise model selection.
38
CV (A1c)i = scan + time + time 2 + cluster + baseage + gender + race
+ BMI + SBP + DBP + med + cons + cvdiag + scan : cluster
+ α1it + α2ittimeit (3.6)
where i indicates the ith patient.
The estimated coefficients of the fixed effects in the LMM for all of the
patients after the model selection are given in Table 3.8.
Covariate Estimate t-value Covariate Estimate t-value (Intercept) 0.07440 < 2e-16 raceOTHER -0.00598 0.00009
scan1 -0.00167 0.01516 raceWHITE -0.00605 < 1e-07 time 0.00056 0.19304 BMI -0.00027 <1e-07
time2 -0.00015 0.04835 med1 0.00657 < 2e-16 clusterB 0.06581 < 2e-16 cons1 0.00912 < 1e-12 clusterC 0.05018 < 2e-16 scan1:clusterB 0.00117 0.34248 baseage -0.00042 < 2e-16 scan1:clusterC -0.00884 < 1e-05
GenderM 0.00721 0.00040
Table 3.8: Estimated coefficients of the fixed effects in the LMM for CV of A1c from all patients
The variance of random intercept, variance of random slope, correlation be-
tween random intercept and random slope estimated from the LMM 0.0030,
0.0004, -0.85. The result showed that overall, SCAN-ECHO training has a
statistically significant effect on the SD of A1c over time reducing the varia-
tion by 0.00167 (p-value = 0.01516). In addition, in Cluster C, SCAN-ECHO
training reduced the variability significantly more by 0.0088 (p-value < 1e-05)
which was our new finding in fitting LMM with the entire patients compared
to the result from our LMMs by clusters.
Based on the 3 summary measures, the LMM analysis suggests that SCAN-
ECHO training statistically help reduce the mean, SD and CV of the A1c for
39
relatively healthy patients. However, for patients in poor health conditions,
SCAN-ECHO training does not help to improve their diabetic status.
To examine the data from an alternative point, we decided to also fit an
LMM using the raw data assuming that the mixed effects will incorporate
the measurement variability. In fitting the overall LMM, the identical sub-
groups from the cluster analysis using the mean of A1c were considered. So,
we tried to fit an LMM for each of the subgroups and for all patients with
the subgroups as a covariate. However, the raw data, even within a sub-
group, were too big to be imported to R. As a solution to this issue, we
developed a method to fit bLMM using a partial expectation-maximization
(PEM) approach. The details will be explained in the next Chapter.
Chapter 4
Big-data Linear Mixed Effects Model (bLMM)
4.1 Introduction to bLMM
Big-data linear mixed effects model (bLMM) is a modern extension of LMM
to fit a LMM to big data. As introduced in Chapter 1, the need for bLMM
arose in our analysis of electronic health records of VA patients to find the
effect of SCAN-ECHO training on PCPs in treating diabetes. Figure 1.1
presents the profiles of a diabetes outcome measure, A1c levels over time for
the entire patient population. From the plot, we see that both the starting
A1c values and the trends of A1c over time have large variability by patient.
Thus, we chose to fit a LMM to accommodate the variation by patients and
find the effect of SCAN-ECHO training on A1c levels. However, we were not
able to fit a LMM using a traditional model fitting approach because the full
data were too big to be imported into R. We then tried to conduct longitu-
dinal K-means cluster analysis to find subgroups using A1c levels recorded
on the absolute length of time in days from the starting days. It again was
not possible because the data exceeded the memory capacity. Our proposed
procedure was developed to fit a linear mixed effects model for big-data when
40
41
the data are too big or the analysis could not be done because of memory
limits.
Recent strategies for big-data LMM include a method of moment estimation
(Gao and Owen [13]), random effects meta regression (REMR) (Gebregziab-
her et. al [14]),(Jackson et al. [21]), semiparametric 2-step estimation (Pen-
nell and Dunson [36]), and sparse matrix technique (Johnson and Thompson
[24]), (SAS: Proc HPMIXED [17]). Among these methods, random effects
meta regression and semiparametric 2-step estimation are relevant to our
goal of finding a solution to a data set that exceeds the current memory
capacity. Random effects meta regression independently fits LMM to the
partitions of the entire data and merge them using random effects regression
model. However, when using partitions, it is important to develop an effec-
tive fitting algorithm that incorporates interactive informations of different
partitions in addition to using the parameter estimates from each partition to
summarize the final estimates. Semiparametric 2-step estimation is a data di-
mension reduction strategy rather than fitting a LMM using the entire data.
Therefore, to fulfill our needs, we developed a method to fit big-data linear
mixed effects model using a Partial Expectation-Maximization (PEM)
algorithm based on random partitions of the data. Our PEM procedure was
developed primarily to solve the big data issue and fit a longitudinal LMM for
our SCAN-ECHO project, but the method can be extended to other types of
data such as live-feeding data and to high-performance cluster environments
when fitting LMM.
Next, we introduce our procedure at a glance. Recall that our data were pro-
vided in separate tables, one for each of the variable, and data tables were
linked by patient names. When including all of the patients for data curation,
the data was too big to be imported into R. Therefore, as shown in Figure
4.1, we first divided patients into K groups. Next we selected a group and
conducted its data curation to get one complete dataset to fit a LMM. We
42
saved the parameter estimates from fitting LMM and additional summary
information (specified in details later) about the first group to pass over to
the analysis when adding a second group. We then fed the parameters esti-
mates and summary information to the second group to get the accumulated
parameter estimates and summary information. This procedure continued
until all of the patients were included in the analysis which was the final
parameter estimate of the LMM.
Figure 4.1: Overall flow of bLMM PEM algorithm
bLMM is a modern extension of LMM to fit a LMM for big data and our
procedure is a deviation from the general Expectation and Maximization
(EM) algorithm. Thus, knowing LMM and EM algorithm is very important
to understand our PEM procedure. Therefore, in Section 4.2 we will review
LMM, focusing on the derivation of the parameter estimates and introducing
available statistical packages. In Section 4.3, we will review the general E-
43
M algorithm. In Section 4.4, we will describe in detail our proposed PEM
procedure. In Chapter 5, we will show the performance of our PEM procedure
via a simulation study. In Chapter 6, we will show the results from evaluating
SCAN-ECHO training using our PEM procedure.
4.2 Linear Mixed Effects Model (LMM)
A linear regression model is generally used to find a linear relationship be-
tween an outcome and factors of interest. A linear regression model can be
expressed as
Yn×1 = Xn×pβp×1 + �n×1 (4.1)
where Y is a vector of an outcome variable measured at n times, X is a
n × p matrix of p covariates measured at n times, β is a p − dimensional
vector of unknown regression coefficients, and � is a n−dimensional vector
of unknown random errors. In this model, β indicates the effect size of
the variables of interest on the outcome. These β values are fixed for the
entire study population, meaning that the effect size of a variable is same
for all of the subjects involved in the study. However, the outcome may be
correlated within a group or a subject, and the effect of a variable can be
different depending on a specific group or a subject. For example, in medical
research, when patients have repeated tests (or measures), it is possible that
those measures will be correlated in a given patient, especially if the measures
are taken within a short time interval. In this case, the mean and the effect
of some covariates will likely differ by patient. The understanding of the
differences between patients and/or correlation of measures for one patient
can be represented using random effects in a model. A linear model with
both fixed and random effects is called linear mixed effects model (LMM)
and can be generally expressed as
Yn×1 = Xn×pβp×1 + Zn×qαq×1 + �n×1 (4.2)
44
where Y is a vector of an outcome variable measured at n times, X is a
n × p matrix of p covariates measured at n times, β is a p − dimensional
vector of unknown regression coefficients, Z is a n× q known matrix, α is a
q − dimensional vector of the random effects, and � is a n − dimensional
vector of unknown random errors. If � ∼ N(0, Σ�) then the model in (4.2) is
a Gaussian LMM.
The random effects model was first introduced in 1918 by R.A. Fisher [11] in
a study to find correlations among some traits between familial relatives. The
estimation methods for the random effects model were formally introduced
in the 1950s. A best linear unbiased estimator (BLUE) of fixed effects and
a best linear unbiased predictions (BLUP) of random effects introduced by
C.R. Henderson [20] in 1959 were the foundation for the development of LMM
estimation (Robinson [41]). Estimation procedures for the parameters in a
LMM, differ between Gaussian and non-Gaussian models. Coefficients of a
Gaussian LMM are generally estimated using a maximum likelihood estima-
tor (MLE) or by using a restricted maximum likelihood estimator (REML).
Coefficients of non-Gaussian LMM can generally be estimated using a Quasi-
likelihood method, partially observed information, iterative weighted least
squares, or a jackknife method (Jiang [23]). Mixed effects models are useful
to model personalized outcomes. Xu et al. [51] used a mixed effects expo-
nential model to capture personalized differences of a chronic wound size,
Cenik et al. [6] used mixed effects model to improve identification of the
individual protein level differeces, and Koonce et al. [27] used mixed effects
multivariate linear regression model to incorporated personalized knowledge
of each study subject. There are many different mixed effects models used in
recent research but since our bLMM is based on a Gaussian LMM, the next
sections will focus on Gaussian LMM.
45
In general, a Gaussian LMM can be expressed as
Yn×1 = Xn×pβp×1 + Zn×qαq×1 + �n×1 (4.3)
where Y is a vector of an outcome variable measured at n times, X is a
n × p matrix of p covariates measured at n times, β is a p − dimensional
vector of unknown regression coefficients, Z is a n× q known matrix, α is a
q − dimensional vector of the random effects, and � is a n − dimensional
vector of unknown random errors. Typically LMM assumes that the α and
the � are uncorrelated and have a mean of zero and finite variance such that
G = var(α) and R = var(�).
The most widely used methods to estimate parameters in LMM are the
maximum likelihood estimator (MLE) and the restricted maximum likelihood
estimator (REML). Even though the MLE (Pfanzagl [37]) and random effects
model (Fisher [11]) were both introduced by R.A. Fisher in the 1920s, the
maximum likelihood method was not used for mixed model estimation until
Hartley and Rao [19] published a paper in 1967. Following Jiang [23], under
the Gaussian LMM, let θ be the all variance components involved in V where
V = var(Y ) = ZGZ′ + R. Then the likelihood function L(β,θ|y) can be
expressed using the distribution of Y in model (4.3) as
L(β,θ|y) = 1
(2π)n/2|V |1/2 exp { −
1
2 (y −Xβ)′V −1(y −Xβ)
} (4.4)
The log-likelihood function can be expressed as
l(β,θ|y) = C − 1
2 log(|V |) −
1
2 (y −Xβ)V −1(y −Xβ) (4.5)
with a constant C. By differentiating the log-likelihood function with respect
to the parameters, we get the following,
∂l
∂β = X′V −1y −X′V −1Xβ (4.6)
46
∂l
∂θr =
1
2
{ (y−Xβ)′V −1
∂V
∂θr V −1(y−Xβ)−tr
( V −1
∂V
∂θr
)} ,r = 1, ...,Q (4.7)
where θr is the rth element of θ and θ is of dimension Q. Then the standard
procedure to find the MLE is to solve for β and θ from ∂l ∂β
= 0 and ∂l ∂θ
= 0,
respectively. However, in practice when there are many random effects in
a LMM, the MLE which maximizes the likelihood function is inadequate.
When estimating the variance component of a LMM with many random
effects and with the number of fixed effects that increases proportional to
the sample size, MLE becomes biased and inconsistent (Faraway [9], Jiang
[22]). For example, with many random effects in a LMM, the MLE of the
variance component can result in a negative value. Because the variance
cannot be negative, the MLE of variance will be 0.
The underlying idea of REML was first introduced by M.S. Bartlett [2] in
1937 and the first approach using REML to estimate variance component
was introduced by Patterson and Thompson [35] in 1971. For a LMM, when
estimating the variance components, the fixed effects can be considered nui-
sance parameters. REML works under this assumption and finds a linear
combination A that will make X into a nuisance variable. W.l.o.g, from
model (1.3) we will assume that the rank(X) = p and let A be an n×(n−p)
matrix such that
rank(A) = n−p,E(A′X) = 0 (4.8)
Then we define T = A′y where T ∼ N(0,A′V A) because Y ∼ N(0,V ).
Thus the restricted likelihood function LR(β,θ|T) can be expressed from the
distribution of T as
LR(β,θ|T) = 1
(2π)(n−p)/2|A′V A|1/2 exp { −
1
2 T ′(A′V A)−1T
} (4.9)
where the subscript R stands for “restricted” or “residual”. Then the re-
47
stricted log-likelihood function can be expressed as
lR(β,θ|T) = C1 − 1
2 log(|A′V A|) −
1
2 T ′(A′V A)−1T (4.10)
with a constant C1. By differentiating the restricted log-likelihood function
with respect to θ and setting it to 0 we will get a REML estimator. According
to Jiang [23] (see pg13-14), even though REML uses a linear transformation
matrix A, the estimator does not depend on A.
Even though MLE and REML estimators can be approximated by finding
the roots of derivatives of the relevant likelihood, finding the maximum of
the MLE or REML is challenging from a computation standpoint. In prin-
ciple, finding the maximum of MLE or REML and checking the convergence
can be done using numerical procedures such as the Newton-Raphson (NR)
algorithm (Lindstrom and Bates [29]) or by the EM algorithm (Laird and
Ware [28]). According to Jiang [23], the NR algorithm is known to have
faster convergence but is more sensitive to the initial starting value than the
EM algorithm. Therefore, depending on the data and study design, choosing
the right maximization algorithm will reduce the time of computation and
increase the accuracy of the MLE and REML Details on the computation is
well explained in Jiang [23] (see pg 29-34).
Many major statistical software provide functions, macros or packages for
standard Gaussian LMM analysis. SAS provides PROC MIXED as a stan-
dard procedure to fit a Guassian LMM. By default, PROC MIXED delivers
empirical (or estimated) BLUE [42] following the method from Kachar and
Harvielle [26]. PROC MIXED uses ridge-stabilized Newton-Raphson algo-
rithm to optimize either a full (ML) or residual (REML) likelihood function
[42]. R has many packages on CRAN for LMM such as lme4, nlme, lmm,
ASReml, MCMCglmm, and glmmADMB. Using MLE and REML, the
standard Gaussian LMM can be analyzed using packages lme4 and nlme.
48
4.3 Expectation-Maximization (EM) Algorithm
In this section, we describe a full EM algorithm of fitting a longitudinal
LMM. Recall that a longitudinal LMM with its random effects grouped by
each of m patients can be defined as,
Yi = Xiβ + Ziαi + �i (4.11)
for i = 1, ...,m, where the ith subject has Ji repeated measures so that there
is a total m∑ i=1
Ji = N observations. Yi is a Ji×1 vector, representing outcome
values of the ith subject; Xi is a Ji × (p + 1) matrix with a column vector
of 1 representing the intercept and rest columns representing p covariates
measured at Ji times; β is a (p + 1) × 1 vector representing unknown fixed
effects of the covariates; Zi is a Ji × q known matrix associated with the
q-dim random effects αi, with αi iid∼ Nq(0,Gi), Gi being a q × q covariance
matrix of the random effects. The random error �i is Ji-dimensional with
�i iid∼ NJi (0,Ri), Ri being a Ji × Ji covariance matrix of an AR1 process,
indicating a serial correlation in the Ji observations. Further, αi and �i are
independent of each other, denoted as αi |= �i for i = 1, ...,m.
We will assume that the covariance matrix Gi of the random effects αi
and the parameters in the covariance matrix Ri of the random error �i are
identical across all subjects for i = 1, ...,m. So that the q × q covariance
matrix is
Gi = G ∆ =
σ21 ρ12σ1σ2 ρ13σ1σ3 · · · ρ1qσ1σq ρ21σ2σ1 σ
2 2 ρ23σ2σ3 · · · ρ2qσ2σq
... ... σ23 · · ·
... ...
... ...
. . . ...
ρq1σqσ1 ρq2σqσ2 ρq3σqσ3 · · · σ2q
(4.12)
where, ρjk = ρkj for j,k = 1, ...,q.
49
Ri = σ20
1 −φ2 Σ(i) �
and Σ(i) �
=
1 φ φ2 · · · φJi−1 φ 1 φ2 · · · φJi−2 ...
... . . .
... ...
... . . .
... φJi−1 · · · φ2 φi 1
is a Ji ×Ji matrix
(4.13)
Grouping m subjects together, we can re-write our growth curve model in
(4.11) as
Y N×1 = XN×(p+1)β(p+1)×1 + ZN×qαq×1 + �N×1 (4.14)
where Y , X, Z are observable, β and α are unknown fixed and random
effects, and
YN×1 =
Y1 Y2 ...
Ym
, XN×(p+1) =
1 X1 1 X2 ... 1 Xm
, β(p+1)×1 =
β0 β1 ... βp
,
ZN×(qm) =
Z1 0 · · · 0 0 Z2
. . . 0 ...
. . . . . .
... 0 · · · 0 Zm
, α(qm)×1 =
α1 α2 ... αm
∼ N(0,G),
�N×1 =
�1 �2 ... �m
∼ NN (0,R)
Yi =
Yi1... YiJi
, Xi =
X
(i) 11 · · · X
(i) 1p
... · · · ...
X (i) Ji1 · · · X(i)Jip
, Zi =
Z
(i) 11 · · · Z
(i) 1q
... · · · ...
Z (i) Ji1 · · · Z(i)Jiq
, αi =
50
αi1... αiq
, �i =
�i1... �iJi
and �i ∼ AR1(φ), R =
R1 0 · · · 0 0 R2
. . . 0 ...
. . . . . .
... 0 · · · 0 Rm
N×N
,
G =
G1 0 · · · 0 0 G2
. . . 0 ...
. . . . . .
... 0 · · · 0 Gm
=
G 0 · · · 0 0 G
. . . 0 ...
. . . . . .
... 0 · · · 0 G
mq×mq
Hence,
Y |α ∼ N(Xβ + Zα,R) and α ∼ N(0,G)
and the complete pdf of (Y ,α) from our model (4.14) is
f(Y ,α) =f(Y |α)f(α)
= m∏ i=1
( 1
(2π)Ji/2 ∣∣∣∣ σ201 −φ2 Σ(i)�
∣∣∣∣1/2 exp
{ −
1
2 ri ′ (
σ20 1 −φ2
Σ(i) �
)−1 ri
})
× m∏ i=1
Ji∏ j=1
( 1
2π|Gi|1/2 exp
{ −
1
2 αi ′Gi
−1αi
}) (4.15)
where
ri =
ri1... riJi
= Yi −Xiβ−Ziαi (4.16)
So if α were also observable, the log-likelihood function l(β, (σ2k), (ρkl),σ 2 0,φ),
given (Y ,α), where k 6= l = 1, ..,q, is
51
l(β,(σ2k)k=1,...,q, (ρkl)k,l,σ 2 0,φ)
=
( − m
2 ln(σ20 ) +
m
2 ln(1 −φ2) −
1
2
m∑ i=1
ln|Σ(i) � |−
1 −φ2
2σ20
m∑ i=1
r′ i (Σ(i)
� )−1ri
) +
( − N
2 ln|G|−
1
2
m∑ i=1
Ji∑ j=1
αi ′G−1αi
) + C
(4.17)
for some constant C.
Then the maximum likelihood estimator (MLE) can be derived by finding
the roots of the partial derivatives of the log-likelihood function with respect
to the parameters (β, (σ2k), (ρkl),σ 2 0,φ) as follows:
∂l
∂β =
∂
∂β
( −
1 −φ2
2σ20
m∑ i=1
ri ′(Σ(i)
� )−1ri
) (4.18)
∂l
∂σ2k =
∂
∂σ2k
( − N
2 ln|G|−
1
2
m∑ i=1
Ji∑ j=1
αi ′G−1αi
) (4.19)
∂l
∂ρkl =
∂
∂ρkl
( − N
2 ln|G|−
1
2
m∑ i=1
Ji∑ j=1
αi ′G−1αi
) (4.20)
∂l
∂σ20 =
∂
∂σ20
( − m
2 ln(σ20 ) −
1 −φ2
2σ20
m∑ i=1
ri ′(Σ(i)
� )−1ri
) (4.21)
∂l
∂φ =
∂
∂φ
( m
2 ln(1 −φ2) −
1
2
m∑ i=1
ln|Σ(i) � |−
1 −φ2
2σ20
m∑ i=1
ri ′(Σ(i)
� )−1ri
) (4.22)
52
Proposition 1 Under the LMM model (4.14), given α, the MLE of β, (σ2k),
(ρkl), σ 2 0,φ are
β̂ =
( m∑ i=1
Xi ′(Σ̂(i)
� )−1Xi
)−1( m∑ i=1
Xi ′(Σ̂(i)
� )−1(Yi −Ziαi)
) (4.23)
σ̂2k = 1
N
m∑ i=1
Jiα 2 ik (4.24)
ρ̂kl =
1
N
m∑ i=1
Jiαikαil√√√√ 1 N
m∑ i=1
Jiα 2 ik
√√√√ 1 N
m∑ i=1
Jiα 2 il
(4.25)
(σ̂20, φ̂) are the real solutions in (0,∞) × (0, 1) of the following equations:
σ̂20 = (1 −φ 2) ×
1
m
m∑ i=1
ri ′(Σ(i)
� )−1ri (4.26)
σ20 (N −m)φ 3 + 2(A−mσ20 )φ
2 + (2B −σ20 (N −m))φ + 2(mσ 2 0 −A) = 0
(4.27)
where A = m∑ i=1
Ji−1∑ k=1
rikri(k+1), B = m∑ i=1
( r2i1 +r
2 iJi
+2 Ji−1∑ j=2
r2ij
) , ri and rij defined
as in (4.16) and Σ̂(i) �
= Σ(i) �
(φ̂). �
The proof of Proposition 1 is in Appendix A.1. The solutions of (σ̂20, φ̂)
from (4.26) - (4.27) can be obtained in principle, but not tractable.
Next, we propose computationally efficient estimates of σ20 and φ using the
residuals of the LMM. Since �i follows and AR1(φ) model by its definition
in (4.14), we know that for the ith subject,
�i2 = φ�i1 + ηi1 �i3 = φ�i2 + ηi2
... �iJi = φ�i(Ji−1) + ηi(Ji−1)
(4.28)
53
for i = 1, ...m.
Denote � (t) i = (�i2, �i3, ...,�iJi ) and �
(t−1) i = (�i1, �i2, ...,�iJi−1 ) to be vectors
of errors with 1-lag time difference. Define �(t) = (� (t) 1 ,�
(t) 2 , ...,�
(t) m
) and
�(t−1) = (� (t−1) 1 ,�
(t−1) 2 , ...,�
(t−1) m
). Then, �(t) and �(t−1) satisfy the following
linear relationship:
�(t) = φ01 + φ� (t−1) + ηt (4.29)
where ηt iid∼ N(0,σ20 ), φ0 = 0 and 1 =
1...
1
.
Because �(t) and �(t−1) are not observable, we can use the residuals from the
LMM to get the OLS estimate of the φ.
Let r(t) and r(t−1) be the residuals corresponding to �(t) and �(t−1) so that
r(t) = (r (t) 1 ,r
(t) 2 , ...,r
(t) m
) = (r12, ...,r1J1,r22, ...,r2J1, ......,rm2, ...,rmJm ) and
r(t−1) = (r (t−1) 1 ,r
(t−1) 2 , ...,r
(t−1) m
) = (r11, ...,r1(J1−1),r21, ...,r2(J2−1), ......, rm1, ...,rm(Jm−1)).
For the ith subject, denote a (J1 − 1) × 2 matrix X (t−1) i as,
X (t−1) i =
1 ri1 1 ri2 ...
... 1 ri(Ji−1)
= (1,r(t−1)i )
Then the OLS estimate of φ from model (4.29) is
φ̂ = m∑ i=1
(X (t−1)′ i X
(t−1) i )
−1X (t−1)′ i r
(t) i (4.30)
Again, the AR(1) structure in (4.29) implies that var(�) = σ20
1 −φ2 (see [44]
Section 3.3). Thus, there are two good estimates of σ20 using either the
54
residuals from our LMM model (4.14) or using the residuals from the linear
model of the errors (4.29):
σ̂20,1 = (1 − φ̂ 2)v̂ar(�)
σ̂20,2 = v̂ar(η)
where “∧” indicates an estimate of its underlying element, e.g., v̂ar(η) is an
estimate of var(η). For T = m∑ i=1
(Ji − 1), let ri be a Ji −dimensional vector
of the residuals from the ith subject from the LMM model (4.14) and let res
be a T − dimensional vector of the residuals from the linear model of the
errors (4.29). Then, our two estimators of σ20 are in the following (4.31) and
(4.32) respectively.
σ̂20,1 = (1 − φ̂ 2) ×
1
N −m
m∑ i=1
(Ji − 1)v̂ar(ri) (4.31)
σ̂20,2 =
T∑ j=1
res2j
T − 2 (4.32)
where r̂i = Yi − Xiβ̂ − Ziα̂i, v̂ar(ri) = 1Ji − 1 Ji−1∑ j=1
(rij − r̄i)2 with r̄i =
1 Ji − 1
Ji−1∑ j=1
rij and resj = r (t) jk − φ̂0 − φ̂1r
(t−1) jk where j = 1, ...,m and k =
1, ..., (Jj−1). For simplicity of the calculation, we will use (4.32) to estimate
σ20 in the rest of this thesis.
Computation of αi, αiα ′ i
The MLE of the parameters β, (σ2k), (ρkl) in Proposition 1 and alternative
estimates of σ20 and φ in (4.30) and (4.32) depend on unobservable αi and
55
αiαi ′ where αi =
αi1... αiq
. Thus, we calculate their estimates
• α̂ik
• α̂2ik
• (α̂ikα̂il)
for k 6= l = 1, ...,q and i = 1, ...,m using the elements in E(αi|Y ) and
E(αiα ′ i |Y ). To compute the conditional expectations E(αi|Y ) and E(αiα′i|Y ),
we start from the joint distribution of Y and α.
( α Y
) ∼ N
(( 0 Xβ
) ,
( var(α) cov(α,Y )
cov(Y ,α) var(Y )
))
Then, by [3] and [25], the conditional distribution of α given Y is Normal
with mean and variance as:
E(α|Y ) = E(α) + cov(α,Y )var(Y )−1(Y −Xβ) (4.33)
var(α|Y ) = var(α) − cov(α,Y )var(Y )−1cov(Y ,α)) (4.34)
Unfortunately, computing (4.33) and (4.34) requires calculating an inverse
matrix with a size of dim(Y ) ×dim(Y ) which can be very large with a big-
data. Fortunately, we have the following Lemma under the independence
scenario.
Lemma 1 If (Y1,α1), ..., (Ym,αm) are independent to each other, then
E(αi|Y ) = E(αi|Yi) (4.35)
E(αiα ′ i |Y ) = E(αiα′i|Yi) (4.36)
�
56
Based on Lemma 1 and careful calculation, we have the following conditional
expectations.
Proposition 2 Under the LMM in (4.14), we have
E(αi|Yi) = GiZi′(ZiGiZi′ + Ri)−1(Yi −Xiβ) ∆ ≡ F1
E(αiαi ′|Yi) = Gi −GiZi′(ZiGiZi′ + Ri)−1ZiGi + E(αi|Yi)E(αi|Yi)′
∆ ≡ F2
�
Here F1 is a q−dim vector and F2 is a q×q matrix. See Appendix A.2 and
Appendix A.3 for the proofs of Lemma 1 and Proposition 2 respectively.
Note that Gi depends on (σ 2 k) and (ρkl) for k 6= l = 1, ...,q and Ri depends
on σ20 and φ and is Ji −dim, as given in (4.12) and (4.13). So,
Gi = Gi((σ 2 k), (ρkl)) (4.37)
Ri = Ri(σ 2 0,φ) (4.38)
F1 = F1(Zi,Gi,Ri,Yi,Xi,β) (4.39)
F2 = F2(Zi,Gi,Ri,Yi,Xi,β) (4.40)
which can be computed based on data Xi, Yi and previous estimates of β,
(σ2k), (ρkl), σ 2 0 , φ for k 6= l = 1, ...,q. This leads to the following full EM
algorithm.
Using the formulas for parameter estimates of (β,σ2k,ρkl,σ 2 0,φ) in (4.23),
(4.24), (4.25), (4.30), (4.32), and estimates for αi, our full EM algorithm is
in (4.41) below.
57
Schematic Structure of Full EM algorithm (4.41)
1. Initial Step : Set initial values for (β,σ2k,ρkl,σ 2 0,φ) which can be ob-
tained from fitting a small sample to the LMM
2. E-Step : Compute the estimates α̂ (j) ik , α̂
2(j) ik , and (α̂ikα̂il)
(j) using for- mulas in Proposition 2 and (4.39) & (4.40) at step (j − 1)
3. M-step : Compute the MLE (β̂(j), σ̂ 2(j) k , ρ̂
(j) kl , σ̂
2(j) 0 , φ̂
(j)) in Proposition
1 using α̂ (j) ik , α̂
2(j) ik , and (α̂ikα̂il)
(j) from the E-step above
4. Repeat 2 and 3 until all of the iterative parameter estimates are con-
verged, e.g. |β̂(j)1 − β̂
(j−1) 1 |
|β̂(j−1)1 | ≤ δ where β̂(j)1 is the estimated β1 in current
iteration and β̂ (j−1) 1 is the estimated β1 from previous iteration.
Detailed Full EM algorithm
We first calculate F1 and F2 using a known Ji × p matrix Zi, an observed
Ji × 1 vector Yi, observed Ji × (p + 1) matrix Xi from the jth block of
data and given initial estimated values of the parameters (β̂(j−1), (σ̂ 2(j−1) k ),
(ρ̂ (j−1) kl ), σ̂
2(j−1) 0 , φ̂
(j−1)) from the (j − 1)th block of data as in Proposition
2.
E-Step : Compute
F1 (j) = F1(β̂
(j−1) , (σ̂
2(j−1) k ), (ρ̂
(j−1) kl ), σ̂
2(j−1) 0 , φ̂
(j−1),Yi (j),Xi
(j),Zi (j))
(4.42)
F2 (j) = F2(β̂
(j−1) , (σ̂
2(j−1) k ), (ρ̂
(j−1) kl ), σ̂
2(j−1) 0 , φ̂
(j−1),Yi (j),Xi
(j),Zi (j))
(4.43)
Then (α̂ (j) ik , α̂
2(j) ik , (α̂ikα̂il)
(j)) are the elements of F1 (j) and F2
(j)
• α̂(j)ik = F1(k) (j) (4.44)
• α̂2(j)ik = F2(k,k) (j) (4.45)
• (α̂ikα̂il)(j) = F2(k,l)(j) (4.46)
58
for k 6= l = 1, ...,q, i = 1, ...,m.
M-Step : We plug-in (α̂ (j) ik , α̂
2(j) ik , (α̂ikα̂il)
(j)) from the above E-Step into
the MLE of β, (σ2k), (ρkl) and estimates of σ 2 0 , φ in (4.30), (4.32) to obtain,
β̂(j) =
( m∑ i=1
Xi ′(Σ̂(i)
� )−1Xi
)−1( m∑ i=1
Xi ′(Σ̂(i)
� )−1(Yi −Ziα̂i)
) (4.47)
σ̂ 2(j) k =
1
N
m∑ i=1
Jiα̂ 2(j) ik (4.48)
ρ̂ (j) kl =
1
N
m∑ i=1
Ji(α̂ikα̂il) (j)
√√√√ 1 N
m∑ i=1
Jiα̂ 2(j) ik
√√√√ 1 N
m∑ i=1
Jiα̂ 2(j) il
=
m∑ i=1
Ji(α̂ikα̂il) (j)
√√√√ m∑ i=1
Jiα̂ 2(j) ik
√√√√ m∑ i=1
Jiα̂ 2(j) il
(4.49)
Using β̂ and α̂i we calculate residuals r̂i for the ith subject from the model
4.14 as
r̂i = Yi −Xiβ̂−Ziα̂i (4.50)
Divide r̂i as we have done to derive 4.30 and define X̂ (t−1) i such that,
• r̂(t) = (r̂(t)1 , r̂ (t) 2 , ..., r̂
(t) m
) = (r̂12, ..., r̂1J1, r̂22, ..., r̂2J2, ......, r̂m2, ..., r̂mJm )
• r̂(t−1) = (r̂(t−1)1 , r̂ (t−1) 2 , ..., r̂
(t−1) m
)
= (r̂11, ..., r̂1(J1−1), r̂21, ..., r̂2(J2−1), ......, r̂m1, ..., r̂m(Jm−1))
• X̂(t−1)i = (1, r̂ (t−1) i )
Then φ and σ20 can be estimated by
59
φ̂(j) = m∑ i=1
(X̂ (t−1)′ i X̂
(t−1) i )
−1X̂ (t−1)′ i r̂i
(t) (4.51)
σ̂ 2(j) 0 =
T∑ i=1
r̂es 2 i
N − 2 (4.52)
where r̂esi for i = 1, ...,T is the residual from r̂ (t) = φ0 + φr̂
(t−1) + ηt.
Iteration and Convergence: After we get the parameter estimates from
the M-step, we calculate the relative error between the updated estimates
and previous estimates, e.g. as in (4.53) using β1.
relerror = |β̂(j)1 − β̂
(j−1) 1 |
|β̂(j−1)1 | (4.53)
For convergence of the parameter estimates, we iterate the E-Step and M-
step until the relative error for all of the parameter estimates gets smaller
than a given δ or a certain number of maximum iteration has reached.
4.4 Partial EM (PEM) Procedure for bLMM
Our PEM procedure was developed to overcome the difficulty in fitting a
statistical model to an entire dataset using an existing procedure (e.g. MLE,
Full EM algorithm(4.41)) when the data size exceeds the memory capacity.
The procedure partitions the (big) data into blocks and fits a LMM using
a block and then updates the parameter estimates as new blocks are added
into the computation. The overall flow of PEM is introduced in Figure 4.2.
First, we partition the full data into C blocks. In this step we need to make
sure that the size of each partition can be imported within the memory
capacity. Choose one block of data and fit a LMM using R package nlme
[38] to derive initial parameter estimates (β(1), σ 2(1) k , ρ
(1) kl , σ
2(1) 0 , φ
(1)). Along
60
Figure 4.2: Overall flow of PEM Procedure
with the parameter estimates, we also calculate S (1) 1 , ...,S
(1) 7 (given below)
that will be used to derive updated estimates. Using (β(1), σ 2(1) k , ρ
(1) kl , σ
2(1) 0 ,
φ(1)) and S (1) 1 , ...,S
(1) 7 , we derive (β
(2), σ 2(2) k , ρ
(2) kl , σ
2(2) 0 , φ
(2)) and S (2) 1 , ...,S
(2) 7
about the data in Block 2 by adding data in Block 2. We then iterate PEM
procedure until the parameter estimates converge. Then the converged (β(2),
σ 2(2) k , ρ
(2) kl , σ
2(2) 0 , φ
(2)) and S (2) 1 , ...,S
(2) 7 from Block 2 will be used to derive the
next set of parameter estimates by adding the data from Block 3. Repeat the
same steps until all of the data is included in the calculation. The updated
estimates (β(C), σ 2(C) k , ρ
(C) kl , σ
2(C) 0 , φ
(C)) from adding the last partition C, are
the final estimates for statistical inference. The PEM algorithm for bLMM
is described in (4.54).
61
PEM algorithm (4.54)
1. Partition the full data into C blocks
2. Select a block, B1 and fit LMM using R package nlme and get initial
estimates (β̂ (1)
, σ̂ 2(1) k , ρ̂
(1) kl , σ̂
2(1) 0 , φ̂
(1))
3. Use (β̂ (1)
, σ̂ 2(1) k , ρ̂
(1) kl , σ̂
2(1) 0 , φ̂
(1)) and the data from B1 to calculate nec-
essary sum components S (1) 1 , ...,S
(1) 7 that contain summary information
on the data in B1
4. Choose next block, B2
5. E-Step : Compute α̂ik, α̂ 2 ik, and (α̂ikα̂il) from E(α|Y ) and E(αα
′|Y ) using (β̂
(1) , σ̂
2(1) k , ρ̂
(1) kl , σ̂
2(1) 0 , φ̂
(1)) and data in B2
6. M-step : Compute (β̂ (2) j , σ̂
2(2) (k,j)
, ρ̂ (2) (kl,j)
, σ̂ 2(2) (0,j)
, φ̂ (2) j ) using α̂ik, α̂
2 ik, (α̂ikα̂il)
from E-step, and the sum components from B1
7. Repeat 5 and 6 until all of the iterative parameter estimates are con-
verged, e.g. |β̂(2)
(1,j) − β̂(2)
(1,(j−1))| |β̂(2)
(1,(j−1))| ≤ δ where β̂(2)
(1,j) is the estimated β1 in
current iteration and β̂ (2) (1,(j−1)) is the estimated β1 from previous itera-
tion
8. Go to step 4 to select another block and repeat steps 5 to 7 until all blocks are included in the calculation
In the following part of this Section, we will describe the details of E-Step
and M-Step of PEM. Assume that the full data is partitioned into C blocks.
Let Bi represent the data in the ith block and mi represent the number of
subjects in the ith block for i = 1, ...,C. Using the data in B1, we fit a LMM
using R package nlme and get the initial estimates (β̂ (1)
, (σ̂ 2(1) k ), (ρ̂
(1) kl ), σ̂
2(1) 0 ,
φ̂(1)), the estimated random effects α̂(1), and the residuals (r̂(1)). Then we
calculate the following sum components (4.55)-(4.61) which will be used in
the M-step to get the updated parameter estimates when we add the next
block of data (B2) into the analysis.
62
S (1) 1 =
∑ i∈B1
Xi ′(Σ̂(i)
� )−1Xi (4.55)
S (1) 2 =
∑ i∈B1
Xi ′(Σ̂(i)
� )−1(Yi −Ziα̂(1)) (4.56)
S (1) 3k =
∑ i∈B1
Jiα̂ 2 ik (4.57)
S (1) 4kl =
∑ i∈B1
Jiα̂ikα̂il (4.58)
S (1) 5 =
∑ i∈B1
X (t−1)′ i X
(t−1) i (4.59)
S (1) 6 =
∑ i∈B1
X (t−1)′ i r
(1)(t) i (4.60)
S (1) 7 =
∑ i∈B1
r̂es (1)2 i (4.61)
where k 6= l = 1, ...,q. For the ith subject, Xi and Yi are observed covariates
and response in the B1 block respectively, Zi is a known matrix associated
with α̂(1), Σ(i) �
is defined by estimated (σ̂ 2(1) 0 , φ̂
(1)), α̂2ik and (α̂ikα̂il) can be
calculated using α̂(1), X (t−1) i and r
(1)(t) i are derived from (r̂
(1)) and r̂es (1) i
is the residual from fitting (4.29) with (r̂(1)).
E-Step
E-Step in PEM is almost identical to the E-Step in the full EM algo-
rithm. Given the parameter estimates (β̂ (1)
, (σ̂ 2(1) k ), (ρ̂
(1) kl ), σ̂
2(1) 0 , φ̂
(1)) from
B1, Yi and Xi from B2, we calculate (α̂ik,α̂ 2 ik,(α̂ikα̂il)) from EB2 (αi|Yi) and
EB2 (αiαi ′|Yi) for k 6= l = 1, ...,q and i = 1, ...,m using (4.62) and (4.63) as
in (4.42) and (4.43).
F (B2) 1 = F1(β̂
(1) , (σ̂
2(1) k ), (ρ̂
(1) kl ), σ̂
2(1) 0 , φ̂
(1),Yi (2),Xi
(2),Zi (2)) (4.62)
F (B2) 2 = F2(β̂
(1) , (σ̂
2(1) k ), (ρ̂
(1) kl ), σ̂
2(1) 0 , φ̂
(1),Yi (2),Xi
(2),Zi (2)) (4.63)
63
From F (B2) 1 and F
(B2) 2 (α̂ik,α̂
2 ik,α̂ikα̂il) can be calculated as,
α̂ik = F (B2) 1 (k)
α̂2ik = F (B2) 2 (k,k)
α̂ikα̂il = F (B2) 2 (k,l)
Similarly, for block Bc, c = 1, ...,C, given the parameter estimates (β̂ (c−1)
,
(σ̂ 2(c−1) k ), (ρ̂
(c−1) kl ), σ̂
2(c−1) 0 , φ̂
(c−1)) from B1(c−1), Yi and Xi from B(c), we can
calculate,
F (Bc) 1 = F1(β̂
(c−1) , (σ̂
2(c−1) k ), (ρ̂
(c−1) kl ), σ̂
2(c−1) 0 , φ̂
(c−1),Yi (c),Xi
(c),Zi (c))
(4.64)
F (Bc) 2 = F2(β̂
(c−1) , (σ̂
2(c−1) k ), (ρ̂
(c−1) kl ), σ̂
2(c−1) 0 , φ̂
(c−1),Yi (c),Xi
(c),Zi (c))
(4.65)
α̂ik = F (Bc) 1 (k) (4.66)
α̂2ik = F (Bc) 2 (k,k) (4.67)
α̂ikα̂il = F (Bc) 2 (k,l) (4.68)
Then, (α̂ik, α̂ 2 ik, (α̂ikα̂il)) from Bc block will be used in the M-Step below to
get updated estimates of (β̂ (c)
, (σ̂ 2(c) k ), (ρ̂
(c) kl ), σ̂
2(c) 0 , φ̂
(c)).
M-Step
In the M-Step, PEM procedure accumulates the information from the pre-
vious blocks as a new block of the data is added to calculating the parameter
estimates for fitting LMM. After obtaining the parameter estimates (β̂ (c−1)
,
σ̂ 2(c−1) k , ρ̂
(c−1) kl , σ̂
2(c−1) 0 , φ̂
(c−1)) and (S (c−1) 1 , S
(c−1) 2 , S
(c−1) 3k , S
(c−1) 4kl , S
(c−1) 5 , S
(c−1) 6 ,
S (c−1) 7 ) from B(c−1), we can update the parameter estimates and the sum com-
ponents by adding data from the next block Bc as follows.
64
β̂(c) =
( S
(c−1) 1 +
∑ i∈Bc
Xi ′(Σ(i)
� )−1Xi
)−1( S
(c−1) 2 +
∑ i∈Bc
Xi ′(Σ(i)
� )−1(Yi −Ziα̂i)
) (4.69)
σ̂ 2(c) k =
1
Tc
( S
(c−1) 3k +
∑ i∈Bc
Jiα̂ 2 ik
) (4.70)
ρ̂ (c) kl =
S (c−1) 4kl +
∑ i∈Bc
Jiα̂ikα̂il√ S
(c−1) 3k +
∑ i∈Bc
Jiα̂ 2 ik
√ S
(c−1) 3l +
∑ i∈Bc
Jiα̂ 2 il
(4.71)
φ̂(c) =
( S
(c−1) 5 +
∑ i∈Bc
X (t−1)′ i X
(t−1) i
)−1( S
(c−1) 6 +
∑ i∈Bc
X (t−1)′ i r
(t) i
) (4.72)
σ̂ 2(c) 0 =
1
Tc − 2
( S
(c−1) 7 +
∑ i∈Bc
r̂es 2 i
) (4.73)
for Tc = c∑ i=1
mi∑ j=1
(Jj − 1).
Iteration and Output Step
We iterate the E-Step and M-Step until the parameter estimates are con-
verged i.e. relative error = |β̂(c)
(1,j) − β̂(c)
(1,(j−1))| |β̂(c)
(1,(j−1))| ≤ δ. At block Bc, when the
parameters are converged, the converged parameter estimates (β̂ (c)
, σ̂ 2(c) k ,
ρ̂ (c) kl , σ̂
2(c) 0 , φ̂
(c)) and the following sum components are saved to be used for
updating parameter estimates when block B(c+1) is added.
65
S (c) 1 =
c∑ j=1
∑ i∈Bj
Xi ′(Σ̂(i)
� )−1Xi (4.74)
S (c) 2 =
c∑ j=1
∑ i∈Bj
Xi ′(Σ̂(i)
� )−1(Yi −Ziα̂(c)) (4.75)
S (c) 3k =
c∑ j=1
∑ i∈Bj
Jiα̂ 2 ik (4.76)
S (c) 4kl =
c∑ j=1
∑ i∈Bj
Jiα̂ikα̂il (4.77)
S (c) 5 =
c∑ j=1
∑ i∈Bj
X (t−1)′ i X
(t−1) i (4.78)
S (c) 6 =
c∑ j=1
∑ i∈Bj
X (t−1)′ i r
(t) i (4.79)
S (c) 7 =
c∑ j=1
∑ i∈Bj
r̂es 2 i (4.80)
PEM procedure will update the parameter estimates until all C blocks are
included in the analysis. Then the parameter estimates (β̂ (C)
, σ̂ 2(C) k , ρ̂
(C) kl ,
σ̂ 2(C) 0 , φ̂
(C)) after adding all C blocks are used for statistical inference.
Chapter 5
Performance of bLMM PEM
A simulation study was conducted to compare the performance of our bLMM
PEM with four benchmark procedures. The first two were based on entire
data, then the entire data can be inputed into R. The next two use blocks
of the entire data for analyses.
1. Using entire data to fit
• Linear mixed effects model using Broyden−Fletcher−Goldfarb−Shanno
(BFGS) algorithm (R package nlme), or
• Linear mixed effects model using the full EM algorithm
2. Using a block/blocks of data to fit
• Linear mixed effects model using meta analysis of sub LMMs
• Linear mixed effects model using one block of the data
Procedure 1, LMM using BFGS algorithm, was implemented in function lme
in the R package nlme. This package used BFGS algorithm [32] to find MLE
of the parameters of interest.
Procedure 2, LMM using the full EM algorithm, employed an expectation
and maximization approach to find the parameter estimates that maximize
66
67
the likelihood fuction using the entire dataset. The EM algorithm required
initial values of the parameters. Even though EM algorithm was known to
be robust to the initial values, the computation time depends on the initial
values. The reasonable initial values we chose were data driven by deriving
the estimates from a random partition of the data. So, first we chose a
subset of the data to derive the parameter estimates to be used as initial
value. Using these initial values, we iterated the EM algorithm until the
estimates of the parameters met the convergence criteria having the relative
errors smaller than 10−3 or the algorithm reached 200 iterations.
Our Procedure, LMM using PEM algorithm, follows the PEM algorithm in
(4.54). We randomly partitioned the data into C blocks. For the first block,
the function lme in R package nlme was used to fit a LMM. Using the
resulting parameter estimates, summation information and next blocks of
data, we applied our PEM to update the parameter estimates, as introduced
in Chapter 4 Section 4.4. Our PEM procedure iterated until the estimated
parameters met the convergence criteria of 10−3, or the algorithm reached
200 iterations, using the same convergence criteria as those in the full EM
algorithm.
Procedure 3, LMM using meta analysis, was done on the same data parti-
tion used for our bLMM PEM procedure. We treated the partitions to be
independent data. We parallelized the LMM models among the partitions to
get the parameter estimates using the function lme in R package nlme. We
then merged the estimates using fixed effects meta analysis [46]. For the ith
block, let γ̂i be a parameter estimate, vi be the standard error from fitting
the LMM model to the ith block and r be the true value of γ̂i. Then a fixed
effects model γ̂i = r + ei is used to combine the estimates from all C blocks.
The overall parameter estimate γ̂ by merging γ̂1, ...γ̂C can be calculated by
68
γ̂ =
C∑ i=1
wiγ̂i
C∑ i=1
wi
(5.1)
where wi = v −1 i and var(γ̂) = 1/
C∑ i=1
wi.
Procedure 4, using LMM on a block of the data, computed the estimates
based on the first block partitioned for our bLMM PEM procedure. So, the
estimates were derived from applying the function lme in R package nlme
on the first block of the data.
Modified PEM
Our preliminary simulation study showed that the parameter estimates using
PEM procedure had reasonable variance and bias except for the φ parameter.
Finding the reason for this bias is a remaining task to be studied but for the
purpose of using PEM to analyze our SCAN-ECHO data, we were able to
improve the performance of PEM by using a hybrid procedure with the pa-
rameter estimates of φ from the BFGS algorithm. Therefore, the simulation
study and the application in Chapters 5 and 6 uses the hybrid algorithm.
Our hybrid algorithm works exactly same as our PEM procedure introduced
in Section 4.4 but for φ we use the parameter estimate derived from BFGS
algorithm from each block. For example, when updating the parameter esti-
mates for block 1, φ will be estimated from BFGS algorithm, which will be
identical to the estimates from the meta analysis, and stay the same as the
PEM procedure iterates within the block for convergence.
Model Setup
The model used for the simulation study was a longitudinal random effects
69
model with three fixed effects, one random intercept and one random slope.
The errors had AR(1) serial correlation.
Yi = β0 + Xi1β1 + Xi2β2 + Xi3β3 + (α1i + αi2Xi3) + �i (5.2)
where,
• E(αi) = (
0 0
) and var(αi) =
( σ21 ρσ1σ2
ρσ2σ1 σ 2 2
) for αi = (α1i,α2i)
• E(�i) = 0 and var(�i) = σ20
1 −φ2 Σ
(i) � where Σ
(i) � =
1 φ · · · φJi−1 φ 1 · · · φJi−2 ...
... . . .
... φJi−1 · · · φ 1
A dataset was simulated for 5,000 subjects with 5 repeated outcome measures
using β0 = 6, β1 = 1.1, β2 = 1.3, β3 = −2.1, σ21 = 2.4, σ22 = 1.7, ρ =
0.3, σ20 = 0.3, φ = 0.6. The covariates represented age at baseline (X1), a
binary intervention (X2), and time (X3). X1 was generated from N(50, 5),
X2 was generated from binomial distribution with p(intervention) = 0.6 and
p(control) = 0.4, and X3 = time = (1,2,3,4,5) which is same for all subjects.
The parameters used to generate these variables are similar to those from
our original data from the VA patients. For the three procedures bLMM
PEM, meta analysis and partial data analysis, the subjects were divided
into 5 partitions with 1,000 patients in each partition. MSE, bias, %bias (=∣∣true bias
∣∣) and variance of the estimates from the simulation for the 5 methods are provided in Tables 5.1, 5.2, 5.3, and 5.4 respectively. The average time
used for estimation is provided in Table 5.5.
‘true’ indicates the true values of the parameters used to generate simulated
data, ‘nlme’ represents the method using BFGS algorithm, ‘EM’ stands for
the full EM algorithm, ‘PEM’ stands for the Partial EM, ‘Meta’ stands for the
meta analysis with a fixed effects model, and ‘partial’ stands for using partial
70
data and fitting the model using BFGS algorithm. The result is based on 450
simulations and the mean-squared error was calculated by adding the esti-
mated variance and square of the bias of the parameters estimated from the
450 simulations. Among the 5 methods, we expect BFGS algorithm (‘nlme’)
and full EM (‘EM’) to perform better than the other methods because it uses
the entire data at once for estimation.
param true nlme EM PEM Meta Partial β1 1.1 0.00001 0.00004 0.00007 0.00001 0.00012 β2 1.3 0.00019 0.00159 0.00168 0.00019 0.00191 β3 -2.1 0.00006 0.00006 0.00006 0.00006 0.00066 σ21 2.4 0.00323 0.03308 0.05324 0.00449 0.04098 σ22 1.7 0.00066 0.00088 0.00084 0.00065 0.00575 ρ 0.3 0.00015 0.00102 0.0018 0.00018 0.00172 σ20 0.3 0.14883 0.00604 0.00094 0.15728 0.16353 φ 0.6 0.00043 0.00044 0.00044 0.00044 0.00447
Table 5.1: Mean-square error of the estimates
param true nlme EM PEM Meta Partial β1 1.1 -0.0001 -0.0002 -0.00018 -0.0001 -0.00032 β2 1.3 0.000004 -0.00041 -0.00067 0.00001 -0.00115 β3 -2.1 0.00023 0.00023 0.00026 0.00022 0.00063 σ21 2.4 -0.00263 0.17418 -0.17382 -0.02612 -0.02952 σ22 1.7 0.00044 0.0147 -0.00995 -0.00017 -0.00035 ρ 0.3 -0.00005 -0.02986 0.03191 0.00332 0.00247 σ20 0.3 0.38529 -0.07756 -0.02523 0.39596 0.39796 φ 0.6 -0.00012 0.00241 0.00241 0.00241 0.00484
Table 5.2: Bias of the estimates
param true nlme EM PEM Meta Partial β1 1.1 0.00009 0.00018 0.00016 0.00009 0.00029 β2 1.3 0.00000 0.00032 0.00051 0.00000 0.00088 β3 -2.1 0.00011 0.00011 0.00013 0.0001 0.0003 σ21 2.4 0.0011 0.07257 0.07242 0.01088 0.0123 σ22 1.7 0.00026 0.00865 0.00585 0.0001 0.0002 ρ 0.3 0.00015 0.09953 0.10637 0.01107 0.00822
71
σ20 0.3 1.28428 0.25854 0.08412 1.31987 1.32652 φ 0.6 0.00021 0.00402 0.00402 0.00402 0.00807
Table 5.3: % Bias of the estimates
param true nlme EM PEM Meta Partial β1 1.1 0.00001 0.00004 0.00007 0.00001 0.00012 β2 1.3 0.00019 0.00159 0.00168 0.00019 0.00191 β3 -2.1 0.00006 0.00006 0.00006 0.00006 0.00066 σ21 2.4 0.00322 0.00274 0.02302 0.0038 0.04011 σ22 1.7 0.00066 0.00066 0.00074 0.00065 0.00575 ρ 0.3 0.00015 0.00013 0.00078 0.00017 0.00171 σ20 0.3 0.00038 0.00002 0.00031 0.0005 0.00517 φ 0.6 0.00043 0.00044 0.00044 0.00044 0.00445
Table 5.4: Variance of the estimates
nlme EM PEM Meta Partial time 38.77 150.45 54.64 27.80 2.84
Table 5.5: Efficiency (sec)
The MSE of the parameter estimates using BFGS algorithm were smallest
overall except for σ20 . The MSE is expected to be smaller for BFGS algorithm
and EM algorithm because they use the entire data at once to derive the
parameter estimates. However, for EM algorithm, the MSE was not as good
as the BFGS algorithm. Among our PEM procedure, Meta analysis, and
using partial data which uses block(s) of data, the performance of PEM
procedure or Meta analysis differed by parameters. Especially for σ20 , the
MSE, bias and variance was smallest among all five algorithms. This maybe
because PEM procedure has “superefficiency”. The computation time of
PEM was a little longer than meta analysis with only few seconds.
Chapter 6
Evaluation of SCAN-ECHO training on diabetes treatment using bLMM PEM
To examine the effect of SCAN-ECHO training on diabetes treatment using
the trend of original A1c values by days, we fitted LMMs using our bLMM
PEM method on each of homogeneous subgroups and for the entire patients
with clusters as a covariate. Because we were not able to conduct cluster
analysis when using the original A1c values, we subgrouped the patients
based on the result from our cluster analysis used the mean of A1c by year
in Section 3.2. The cluster analysis using the mean of A1c suggested that the
patients can be divided into 5 subgroups as in Figure 3.2. So, a LMM was
fitted for each subgroup as in (6.1)and then for the entire patient population
with the cluster variable as a covariate as in (6.2).
A1ci = scan + log(time) + baseage + gender + race + BMI
+ SBP + DBP + med + cons + cvdiag + α1i + α2ilog(timei) (6.1)
72
73
A1ci = scan + log(time) + cluster + baseage + gender + race + BMI
+ SBP + DBP + med + cons + cvdiag + α1i + α2ilog(timei) (6.2)
where i is the index for the ith patient. The distribution of time in days
was heavily right skewed. Therefore, in our LMMs (6.1, 6.2), time was log
transformed to reduce the skewness.
The parameter estimates of the fixed effects from fitting LMMs (6.1) and
(6.2) are provided in Table 6.1 and Table 6.2 respectively.
Cluster A Cluster B Cluster C Cluster D Cluster E Estimate Estimate Estimate Estimate Estimate
(Intercept) 4.68529 5.62723 6.64122 8.17743 11.16059 scan1 -0.01255 0.00314 0.01521 0.04584 0.22487
log(time) -0.00447 0.01238 0.03701 0.03667 0.00089 baseage 0.00485 0.00334 0.00023 -0.00267 -0.00864
GenderM 0.10577 0.08439 0.06080 0.04939 -0.17455 raceOTHER -0.00457 0.00345 -0.03526 -0.03058 -0.04296 raceWHITE 0.00028 -0.03478 -0.02947 -0.04619 -0.13193
SBP -0.00067 0.00047 0.00028 -0.00153 -0.00496 DBP 0.00095 0.00107 0.00366 0.00675 0.01499 BMI 0.02112 0.01083 0.00617 -0.00161 -0.02344
med1 0.10046 0.16606 0.09695 0.05824 0.08578 cons1 0.06127 0.03829 0.07746 0.10958 0.03940
cvdiag1 -0.00882 -0.01206 0.00150 0.02552 0.07696
Table 6.1: Estimated coefficients of the fixed effects in the LMM for original A1c by clusters
The effect of SCAN-ECHO training increased from Cluster A to Cluster E.
SCAN-ECHO training decreased the A1c levels of the patients in Cluster
A by 0.01255 but starting from Cluster B, SCAN-ECHO increased the A1c
levels of the patients by 0.00314, 0.01521, 0.04584, 0.22487 in Clusters B,
C, D, E respectively. This shows that SCAN-ECHO helps to reduce the
A1c levels for patients with relatively healthy diabetic status where as, for
74
patients with relatively poor diabetic status, SCAN-ECHO training does not
help to reduce A1c levels.
Covariate Estimate Covariate Estimate Covariate Estimate (Intercept) 5.15236 ClusterE 2.73716 DBP 0.00353
scan1 0.02065 baseage 4.44322 BMI 0.00524 time 0.02070 GenderM 0.05369 med 0.13118
ClusterB 0.00077 raceOTHER -0.01516 cons 0.06148 ClusterC 0.73788 raceWHITE -0.03075 cvdiag -0.00039 ClusterD 1.65668 SBP -0.00040
Table 6.2: Estimated coefficients of the fixed effects in the LMM for original A1c from all patients
The estimated fixed effect of SCAN-ECHO training from our model (6.2)
showed that the treatments from PCPs who had SCAN-ECHO training in-
creased A1c level in average by 0.02065 relative to the treatment from PCPs
who did not have SCAN-ECHO training.
Chapter 7
Discussion and Conclusion
In the first part of this thesis, we introduced the challenges and their solu-
tions in evaluating the effect of SCAN-ECHO training on treating diabetes.
Section 3.2 evaluates the effects of SCAN-ECHO training based on three
summary measures of A1c level which were annual mean, SD and CV of A1c
levels. SCAN-ECHO training appeared to slightly decrease the mean of A1c
for the subgroup of patients whose mean of A1c is around 5.5, and with a flat
trend over 6 years. In contrast, SCAN-ECHO training appeared to increase
the mean of A1c for the subgroup of patients who had flat trend of mean A1c
levels around 10.2 over 6 years. For variability measures by SD and CV of
A1c, SCAN-ECHO statistically significantly decreased the SD and CV of A1c
over time for subgroups with a relatively smaller variability. Hence, based on
the analysis of these measures, SCAN-ECHO training showed statistically
significant effect in decreasing (or maintaining) the yearly mean, SD, and
CV of A1c over time for the subgroup patients who were relatively healthy,
whereas, for patients with high mean A1c or elevated SD or CV, SCAN-
ECHO showed a negative effect by increasing the A1c slightly. As shown in
Chapter 6, the results from LMMs using original A1c levels were similar to
the results for annual mean of A1c. Overall, SCAN-ECHO training appear
to be beneficial for patients with better diabetic health conditions, whereas
75
76
for patients with worse diabetic health conditions, it might be helpful to see
a specialist for diabetes treatment rather than the PCPs in the outpatient
clinics.
Our proposed PEM procedure for fitting bLMM provided a significant con-
tribution to our analysis. It allowed us to fit a LMM using the entire data
when the data exceeded the memory capacity. Data partition made it pos-
sible to fit a LMM to a block of the big data within the memory limitation
and the PEM procedure made it possible to update the parameter estimates
and information as additional blocks of data are added. The parameter es-
timates passed over from a previous block make the PEM procedure very
efficient because they are not only good initial values, but also placed in the
updating formula of our PEM procedure when the next block is added into
the calculation. PEM procedure is a general method that can be adapted
to various types of big-data when fitting a longitudinal linear mixed effects
model.
At the present time, however, it is unclear how to parallelize the sequential
updating algorithm in the PEM procedure. Also, as each block is added into
the calculation, the PEM procedure iterates within the current block of data
until the convergence of the estimates. Therefore, PEM can be computation-
ally intensive. However, as shown in Chapter 5, the calculation time by PEM
is not too much more than that by the BFGS algorithm using the entire data,
and meta analysis using BFGS algorithm for each block of data. Therefore,
when BFGS algorithm using the entire data cannot be used for a big data,
considering the performance of PEM being better than the meta analysis
in terms of the MSE, the time difference between PEM and meta-analysis
procedures can be a beneficial trade-off.
Current PEM procedure is developed for a longitudinal LMM when the ran-
dom error has an AR(1) correlation structure. AR(1) structure is widely
used to capture the correlation between the repeated measures within a sub-
77
ject, but it ignores the actual length of the time between each consecutive
measures. This could be a limitation when correlation varies greatly as the
length between the measures changes in a study. Generalization of the cor-
relation structure of the random error is one of our future research agenda
for bLMM.
Also, current PEM does not provide hypothesis testing results for the fixed
effects. However, hypothesis testing for the fixed effect can be informative.
For the fixed effects, we already have considered different approaches in de-
riving the standard error of the fixed effects such as using our updating PEM
procedure, or bootstrap sampling which will be implemented in future re-
search. The challenge in finding a way to derive the standard error of the
variance components of the LMM still remains a challenge.
In short, our proposed PEM procedure is a significant contribution to fitting a
longitudinal LMM for big-data. It sets a good foundation for future research
when more general correlation structures of the random errors and more
complex models including interactions between the fixed and random effects
need to be considered.
Part II
Projection Pursuit Ellipse (PPE)
78
Chapter 8
Original PPE
In biomedical research, when investigating the effect of an intervention, the
simplest analysis is mean comparison between a control group and the in-
tervention group, or among many groups. Widely-used statistical tests for
mean comparison in low-dimensions are t-test and ANOVA, and Hotelling’s
T2 and MANOVA in high-dimensions. The key assumptions for all of these
tests are that the data are distributed normally and have a common variance-
covariance. In this second part of the dissertation, we introduce Projec-
tion Pursuit Ellipse (PPE) for testing the equality of covariances for high-
dimensional data and present our proposed enhanced high-dimensional PPE
(hPPE) algorithm, evaluate the performance of PPE procedures in a com-
parative study with the Bartlett’s test and a modern benchmark proposed
by T.Cai et al [39].
8.1 Introduction
Since equal covariance is one of the basic assumptions in a large number
of multivariate techniques, a test of equal covariance matrices is often used
in preliminary data exploration. For high-dimensional data, classic tests to
evaluate if several covariance matrices are equal include the likelihood ratio
79
80
test [1] and a modified likelihood ratio test, such as, Bartlett’s test [2, 45].
Under the normality assumption, consider a G group comparisons where
the random samples in the gth population Xg,1, ...,Xg,ng iid∼ Np(µg, Σg) for
g = 1, ...,G. Then to test
H0 : Σ1 = Σ2 = ... = ΣG
Ha : Σi 6= Σj for at least one pair of (i,j) where i 6= j = 1, ...,G
the Bartlett’s test statistic is
B =
G∏ g=1
|Sg| 1 2
(ng−1)
|S0| 1 2
(n−G) (8.1)
which has a χ2G−1 distribution under H0
where, G∑ g=1
ng = n, Sg =
ng∑ i=1
(Xi−X̄g)(Xi−X̄g)′
ng−1 , S0 =
G∑ g=1
ng∑ i=1
(Xi−X̄g)(Xi−X̄g)′
n−G , g =
1, ...,G.
Clearly, the Bartlett’s test statistic B is a function of generalized variance
that is only based on the volumes (i.e. determinants) of the sample co-
variance matrix, and thus may not be powerful enough to detect the true
difference among population covariances that have the same volume but dif-
ferent orientations and shapes. To consider all possible information from
a multivariate dataset in high-dimension, developing an alternative way to
find the maximum discrepancies of the covariance matrices is necessary to
capture the true covariance differences better than the Bartlett’s test would.
The idea of Projection pursuit Ellipses (PPE) was introduced by Wei and
Sun (manuscript). PPE was motivated by the concept of Projection Pursuit
[12, 34, 47, 48]. PPE explores the covariance structures of high-dimensional
81
data by projecting high-dimensional data into low-dimensional spaces and
comparing the projected ellipses of the covariances. Three PPE indices are
proposed where the maximum of each index shows the maximum discrepancy
of the projected ellipses. A PPE algorithm that maximizes these indices us-
ing constrained optimization had been developed. The three PPE indices
and the PPE algorithm will be reviewed in Sections 8.2 and 8.3 respectively.
In Chapter 9, our proposed hPPE algorithm will be introduced and a power
analysis using simulated data will be conducted.
8.2 PPE Index
Consider the case that G = 2 groups. W1,1, ...,W1,n1 ∈<2 and W2,1, ...,W2,n2 ∈
<2 be the linear projections of X1,1, ...,X1,n1 and X2,1, ...,X2,n2 by α i.e.,
W1,i = α TX1,i and W2,i = α
TX2,i for α Tα = I2. Here α is a p × 2 matrix
α = (α1,α2) representing a particular 2−dimensional space spanned by two
p−dimensional vectors α1 and α2.
Then W (j) (α)
= (Wj,1, ...,Wj,nj ) has mean α TµX,j and covariance α
T ΣX,jα,
denoted as W (j) (α) ∼ (αTµX,αT ΣXα) for j = 1, 2. The three PPE indices
proposed were
• PPE Index 1: PPE index based on 2 −dimensional Bartlett’s test
• PPE Index 2: PPE index based on a generalized distance
• PPE Index 3: PPE index based on a singular value decomposition
(SVD)
PPE index 1 applies Bartlett’s test to the projected data, PPE index 2 applies
a generalized F-test for testing equality of variance. PPE index 3 uses SVD to
incorporate the orientation and shape of the ellipse of the projected data. The
rationale for the three indices is as follows. An F-test for testing the equality
82
of 2 1 −dimensional variances is the most powerful test in 1 −dimension.
Both the Bartlett’s test and F-test are simple generalizations of the univariate
F-test to higher dimensions. Although, these simple direct generalization
may not carry its power to high dimension, it may have a better power in
2 −dimensional projection pursuit. Hence, we proposed PPE indices 1 and
2. PPE index 3 uses SVD to bring back the information that may have lost
in PPE indices 1 and 2.
8.2.1 PPE Index 1
PPE Index 1, I1 is based on 2-d Bartlett’s test and is defined as
I1 = −(n1 − 1)ln|S1|− (n2 − 1)ln|S2| + (n− 2)ln|S0| (8.2)
where S1 and S2 are 2×2 covariance matrix of W (1) = W (1) (α)
and W (2) =
W (2) (α)
and S0 is the pooled covariance matrix of S1 and S2. Under the null
distribution, it is known that I1 ∼ χ21.
8.2.2 PPE Index 2
PPE index 2, I2 is a generalization of the univariate F-test for testing the
equality of 2 covariances Σ1 and Σ2. In the univariate case p = 1, under
the null hypethesis, s22/s 2 1 ∼ F(n2 − 1,n1 − 1) where s21 and s22 are sample
variances of the first and second group samples.
Similarly, I2(α) is defined as
I2(α) = d̃(S −1 1 S2 − I2) (8.3)
where S1 and S2 are 2×2 covariance matrix of W (1) = W (1) (α)
and W (2) = W (2) (α)
and d̃ = ||D|| = ||(αT Σ1α)−1(αT Σ2α)−I2|| = √ d211 + d
2 12 + d
2 21 + d
2 22, where
α = (α1,α2) and D ∆ = (αT Σ1α)
−1(αT Σ2α) − I2 ∆ =
( d11 d12 d21 d22
) . I2(α) does
83
not follow a known distribution, so its critical value for conducting the test of
equal covariance was simulated under the null distribution. Specifically, 1000
I2(α) were simulated and its simulated value at 95% quantile was selected
to be the critical value at the significance level of 0.05.
8.2.3 PPE Index 3
PPE index 3 (I3) used SVD to incorporate the shape and orientation of the
ellipse by a projected data draws. By SVD on S1 and S2, where S1 and S2
are 2×2 covariance matrix of W (1) = W (1) (α)
and W (2) = W (2) (α)
it can be shown
that,
S1 =
( cos θ1 −sin θ1 sin θ1 cos θ1
)( a21 0 0 b21
)( cos θ1 −sin θ1 sin θ1 cos θ1
) (8.4)
,
S2 =
( cos θ2 −sin θ2 sin θ2 cos θ2
)( a22 0 0 b22
)( cos θ2 −sin θ2 sin θ2 cos θ2
) (8.5)
.
Then I3(α) is defined as
I3(α) =
[(( a2 a1
)2 −1 )2
+
(( b2 b1
)2 −1 )2
+2
( a2/b2 a1/b1
)2 sin2(θ1−θ2)
]1 2
(8.6)
where,
(( a2 a1
)2 − 1 )2
and
(( b2 b1
)2 − 1 )2
has the shape information and
sin2(θ1 − θ2) has the orientation information on the ellipse of the projected
data. I3(α) does not follow a known distribution, so its critical value for
conducting the test of equal covariance was simulated under the null distri-
bution. Specifically, 1000 I3(α) were simulated and the simulated value at
95% quantile was selected to be the critical value at the significance level of
0.05.
84
8.3 PPE Algorithm
A PPE algorithm is an optimization algorithm that maximizes a PPE in-
dex over all possible projections of the data, i.e. over α s.t. αTα = I2.
Thus, this is a constrained optimization problem. A PPE algorithm must
be developed in consideration of efficiency and reliability of the algorithm.
As a testing procedure, a PPE algorithm needs to find the global maximum
of I(α) : max α:αT α=I2
I(α). As an exploratory or visualization tool, a PPE al-
gorithm needs to find the substantial local maxima instead of finding one
single global maximum in order to have additional insight into the multiple
discrepancies of the high-dimensional variance matrices. Constrained op-
timization itself is a difficult problem, especially when both the objective
function and the constraint functions are non-linear as in the PPE algo-
rithm. The original PPE algorithm employed an optimization tailored to a
non-linear objective function subject to non-linear equality constraints [18].
The key idea of the method is to incorporate the constraints into a modified,
unconstrained objective function, which can then be optimized by uncon-
strained optimization techniques such as a quasi-Newton method. The al-
gorithm, therefore, can be called a constrained quasi-Newton method. This
optimization method was efficient and reliable in finding a local maximum.
Since PPE index 2 and PPE index 3 have many local maxima under the
null distribution, a PPE algorithm with an enhanced global search on the
quasi-Newton method was developed. Specifically, let X and Y be two data
matrices with dimensions p×n1 and p×n2 respectively. With two p×1 pro-
jection vectors α1 and α2, W (1) (α)
and W (2) = W (2) (α)
be the linear projections of
X1 = (X1,1, ...,X1,n1 ) and X2 = (X2,1, ...,X2,n2 ) with respect to α = (α1,α2),
i.e. W (j) (α)
= (Wj,1, ...,Wj,nj ) for j = 1, 2. The constraints required by the
PPE algorithm are
85
• α1α1 = 1
• α2α2 = 1
• α1α2 = 0
Under the three constraints, the following the next 3 steps form the PPE
algorithm.
1. Preliminary Search: Search along p(p− 2)
2 pairs of coordinate axes and
find α(1) = (α (1) 1 ,α
(1) 2 ) that maximizes the PPE index.
2. Course Search: Starting at α(1), perform a large stepping search along
the pairs of orthogonal directions and find α(2) that maximizes the PPE
index.
3. Local Search: Starting at α(2), perform a local search using a derivative
free constrained optimization by linear approximations (COBYLA) al-
gorithm [39].
To simulate critical values of our PPE index (or test), we generate 2-data
samples. We keep the maximum from step 3 as an estimate of max α:αT α=I2
I(α)
under the H0. We repeatedly generate 2-data samples under H0 and obtain
the max α:αT α=I2
I(α) 1000 times and then compute the empirical upper γ-quantile
of max α:αT α=I2
I(α) as the critical value at the level γ, e.g. γ = 0.05.
Chapter 9
High-dimensional PPE (hPPE)
9.1 hPPE algorithm
A PPE algorithm for testing the equality of two covariance matrices provides
informative information in finding the projection that describes the most
dependencies of the two data. Therefore, we included another step of the
optimization algorithm to the original PPE algorithm to improve the chance
of finding the true global maximum of the index to improve the power of PPE
in testing for equal covariance matrices in high-dimension. High-dimensional
PPE (hPPE) algorithm
1. Step 1: Select T number of random projection satisfying the constraints
and conduct local search using COBYLA as in the step 3 of the original
PPE algorithm using each of random projections as its initial value.
2. Step 2: Conduct the original PPE algorithm.
Compare the index values obtained from steps 1 and 2 an decide the projec-
tion that maximizes the PPE index among these T +1 values. PPE algorithm
and hPPE algorithm are reprogrammed and implemented using Julia version
0.6.2.
86
87
9.2 hPPE power analysis
To evaluate the performance of our hPPE, we conducted a power analysis
using simulated data to compare our method with the Bartlett’s test and to
a high-dimension method proposed by T.Cai et al [4]. Bartlett’s test was
provided in Section 8.1. T.Cai’s method is introduced below.
Let X1,1, ...,X1,n1 iid∼ F(Σ1) and X2,1, ...,X2,n2
iid∼ F(Σ2) where, Σ1 = (σij1)p×p,
Σ2 = (σij2)p×p and n = n1 + n2. Then the null hypothesis H0 : Σ1 = Σ2 is
identical to H0 : max1≤i≤j≤p|σij1 −σij2| = 0. Thus, T.Cai et al proposed the
test statistic,
Mn =: max 1≤i≤j≤p
Mij = max 1≤i≤j≤p
(σ̂ij1 − σ̂ij2)2
(θ̂ij1/n1 + θ̂ij2/n2) (9.1)
where, θ̂ij1/n1 and θ̂ij2/n2 are estimated variances of σ̂ij1 and σ̂ij2. Then test
Φα is defined as
Φα = I(Mn ≥ qα + 4logp− loglogp)
where qα is the 1 − α quantile of the Type 1 extreme value distribution
p,n →∞ with qα = −log(8π) − 2loglog(1 −α)−1.
The power analysis was done to test for H0 : Σ1 = Σ2 where Σ1 and Σ2 are
covariance matrices of simulated p − dimensional data X1 and X2. X1 =
(X1,1, ...,X1,n) iid∼ N(0, Σ1) and X2 = (X2,1, ...,X2,n)
iid∼ N(2, Σ2) where Σ1 =
Ip, Σ2 = R ′IpR and R is a known p×p matrix that reflects the ratio of the
data at each dimension.
We compared 5 different methods using PPE and hPPE.
• PPE Index 1 (Using 2-d Bartlett’s Test)
88
• PPE Index 2 (Using Generalized Distance)
• PPE Index 3 (Using Singular Value Decomposition)
• Bartlett’s Test
• T.Cai et al Method
For the significance level of 0.05, n = 100 and m number of simulations,
the result using PPE is provided in Table 9.1 and the result using hPPE is
provided in Table 9.2.
Method p = 20, m = 1000 p = 50, m = 500 PPE Index1 90.4 89.2 PPE Index2 86.4 72.8 PPE Index3 99.9 96.4
Bartlett’s Test 89.6 86.4 T.Cai et al Test 99.8 95.4
Table 9.1: Empirical power in percents comparison for PPE
Method p = 20, m = 100 PPE Index1 86 PPE Index2 88 PPE Index3 100
Bartlett’s Test 74 T.Cai et al Test 99
Table 9.2: hPPE empirical power in percents comparison for hPPE
Table 9.1 shows that when using PPE algorithm, PPE Index 3, which in-
corporates the shape and orientation of the data using SVD, has the highest
power for data with 20 dimensions and 50 dimensions. However, T.Cai et
al’s method almost has as good power as PPE index 3. Table 9.2 shows that
for a data with 20 dimensions and 100 simulation, using hPPE, PPE index
3 captures all of the difference and T.Cai et al’s method did not capture 1.
Chapter 10
Discussion and Conclusion
Power analysis shows that PPE index 3 using SVD to incorporate data
shape and orientation performs the best in capturing the difference of the
covariance matrices between two high-dimensional samples for both PPE and
hPPE. This implies that PPE index using SVD works the best in conducting
the test of equal covariance between two groups for high dimensional data.
For data with 20 dimensions and 50 dimensions, PPE with PPE index 3 will
capture 99.9% and 96.4% of the difference in covariance matrices respectively.
Even though the number of simulation was 100, hPPE was able to capture
all of the different covariances. However, hPPE is designed to be applied
to much higher dimension and the number of simulations was 100 which
could not be enough to derive the relative and absolute performance of the
method. Therefore, further studies for higher dimension and higher number
of simulations could be our first future research goal using better hardware
environment.
Both PPE and hPPE are computationally intensive. Using a high perfor-
mance computing environment, under 36 hours of limitation, PPE was able
to conduct power analysis with 1000 simulations for 20 dimensions and 500
dimensions for 50 dimensions and hPPE was able conduct power analysis
with 100 simulations for 20 dimensions. Power analysis for higher dimen-
89
90
sions and more simulation iterations will require better hardware or a better
PPE algorithm. Both PPE and hPPE use Constrained Optimization by Lin-
ear Approximation (COBYLA) which is a local derivative free optimization
method for local optimization. However, the local optimization with the con-
straint of PPE and hPPE algorithm resulted in very intensive computation
based on the duration of the computation. This could be extended to general
local optimization problem which could be our second future research goal
to improve PPE and hPPE.
Appendix A
Appendix
In this Appendix, we will provide the proof of Proposition 1 in Section A.1,
proof of Lemma 1 in Section A.2, and proof of Proposition 2 in Section
A.3.
A.1 Proof of Proposition 1
Recall that when αi are observable, the log-likelihood function of our model
in (4.14) was
l(β,(σ2k)k=1,...,q, (ρkl)k 6=l=1,...,q,σ 2 0,φ)
=
( − m
2 ln(σ20 ) +
m
2 ln(1 −φ2) −
1
2
m∑ i=1
ln|Σ(i) � |−
1 −φ2
2σ20
m∑ i=1
r′ i (Σ(i)
� )−1ri
) +
( − N
2 ln|G|−
1
2
m∑ i=1
Ji∑ j=1
αi ′G−1αi
) + C
(A.1)
for some constant C and the covariance matrix of the random effects G is
identical for all subjects.
91
92
A.1.1 MLE for β
Take the partial derivative of the log-likelihood with respect to β
∂l
∂β =
∂
∂β
( −
1 −φ2
2σ20
m∑ i=1
r′i(Σ (i) � ) −1ri
)
= ∂
∂β
( −
1 −φ2
2σ20
m∑ i=1
(Yi −Xiβ −Ziαi)′(Σ(i)� ) −1(Yi −Xiβ −Ziαi)
)
= 1 −φ2
2σ20
m∑ i=1
X′i(Σ (i) � ) −1(Yi −Xiβ −Ziαi)
Setting ∂l ∂β
= 0 leads to
( m∑ i=1
X′i(Σ (i) � ) −1Xi
) β =
m∑ i=1
Xi(Σ (i) � ) −1(Yi −Ziαi)
Thus,
β̂MLE =
( m∑ i=1
X′i(Σ (i) � ) −1Xi
)−1( m∑ i=1
Xi(Σ (i) � ) −1(Yi −Ziαi)
) (A.2)
A.1.2 MLE for (σ2k)k=1,...,q and (ρkl)k 6=l=1,...q
To get MLE for (σ2k)k=1,...,q, take the partial derivative of the log-likelihood
with respect to σ2k.
∂l
∂σ2k =
∂
∂σ2k
( − N
2 ln|G|−
1
2
m∑ i=1
Ji∑ j=1
α′iG −1αi
)
= − N
2 tr
( G−1
∂G
∂σ2k
) −
1
2
m∑ i=1
Ji∑ j=1
α′i ∂G−1
∂σ2k αi
= − N
2 tr
( G−1
∂G
∂σ2k
) −
1
2
m∑ i=1
Ji∑ j=1
α′iG −1 ∂G
∂σ2k G−1αi
93
Setting ∂l ∂σ2k
= 0 leads to the following equation.
m∑ i=1
Ji∑ j=1
α′iG −1 ∂G
∂σ2k G−1αi = Ntr
( G−1
∂G
∂σ2k
) (A.3)
MLE of σ2k can be obtained by solving the equation (A.3) for σ 2 k.
For a model with q random effects, G is a q × q matrix and the inverse of
G needs to be calculated. Therefore, without loss of generality (WLOG), we
will consider a simple case with two random effects to derive the MLE of
(σ2k)k=1,2 and (ρkl)k 6=l=1,2. In the simple case, the random effects αi and the
covariance matrix of the random effects G can be defined as,
αi =
( αi1 αi2
) ,G =
( σ21 ρ12σ1σ2
ρ21σ2σ1 σ 2 2
)
where, ρ12 = ρ21 = ρ. So, to solve the equation (A.3) for σ 2 1 , we first need to
get G−1 and ∂G ∂σ21
as below.
G−1 = 1
σ21σ 2 2 (1 −ρ
2)
( σ22 −ρσ1σ2
−ρσ2σ1 σ21
)
∂G
∂σ21 =
∂σ21 ∂σ21
∂ρσ1σ2 ∂σ21
∂ρσ2σ1 ∂σ21
∂σ22 ∂σ21
=
( 1
ρσ2 2σ1ρσ2
2σ1 0
)
where
∂ρσ1σ2
∂σ21 = ρσ2
∂(σ21 ) 1 2
∂σ21 = ρσ2 2
(σ21 ) −1
2 = ρσ2 2σ1
94
Then by multiplying G−1 with ∂G ∂σ21
, we get,
G−1 ∂G
∂σ21 =
1
σ21σ 2 2 (1 −ρ
2)
( σ22 −ρσ1σ2
−ρσ2σ1 σ21
)( 1
ρσ2 2σ1ρσ2
2σ1 0
)
= 1
σ21σ 2 2 (1 −ρ
2)
σ22 − ρ22 σ22 ρσ322σ1 −ρ 2 σ1σ2 −
ρ2
2 σ22
By multiplying G−1 again we get,
G−1 ∂G
∂σ21 G−1 =
1
σ41σ 4 2 (1 −ρ
2)2
σ42 −ρ2σ42 ρ32 σ1σ32 − ρ2σ1σ32 ρ3
2 σ1σ
3 2 −
ρ 2 σ1σ
3 2 0
Using these calculations, we can now solve the equation (A.3) as below.
(A.3)LHS = m∑ i=1
Ji∑ j=1
α′iG −1 ∂G
∂σ21 G−1αi =
1
σ41σ 4 2 (1 −ρ
2)2
m∑ i=1
Ji(σ2α 2 i1 −ρσ1αi1αi2)
(A.3)RHS = Ntr
( G−1
∂G
∂σ21
) = N
σ21
Setting (A.3)LHS = (A.3)RHS, when σ21 6= 0, σ22 6= 0, and ρ2 6= 1, we get
Nσ21σ2(1 −ρ 2) = σ2
m∑ i=1
Jiα 2 i1 −ρσ1
m∑ i=1
Jiαi1αi2 (A.4)
Similarly, from ∂l ∂σ22
we get,
Nσ1σ 2 2 (1 −ρ
2) = σ1
m∑ i=1
Jiα 2 i2 −ρσ2
m∑ i=1
Jiαi1αi2 (A.5)
To get MLE for ρ we need to take the partial derivative of the log-likelihood
with respect to ρ and solve equation (A.6) for ρ,
m∑ i=1
Ji∑ j=1
α′iG −1 ∂G
∂ρ G−1αi = Ntr
( G−1
∂G
∂ρ
) (A.6)
95
To solve (A.6), we need to calculate ∂G ∂ρ
in addition to what we already have.
∂G
∂ρ =
∂σ
2 1
∂ρ ∂ρσ1σ2 ∂ρ
∂ρσ2σ1 ∂ρ
∂σ22 ∂ρ
= ( 0 σ1σ2
σ2σ1 0
)
Then by multiplying G−1 with ∂G ∂ρ
, we get,
G−1 ∂G
∂ρ =
1
σ21σ 2 2 (1 −ρ
2)
( σ22 −ρσ1σ2
−ρσ2σ1 σ21
)( 0 σ1σ2
σ2σ1 0
)
= 1
σ21σ 2 2 (1 −ρ
2)
( −ρσ21σ22 σ1σ32 σ31σ2 −ρσ21σ22
)
G−1 ∂G
∂ρ G−1 =
1
σ21σ 2 2 (1 −ρ
2)2
( −2ρσ22 (1 + ρ2)σ1σ2
(1 + ρ2)σ1σ2 −2ρσ21
)
Using these calculation, we can solve the equation (A.6) as below.
(A.6)LHS = m∑ i=1
Ji∑ j=1
α′iG −1 ∂G
∂ρ G−1αi =
1
σ21σ 2 2 (1 −ρ
2)2
m∑ i=1
Ji(−2ρ(σ22α 2 i1 + σ
2 1α
2 i2) + 2(1 + ρ
2)σ1σ2αi1αi2)
(A.6)RHS = Ntr
( G−1
∂G
∂ρ
) = −
2Nρ
1 −ρ2
Setting (A.6)LHS = (A.6)RHS, when ρ2 6= 1, we get
− 2Nρσ21σ 2 2 (1 −ρ
2) = −2ρ ( σ22
m∑ i=1
Jiα 2 i1 + σ
2 1
m∑ i=1
Jiα 2 i2
) + 2(1 + ρ2)σ1σ2
m∑ i=1
Jiαi1αi2
(A.7)
Denote A, B, and C as below,
A ∆ ≡
m∑ i=1
Jiα 2 i1, B
∆ ≡
m∑ i=1
Jiα 2 i2, C
∆ ≡
m∑ i=1
Jiαi1αi2
96
Then we can rewrite (A.4), (A.5), and (A.7) as,
Nσ21σ2(1 −ρ 2) = Aσ2 −Cρσ1 (A.4-1)
Nσ1σ 2 2 (1 −ρ
2) = Bσ1 −Cρσ2 (A.5-1)
− 2Nρσ21σ 2 2 (1 −ρ
2) = −2ρσ22A− 2ρσ 2 1B + 2(1 + ρ
2)σ1σ2C (A.7-1)
By solving (A.4-1), (A.5-1), and (A.7-1), we get,
σ̂21MLE = 1
N
m∑ i=1
Jiα 2 i1 (A.8)
σ̂22MLE = 1
N
m∑ i=1
Jiα 2 i2 (A.9)
ρ̂MLE =
1
N
m∑ i=1
Jiαi1αi2√√√√( 1 N
m∑ i=1
Jiα 2 i1
)√√√√( 1 N
m∑ i=1
Jiα 2 i2
) (A.10)
This can be easily extended to q number of random effects when the random
effect αi for subject i and the covariance matrix of the random effects G is
defined as,
αi =
αi1 αi2 ... ... αqi
,G =
σ21 ρ12σ1σ2 ρ13σ1σ3 · · · ρ1qσ1σq ρ21σ2σ1 σ
2 2 ρ23σ2σ3 · · · ρ2qσ2σq
... ... σ23 · · ·
... ...
... ...
. . . ...
ρq1σqσ1 ρq2σqσ2 ρq3σqσ3 · · · σ2q
Then the MLE of σk where k = 1, ...,q and ρkl where k 6= l = 1, ...,q can be
97
defined as,
σ̂2k = 1
N
m∑ i=1
Jiα 2 ik
ρ̂kl =
1
N
m∑ i=1
Jiαikαil√√√√( 1 N
m∑ i=1
Jiα 2 ik
)√√√√( 1 N
m∑ i=1
Jiα 2 il
)
A.1.3 MLE for σ20
First, we take partial derivative of the log-likelihood function with respect to
σ20 and get
∂l
∂σ20 =
∂
∂σ20
( − m
2 ln(σ20 ) −
1 −φ2
2σ20
m∑ i=1
ri ′(Σ(i)� )
−1ri
)
= − m
2σ20 +
1 −φ2
2σ40
m∑ i=1
ri ′(Σ(i)� )
−1ri
where ri = Yi −Xiβ̂−Ziα̂i. Setting ∂l ∂σ20
= 0 leads to
m
2σ20 =
1 −φ2
2σ40
m∑ i=1
ri ′(Σ(i)� )
−1ri
Then solving this equation for σ20 when σ 2 0 6= 0 will result in,
σ̂20 = 1 −φ2
m
m∑ i=1
ri ′(Σ(i)� )
−1ri (A.11)
To get the MLE of σ20 from (A.11) requires calculation of (Σ (i) � ) −1 which can
be obtained using Cholesky decomposition. In case for 4 repeated measures
98
(Ji = 4),
Σ(i)� =
1 φ φ2 φ3
φ 1 φ φ2
φ2 φ 1 φ φ3 φ2 φ 1
(Σ(i)� )−1 = 11 −φ2
1 −φ 0 0 −φ 1 + φ2 −φ 0 0 −φ 1 + φ2 −φ 0 0 −φ 1
and with 5 repeated measures (Ji = 5),
Σ(i)� =
1 φ φ2 φ3 φ4
φ 1 φ φ2 φ3
φ2 φ 1 φ φ2
φ3 φ2 φ 1 φ
(Σ(i)� )−1 = 11 −φ2
1 −φ 0 0 0 −φ 1 + φ2 −φ 0 0 0 −φ 1 + φ2 −φ 0 0 0 −φ 1 + φ2 −φ 0 0 0 −φ 1
Because (A.11) has two unknown variable sigma20 and φ, we need to derive
the equation for φ to get the MLE of σ20 .
A.1.4 MLE for φ
First we take the partial derivative of the log-likelihood with respect to φ
and get
∂l
∂φ =
∂
∂φ
( m
2 ln(1 −φ2) −
1
2
m∑ i=1
ln|Σ(i) � |−
1 −φ2
2σ20
m∑ i=1
ri ′(Σ(i)
� )−1ri
)
= − 2φ
1 −φ2 −
1
2
m∑ i=1
tr
( (Σ(i)
� )−1
∂Σ(i) �
∂φ
) +
1
2σ20
m∑ i=1
ri ′(Σ(i)
� )−1
∂Σ(i) �
∂φ (Σ(i)
� )−1ri
Setting ∂l ∂φ
= 0 leads to
4φσ20 1 −φ2
+ σ20
m∑ i=1
tr
( (Σ(i)
� )−1
∂Σ(i) �
∂φ
) =
m∑ i=1
ri ′(Σ(i)
� )−1
∂Σ(i) �
∂φ (Σ(i)
� )−1ri
(A.12)
and solve (A.12) for φ. We will solve the equation for Ji = 4 and Ji = 5 and
find the general solution. After a careful calculation of tr
( (Σ(i)
� )−1 ∂Σ
(i) �
∂φ
)
99
and ri ′(Σ(i)
� )−1 ∂Σ
(i) �
∂φ (Σ(i)
� )−1ri for Ji = 4 and Ji = 5, we got,
(A.12)LHS (Ji=4)
= 4φσ20
1 −φ2 +
m∑ i=1
−σ20φ 1 −φ2
× (1 + 2 + 2 + 1)
(A.12)LHS (Ji=5)
= 4φσ20
1 −φ2 +
m∑ i=1
−σ20φ 1 −φ2
× (1 + 2 + 2 + 2 + 1)
Therefore, in general,
(A.12)LHS = φσ20
1 −φ2
( 4 −
m∑ i=1
(2 + 2 × (Ji − 2)) )
= −2φσ20 1 −φ2
(N − (m− 2))
For Ji = 4, let ri = (ri1,ri2,ri3,ri4) and for Ji = 5, let ri = (ri1,ri2,ri3,ri4,ri5)
then
(A.12)RHS (Ji=4)
= 1
(1 −φ2)2 m∑ i=1
((−2φr2i1 − 4φr 2 i2 − 4φr
2 i3 − 2φr
2 i4)
+ 2(1 + φ2)(ri1ri2 + ri2ri3 + ri3ri4))
(A.12)RHS (Ji=5)
= 1
(1 −φ2)2 m∑ i=1
((−2φr2i1 − 4φr 2 i2 − 4φr
2 i3 − 4φr
2 i4 − 2φr
2 i5)
+ 2(1 + φ2)(ri1ri2 + ri2ri3 + ri3ri4 + ri4ri5))
Therefore, in general,
(A.12)RHS = 1
(1 −φ2)2 m∑ i=1
[ − 2φ(r2i1 + r
2 iJi
+ 2
Ji−1∑ j=2
r2ij) + 2(1 + φ 2)
Ji−1∑ k=1
rikri(k+1)
]
Setting (A.12)LHS = (A.12)RHS, we get,
σ20 (N − (m− 2))φ 3 −Cφ2 + (D −σ20 (N − (m− 2)))φ−C = 0 (A.13)
when φ2 6= 1 and denoting C and D as,
C ∆ ≡
m∑ i=1
Ji−1∑ k=1
rikri(k+1) D ∆ ≡
m∑ i=1
( r2i1 + r
2 iJi
+ 2
Ji−1∑ j=2
r2ij
)
By solving (A.11) and (A.12) we can obtain the MLE of σ20 and φ.
100
A.2 Proof of Lemma 1
Let p(·) denote the probability density of its argument(s). Denote Y =
(Y1, ...,Ym) and Y(−i) = (Y1, ...,Y(i−1),Y(i+1)Ym). If (αi,Yi) for i = 1, ...m are
independent to each other, then,
p(αi|Y ) = p(αi|Yi,Y(−i)) = p(αi,Yi,Y(−i))
p(Yi,Y(−i))
= p(αi,Yi)p(Y(−i))
p(Yi)p(Y(−i)) = p(αi|Yi)
(A.14)
A.3 Proof of Proposition 2
When the joint distribution of Y and α follow Normal as,( α Y
) ∼ N
(( 0 Xβ
) ,
( var(α) cov(α,Y )
cov(Y ,α) var(Y )
)) (A.15)
Then the conditional distribution of α given Y follows Normal distribution
as,
α|Y ∼ N(E(α) + cov(α,Y )var(Y )−1(Y −Xβ),
var(α) − cov(α,Y )var(Y )−1cov(Y ,α)) (A.16)
and,
E(α) = 0,
var(Y ) = var(Xβ + Zα + �) = var(Zα) + var(�) = Zvar(α)Z′ + var(�)
= ZGZ′ + R (A.17)
where R = σ20
1 −φ2 Σ� and Σ� =
Σ� (1) 0 0 · · · 0
0 Σ� (2) 0 · · · 0
0 0 Σ� (3) · · · 0
... ...
. . . . . .
...
0 · · · · · · 0 Σ�(m)
cov(α,Y ) = cov(α,Xβ + Zα + �) = cov(α,Zα) = var(α)Z′ = GZ′
(A.18)
101
Since Gi = G for i = 1, ...,m, using (A.17) and (A.18), we can get
E(αi|Yi) = E(αi) + cov(αi,Yi)var(Yi)−1(Yi −Xiβ)
= 0 + GiZi ′(ZiGiZi
′ + Ri) −1(Yi −Xiβ)
= GZi ′(ZiGZi
′ + Ri) −1(Yi −Xiβ) (A.19)
var(αi|Yi) = var(αi) − cov(αi,Yi)var(Yi)−1cov(Yi,αi)
= Gi −GiZi′(ZiGiZi′ + Ri)−1ZiGi
= G−GZi′(ZiGZi′ + Ri)−1ZiG (A.20)
E(αi|Yi)E(αi|Yi)′ = GZi′(ZiGZi′ + Ri)−1(Yi −Xiβ)
× (Yi −Xiβ)′(ZiGZi′ + Ri)−1ZiG (A.21)
Thus,
E(αiαi ′|Yi) = var(αi|Yi) + E(αi|Yi)E(αi|Yi)′
= (G−GZi′(ZiGZi′ + Ri)−1ZiG)
+ GZi ′(ZiGZi
′ + Ri) −1(Yi −Xiβ)(Yi −Xiβ)′(ZiGZi′ + Ri)−1ZiG
(A.22)
where, G = G(σ2k,ρkl) and Ri = Ri(σ 2 0,φ).
102
Bibliography
[1] Anderson, T. An Introduction to Multivariate Statistical Analysis,
3rd ed. Wiley Series in Probability and Statistics, 2003.
[2] Bartlett, M. Properties of sufficiency and statistical tests. Proceed-
ings of the Royal Society A: Mathematical, Physical and Engineering
Sciences 160(901) (1937), 268–282.
[3] Bishop, C. Pattern Recognition and Machine Learning. Springer, 2006.
[4] Cai, T., Liu, W., and Xia, Y. Two-sample covariance matrix testing
and support recovery in high-dimensional and sparse settings. Journal
of American Statistical Association 108(501) (2013), 265–277.
[5] Calinski, T., and Harabasz, J. A dendrite method for cluster
analysis. Communications in Statistics 3 (1974), 1–27.
[6] Cenik, C., E.S. Cenik, a. G. B., Grubert, F., Candille, S.,
Spacek, D., Alsallakh, B., Tilgner, H., Araya, C., Tang, H.,
Ricci, E., and Snyder, M. Integrative analysis of rna, translation,
and protein levels reveals distinct regulatory variation across humans.
Genome Research 25 (2015), 1610–1621.
[7] Davies, D., and Bouldin, D. A cluster separation measure. IEEE
Transactions on Pattern Analysis and Machine Intelligence 1 (1979),
224–227.
[8] Department of Veteran Affairs. Specialty care access network-
extension for community healthcare outcomes (SCAN-ECHO) fact
sheet.
[9] Faraway, J. J. Extending the Linear Model with R. Chapman and
Hall/CRC, 2006.
103
[10] Firpoa, S., A.F.Galvaob, and Song, S. Measurement errors in
quantile regression models. Journal of Econometrics 198(1) (2017), 146–
164.
[11] Fisher, R. The correlation between relatives on the supposition of
mendelian inheritance. Transactions of the Royal Society of Edinburgh
52(2) (1918), 399–433.
[12] Friedman, J., and Tukey, J. A projection pursuit algorithm for
exploratory data analysis. IEEE trans. comput. C-23 (1974), 881–889.
[13] Gao, K., and Owen, A. Efficient moment calculations for variance
components in large unbalanced crossed random effects models. Elec-
tronic Journal of Statistics 11(1) (2017), 1235–1296.
[14] Gebregziabher, M., Egede, L., Gilbert, G., Hunt, K., Ni-
etert, P., and Mauldin, P. Fitting parametric random effects mod-
els in very large data sets with application to vha national data. BMC
Medical Research Methodology 12 (2012), 163–177.
[15] Genolini, C., and Falissard, B. Kml: A package to cluster lon-
gitudinal data. Computer Methods and Programs in Biomedicine 104
(2011), e112–e121.
[16] Genolini, C., and Falissard, B. Package ‘kml’, 2016.
[17] Guide, S. . U. The hpmixed procedure, 2013.
[18] Haarhoff, P., and Buys, J. D. A new method for the optimization
of a nonlinear function subject to nonlinear constraints. The Computer
Journal 13(2) (1970), 178184.
[19] Hartley, H., and Rao, J. Maximum-likelihood estimation for the
mixed analysis of variance model. Biometrika 54(1-2) (1967), 93–108.
104
[20] Henderson, C., Kempthorne, O., Searle, S., and von
Krosigk, C. The estimation of environmental and genetic trends from
records subject to culling. Biometrics 15(2) (1959), 192–218.
[21] Jackson, C., N, B., and Richardson, S. Hierarchical related re-
gression for combining aggregate and individual data in studies of socio-
economic disease risk factors. J R Stat Soc, Series A 171 (2008), 159178.
[22] Jiang, J. Reml estimation: asymptotic behavior and related topics.
The Annals of Statistics 24(1) (1996), 255–286.
[23] Jiang, J. Linear and Generalized Linear Mixed Models and Their Ap-
plications. Springer, 2007.
[24] Johnson, D., and Thompson, R. Restricted maximum likelihood
estimation of variance components for univariate animal models using
sparse matrix techniques and average information. Journal of Dairy
Science 78 (1995), 449456.
[25] Johnson, R., and Wichern, D. Applied Multivariate Statistical
Analysis 6th Edition. Pearson.
[26] Kackar, R., and Harville, D. Approximations for standard errors
of estimators of fixed and random effects in mixed linear models. Journal
of American Statistical Association 79(388) (1984), 853–862.
[27] Koonce, T., Giuse, N., Kusnoor, S., Hurley, S., and Ye, F.
A personalized approach to deliver health care information to diabetic
patients in community care clinics. Journal of the Medical Library As-
sociation 103(3) (2015), 123130.
[28] Laird, N., and Ware, J. Random-effects models for longitudinal
data. Biometrics 38(4) (1982), 963–974.
105
[29] Lindstrom, M., and Bates, D. Newton-raphson and em algorithms
for linear mixed-effects models for repeated measures data. Journal of
the American Statistical Association 83(404) (1988), 1014–1022.
[30] Lucio-Eceiza, E., and González-Rouco, J. Quality control of
surface wind observations in northeastern north america. part ii: Mea-
surement errors. Journal of Atmospheric and Oceanic Technology 35(1)
(2018), 163182.
[31] Miller, R. Big data curation. 20th International Conference on Man-
agement of Data (COMAD) 2014, Hyderabad, India, 2014.
[32] Mordecai, A. Nonlinear Programming: Analysis and Methods. Dover
Publishing, 2003.
[33] of Diabetes, N. I., Digestive, and Diseases, K. The a1c test &
diabetes, 2014.
[34] Park, M., and Sun, J. Tests in projection pursuit regression. Journal
of Statistical Planning and Inference 75(1) (1998), 65–90.
[35] Patterson, H., and Thompson, R. Recovery of inter-block infor-
mation when block sizes are unequal. Biometrika 58(3) (1971), 545–554.
[36] Pennell, M., and Dunson, D. Fitting semiparametric random ef-
fects models to large data sets. Biostatistics 8(4) (2007), 821834.
[37] Pfanzagl, J. Parametric Statistical Theory. Walter de Gruyter and
Co., 1991.
[38] Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D., Heis-
terkamp, S., and Willigen, B. V. Package nlme, 2018.
[39] Powell, M. Direct search algorithms for optimization calculations.
Acta Numerica 7 (1998), 287–336.
106
[40] Ray, S., and Turi, R. Determination of number of clusters in k-means
clustering and application in colour image segmentation. in: Proceedings
of the 4th International Conference on Advances in Pattern Recognition
and Digital Techniques (ICAPRDT99) (1999), 137–143.
[41] Robinson, G. That blup is a good thing: The estimation of random
effects. Statistical Science 6(1) (1991), 15–32.
[42] SAS. SAS/STAT(R) 9.2 User’s Guide, Second Edition PROC MIXED,
2010.
[43] Shim, Y., Chung, J., and Choi, I. A comparison study of clus-
ter validity indices using a nonhierarchical clustering algorithm. in:
Proceedings of CIMCA-IAWTIC05-Volume 01, IEEE Computer Society
(2005), 199–204.
[44] Shumway, R. Applied Statistical Time Series Analysis. PRENTICE
HALL SERIES IN STATISTICS, 1988.
[45] Snedecor, G., and Cochran, W. Statistical Methods, 8th ed. Iowa
State University Press, 1989.
[46] Stanleym, T., and Jarrell, S. Meta-regression analysis: A quanti-
tative method of literature surveys. Journal of Economic Surveys 19(3)
(1989), 299–308.
[47] Sun, J. Significance levels in exploratory projection pursuit.
Biometrika, 78(4) (1991), 759–769.
[48] Sun, J. Projection Pursuit: Encyclopedia of Statistical Sciences. John
Wiley and Sons, Inc, 2006.
[49] Watts, S., L.Roush, Julius, M., and Sood, A. Improved glycemic
control in veterans with poorly controlled diabetes mellitus using a spe-
107
cialty care access network-extension for community healthcare outcomes
model at primary care clinics. Journal of Telemedicine and Telecare
22(4) (2015), 221–224.
[50] Watts, S., Strauss, G., Pascuzzi, K., ODay, M., Young, K.,
Aron, D., and Kirsh, S. Shared medical appointments for patients
with diabetes: Glycemic reduction in high-risk patients. Journal of the
American Association of Nurse Practitioners 17(8) (2015), 450456.
[51] Xu, Y., Sun, J., Carter, R., and Bogie, K. Personalized pre-
diction of chronic wound healing: An exponential mixed effects model
using stereophotogrammetric measurement. Journal of Tissue Viability
23(2) (2014), 48–59.
[52] Y.Wu, Ma, Y., and Yin, G. Smoothed and corrected score ap-
proach to censored quantile regression with measurement errors. JASA
110(512) (2016), 1670–1683.