help 4 pages

bcs

5.pdf

Home >Computer Science homework help >help 4 pages

PARTIAL EM PROCEDURE FOR BIG-DATA LINEAR MIXED

EFFECTS MODEL, AND GENERALIZED PPE FOR

HIGH-DIMENSIONAL DATA IN JULIA

JANG-IK CHO

Submitted in partial fulfillment of the requirements

For the degree of Doctor of Philosophy

Dissertation Advisor: Dr. Jiayang Sun

Department of Population and Quantitative Health Sciences

CASE WESTERN RESERVE UNIVERSITY

March 2018

ProQuest Number:

INFORMATION TO ALL USERS The quality of this reproduction is dependent on the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed,

a note will indicate the deletion.

Published by ProQuest LLC (

ProQuest

). Copyright of the Dissertation is held by the Author.

Microform Edition © ProQuest LLC.

ProQuest LLC 789 East Eisenhower Parkway

P.O. Box 1346 Ann Arbor, MI 48106 - 1346

28099746

2020

CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the dissertation of

JANG-IK CHO

candidate for the Doctor of Philosophy degree *

Committee Chair: Dr. Jeffrey Albert Professor Department of Population and Quantitative Health Sciences

Committee: Dr. Jiayang Sun Dissertation Advisor Professor Department of Population and Quantitative Health Sciences

Committee: Dr. Mark Schluchter Professor Department of Population and Quantitative Health Sciences

Committee: Dr. David Aron Professor Department of Clinical and Molecular Endocrinology

Committee (optional member): Dr. Yifan Xu Applied Scientist II Amazon

March 2018

*We also certify that written approval has been obtained for any proprietary material contained therein.

Table of Contents

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . vii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

I Big-data Linear Mixed Effects Model (bLMM) 1

1 Introduction and Challenges 2

2 Data Curation 10

3 Measurement Variability and Heterogeneity of the Popula- tion 21 3.1 Measurement Variability . . . . . . . . . . . . . . . . . . . . . 21 3.2 Heterogeneity of the Population . . . . . . . . . . . . . . . . . 23

4 Big-data Linear Mixed Effects Model (bLMM) 40 4.1 Introduction to bLMM . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Linear Mixed Effects Model (LMM) . . . . . . . . . . . . . . . 43 4.3 Expectation-Maximization (EM) Algorithm . . . . . . . . . . 48 4.4 Partial EM (PEM) Procedure for bLMM . . . . . . . . . . . . 59

5 Performance of bLMM PEM 66

6 Evaluation of SCAN-ECHO training on diabetes treatment using bLMM PEM 72

7 Discussion and Conclusion 75

iii

II Projection Pursuit Ellipse (PPE) 78

8 Original PPE 79 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 8.2 PPE Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

8.2.1 PPE Index 1 . . . . . . . . . . . . . . . . . . . . . . . 82 8.2.2 PPE Index 2 . . . . . . . . . . . . . . . . . . . . . . . 82 8.2.3 PPE Index 3 . . . . . . . . . . . . . . . . . . . . . . . 83

8.3 PPE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 84

9 High-dimensional PPE (hPPE) 86 9.1 hPPE algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 86 9.2 hPPE power analysis . . . . . . . . . . . . . . . . . . . . . . . 87

10 Discussion and Conclusion 89

A Appendix 91 A.1 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . 91

A.1.1 MLE for β . . . . . . . . . . . . . . . . . . . . . . . . . 92 A.1.2 MLE for (σ2k)k=1,...,q and (ρkl)k 6=l=1,...q . . . . . . . . . . 92 A.1.3 MLE for σ20 . . . . . . . . . . . . . . . . . . . . . . . . 97 A.1.4 MLE for φ . . . . . . . . . . . . . . . . . . . . . . . . . 98

A.2 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . 100 A.3 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . 100

Bibliography 101 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

List of Tables

1.1 Variable definition and demographics of the study population . 4

2.1 Arguments for R-function fusionr . . . . . . . . . . . . . . . . 13

3.1 Number of years for number of A1c counts per year . . . . . . 22 3.2 Descriptive statistics of the summary measures of A1c . . . . . 23 3.3 Estimated coefficients of the fixed effects in the LMM for mean

of A1c by clusters . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Estimated coefficients of the fixed effects in the LMM for mean

of A1c from all patients . . . . . . . . . . . . . . . . . . . . . 28 3.5 Estimated coefficients of the fixed effects in the LMM for SD

of A1c by clusters . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.6 Estimated coefficients of the fixed effects in the LMM for SD

of A1c from all patients . . . . . . . . . . . . . . . . . . . . . 33 3.7 Estimated coefficients of the fixed effects in the LMM for CV

of A1c by clusters . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.8 Estimated coefficients of the fixed effects in the LMM for CV

of A1c from all patients . . . . . . . . . . . . . . . . . . . . . 38

5.1 Mean-square error of the estimates . . . . . . . . . . . . . . . 70 5.2 Bias of the estimates . . . . . . . . . . . . . . . . . . . . . . . 70 5.3 % Bias of the estimates . . . . . . . . . . . . . . . . . . . . . . 71 5.4 Variance of the estimates . . . . . . . . . . . . . . . . . . . . . 71 5.5 Efficiency (sec) . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.1 Estimated coefficients of the fixed effects in the LMM for orig- inal A1c by clusters . . . . . . . . . . . . . . . . . . . . . . . . 73

6.2 Estimated coefficients of the fixed effects in the LMM for orig- inal A1c from all patients . . . . . . . . . . . . . . . . . . . . 74

9.1 Empirical power in percents comparison for PPE . . . . . . . 88 9.2 hPPE empirical power in percents comparison for hPPE . . . 88

List of Figures

1.1 Profiles of A1c from all patients . . . . . . . . . . . . . . . . . 5 1.2 Profiles of A1c from 80 random patients . . . . . . . . . . . . 6 1.3 Trends of SD of A1c from patients in subgroups . . . . . . . . 8

2.1 Data curation scheme . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Clustering decision criterions for the mean of A1c . . . . . . . 24 3.2 Cluster analysis result for the mean of A1c . . . . . . . . . . . 25 3.3 Clustering decision criterions for the SD of A1c . . . . . . . . 30 3.4 Cluster analysis result for the SD of A1c . . . . . . . . . . . . 31 3.5 Clustering decision criterions for the CV of A1c . . . . . . . . 34 3.6 Cluster analysis result for the CV of A1c . . . . . . . . . . . . 35

4.1 Overall flow of bLMM PEM algorithm . . . . . . . . . . . . . 42 4.2 Overall flow of PEM Procedure . . . . . . . . . . . . . . . . . 60

ACKNOWLEDGEMENTS

I would like to express my profound gratitude to the people who made this

thesis possible.

First and foremost, I wish to thank my advisor, Professor Jiayang Sun, the

director of the Center for Statistical Research, Computing and Collaborations

(SR2c) at Case Western Reserve University. This work would not have been

possible without her academic and financial support. For the last 4.5 years

since I started my Ph.D. program, she has guided me through the entire

journey, having her door always opened whenever I needed her advice. Her

insights in statistics have been the driving force to overcome any challenges

I encountered. During the most challenging times when writing this thesis,

she gave me the moral support, encouragement, and freedom I needed to

move on, in addition to scientific guidance.

I also would like to sincerely thank my committee members, Professor Jef-

frey Albert, Professor David Aron, Professor Mark Schluchter, and Dr. Yifan

Xu. Professor Albert, my former advisor, and thesis committee chair, pro-

vided me with academic and financial support, which has been a considerable

contribution to completing this thesis. He has always been supportive and

understanding of any issues I had, academically or personally. He leads in-

sightful discussions and offered me valuable advice in improving this thesis.

Professor Aron, the director of the Interprofessional Improvement Research,

Education, and Clinical Center (IIRECC) at Louis Stokes Cleveland VA Med-

ical Center, provided me with financial support and clinical insight, which

made this thesis to gain significantly higher scientific value. He always had

an answer to any clinical question I had and gave me the freedom to focus

on my thesis during the critical times. Professor Schluchter provided excel-

lent statistical advice as I was developing this thesis. His valuable statistical

insights brought this thesis into the next level. Dr. Xu, an applied scien-

vii

tist at Amazon, exposed me to the most modern statistical approaches in

making this thesis applicable to the big-data infrastructures. His mental and

emotional support as a friend also has been a great help.

Staff from the Department of Population and Quantitative Health Sciences,

Nick Koziura for all his administrative support and Alberto Santana for his

technical support deserve a big thank you. I also would like to thank my

classmates and friends in the Department of Population and Quantitative

Health Sciences. Special thanks to Arielle Bloostein for her support in im-

proving the linguistics of this thesis.

Last, but not the least, I would like to thank my parents for their uncondi-

tional love and support. Sa-rang-hae-yo!

viii

Partial EM Procedure for Big-data Linear Mixed Effects Model, and Generalized PPE for High-dimensional Data in Julia

Abstract

JANG-IK CHO

Methodologically, this dissertation contributes to two areas in Statistics:

Linear mixed effects models for big data and Test of equal covariance for

high-dimensional data. Scientifically, this dissertation helps to comprehen-

sively evaluate the effect of the Specialty Care Access Network-Extension

for Community Healthcare Outcomes (SCAN-ECHO) training on primary

care providers at outpatient clinics in treating diabetes for the VA patient

population.

In the first part of this dissertation, we introduce three challenges and of-

fer solutions to each, in examining the effect of SCAN-ECHO training on

VA diabetic patients. The first challenge was data curation for longitudi-

nal variables. As a solution, we developed an R-function called “fusionr”

customized to our data structure for effective data curation. The second

challenge was measurement variability and heterogeneity of the population.

Different types of summary measures were used to reduce the variability of

the outcome. Longitudinal cluster analysis was conducted to identify similar

subgroups among the heterogeneous population. The third challenge was

fitting linear mixed effects model for big data that could not be imported

to R because the data exceeded the memory capacity. As a solution, we

proposed a new modern approach to Big-data Linear Mixed Effects Model

(bLMM) using a Partial EM (PEM) algorithm and data partitioning. Our

PEM procedure was developed to analyze the effect of SCAN-ECHO train-

ing on diabetes treatment but this analytic approach is of interest by itself

(statistical contribution 1) because the PEM is a general procedure for

fitting LMM for big data. We evaluated the performance of bLMM PEM by

comparing PEM to the following three methods for fitting LMM: Broyden-

Fletcher-Goldfarb-Shanno (BFGS) algorithm using the entire data, full EM

using the entire data, and meta analysis using data partitions. Finally, for

implementation, we applied our PEM procedure to evaluate the effect of

SCAN-ECHO training for diabetes treatment.

In the second part of this dissertation, improvement in the optimization

algorithm for Projection Pursuit Ellipse (PPE) to test for equal variance in

high-dimensional data is introduced (statistical contribution 2). Many

standard multivariate techniques were developed based on the assumption

that the covariance matrices from different groups are equal. A well-known

test for the equality of covariance is the Bartletts test. However, the Bartletts

test is only a function of the volumes of covariance matrices, which does not

account for the shapes and orientations of the matrices. In this work we

developed a Projection Pursuit Ellipses procedure for high-dimensional data

(hPPE) and compared its performance to the Bartletts test and a modern

benchmark for high dimensional p data.

Part I

Big-data Linear Mixed Effects Model (bLMM)

Chapter 1

Introduction and Challenges

Electronic health records (EHR) and electronic medical records (EMR) have

grown massively in the last few decades. These big data have provided un-

precedented opportunities for medical research, while challenging statistics

beyond the capabilities of traditional statistical methods due to the large size

of data. For example, the need for developing our Big-data Linear Mixed Ef-

fects Model (bLMM) came from a project at the Veteran Affairs (VA) medical

center. The Cleveland VA hospital was funded with a multi-million dollar

project to carry out a Specialty Care Access Network-Extension for Commu-

nity Healthcare Outcomes (SCAN-ECHO) training to improve the quality of

treatment for Veterans since 2011 [8]. SCAN-ECHO training mimics Project

ECHO (Extension for Community Healthcare Outcomes), which started at

the University of New Mexico Health Sciences Center. SCAN-ECHO train-

ing is designed to train primary care providers (PCP) in outpatient clinics to

improve the quality of treatment for chronic disease or complex health con-

ditions such as diabetes, heart failure, chronic pain and vascular conditions

so that the quality of care at outpatient clinics is as good as that of the care

provided by specialists at the main hospital. We were involved in evaluat-

ing the effect of the SCAN-ECHO training on PCPs for diabetes treatment

during the period of 2011-2016. The study population consisted of patients

who were already diagnosed with (pre)diabetes in the greater Cleveland area

and the outcome of interest was the hemoglobin A1c level (i.e., A1c) of the

patients. Our goal was to evaluate whether SCAN-ECHO training helped

the PCPs to provide enhanced care to the diabetic patients so that the A1c

measures of patients were well controlled over time relative to the patients

treated by the PCPs who did not participate in the training.

According to the National Institute of Diabetes and Digestive and Kidney

Diseases, a division under NIH, A1c can be measured by the A1c test, which

provides information on the average blood glucose control for the past 3

months and the test provides evidence on how well the diabetes treatment

plan is working. The scale of A1c is reported in percent and an A1c level

below 5.7 percent is normal, between 5.7 and 6.4 is prediabetic and above

6.4 is diabetic [33]. Each patient had multiple A1c measurements over time,

but not necessarily with the same frequency. The covariates considered were

a patient’s age at baseline, gender, race, body mass index (BMI), systolic

blood pressure (SBP), diastolic blood pressure (DBP), medication, consults,

and diagnosis of cardiovascular disease (CVD) recorded as (yes or no). For

each patient, BMI, SBP, and DBP could be measured repeatedly over time,

different types of medications could be prescribed over time and consult ses-

sions could be held multiple times also over time. However, the repeatedly

measured variables are not necessarily measured on the same dates. The

original dataset had 38,173 patients and each of the variables measured re-

peatedly had 1 to 52 repetitions. The definitions and types of the variables

and the demographics of the study population are summarized in Table 1.1.

Continuous Variable Definition n Mean Std A1c A1c level 268,466 7.2 1.5 baseage Age at first A1c measure 38,173 64.2 12.3 BMI Body Mass Index 533,965 32.3 6.4 SBP Systolic Blood Pressure 533,965 130.2 15.8 DBP Diastolic Blood Pressure 533,965 74.98 9.1

Categorical Variable Definition Category Count % gender Gender Male 18,450 48.3

Female 614 1.6 (NA) 19,109 50.1

race Race Black 1,856 4.9 White 15,651 41.0 Other 1,447 3.8 (NA) 19,219 50.3

med On medication 1 (yes) 1,867 4.9 0 (no) 36,306 95.1

cons On consult 1 (yes) 1,002 2.6 0 (no) 37,171 97.4

cvdiag Diagnosed for CVD 1 (yes) 13,132 34.4 0 (no) 25,041 65.6

Table 1.1: Variable definition and demographics of the study population

Here ‘A1c’ is a continuous variable ranging from 3.1 to 19.4 with the mean of

7.2 and the standard deviation of 1.5. ‘med’ is a binary variable indicating

that a patient was prescribed with medication since a certain date and ‘cons’

is also a binary variable indicating that a patient had a consult session related

to his/her diabetes treatment since a certain date. Therefore, the ‘med’

and ‘cons’ variables will have a value of 1 from the date a patient was first

prescribed medication and the date they had a consult session. ‘cvdiag’ is

a binary variable and all of the dates since the patient was diagnosed with

CVD will have a value of 1.

Each variable is provided in a separate data table. Thus data management

including data reshaping, cleaning, integrating and organizing was essential

to facilitate any data analysis of this reach collection of data from the VA

system. With 38,173 patients and some longitudinal variables measured at

different dates, simple merging the data tables exceeded the memory capacity.

Therefore, finding a data curation strategy was our first challenge to continue

with the data analysis.

Figure 1.1: Profiles of A1c from all patients

Next, to understand the data and get an idea of which statistical approach

to use, we conducted exploratory data analysis (EDA) using a plot with the

trends of A1c from all of the patients as in Figure 1.1. For all patients, their

first date of A1c measurement was defined as time 0 and the number of days

to the next measures was calculated and used as the time variable. In Figure

1.1 the number of days is in the x-axis, the A1c level is in the y-axis, and

each line represents a trend of A1c for a patient.

Figure 1.2: Profiles of A1c from 80 random patients

Because the lines are overlapping, it was not easy to visually distinguish

the exact trend of each patient. We then provided the profile plot of 80

randomly selected patients from the population as given in Figure 1.2. We

were able to see that patients had a wide range of starting A1c values and

the trends of A1c level over time varied by patients. Some are increasing,

some decreasing and some are flat. This suggested that the population is

heterogeneous. Therefore, including all of the patients into one model could

obscure the trend because an increasing trend can compensate for a decreas-

ing trend when taken averages of the profiles. Also, measurement variability

was another concern. Variability in repeated measures within a patient could

be due to different clinicians, different testing methods, an input error, or

the patient’s other health conditions. It was not possible to trace back and

confirm the reason for the measurement variability for every single patient.

Because we did not have the information to determine whether the variability

is indeed real or an error, finding an approach not only to minimize but incor-

porate the measurement variability in fitting a model was important. Here

we face our second challenge measurement variability and heterogeneity of

the population.

An approach we took to reduce the measurement variability of A1c was to

use summary statistics such as mean, standard deviation, and coefficient of

variation of the A1c levels. Usually, it is advised to use a 3-months interval

when taking a summary measure of A1c, but because 95.7% of patients

had fewer than 2 measurements per year for some years in the duration

of the study, we’ve decided to use a one-year interval in our analysis. Past

contributions in the studies to evaluate diabetes using A1c focused mainly on

the mean of A1c [50] [49]. However, a mean can not measure the variation in

data. Therefore, to capture the variation, we also derived standard deviation

(SD) and coefficient of variation (CV) of the A1c level by years and used

them as outcomes. Based on each of these measures, cluster analysis was

conducted to identify the groups of patients with similar A1c trends over

time from the heterogeneous population. The results of the cluster analyses

suggested that the patients can be divided into 5, 3, and 3 clusters, for the

mean of A1c, the SD of A1c, and the CV of A1c, respectively. Details about

the results from the measures of the outcome and cluster analysis can be

found in Chapter 3.

Using the results from the cluster analysis, we continued to evaluate the effect

of SCAN-ECHO training by running EDA on the trends of A1c levels over

time for the patients in each of the subgroups. As an example, we show the

plots of trends of the SD of A1c over time by subgroups in Figures 1.3a, 1.3b,

and 1.3c.

Figures 1.3a, 1.3b, and 1.3c visually suggested that even within homogeneous

subgroups, the starting SD of A1c values and the trend of SD of A1c over

time across the patients vary. Therefore, based on this visual EDA, we fitted

two different LMM, one by subgroups and one with all patients using the

(a) Cluster A (b) Cluster B (c) Cluster C

Figure 1.3: Trends of SD of A1c from patients in subgroups

cluster as a covariate. Figures 1.3a, 1.3b, and 1.3c, also showed possible

linear, quadratic or even cubic time effects. Thus, we considered linear,

quadratic and cubic trends of the time along with the interactions between

linear/quadratic/cubic time effect and the intervention. The final model

described in Chapter 3 Section 3.2 is the best model selected based on the

model fit and the significance of the linear/quadratic/cubic time trends. The

identical strategy was used for fitting LMM on the mean of A1c and the CV

of the A1c.

As an alternative approach in examining the data, we also tried to fit LMM

on the raw A1c levels. This was to cross check with the results derived

from the summary measures to the ones from an LMM using the raw A1c

levels assuming that the measurement variability can be explained by random

effects. We attempted to conduct a longitudinal cluster analysis using the

raw A1c and time in days. However, because the time to A1c measurement

varied so much across the patients when the data were transformed to a wide

format for longitudinal cluster analysis, there were a lot of missing values.

The two main solutions to missing values when conducting cluster analysis

are to impute the missing values or to take summary measures over a range of

time. Since we already conducted the cluster analysis using the mean of A1c

per year, we decided to use the same subgroups suggested from the cluster

analysis result using the mean of A1c per year.

Based on the homogeneous subgroups, we tried to fit an LMM on the raw

A1c levels by the subgroups and also for all of the patients with the clusters

as a covariate. In order to fit a standard LMM, a single dataset with all

of the variables was required. However, even within a subgroup, deriving a

single dataset with all of the variables was not possible because it exceeded

the memory capacity. To continue with the analysis, finding an approach

for fitting an LMM for big data was our third challenge. As a solution, we

developed a new modern approach for Big-data Linear Mixed Effects Model

(bLMM) using a Partial Expectation and Maximization (PEM) procedure.

Our PEM procedure is of its interest itself as a statistical analytic procedure

which can be applied not only to our SCAN-ECHO data but also to other

data to fit bLMM.

In summary, this thesis provides our solutions to the following three main

challenges we faced when analyzing the data to find the effect of SCAN-

ECHO training on diabetes treatment in VA community:

1. Data curation for repeated measures

2. Measurement variability and heterogeneity in a study population

3. Fitting a ‘big-data’ linear mixed effects model

In Chapter 2, we will describe our data curation algorithm; in Chapter 3,

we will summarize the results from our cluster analysis and LMM analysis of

sub and whole groups; in Chapter 4, we will introduce our PEM procedure

for fitting bLMM; in Chapter 5, we will show the performance of our PEM

procedure for bLMM; and in Chapter 6, we will summarize the analysis result

using PEM procedure on the original A1c measurements.

Chapter 2

Data Curation

For big data, new ways of data curation are needed to add value to the

overflowing messy data [31]. Our study also required an algorithm of big

data curation. In our study, all of the variables were provided in separate

data tables with the repeated measures in a wide format table. Repeated

measures were not necessarily recorded on same dates, so data cleaning and

curation was an important task. Patient name was used as a key variable in

merging the data tables.

Using the least amount of memory was one of our biggest challenges for

data curation because R stores the imported data into the memory. So to

minimize the use of memory, our strategy was to merge and clean data tables

with repeated measures one at a time and then merge data tables with non-

repeated measures. When merging a data table with a repeated measure, the

date of the measurement is important for data cleaning. So, when merging

two data tables with repeated measures when dates do not match, the merged

table before data cleaning can be very large. Therefore, it is important to use

the least amount of data when merging data tables with repeated measures.

However, when merging non-repeated measures, the non repeated measure

will only add one additional column to the data matching to the key variable

(which was patient in our data). So merging non-repeated variable does not

require as much memory as to merging variables with repeated measures.

After two data tables were merged by patient, different strategies for repeated

measures and non-repeated measures were applied. Data curation was done

using R version 3.4.3 and our data curation was written as an R function

fusionr and followed the process depicted in the flowchart in Figure 2.1. In

this thesis, we will be using “variable” and “covariate” interchangeably.

Figure 2.1: Data curation scheme

The key variable used for data curation is A1c because it is the outcome

of interest. Therefore throughout the data curation process, all of the data

comparisons were made based on the dates A1c was measured. In Step 1,

the data curation function selected the data table with A1c and another data

table with one of the variables which were measured repeatedly. Because data

tables with repeated measures were provided in a wide format, in Step 2.1 we

first converted each data table from wide to long format and merged them

by patient name in Step 2.2. The merged dataset had two dates, one from

A1c and one from the other variable. In Steps 2.3 - 2.5, three steps of data

cleaning were conducted. In Step 2.3, the dates between A1c and the other

variable were compared. For a continuous covariate, if its date was within

3 months of the date A1c was measured, the value was kept otherwise it

was discarded. For a categorical variable, if the dates of the A1c measure

took place after the date of this categorical variable, the value of covariate

was copied for the later dates of A1c measures. In Step 2.4, if a continuous

variable was measured within 3 months of one another, we replace the values

and the dates within 3 months by their mean. This averaging step reduced

the measurement error and the data size without loosing much of the clinical

information. In Step 2.5, in order to keep only one date variable, the function

only kept the date from the A1c measure and deleted the dates from other

variables. Then the function selected another table with another variable

with repeated measures and iterated through Steps 2.1 - 2.5 until all of the

data tables with repeated covariates were included. In Step 3, data tables

with non-repeated measures were merged by patient name. After all of the

data tables were included, the function exported a single curated dataset for

statistical modeling and analysis. The overall scheme of data curation was

designed to minimize the memory usage needed for data storage. Our data

curation R function fusionr builds on R packages dplyr and tidyr and

other relevant functions in R default packages.

The basic syntax of fusionr function is

fusionr(outcome=list(y_table,y_name,y_date,y_repcol_start,y_repcol_end),

conticov=list(conticov1=list(conticov1_table,conticov1_name,

conticov1_date,conticov1_repcol_start,conticov1_repcol_end),

conticov2,...),

catcov = list(catcov1=list(catcov1_table,catcov1_name,

catcov1_date,catcov1_repcol_start,catcov1_repcol_end),

catcov2,...),key, length)

Argument Description Type

outcome y table: name of the data table that contains the list outcome variable, y name: name of the outcome variable, y date: dates of the outcome measurements, y repcol start: column number repeated measure starts, y repcol end: column number repeated measure ends

conticov conticov1 table: name of the data table that contains the list a continuous variable, conticov1 name: name of the continuous variable, conticov1 date: dates of the continuous measurements, conticov1 repcol start: column number repeated measure starts, conticov1 repcol end: column number repeated measure ends

catcov catcov1 table: name of the data table that contains the list a categorical variable, catcov1 name: name of the categorical variable, catcov1 date: dates of the categorical measurements, catcov1 repcol start: column number repeated measure starts, catcov1 repcol end: column number repeated measure ends

key name of the variable to merge the data tables character length time difference used for cleaning continuous variable in numeric

days

Table 2.1: Arguments for R-function fusionr

The outcome’s input is a list object. Its first list, y table, is the name of

the data frame that contains the outcome variable, the second list, y name,

is the name of the outcome variable, the third list, y date, is the date

the outcome is measured, the fourth list, y repcol start, is the column

number when the repeated outcome measure (or date) starts and the fifth

list, y repcol end, is the column number which the repeated outcome mea-

sure (or date) ends. conticov and catcov arguments have the same struc-

ture. conticov is a list object with its primary element being another list

which contains the information of the data frames for each of the continu-

ous variables. A continuous variable’s information is included in a list ob-

ject, conticov1. Inside the conticov1, the first list, conticov1 table, is

the name of the data frame that contains the continuous variable, the sec-

ond list, conticov1 name, is the name of the continuous variable, the third

list, conticov1 date, is the date the continuous variable is measured, the

fourth list, conticov1 repcol start, is the column number which the re-

peated measure or date starts and the fifth list, conticov1 repcol end, is

the column number which the repeated measure or date ends. For a second

continuous variable, the same argument is applied to conticov2. When all

of the list objects are defined for all of the continuous variable, conticov

will have conticov1,conticov2,... as its elements in each list. The identi-

cal argument definition is applied for catcov. key is a character which is

the name of the variable that will be used to merge all of the data tables.

length is numeric which is the number of days that will be used to clean

the continuous variables. If the date difference between a continuous variable

and the outcome variable is larger than length, the value of the continuous

variable will be defined as a missing value. Also, for each subject, if its con-

secutive repeated measures of a continuous covariate are within the length,

the function fusionr will calculate it’s mean value and replace the original

value with the mean value. Finally, since the data curation is based on the

outcome variable, the dates from the outcome variable will be kept and the

dates from the covariates will be deleted after the data cleaning is complete.

An example of fusionr is provided below. Because of the HIPPA privacy

issue, the example is presented using a simulated data which mimics the orig-

inal data structure. First, we will look at the data structure of an outcome

variable, a multivariate continuous covariate with repeated measurements

and a categorical covariate before their curation.

1. Data structure of the outcome variable: A1c

Patient.Name Last.4.SSN Date.1 Result.1 Date.2 Result.2

1 aaa 1357 3/9/15 6.9 7/9/15 7.3

2 bbb 2579 <NA> NA 7/27/15 11.2

3 ccc 9024 1/9/15 8.8 7/13/15 9.3

4 ddd 8501 <NA> NA <NA> NA

5 eee 4848 <NA> NA <NA> NA

6 rrr 2594 2/23/15 7.1 <NA> NA

7 ggg 9092 <NA> NA <NA> NA

Date.3 Result.3 Date.4 Result.4 Date.5 Result.5

1 <NA> NA <NA> NA 1/3/16 9.2

2 <NA> NA 11/4/15 10.8 <NA> NA

3 <NA> NA <NA> NA <NA> NA

4 9/23/15 12.2 <NA> NA 1/13/16 10.8

5 <NA> NA 12/18/15 9.3 <NA> NA

6 <NA> NA 12/12/15 7.7 <NA> NA

7 9/14/15 8.1 <NA> NA <NA> NA

Date.6 Result.6 Date.7 Result.7 Date.8 Result.8

1 <NA> NA 9/5/16 10.2 <NA> NA

2 <NA> NA <NA> NA <NA> NA

3 4/14/16 7.3 9/2/16 8.1 <NA> NA

4 <NA> NA <NA> NA <NA> NA

5 2/15/16 8.8 <NA> NA 12/23/16 9.1

6 <NA> NA 9/14/16 8.1 <NA> NA

7 <NA> NA <NA> NA 11/29/16 7.8

There are 7 unique patients and the data table is in the wide format.

2. Data structure of the continuous covariate: BMI, SBP and DBP

Patient.Name Last.4.SSN BMI.1 SBP.1 DBP.1 Date.1 BMI.2

1 bbb 2579 28.2 128 92 7/27/15 NA

2 ccc 9024 NA NA NA <NA> 32.3

3 ddd 8501 24.3 118 84 9/23/15 NA

4 eee 4848 NA NA NA <NA> NA

5 ggg 9092 NA NA NA <NA> 29.8

SBP.2 DBP.2 Date.2 BMI.3 SBP.3 DBP.3 Date.3 BMI.4

1 NA NA <NA> 27.3 132 93 11/4/15 NA

2 126 92 1/9/15 34.8 129 96 7/13/15 NA

3 NA NA <NA> NA NA NA <NA> NA

4 NA NA <NA> NA NA NA <NA> 25.6

5 128 82 9/14/15 NA NA NA <NA> 28.8

SBP.4 DBP.4 Date.4 BMI.5 SBP.5 DBP.5 Date.5 BMI.6

1 NA NA <NA> NA NA NA <NA> 29.7

2 NA NA <NA> 32.6 133 92 1/13/16 NA

3 NA NA <NA> NA NA NA <NA> NA

4 124 79 12/18/15 NA NA NA <NA> NA

5 126 79 12/13/15 NA NA NA <NA> NA

SBP.6 DBP.6 Date.6 BMI.7 SBP.7 DBP.7 Date.7

1 127 78 3/4/16 NA NA NA <NA>

2 NA NA <NA> NA NA NA <NA>

3 NA NA <NA> NA NA NA <NA>

4 NA NA <NA> 26.4 128 82 12/23/16

5 NA NA <NA> NA NA NA <NA>

There are 5 unique patients and the data table is in the wide format.

3. Data structure of the categorical covariate: medication

Patient.Name Last.4.SSN medication.1 Date.1

1 aaa 1357 Bromocriptine 5MG 3/13/12

2 aaa 1357 Canagliflozin 5MG 5/20/10

3 bbb 2579 Metformin 5MG 10/21/11

4 ccc 9024 Bromocriptine 5MG 3/13/12

5 ddd 8501 Metformin 5MG 12/16/13

6 eee 4848 Metformin 5MG 3/12/12

7 rrr 2594 Metformin 5MG 7/20/12

8 ggg 9092 Bromocriptine 5MG 11/3/09

medication.2 Date.2 medication.3 Date.3

1 Canagliflozin 5MG 3/13/12 <NA> <NA>

2 <NA> <NA> <NA> <NA>

3 Canagliflozin 5MG 1/16/13 Metformin 10MG 3/20/13

4 <NA> <NA> Metformin 5MG 7/7/14

5 <NA> <NA> <NA> <NA>

6 Metformin 5MG 5/1/12 <NA> <NA>

7 <NA> <NA> <NA> <NA>

8 Metformin 5MG 2/5/10 Metformin 10MG 2/7/13

The categorical variable medication was recorded multiple times at various

time points.

The fusionr function merges and cleans data tables of the outcome vari-

able with the continuous covariates first, and then merges into categorical

variables one by one as given in Figure 2.1.The current version of fusionr

dichotomizes categorical variables. If there is a measure before a date of the

outcome, the categorical variable is defined as 1 otherwise 0. Here is an R

example script.

a1cdata = read.csv("outcome.csv")

healthdata = read.csv("health.csv")

meddata = read.csv("medication.csv")

outcome = list(a1cdata,"Result","Date",3,18)

conticov1 = list(healthdata,"BMI","SBP","DBP","Date",3,30)

conticov = list(conticov1)

catcov1 = list(meddata,"Medication","Date",3,8)

catcov = list(catcov1)

key = "Patient.Name"

length = 90

cleandata = fusionr(outcome, conticov, catcov, key, length)

> cleandata

Patient.Name Last.4.SSN Date Result BMI DBP SBP catcov1

ccc 9092 2015-09-14 8.1 NA NA NA 1

ccc 9092 2016-11-29 7.8 29.8 82 128 1

ccc 9092 2016-11-29 7.8 28.8 79 126 1

aaa 1357 2015-07-09 7.3 NA NA NA 1

aaa 1357 2016-01-03 9.2 NA NA NA 1

aaa 1357 2016-09-05 10.2 NA NA NA 1

aaa 1357 2015-03-09 6.9 NA NA NA 1

eee 4848 2015-12-18 9.3 NA NA NA 1

eee 4848 2015-12-18 9.3 26.4 82 128 1

eee 4848 2016-02-15 8.8 NA NA NA 1

eee 4848 2016-02-15 8.8 26.4 82 128 1

eee 4848 2016-12-23 9.1 25.6 79 124 1

eee 4848 2016-12-23 9.1 NA NA NA 1

ccc 9024 2015-07-13 9.3 32.3 92 126 1

ccc 9024 2015-07-13 9.3 NA NA NA 1

ccc 9024 2015-07-13 9.3 32.6 92 133 1

ccc 9024 2016-04-14 7.3 32.3 92 126 1

ccc 9024 2016-04-14 7.3 34.8 96 129 1

ccc 9024 2016-04-14 7.3 32.6 92 133 1

ccc 9024 2016-09-02 8.1 32.3 92 126 1

ccc 9024 2016-09-02 8.1 34.8 96 129 1

ccc 9024 2016-09-02 8.1 32.6 92 133 1

ccc 9024 2015-01-09 8.8 NA NA NA 1

ccc 9024 2015-01-09 8.8 34.8 96 129 1

ccc 9024 2015-01-09 8.8 32.6 92 133 1

ddd 8501 2015-09-23 12.2 NA NA NA 1

ddd 8501 2016-01-13 10.8 24.3 84 118 1

bbb 2579 2015-07-27 11.2 27.3 93 132 1

bbb 2579 2015-07-27 11.2 29.7 78 127 1

bbb 2579 2015-07-27 11.2 NA NA NA 1

bbb 2579 2015-11-04 10.8 NA NA NA 1

bbb 2579 2015-11-04 10.8 29.7 78 127 1

bbb 2579 2015-11-04 10.8 28.2 92 128 1

rrr 2594 2015-12-12 7.7 NA NA NA 1

rrr 2594 2016-09-14 8.1 NA NA NA 1

rrr 2594 2015-02-23 7.1 NA NA NA 1

fusionr is applicable to conducting data curation for data with an identical

structure to ours. A user only needs to specify the path where data tables are

stored, the names of the data tables, response variable, continuous variables,

categorical variables and covariates with repeated measures to get a single

dataset ready for statistical analysis. This minimal specification requirement

is custom friendly to researchers who may be unfamiliar with R. However,

the current version of fusionr is limited to a specific data structure identical

to that of our case. Generalization of our data curation method for repeated-

measure big data can be one of our potential future development.

Chapter 3

Measurement Variability and Heterogeneity of the Population

3.1 Measurement Variability

Measurement variability is another challenge to our longitudinal data anal-

ysis. Variation among repeated measures in a short-time interval can result

from a random error or is important information to be explained and consid-

ered in the model. However, finding the nature and cause of variability can

be difficult in reality especially for data recorded by many different people in

the past such as in EHR and EMR. Therefore, when the variability is diffi-

cult to understand or reflects the natural variation of a measurement system,

incorporating or reducing it is essential to improve the quality of the data

for statistical analysis. Quantile regression [10, 52] is a recent method to

incorporate some type of measurement variability within a statistical model.

Another recent approach use the variable itself [30] to reduce the measure-

ment variability. For our date, reducing the measurement variability directly

using the variable was more natural.

The repeated measures in our data are A1c, BMI, SBP, DBP, medication,

and consults. Medication and consults are categorical and have minimal mea-

surement variability because it is recorded whenever there was a medication

prescribed and a consult session held. However, for the continuous variables,

A1c, BMI, SBP, DBP, their measurement variability can be an issue. For

BMI, SBP, and DBP, we calculated the mean of each variable measured

within the 3 months to reduce their measurement variability. Based on the

guideline provided by NIH [33], it is advised that A1c measures within 3

months are expected to be similar. So to reduce the measurement variability

of A1c, we calculated the summary measures of A1c by a quarter of a year.

However, since many patients had less than 3 observations per year, if the

summary measure of A1c was taken every 3 months, we would observe many

missing values of the summary measures. Table 3.1 shows the frequencies of

number of A1c measures per year for a total of m = 145, 200 ≤ n× 6 years,

where n = 38, 173 is the number of subjects. m is less than n × 6 because

some subjects had been observed for less than 6 years. The first column indi-

cates that there were 69,461 years with only 1 A1c value, the second column

indicates that there were 54,168 years with only 2 A1c values and so on until

15 which was the maximum number of A1c count per year. In summary,

out of 145,200 years in the study, 123,629 (which is 85.1%) years had 1 or 2

counts.

# of A1c count / yr 1 2 3 4 5 6 7 # of years 69,461 54,168 8,488 9,027 334 1,679 52

# of A1c count / yr 8 9 10 12 14 15 # of years 1,627 132 2 228 1 1

Table 3.1: Number of years for number of A1c counts per year

Thus, for A1c, we decided to derive summary measures by year for each

patient. The measures we used were mean, standard deviation (SD) and

coefficient of variation (CV), where CV = SDmean. The summary statistics of

the mean, SD and CV of the A1c are provided in Table 3.2.

n mean sd Mean of A1c/yr 6 7.09 1.43

SD of A1c/yr 6 0.3 0.5 CV of A1/yrc 6 0.04 0.06

Table 3.2: Descriptive statistics of the summary measures of A1c

3.2 Heterogeneity of the Population

In addition to the measurement variability problem, heterogeneity of the

population was another issue. Identifying homogeneous subgroups of the

patients in terms of on their trends of A1c over time is important in finding

the pure effect of SCAN-ECHO training on A1c (see Figure 1.1). Because

we did not have any information on the distribution or on the trends of

the A1c, we decided to use longitudinal cluster analysis (LCA) to our data.

LCA is a data-driven unsupervised learning method. Our LCA was done

on the three measures: mean, SD, and CV of the A1c per year by patient

using the R-package kml [15, 16]. kml is K-means clustering generalized

for longitudinal data. We used serial correlation as the distance function in

our cluster analysis and chose the number of clusters to be 2 to 6. The best

number of clusters was chosen based on the five decision criteria provided by

the kml package. The most widely used criteria among the five criteria are

three different variations of Calinski & Harabatz criteria C(k) [5]. However,

C(k) does not always provide the correct number of clusters [43] and the

three C(k) variations may not provide the same number of optimal clusters.

Therefore, Ray & Turi [40] and Davies & Bouldin [7] criteria are used to

cross-check the candidate optimal numbers of clusters. The outcome values

of the 5 decision criteria are shown in Figure 3.1 where the X-axis represents

the number of clusters and the Y-axis represents the decision criteria values.

For all 5 decision criteria, a high value is favorable.

Figure 3.1: Clustering decision criterions for the mean of A1c

Thus, for the means of A1c, the optimal number of clusters is 3 by Calinski

& Harabatz criterion, Ray & Turi criterion, and Davies & Bouldin criterion

and is 5 by two other variations of Calinski & Harabatz criteria. When using

the majority rule, the optimal number of clusters to be used is 3. However,

when subgrouping the data into 3 clusters, the profile of each cluster was

parallel over time with low, medium and high mean of A1c whereas, with 5

clusters, we were able to see not only the different magnitude of the mean

of A1c by clusters but also different profiles which were parallel, increasing

and decreasing by clusters. Therefore, considering the statistical criteria and

the profiles of the clusters, we decided that 5 clusters are the most sensible

number of clusters for our data. Figure 3.2 shows the profiles of the 5 clusters

where the X-axis represents the time in years and the Y-axis represents the

mean of A1c values. Further, Cluster A consists of patients with a flat mean

Figure 3.2: Cluster analysis result for the mean of A1c

A1c around 5.5 over time, Cluster B consists of patients with a flat mean A1c

around 7 over time, Cluster C consists of patients with an increasing A1c over

time, starting around 7.5, Cluster D consists of patients with a decreasing

A1c over time, starting around 9, and Cluster E consists of patients with a

flat mean of A1c around 10 over time. Thus, Cluster A has patients with

the healthiest A1c levels in the range of normal health status, Cluster B has

patients with stable A1c over time but in the rage of diabetes, Cluster C

has patients with diabetic condition getting worse over time, Cluster D has

patients with diabetic conditions getting better over time, and Cluster E has

patients with stable A1c but in worst diabetic condition. For each of the 5

subgroups, we fitted an LMM with a random intercept α1i and random slope

of time α2i, as in (3.1) at time = 1,2,3,4,5,6 and a stepwise model selection

was conducted using the AIC values.

mean(A1c)i = scan + time + baseage + gender + race + BMI + SBP

+ DBP + med + cons + cvdiag + α1i + α2itimei (3.1)

where i indicates the ith patient.

The estimated coefficients of the fixed effects in the LMM by subgroups after

the model selection are given in Table 3.3.

Cluster A Cluster B Cluster C Estimate p-value Estimate p-value Estimate p-value

(Intercept) 4.75300 < 2e-16 5.62500 < 2e-16 6.91000 < 2e-16 scan1 -0.02438 0.00428 -0.00376 0.61718 0.00626 0.53800 time -0.00094 0.70591 0.04711 < 2e-16 0.09047 < 2e-16

baseage 0.00424 1.31E-11 0.00374 1.06E-15 -0.00002 0.97600 GenderM 0.13050 4.23E-05 0.09358 0.00094 0.00598 0.86200

raceOTHER -0.01131 0.59304 raceWHITE -0.04238 0.00559

SBP DBP -0.00005 0.86150 0.00027 0.32593 0.00188 2.90E-06

BMI 0.02140 < 2e-16 0.01571 < 2e-16 0.00760 < 2e-16 med1 0.07683 4.03E-16 0.07517 2.12E-13 cons1 0.05294 0.00543 0.09538 9.54E-09 0.18250 3.76E-15

cvdiag1 -0.03412 0.02235

Cluster D Cluster E Estimate p-value Estimate p-value

(Intercept) 8.40700 < 2e-16 11.99000 < 2e-16 scan1 -0.00060 0.97304 0.16330 0.00904 time 0.08571 < 2e-16 0.05646 0.01355

baseage -0.00302 0.00258 -0.00818 0.02475 GenderM 0.04579 0.39854 -0.47720 0.00790

raceOTHER -0.17340 0.27856 raceWHITE -0.20480 0.02127

SBP -0.00153 0.00070 -0.00444 0.00293 DBP 0.00487 5.03E-09 0.00805 0.00368 BMI -0.02289 9.54E-06

med1 0.05442 0.00192 cons1 0.34780 <2e-16 0.50130 4.64E-07

cvdiag1

Table 3.3: Estimated coefficients of the fixed effects in the LMM for mean of A1c by clusters

The estimates for variance of random intercept, variance of random slope,

correlation between random intercept and random slope for Clusters A, B, C,

D, and E were (0.13708,0.01086,-0.61), (0.21999,0.02906,-0.82), (0.40391,0.05631,-

0.90), (0.73480,0.11150,-0.90), and (1.52790,0.21560,-0.83) respectively. For

all five clusters, the random intercept and random slope had a negative cor-

relation. SCAN-ECHO training had a significant non-zero effect in Clusters

A (p-value = 0.00428) and in Cluster E (p-value = 0.00904). In Cluster A,

SCAN-ECHO training decreased the mean of A1c level by 0.02438 and in

Cluster E, SCAN-ECHO training increased the mean of A1c level by 0.1633.

The statistical significance of these two subgroups explains that for diabetic

patients who are in reasonable good health, the SCAN-ECHO training helped

them to slightly decrease their A1c level to maintain their A1c levels at nor-

mal. However, for patients with abnormally high A1c level, SCAN-ECHO

training does not help to control the A1c level of a patient compared to the

patients seeing non-trained PCPs. For these patients, we might recommend

visiting the specialists in the main hospital. For Clusters B,C, and D, SCAN-

ECHO training did not show a statistically significant effect on decreasing

the A1c level. As an alternative modeling approach, we fitted an LMM (3.2)

for all patients with clusters as a covariate. In this model, we also considered

the interaction between SCAN-ECHO training and the clusters.

mean(A1c)i = scan + time + cluster + baseage + gender + race + BMI

+ SBP + DBP + med + cons + cvdiag + scan : cluster

+ α1i + α2itimei (3.2)

where i indicates the ith patient.

The estimated coefficients of the fixed effects in the LMM for all of the

patients after the model selection are given in Table 3.4.

Covariate Estimate p-value Covariate Estimate p-value (Intercept) 5.29800 < 2e-16 GenderM 0.03287 0.08016

scan1 -0.02491 0.10696 SBP -0.00055 0.00016 time 0.05892 < 2e-16 DBP 0.00195 2.10E-13

clusterB 0.76210 < 2e-16 BMI 0.00751 < 2e-16 clusterC 1.70500 < 2e-16 med1 0.08041 < 2e-16 clusterD 2.81200 < 2e-16 cons1 0.20800 < 2e-16 clusterE 4.48600 < 2e-16 scan1:clusterB 0.01367 0.45944 baseage 0.00073 0.03208 scan1:clusterC 0.03694 0.04788

raceOTHER -0.01634 0.27063 scan1:clusterD 0.02691 0.18873 raceWHITE -0.03269 0.00183 scan1:clusterE 0.18950 2.24E-08

Table 3.4: Estimated coefficients of the fixed effects in the LMM for mean of A1c from all patients

The variance of random intercept, the variance of random slope, the corre-

lation between random intercept and the random slope estimated from the

LMM 0.4025, 0.0573, -0.87. The results show that on average, SCAN-ECHO

training does not have a statistically significant effect on the mean of A1c

over time, while it is a helpful as a maintenance for relatively healthy group of

patients. In addition, significant variables in the model were clusters, time,

race, SBP, DBP, BMI, medication and consults meaning that the clusters

were very well identified, the mean of A1c gradually increases by 0.05892

every year, and white patients have a -0.03269 smaller mean of A1c relative

to the black patients. Medication and consults increased the mean of A1c

which is a remaining work to be done, to find out whether the cause of the

increase is directly from medication or consults or whether the increases is

coming from the characteristics of the patient population who were on med-

ication or had consults.

The decision criteria are shown in Figure 3.3 and the result of cluster analysis

on the SD of A1c is shown Figure 3.4. In Figure 3.3, the X-axis represents the

number of clusters and the Y-axis represents the decision criteria values. In

Figure 3.4, the X-axis represents the time in years and the Y-axis represents

the SD of A1c values.

Figure 3.3: Clustering decision criterions for the SD of A1c

For the SDs of A1c, the optimal number of clusters is 3 based on 4 out of 5

criteria as in Figure 3.3. In this analysis, both the majority rule among the

decision criteria and the information provided from the profiles of the clusters

both agreed to use 3 clusters to subgroup the patient population. Figure 3.4

shows the profiles of the 3 clusters where Cluster A consists of patients with

a flat SD of A1c around 0.2 over time, Cluster B consists of patients with

increasing SD of A1c at the beginning and gradually decreases over time and

Cluster C consists of patients with high SD of A1c at the beginning with

a steep decrease in the first year but stabilizes over time. Thus, Cluster A

has patients with almost no variance over time, Cluster B has patients with

the highest variance of A1c over time and Cluster C has patients who had

a stabilized variance of A1c over time. For each of the 3 subgroups, as in

(3.3), we fitted an LMM with a random intercept α1i and random slope of

time α2i. In Figure 3.4, we were able to identify quadratic profiles of SD of

A1c over time by the patient. Therefore, in our LMM, we included a fixed

Figure 3.4: Cluster analysis result for the SD of A1c

quadratic time effect at time = 1,2,3,4,5,6 and a stepwise model selection

was conducted using the AIC values. For these models, measures with SD =

0 were excluded because they were treated as missing values.

SD(A1c)i = scan + time + time 2 + baseage + gender + race + SBP + DBP

+ BMI + med + cons + cvdiag + α1i + α2itimei (3.3)

where i indicates the ith patient.

The estimated coefficients of the fixed effects in the LMM by subgroups after

the model selection are given in Table 3.5.

Cluster A Cluster B Cluster C Estimate p-value Estimate p-value Estimate p-value

(Intercept) 0.34370 < 2e-16 1.51800 < 2e-16 2.40400 < 2e-16 scan1 -0.01312 < 1e-05 -0.02514 0.11216 -0.02581 0.35909 time 0.01050 < 1e-15 0.23710 < 2e-16 -0.66680 < 2e-16

time2 -0.03798 < 2e-16 0.10510 < 2e-16 baseage -0.00202 < 2e-16 -0.00755 < 1e-11 -0.00865 < 1e-05

GenderM 0.03838 0.00413 0.15220 0.01743 0.01252 0.90251 raceOTHER -0.00693 0.50031 -0.09996 0.03352 -0.19770 0.01017 raceWHITE -0.01402 0.06422 -0.11380 0.00008 -0.16470 0.00055

SBP -0.00099 0.00239 DBP 0.00021 0.06153 0.00082 0.43693 BMI 0.00082 0.00999 -0.00864 < 1e-08 -0.00742 0.00139

med1 0.04744 < 2e-16 0.10550 < 1e-09 0.03414 0.23413 cons1 0.03583 < 1e-07 0.12290 0.00022 0.16510 0.00563

cvdiag1 0.01097 0.02367

Table 3.5: Estimated coefficients of the fixed effects in the LMM for SD of A1c by clusters

The estimates for variance of random intercept, variance of random slope,

correlation between random intercept and random slope for Clusters A, B,

and C were (0.07791,0.01002,-0.78), (0.74262,0.09083,-0.85), and (0.43121,0.04818,-

0.82) respectively. For all three clusters, the random intercept and random

slope had negative correlation. SCAN-ECHO training had significantly de-

creased SD of A1c in Cluster A (p-value = < 1e-05) by 0.01312 whereas

for Clusters B and C, SCAN-ECHO training was not statistically significant.

Cluster A consists of patients with low SD of A1c value throughout the study

period. So, given our data, SCAN-ECHO training helps to reduce the SD of

A1c for patients who have relatively stable A1c values over time compared to

the patients who have more variability in their A1c measurements over time.

As an alternative modeling approach, we fitted an LMM for all patients with

clusters as a covariate as in (3.4) and conducted stepwise model selection.

SD(A1c)i = scan + time + time 2 + cluster + baseage + gender + race

+ SBP + DBP + BMI + med + cons + cvdiag + scan : cluster

+ α1it + α2ittimeit (3.4)

where i indicates the ith patient.

The estimated coefficients of the fixed effects in the LMM for all of the

patients after the model selection are given in Table 3.6.

Covariate Estimate t-value Covariate Estimate t-value (Intercept) 0.62310 < 2e-16 raceOTHER -0.04972 0.00029

scan1 -0.01309 0.02483 raceWHITE -0.05762 < 1e-08 time 0.00133 0.72839 BMI -0.00235 < 1e-06

time2 -0.00041 0.53513 med1 0.06468 < 2e-16 clusterB 0.64340 < 2e-16 cons1 0.08558 < 1e-13 clusterC 0.54700 < 2e-16 scan1:clusterB 0.00823 0.46704 baseage -0.00389 < 2e-16 scan1:clusterC -0.11040 < 1e-07

GenderM 0.06893 0.00016

Table 3.6: Estimated coefficients of the fixed effects in the LMM for SD of A1c from all patients

The variance of random intercept, variance of random slope, correlation be-

tween random intercept and random slope estimated from the LMM 0.24572,

0.03162, -0.85. The result showed that overall, SCAN-ECHO training has

a statistically significant effect on the SD of A1c over time reducing the

variation by 0.01309 (p-value = 0.02483). In addition, in Cluster C, SCAN-

ECHO training reduced the variability significantly more by 0.1104 (p-value

< 1e-07) which was our new finding in fitting LMM with the entire patients

compared to the result from our LMMs by clusters.

The decision criteria are shown in Figure 3.5 and the result of cluster analysis

on the CV of A1c is shown Figure 3.6. In Figure 3.5, the X-axis represents the

number of clusters and the Y-axis represents the decision criteria values. In

Figure 3.6, the X-axis represents the time in years and the Y-axis represents

the SD of A1c values.

Figure 3.5: Clustering decision criterions for the CV of A1c

For the CVs of A1c, the optimal number of clusters is 3 based on 4 out of 5

Figure 3.6: Cluster analysis result for the CV of A1c

criteria as in Figure 3.5. In this analysis, both the majority rule among the

decision criteria and information provided from the profiles of the clusters

both agreed to use 3 clusters to subgroup the patient population. Figure 3.6

shows the profiles of the 3 clusters where Cluster A consists of patients with

a flat CV of A1c around 0.01 over time, Cluster B consists of patients with

increasing CV of A1c at the beginning and gradually decreases over time

and Cluster C consists of patients with high CV of A1c at the beginning

with a steep decrease in the first year but stabilizes over time. Thus, Cluster

A has patients with almost no variance over time, Cluster B has patients

with the highest variance of A1c over time and Cluster C has patients who

stabilized the variance of A1c over time. For each of the 3 subgroups, as in

(3.5), we fitted a LMM with a random intercept α1i and random slope of

time α2i. In Figure 3.6, we were able to identify quadratic profiles of CV of

A1c over time by the patient. Therefore, in our LMM, we included a fixed

quadratic time effect at time = 1,2,3,4,5,6 and a stepwise model selection

was conducted using the AIC values. For these models, measures with CV

= 0 were excluded because they were treated as missing values.

CV (A1c)i = scan + time + time 2 + baseage + gender + race + SBP + DBP

+ BMI + med + cons + cvdiag + α1i + α2itimei (3.5)

where i indicates the ith patient.

The estimated coefficients of the fixed effects in the LMM by subgroups after

the model selection are given in Table 3.7.

Cluster A Cluster B Cluster C Estimate p-value Estimate p-value Estimate p-value

(Intercept) 0.04438 < 2e-16 0.15770 < 2e-16 0.24720 < 2e-16 scan1 -0.00175 < 1e-05 -0.00293 0.05770 -0.00249 0.29068 time 0.00209 < 1e-11 0.02703 < 2e-16 -0.06196 < 2e-16

time2 -0.00009 0.05775 -0.00447 < 2e-16 0.00986 < 2e-16 baseage -0.00019 < 1e-12 -0.00072 < 1e-11 -0.00101 < 1e-09

GenderM 0.00361 0.02372 0.01701 0.00519 -0.01468 0.02645 raceOTHER -0.00236 0.05662 -0.01067 0.01751 -0.01265 0.00414 raceWHITE -0.00284 0.00173 -0.00971 0.00059

SBP 0.00002 0.00490 -0.00011 0.00106 DBP BMI -0.00082 < 1e-08 -0.00055 0.00921

med1 0.00448 < 2e-16 0.01031 < 1e-09 0.00542 0.02760 cons1 0.00342 0.00006 0.01261 0.00014 0.01660 0.00057

cvdiag1 0.00176 0.00253 -0.00486 0.04581

Table 3.7: Estimated coefficients of the fixed effects in the LMM for CV of A1c by clusters

The estimates for variance of random intercept, variance of random slope,

correlation between random intercept and random slope for Clusters A, B,

and C were (0.0011,0.0002,-0.80), (0.0072,0.0008,-0.84), and (0.0041,0.0005,-

0.78) respectively. For all three clusters, the random intercept and random

slope had a negative correlation. SCAN-ECHO training had significantly

decreased CV of A1c in Cluster A (p-value = < 1e-05) by 0.00175 whereas

for Clusters B and C, SCAN-ECHO training was not statistically significant.

Cluster A consists of patients with low SD of A1c value throughout the study

period. So, given our data, SCAN-ECHO training helps to reduce the SD of

A1c for patients who have relatively stable A1c values over time compared

to the patients who have more variability in their A1c measurements over

time. Also, DBP was not a significant factor in all three clusters in the LMM

for the CV of A1c whereas, the fixed quadratic time trend was statistically

significant in all three clusters. As an alternative modeling approach, we

fitted na LMM for all patients with clusters as a covariate as in (3.6) and

conducted stepwise model selection.

CV (A1c)i = scan + time + time 2 + cluster + baseage + gender + race

+ BMI + SBP + DBP + med + cons + cvdiag + scan : cluster

+ α1it + α2ittimeit (3.6)

where i indicates the ith patient.

The estimated coefficients of the fixed effects in the LMM for all of the

patients after the model selection are given in Table 3.8.

Covariate Estimate t-value Covariate Estimate t-value (Intercept) 0.07440 < 2e-16 raceOTHER -0.00598 0.00009

scan1 -0.00167 0.01516 raceWHITE -0.00605 < 1e-07 time 0.00056 0.19304 BMI -0.00027 <1e-07

time2 -0.00015 0.04835 med1 0.00657 < 2e-16 clusterB 0.06581 < 2e-16 cons1 0.00912 < 1e-12 clusterC 0.05018 < 2e-16 scan1:clusterB 0.00117 0.34248 baseage -0.00042 < 2e-16 scan1:clusterC -0.00884 < 1e-05

GenderM 0.00721 0.00040

Table 3.8: Estimated coefficients of the fixed effects in the LMM for CV of A1c from all patients

The variance of random intercept, variance of random slope, correlation be-

tween random intercept and random slope estimated from the LMM 0.0030,

0.0004, -0.85. The result showed that overall, SCAN-ECHO training has a

statistically significant effect on the SD of A1c over time reducing the varia-

tion by 0.00167 (p-value = 0.01516). In addition, in Cluster C, SCAN-ECHO

training reduced the variability significantly more by 0.0088 (p-value < 1e-05)

which was our new finding in fitting LMM with the entire patients compared

to the result from our LMMs by clusters.

Based on the 3 summary measures, the LMM analysis suggests that SCAN-

ECHO training statistically help reduce the mean, SD and CV of the A1c for

relatively healthy patients. However, for patients in poor health conditions,

SCAN-ECHO training does not help to improve their diabetic status.

To examine the data from an alternative point, we decided to also fit an

LMM using the raw data assuming that the mixed effects will incorporate

the measurement variability. In fitting the overall LMM, the identical sub-

groups from the cluster analysis using the mean of A1c were considered. So,

we tried to fit an LMM for each of the subgroups and for all patients with

the subgroups as a covariate. However, the raw data, even within a sub-

group, were too big to be imported to R. As a solution to this issue, we

developed a method to fit bLMM using a partial expectation-maximization

(PEM) approach. The details will be explained in the next Chapter.

Chapter 4

Big-data Linear Mixed Effects Model (bLMM)

4.1 Introduction to bLMM

Big-data linear mixed effects model (bLMM) is a modern extension of LMM

to fit a LMM to big data. As introduced in Chapter 1, the need for bLMM

arose in our analysis of electronic health records of VA patients to find the

effect of SCAN-ECHO training on PCPs in treating diabetes. Figure 1.1

presents the profiles of a diabetes outcome measure, A1c levels over time for

the entire patient population. From the plot, we see that both the starting

A1c values and the trends of A1c over time have large variability by patient.

Thus, we chose to fit a LMM to accommodate the variation by patients and

find the effect of SCAN-ECHO training on A1c levels. However, we were not

able to fit a LMM using a traditional model fitting approach because the full

data were too big to be imported into R. We then tried to conduct longitu-

dinal K-means cluster analysis to find subgroups using A1c levels recorded

on the absolute length of time in days from the starting days. It again was

not possible because the data exceeded the memory capacity. Our proposed

procedure was developed to fit a linear mixed effects model for big-data when

the data are too big or the analysis could not be done because of memory

limits.

Recent strategies for big-data LMM include a method of moment estimation

(Gao and Owen [13]), random effects meta regression (REMR) (Gebregziab-

her et. al [14]),(Jackson et al. [21]), semiparametric 2-step estimation (Pen-

nell and Dunson [36]), and sparse matrix technique (Johnson and Thompson

[24]), (SAS: Proc HPMIXED [17]). Among these methods, random effects

meta regression and semiparametric 2-step estimation are relevant to our

goal of finding a solution to a data set that exceeds the current memory

capacity. Random effects meta regression independently fits LMM to the

partitions of the entire data and merge them using random effects regression

model. However, when using partitions, it is important to develop an effec-

tive fitting algorithm that incorporates interactive informations of different

partitions in addition to using the parameter estimates from each partition to

summarize the final estimates. Semiparametric 2-step estimation is a data di-

mension reduction strategy rather than fitting a LMM using the entire data.

Therefore, to fulfill our needs, we developed a method to fit big-data linear

mixed effects model using a Partial Expectation-Maximization (PEM)

algorithm based on random partitions of the data. Our PEM procedure was

developed primarily to solve the big data issue and fit a longitudinal LMM for

our SCAN-ECHO project, but the method can be extended to other types of

data such as live-feeding data and to high-performance cluster environments

when fitting LMM.

Next, we introduce our procedure at a glance. Recall that our data were pro-

vided in separate tables, one for each of the variable, and data tables were

linked by patient names. When including all of the patients for data curation,

the data was too big to be imported into R. Therefore, as shown in Figure

4.1, we first divided patients into K groups. Next we selected a group and

conducted its data curation to get one complete dataset to fit a LMM. We

saved the parameter estimates from fitting LMM and additional summary

information (specified in details later) about the first group to pass over to

the analysis when adding a second group. We then fed the parameters esti-

mates and summary information to the second group to get the accumulated

parameter estimates and summary information. This procedure continued

until all of the patients were included in the analysis which was the final

parameter estimate of the LMM.

Figure 4.1: Overall flow of bLMM PEM algorithm

bLMM is a modern extension of LMM to fit a LMM for big data and our

procedure is a deviation from the general Expectation and Maximization

(EM) algorithm. Thus, knowing LMM and EM algorithm is very important

to understand our PEM procedure. Therefore, in Section 4.2 we will review

LMM, focusing on the derivation of the parameter estimates and introducing

available statistical packages. In Section 4.3, we will review the general E-

M algorithm. In Section 4.4, we will describe in detail our proposed PEM

procedure. In Chapter 5, we will show the performance of our PEM procedure

via a simulation study. In Chapter 6, we will show the results from evaluating

SCAN-ECHO training using our PEM procedure.

4.2 Linear Mixed Effects Model (LMM)

A linear regression model is generally used to find a linear relationship be-

tween an outcome and factors of interest. A linear regression model can be

expressed as

Yn×1 = Xn×pβp×1 + �n×1 (4.1)

where Y is a vector of an outcome variable measured at n times, X is a

n × p matrix of p covariates measured at n times, β is a p − dimensional

vector of unknown regression coefficients, and � is a n−dimensional vector

of unknown random errors. In this model, β indicates the effect size of

the variables of interest on the outcome. These β values are fixed for the

entire study population, meaning that the effect size of a variable is same

for all of the subjects involved in the study. However, the outcome may be

correlated within a group or a subject, and the effect of a variable can be

different depending on a specific group or a subject. For example, in medical

research, when patients have repeated tests (or measures), it is possible that

those measures will be correlated in a given patient, especially if the measures

are taken within a short time interval. In this case, the mean and the effect

of some covariates will likely differ by patient. The understanding of the

differences between patients and/or correlation of measures for one patient

can be represented using random effects in a model. A linear model with

both fixed and random effects is called linear mixed effects model (LMM)

and can be generally expressed as

Yn×1 = Xn×pβp×1 + Zn×qαq×1 + �n×1 (4.2)

where Y is a vector of an outcome variable measured at n times, X is a

n × p matrix of p covariates measured at n times, β is a p − dimensional

vector of unknown regression coefficients, Z is a n× q known matrix, α is a

q − dimensional vector of the random effects, and � is a n − dimensional

vector of unknown random errors. If � ∼ N(0, Σ�) then the model in (4.2) is

a Gaussian LMM.

The random effects model was first introduced in 1918 by R.A. Fisher [11] in

a study to find correlations among some traits between familial relatives. The

estimation methods for the random effects model were formally introduced

in the 1950s. A best linear unbiased estimator (BLUE) of fixed effects and

a best linear unbiased predictions (BLUP) of random effects introduced by

C.R. Henderson [20] in 1959 were the foundation for the development of LMM

estimation (Robinson [41]). Estimation procedures for the parameters in a

LMM, differ between Gaussian and non-Gaussian models. Coefficients of a

Gaussian LMM are generally estimated using a maximum likelihood estima-

tor (MLE) or by using a restricted maximum likelihood estimator (REML).

Coefficients of non-Gaussian LMM can generally be estimated using a Quasi-

likelihood method, partially observed information, iterative weighted least

squares, or a jackknife method (Jiang [23]). Mixed effects models are useful

to model personalized outcomes. Xu et al. [51] used a mixed effects expo-

nential model to capture personalized differences of a chronic wound size,

Cenik et al. [6] used mixed effects model to improve identification of the

individual protein level differeces, and Koonce et al. [27] used mixed effects

multivariate linear regression model to incorporated personalized knowledge

of each study subject. There are many different mixed effects models used in

recent research but since our bLMM is based on a Gaussian LMM, the next

sections will focus on Gaussian LMM.

In general, a Gaussian LMM can be expressed as

Yn×1 = Xn×pβp×1 + Zn×qαq×1 + �n×1 (4.3)

where Y is a vector of an outcome variable measured at n times, X is a

n × p matrix of p covariates measured at n times, β is a p − dimensional

vector of unknown regression coefficients, Z is a n× q known matrix, α is a

q − dimensional vector of the random effects, and � is a n − dimensional

vector of unknown random errors. Typically LMM assumes that the α and

the � are uncorrelated and have a mean of zero and finite variance such that

G = var(α) and R = var(�).

The most widely used methods to estimate parameters in LMM are the

maximum likelihood estimator (MLE) and the restricted maximum likelihood

estimator (REML). Even though the MLE (Pfanzagl [37]) and random effects

model (Fisher [11]) were both introduced by R.A. Fisher in the 1920s, the

maximum likelihood method was not used for mixed model estimation until

Hartley and Rao [19] published a paper in 1967. Following Jiang [23], under

the Gaussian LMM, let θ be the all variance components involved in V where

V = var(Y ) = ZGZ′ + R. Then the likelihood function L(β,θ|y) can be

expressed using the distribution of Y in model (4.3) as

L(β,θ|y) = 1

(2π)n/2|V |1/2 exp { −

2 (y −Xβ)′V −1(y −Xβ)

} (4.4)

The log-likelihood function can be expressed as

l(β,θ|y) = C − 1

2 log(|V |) −

2 (y −Xβ)V −1(y −Xβ) (4.5)

with a constant C. By differentiating the log-likelihood function with respect

to the parameters, we get the following,

∂l

∂β = X′V −1y −X′V −1Xβ (4.6)

∂l

∂θr =

{ (y−Xβ)′V −1

∂V

∂θr V −1(y−Xβ)−tr

( V −1

∂V

∂θr

)} ,r = 1, ...,Q (4.7)

where θr is the rth element of θ and θ is of dimension Q. Then the standard

procedure to find the MLE is to solve for β and θ from ∂l ∂β

= 0 and ∂l ∂θ

= 0,

respectively. However, in practice when there are many random effects in

a LMM, the MLE which maximizes the likelihood function is inadequate.

When estimating the variance component of a LMM with many random

effects and with the number of fixed effects that increases proportional to

the sample size, MLE becomes biased and inconsistent (Faraway [9], Jiang

[22]). For example, with many random effects in a LMM, the MLE of the

variance component can result in a negative value. Because the variance

cannot be negative, the MLE of variance will be 0.

The underlying idea of REML was first introduced by M.S. Bartlett [2] in

1937 and the first approach using REML to estimate variance component

was introduced by Patterson and Thompson [35] in 1971. For a LMM, when

estimating the variance components, the fixed effects can be considered nui-

sance parameters. REML works under this assumption and finds a linear

combination A that will make X into a nuisance variable. W.l.o.g, from

model (1.3) we will assume that the rank(X) = p and let A be an n×(n−p)

matrix such that

rank(A) = n−p,E(A′X) = 0 (4.8)

Then we define T = A′y where T ∼ N(0,A′V A) because Y ∼ N(0,V ).

Thus the restricted likelihood function LR(β,θ|T) can be expressed from the

distribution of T as

LR(β,θ|T) = 1

(2π)(n−p)/2|A′V A|1/2 exp { −

2 T ′(A′V A)−1T

} (4.9)

where the subscript R stands for “restricted” or “residual”. Then the re-

stricted log-likelihood function can be expressed as

lR(β,θ|T) = C1 − 1

2 log(|A′V A|) −

2 T ′(A′V A)−1T (4.10)

with a constant C1. By differentiating the restricted log-likelihood function

with respect to θ and setting it to 0 we will get a REML estimator. According

to Jiang [23] (see pg13-14), even though REML uses a linear transformation

matrix A, the estimator does not depend on A.

Even though MLE and REML estimators can be approximated by finding

the roots of derivatives of the relevant likelihood, finding the maximum of

the MLE or REML is challenging from a computation standpoint. In prin-

ciple, finding the maximum of MLE or REML and checking the convergence

can be done using numerical procedures such as the Newton-Raphson (NR)

algorithm (Lindstrom and Bates [29]) or by the EM algorithm (Laird and

Ware [28]). According to Jiang [23], the NR algorithm is known to have

faster convergence but is more sensitive to the initial starting value than the

EM algorithm. Therefore, depending on the data and study design, choosing

the right maximization algorithm will reduce the time of computation and

increase the accuracy of the MLE and REML Details on the computation is

well explained in Jiang [23] (see pg 29-34).

Many major statistical software provide functions, macros or packages for

standard Gaussian LMM analysis. SAS provides PROC MIXED as a stan-

dard procedure to fit a Guassian LMM. By default, PROC MIXED delivers

empirical (or estimated) BLUE [42] following the method from Kachar and

Harvielle [26]. PROC MIXED uses ridge-stabilized Newton-Raphson algo-

rithm to optimize either a full (ML) or residual (REML) likelihood function

[42]. R has many packages on CRAN for LMM such as lme4, nlme, lmm,

ASReml, MCMCglmm, and glmmADMB. Using MLE and REML, the

standard Gaussian LMM can be analyzed using packages lme4 and nlme.

4.3 Expectation-Maximization (EM) Algorithm

In this section, we describe a full EM algorithm of fitting a longitudinal

LMM. Recall that a longitudinal LMM with its random effects grouped by

each of m patients can be defined as,

Yi = Xiβ + Ziαi + �i (4.11)

for i = 1, ...,m, where the ith subject has Ji repeated measures so that there

is a total m∑ i=1

Ji = N observations. Yi is a Ji×1 vector, representing outcome

values of the ith subject; Xi is a Ji × (p + 1) matrix with a column vector

of 1 representing the intercept and rest columns representing p covariates

measured at Ji times; β is a (p + 1) × 1 vector representing unknown fixed

effects of the covariates; Zi is a Ji × q known matrix associated with the

q-dim random effects αi, with αi iid∼ Nq(0,Gi), Gi being a q × q covariance

matrix of the random effects. The random error �i is Ji-dimensional with

�i iid∼ NJi (0,Ri), Ri being a Ji × Ji covariance matrix of an AR1 process,

indicating a serial correlation in the Ji observations. Further, αi and �i are

independent of each other, denoted as αi |= �i for i = 1, ...,m.

We will assume that the covariance matrix Gi of the random effects αi

and the parameters in the covariance matrix Ri of the random error �i are

identical across all subjects for i = 1, ...,m. So that the q × q covariance

matrix is

Gi = G ∆ =

 

σ21 ρ12σ1σ2 ρ13σ1σ3 · · · ρ1qσ1σq ρ21σ2σ1 σ

2 2 ρ23σ2σ3 · · · ρ2qσ2σq

... ... σ23 · · ·

... ...

. . . ...

ρq1σqσ1 ρq2σqσ2 ρq3σqσ3 · · · σ2q

  (4.12)

where, ρjk = ρkj for j,k = 1, ...,q.

Ri = σ20

1 −φ2 Σ(i) �

and Σ(i) �

 

1 φ φ2 · · · φJi−1 φ 1 φ2 · · · φJi−2 ...

... . . .

... ...

... . . .

... φJi−1 · · · φ2 φi 1

  is a Ji ×Ji matrix

(4.13)

Grouping m subjects together, we can re-write our growth curve model in

(4.11) as

Y N×1 = XN×(p+1)β(p+1)×1 + ZN×qαq×1 + �N×1 (4.14)

where Y , X, Z are observable, β and α are unknown fixed and random

effects, and

YN×1 =

 

Y1 Y2 ...

 , XN×(p+1) =

 

1 X1 1 X2 ... 1 Xm

 , β(p+1)×1 =

  β0 β1 ... βp

 ,

ZN×(qm) =

 

Z1 0 · · · 0 0 Z2

. . . 0 ...

. . . . . .

... 0 · · · 0 Zm

 , α(qm)×1 =

  α1 α2 ... αm

  ∼ N(0,G),

�N×1 =

  �1 �2 ... �m

  ∼ NN (0,R)

Yi =

 Yi1... YiJi

 , Xi =

 X

(i) 11 · · · X

(i) 1p

... · · · ...

X (i) Ji1 · · · X(i)Jip

 , Zi =

 Z

(i) 11 · · · Z

(i) 1q

... · · · ...

Z (i) Ji1 · · · Z(i)Jiq

 , αi =

 αi1... αiq

 , �i =

 �i1... �iJi

  and �i ∼ AR1(φ), R =

 

R1 0 · · · 0 0 R2

. . . 0 ...

. . . . . .

... 0 · · · 0 Rm

  N×N

G =

 

G1 0 · · · 0 0 G2

. . . 0 ...

. . . . . .

... 0 · · · 0 Gm

  =

 

G 0 · · · 0 0 G

. . . 0 ...

. . . . . .

... 0 · · · 0 G

  mq×mq

Hence,

Y |α ∼ N(Xβ + Zα,R) and α ∼ N(0,G)

and the complete pdf of (Y ,α) from our model (4.14) is

f(Y ,α) =f(Y |α)f(α)

= m∏ i=1

( 1

(2π)Ji/2 ∣∣∣∣ σ201 −φ2 Σ(i)�

∣∣∣∣1/2 exp

{ −

2 ri ′ (

σ20 1 −φ2

Σ(i) �

)−1 ri

})

× m∏ i=1

Ji∏ j=1

( 1

2π|Gi|1/2 exp

{ −

2 αi ′Gi

−1αi

}) (4.15)

where

ri =

 ri1... riJi

  = Yi −Xiβ−Ziαi (4.16)

So if α were also observable, the log-likelihood function l(β, (σ2k), (ρkl),σ 2 0,φ),

given (Y ,α), where k 6= l = 1, ..,q, is

l(β,(σ2k)k=1,...,q, (ρkl)k,l,σ 2 0,φ)

( − m

2 ln(σ20 ) +

2 ln(1 −φ2) −

m∑ i=1

ln|Σ(i) � |−

1 −φ2

2σ20

m∑ i=1

r′ i (Σ(i)

� )−1ri

) +

( − N

2 ln|G|−

m∑ i=1

Ji∑ j=1

αi ′G−1αi

) + C

(4.17)

for some constant C.

Then the maximum likelihood estimator (MLE) can be derived by finding

the roots of the partial derivatives of the log-likelihood function with respect

to the parameters (β, (σ2k), (ρkl),σ 2 0,φ) as follows:

∂l

∂β =

∂

∂β

( −

1 −φ2

2σ20

m∑ i=1

ri ′(Σ(i)

� )−1ri

) (4.18)

∂l

∂σ2k =

∂

∂σ2k

( − N

2 ln|G|−

m∑ i=1

Ji∑ j=1

αi ′G−1αi

) (4.19)

∂l

∂ρkl =

∂

∂ρkl

( − N

2 ln|G|−

m∑ i=1

Ji∑ j=1

αi ′G−1αi

) (4.20)

∂l

∂σ20 =

∂

∂σ20

( − m

2 ln(σ20 ) −

1 −φ2

2σ20

m∑ i=1

ri ′(Σ(i)

� )−1ri

) (4.21)

∂l

∂φ =

∂

∂φ

( m

2 ln(1 −φ2) −

m∑ i=1

ln|Σ(i) � |−

1 −φ2

2σ20

m∑ i=1

ri ′(Σ(i)

� )−1ri

) (4.22)

Proposition 1 Under the LMM model (4.14), given α, the MLE of β, (σ2k),

(ρkl), σ 2 0,φ are

β̂ =

( m∑ i=1

Xi ′(Σ̂(i)

� )−1Xi

)−1( m∑ i=1

Xi ′(Σ̂(i)

� )−1(Yi −Ziαi)

) (4.23)

σ̂2k = 1

m∑ i=1

Jiα 2 ik (4.24)

ρ̂kl =

m∑ i=1

Jiαikαil√√√√ 1 N

m∑ i=1

Jiα 2 ik

√√√√ 1 N

m∑ i=1

Jiα 2 il

(4.25)

(σ̂20, φ̂) are the real solutions in (0,∞) × (0, 1) of the following equations:

σ̂20 = (1 −φ 2) ×

m∑ i=1

ri ′(Σ(i)

� )−1ri (4.26)

σ20 (N −m)φ 3 + 2(A−mσ20 )φ

2 + (2B −σ20 (N −m))φ + 2(mσ 2 0 −A) = 0

(4.27)

where A = m∑ i=1

Ji−1∑ k=1

rikri(k+1), B = m∑ i=1

( r2i1 +r

2 iJi

+2 Ji−1∑ j=2

r2ij

) , ri and rij defined

as in (4.16) and Σ̂(i) �

= Σ(i) �

(φ̂). �

The proof of Proposition 1 is in Appendix A.1. The solutions of (σ̂20, φ̂)

from (4.26) - (4.27) can be obtained in principle, but not tractable.

Next, we propose computationally efficient estimates of σ20 and φ using the

residuals of the LMM. Since �i follows and AR1(φ) model by its definition

in (4.14), we know that for the ith subject,

�i2 = φ�i1 + ηi1 �i3 = φ�i2 + ηi2

... �iJi = φ�i(Ji−1) + ηi(Ji−1)

(4.28)

for i = 1, ...m.

Denote � (t) i = (�i2, �i3, ...,�iJi ) and �

(t−1) i = (�i1, �i2, ...,�iJi−1 ) to be vectors

of errors with 1-lag time difference. Define �(t) = (� (t) 1 ,�

(t) 2 , ...,�

(t) m

) and

�(t−1) = (� (t−1) 1 ,�

(t−1) 2 , ...,�

(t−1) m

). Then, �(t) and �(t−1) satisfy the following

linear relationship:

�(t) = φ01 + φ� (t−1) + ηt (4.29)

where ηt iid∼ N(0,σ20 ), φ0 = 0 and 1 =

 1...

 .

Because �(t) and �(t−1) are not observable, we can use the residuals from the

LMM to get the OLS estimate of the φ.

Let r(t) and r(t−1) be the residuals corresponding to �(t) and �(t−1) so that

r(t) = (r (t) 1 ,r

(t) 2 , ...,r

(t) m

) = (r12, ...,r1J1,r22, ...,r2J1, ......,rm2, ...,rmJm ) and

r(t−1) = (r (t−1) 1 ,r

(t−1) 2 , ...,r

(t−1) m

) = (r11, ...,r1(J1−1),r21, ...,r2(J2−1), ......, rm1, ...,rm(Jm−1)).

For the ith subject, denote a (J1 − 1) × 2 matrix X (t−1) i as,

X (t−1) i =

 

1 ri1 1 ri2 ...

... 1 ri(Ji−1)

  = (1,r(t−1)i )

Then the OLS estimate of φ from model (4.29) is

φ̂ = m∑ i=1

(X (t−1)′ i X

(t−1) i )

−1X (t−1)′ i r

(t) i (4.30)

Again, the AR(1) structure in (4.29) implies that var(�) = σ20

1 −φ2 (see [44]

Section 3.3). Thus, there are two good estimates of σ20 using either the

residuals from our LMM model (4.14) or using the residuals from the linear

model of the errors (4.29):

σ̂20,1 = (1 − φ̂ 2)v̂ar(�)

σ̂20,2 = v̂ar(η)

where “∧” indicates an estimate of its underlying element, e.g., v̂ar(η) is an

estimate of var(η). For T = m∑ i=1

(Ji − 1), let ri be a Ji −dimensional vector

of the residuals from the ith subject from the LMM model (4.14) and let res

be a T − dimensional vector of the residuals from the linear model of the

errors (4.29). Then, our two estimators of σ20 are in the following (4.31) and

(4.32) respectively.

σ̂20,1 = (1 − φ̂ 2) ×

N −m

m∑ i=1

(Ji − 1)v̂ar(ri) (4.31)

σ̂20,2 =

T∑ j=1

res2j

T − 2 (4.32)

where r̂i = Yi − Xiβ̂ − Ziα̂i, v̂ar(ri) = 1Ji − 1 Ji−1∑ j=1

(rij − r̄i)2 with r̄i =

1 Ji − 1

Ji−1∑ j=1

rij and resj = r (t) jk − φ̂0 − φ̂1r

(t−1) jk where j = 1, ...,m and k =

1, ..., (Jj−1). For simplicity of the calculation, we will use (4.32) to estimate

σ20 in the rest of this thesis.

Computation of αi, αiα ′ i

The MLE of the parameters β, (σ2k), (ρkl) in Proposition 1 and alternative

estimates of σ20 and φ in (4.30) and (4.32) depend on unobservable αi and

αiαi ′ where αi =

 αi1... αiq

 . Thus, we calculate their estimates

• α̂ik

• α̂2ik

• (α̂ikα̂il)

for k 6= l = 1, ...,q and i = 1, ...,m using the elements in E(αi|Y ) and

E(αiα ′ i |Y ). To compute the conditional expectations E(αi|Y ) and E(αiα′i|Y ),

we start from the joint distribution of Y and α.

( α Y

) ∼ N

(( 0 Xβ

) ,

( var(α) cov(α,Y )

cov(Y ,α) var(Y )

))

Then, by [3] and [25], the conditional distribution of α given Y is Normal

with mean and variance as:

E(α|Y ) = E(α) + cov(α,Y )var(Y )−1(Y −Xβ) (4.33)

var(α|Y ) = var(α) − cov(α,Y )var(Y )−1cov(Y ,α)) (4.34)

Unfortunately, computing (4.33) and (4.34) requires calculating an inverse

matrix with a size of dim(Y ) ×dim(Y ) which can be very large with a big-

data. Fortunately, we have the following Lemma under the independence

scenario.

Lemma 1 If (Y1,α1), ..., (Ym,αm) are independent to each other, then

E(αi|Y ) = E(αi|Yi) (4.35)

E(αiα ′ i |Y ) = E(αiα′i|Yi) (4.36)

�

Based on Lemma 1 and careful calculation, we have the following conditional

expectations.

Proposition 2 Under the LMM in (4.14), we have

E(αi|Yi) = GiZi′(ZiGiZi′ + Ri)−1(Yi −Xiβ) ∆ ≡ F1

E(αiαi ′|Yi) = Gi −GiZi′(ZiGiZi′ + Ri)−1ZiGi + E(αi|Yi)E(αi|Yi)′

∆ ≡ F2

�

Here F1 is a q−dim vector and F2 is a q×q matrix. See Appendix A.2 and

Appendix A.3 for the proofs of Lemma 1 and Proposition 2 respectively.

Note that Gi depends on (σ 2 k) and (ρkl) for k 6= l = 1, ...,q and Ri depends

on σ20 and φ and is Ji −dim, as given in (4.12) and (4.13). So,

Gi = Gi((σ 2 k), (ρkl)) (4.37)

Ri = Ri(σ 2 0,φ) (4.38)

F1 = F1(Zi,Gi,Ri,Yi,Xi,β) (4.39)

F2 = F2(Zi,Gi,Ri,Yi,Xi,β) (4.40)

which can be computed based on data Xi, Yi and previous estimates of β,

(σ2k), (ρkl), σ 2 0 , φ for k 6= l = 1, ...,q. This leads to the following full EM

algorithm.

Using the formulas for parameter estimates of (β,σ2k,ρkl,σ 2 0,φ) in (4.23),

(4.24), (4.25), (4.30), (4.32), and estimates for αi, our full EM algorithm is

in (4.41) below.

Schematic Structure of Full EM algorithm (4.41)

1. Initial Step : Set initial values for (β,σ2k,ρkl,σ 2 0,φ) which can be ob-

tained from fitting a small sample to the LMM

2. E-Step : Compute the estimates α̂ (j) ik , α̂

2(j) ik , and (α̂ikα̂il)

(j) using for- mulas in Proposition 2 and (4.39) & (4.40) at step (j − 1)

3. M-step : Compute the MLE (β̂(j), σ̂ 2(j) k , ρ̂

(j) kl , σ̂

2(j) 0 , φ̂

(j)) in Proposition

1 using α̂ (j) ik , α̂

2(j) ik , and (α̂ikα̂il)

(j) from the E-step above

4. Repeat 2 and 3 until all of the iterative parameter estimates are con-

verged, e.g. |β̂(j)1 − β̂

(j−1) 1 |

|β̂(j−1)1 | ≤ δ where β̂(j)1 is the estimated β1 in current

iteration and β̂ (j−1) 1 is the estimated β1 from previous iteration.

Detailed Full EM algorithm

We first calculate F1 and F2 using a known Ji × p matrix Zi, an observed

Ji × 1 vector Yi, observed Ji × (p + 1) matrix Xi from the jth block of

data and given initial estimated values of the parameters (β̂(j−1), (σ̂ 2(j−1) k ),

(ρ̂ (j−1) kl ), σ̂

2(j−1) 0 , φ̂

(j−1)) from the (j − 1)th block of data as in Proposition

E-Step : Compute

F1 (j) = F1(β̂

(j−1) , (σ̂

2(j−1) k ), (ρ̂

(j−1) kl ), σ̂

2(j−1) 0 , φ̂

(j−1),Yi (j),Xi

(j),Zi (j))

(4.42)

F2 (j) = F2(β̂

(j−1) , (σ̂

2(j−1) k ), (ρ̂

(j−1) kl ), σ̂

2(j−1) 0 , φ̂

(j−1),Yi (j),Xi

(j),Zi (j))

(4.43)

Then (α̂ (j) ik , α̂

2(j) ik , (α̂ikα̂il)

(j)) are the elements of F1 (j) and F2

(j)

• α̂(j)ik = F1(k) (j) (4.44)

• α̂2(j)ik = F2(k,k) (j) (4.45)

• (α̂ikα̂il)(j) = F2(k,l)(j) (4.46)

for k 6= l = 1, ...,q, i = 1, ...,m.

M-Step : We plug-in (α̂ (j) ik , α̂

2(j) ik , (α̂ikα̂il)

(j)) from the above E-Step into

the MLE of β, (σ2k), (ρkl) and estimates of σ 2 0 , φ in (4.30), (4.32) to obtain,

β̂(j) =

( m∑ i=1

Xi ′(Σ̂(i)

� )−1Xi

)−1( m∑ i=1

Xi ′(Σ̂(i)

� )−1(Yi −Ziα̂i)

) (4.47)

σ̂ 2(j) k =

m∑ i=1

Jiα̂ 2(j) ik (4.48)

ρ̂ (j) kl =

m∑ i=1

Ji(α̂ikα̂il) (j)

√√√√ 1 N

m∑ i=1

Jiα̂ 2(j) ik

√√√√ 1 N

m∑ i=1

Jiα̂ 2(j) il

m∑ i=1

Ji(α̂ikα̂il) (j)

√√√√ m∑ i=1

Jiα̂ 2(j) ik

√√√√ m∑ i=1

Jiα̂ 2(j) il

(4.49)

Using β̂ and α̂i we calculate residuals r̂i for the ith subject from the model

4.14 as

r̂i = Yi −Xiβ̂−Ziα̂i (4.50)

Divide r̂i as we have done to derive 4.30 and define X̂ (t−1) i such that,

• r̂(t) = (r̂(t)1 , r̂ (t) 2 , ..., r̂

(t) m

) = (r̂12, ..., r̂1J1, r̂22, ..., r̂2J2, ......, r̂m2, ..., r̂mJm )

• r̂(t−1) = (r̂(t−1)1 , r̂ (t−1) 2 , ..., r̂

(t−1) m

)

= (r̂11, ..., r̂1(J1−1), r̂21, ..., r̂2(J2−1), ......, r̂m1, ..., r̂m(Jm−1))

• X̂(t−1)i = (1, r̂ (t−1) i )

Then φ and σ20 can be estimated by

φ̂(j) = m∑ i=1

(X̂ (t−1)′ i X̂

(t−1) i )

−1X̂ (t−1)′ i r̂i

(t) (4.51)

σ̂ 2(j) 0 =

T∑ i=1

r̂es 2 i

N − 2 (4.52)

where r̂esi for i = 1, ...,T is the residual from r̂ (t) = φ0 + φr̂

(t−1) + ηt.

Iteration and Convergence: After we get the parameter estimates from

the M-step, we calculate the relative error between the updated estimates

and previous estimates, e.g. as in (4.53) using β1.

relerror = |β̂(j)1 − β̂

(j−1) 1 |

|β̂(j−1)1 | (4.53)

For convergence of the parameter estimates, we iterate the E-Step and M-

step until the relative error for all of the parameter estimates gets smaller

than a given δ or a certain number of maximum iteration has reached.

4.4 Partial EM (PEM) Procedure for bLMM

Our PEM procedure was developed to overcome the difficulty in fitting a

statistical model to an entire dataset using an existing procedure (e.g. MLE,

Full EM algorithm(4.41)) when the data size exceeds the memory capacity.

The procedure partitions the (big) data into blocks and fits a LMM using

a block and then updates the parameter estimates as new blocks are added

into the computation. The overall flow of PEM is introduced in Figure 4.2.

First, we partition the full data into C blocks. In this step we need to make

sure that the size of each partition can be imported within the memory

capacity. Choose one block of data and fit a LMM using R package nlme

[38] to derive initial parameter estimates (β(1), σ 2(1) k , ρ

(1) kl , σ

2(1) 0 , φ

(1)). Along

Figure 4.2: Overall flow of PEM Procedure

with the parameter estimates, we also calculate S (1) 1 , ...,S

(1) 7 (given below)

that will be used to derive updated estimates. Using (β(1), σ 2(1) k , ρ

(1) kl , σ

2(1) 0 ,

φ(1)) and S (1) 1 , ...,S

(1) 7 , we derive (β

(2), σ 2(2) k , ρ

(2) kl , σ

2(2) 0 , φ

(2)) and S (2) 1 , ...,S

(2) 7

about the data in Block 2 by adding data in Block 2. We then iterate PEM

procedure until the parameter estimates converge. Then the converged (β(2),

σ 2(2) k , ρ

(2) kl , σ

2(2) 0 , φ

(2)) and S (2) 1 , ...,S

(2) 7 from Block 2 will be used to derive the

next set of parameter estimates by adding the data from Block 3. Repeat the

same steps until all of the data is included in the calculation. The updated

estimates (β(C), σ 2(C) k , ρ

2(C) 0 , φ

(C)) from adding the last partition C, are

the final estimates for statistical inference. The PEM algorithm for bLMM

is described in (4.54).

PEM algorithm (4.54)

1. Partition the full data into C blocks

2. Select a block, B1 and fit LMM using R package nlme and get initial

estimates (β̂ (1)

, σ̂ 2(1) k , ρ̂

(1) kl , σ̂

2(1) 0 , φ̂

(1))

3. Use (β̂ (1)

, σ̂ 2(1) k , ρ̂

(1) kl , σ̂

2(1) 0 , φ̂

(1)) and the data from B1 to calculate nec-

essary sum components S (1) 1 , ...,S

(1) 7 that contain summary information

on the data in B1

4. Choose next block, B2

5. E-Step : Compute α̂ik, α̂ 2 ik, and (α̂ikα̂il) from E(α|Y ) and E(αα

′|Y ) using (β̂

(1) , σ̂

2(1) k , ρ̂

(1) kl , σ̂

2(1) 0 , φ̂

(1)) and data in B2

6. M-step : Compute (β̂ (2) j , σ̂

2(2) (k,j)

, ρ̂ (2) (kl,j)

, σ̂ 2(2) (0,j)

, φ̂ (2) j ) using α̂ik, α̂

2 ik, (α̂ikα̂il)

from E-step, and the sum components from B1

7. Repeat 5 and 6 until all of the iterative parameter estimates are con-

verged, e.g. |β̂(2)

(1,j) − β̂(2)

(1,(j−1))| |β̂(2)

(1,(j−1))| ≤ δ where β̂(2)

(1,j) is the estimated β1 in

current iteration and β̂ (2) (1,(j−1)) is the estimated β1 from previous itera-

tion

8. Go to step 4 to select another block and repeat steps 5 to 7 until all blocks are included in the calculation

In the following part of this Section, we will describe the details of E-Step

and M-Step of PEM. Assume that the full data is partitioned into C blocks.

Let Bi represent the data in the ith block and mi represent the number of

subjects in the ith block for i = 1, ...,C. Using the data in B1, we fit a LMM

using R package nlme and get the initial estimates (β̂ (1)

, (σ̂ 2(1) k ), (ρ̂

(1) kl ), σ̂

2(1) 0 ,

φ̂(1)), the estimated random effects α̂(1), and the residuals (r̂(1)). Then we

calculate the following sum components (4.55)-(4.61) which will be used in

the M-step to get the updated parameter estimates when we add the next

block of data (B2) into the analysis.

S (1) 1 =

∑ i∈B1

Xi ′(Σ̂(i)

� )−1Xi (4.55)

S (1) 2 =

∑ i∈B1

Xi ′(Σ̂(i)

� )−1(Yi −Ziα̂(1)) (4.56)

S (1) 3k =

∑ i∈B1

Jiα̂ 2 ik (4.57)

S (1) 4kl =

∑ i∈B1

Jiα̂ikα̂il (4.58)

S (1) 5 =

∑ i∈B1

X (t−1)′ i X

(t−1) i (4.59)

S (1) 6 =

∑ i∈B1

X (t−1)′ i r

(1)(t) i (4.60)

S (1) 7 =

∑ i∈B1

r̂es (1)2 i (4.61)

where k 6= l = 1, ...,q. For the ith subject, Xi and Yi are observed covariates

and response in the B1 block respectively, Zi is a known matrix associated

with α̂(1), Σ(i) �

is defined by estimated (σ̂ 2(1) 0 , φ̂

(1)), α̂2ik and (α̂ikα̂il) can be

calculated using α̂(1), X (t−1) i and r

(1)(t) i are derived from (r̂

(1)) and r̂es (1) i

is the residual from fitting (4.29) with (r̂(1)).

E-Step

E-Step in PEM is almost identical to the E-Step in the full EM algo-

rithm. Given the parameter estimates (β̂ (1)

, (σ̂ 2(1) k ), (ρ̂

(1) kl ), σ̂

2(1) 0 , φ̂

(1)) from

B1, Yi and Xi from B2, we calculate (α̂ik,α̂ 2 ik,(α̂ikα̂il)) from EB2 (αi|Yi) and

EB2 (αiαi ′|Yi) for k 6= l = 1, ...,q and i = 1, ...,m using (4.62) and (4.63) as

in (4.42) and (4.43).

F (B2) 1 = F1(β̂

(1) , (σ̂

2(1) k ), (ρ̂

(1) kl ), σ̂

2(1) 0 , φ̂

(1),Yi (2),Xi

(2),Zi (2)) (4.62)

F (B2) 2 = F2(β̂

(1) , (σ̂

2(1) k ), (ρ̂

(1) kl ), σ̂

2(1) 0 , φ̂

(1),Yi (2),Xi

(2),Zi (2)) (4.63)

From F (B2) 1 and F

(B2) 2 (α̂ik,α̂

2 ik,α̂ikα̂il) can be calculated as,

α̂ik = F (B2) 1 (k)

α̂2ik = F (B2) 2 (k,k)

α̂ikα̂il = F (B2) 2 (k,l)

Similarly, for block Bc, c = 1, ...,C, given the parameter estimates (β̂ (c−1)

(σ̂ 2(c−1) k ), (ρ̂

(c−1) kl ), σ̂

2(c−1) 0 , φ̂

(c−1)) from B1(c−1), Yi and Xi from B(c), we can

calculate,

F (Bc) 1 = F1(β̂

(c−1) , (σ̂

2(c−1) k ), (ρ̂

(c−1) kl ), σ̂

2(c−1) 0 , φ̂

(c−1),Yi (c),Xi

(c),Zi (c))

(4.64)

F (Bc) 2 = F2(β̂

(c−1) , (σ̂

2(c−1) k ), (ρ̂

(c−1) kl ), σ̂

2(c−1) 0 , φ̂

(c−1),Yi (c),Xi

(c),Zi (c))

(4.65)

α̂ik = F (Bc) 1 (k) (4.66)

α̂2ik = F (Bc) 2 (k,k) (4.67)

α̂ikα̂il = F (Bc) 2 (k,l) (4.68)

Then, (α̂ik, α̂ 2 ik, (α̂ikα̂il)) from Bc block will be used in the M-Step below to

get updated estimates of (β̂ (c)

, (σ̂ 2(c) k ), (ρ̂

2(c) 0 , φ̂

(c)).

M-Step

In the M-Step, PEM procedure accumulates the information from the pre-

vious blocks as a new block of the data is added to calculating the parameter

estimates for fitting LMM. After obtaining the parameter estimates (β̂ (c−1)

σ̂ 2(c−1) k , ρ̂

(c−1) kl , σ̂

2(c−1) 0 , φ̂

(c−1)) and (S (c−1) 1 , S

(c−1) 2 , S

(c−1) 3k , S

(c−1) 4kl , S

(c−1) 5 , S

(c−1) 6 ,

S (c−1) 7 ) from B(c−1), we can update the parameter estimates and the sum com-

ponents by adding data from the next block Bc as follows.

β̂(c) =

( S

(c−1) 1 +

∑ i∈Bc

Xi ′(Σ(i)

� )−1Xi

)−1( S

(c−1) 2 +

∑ i∈Bc

Xi ′(Σ(i)

� )−1(Yi −Ziα̂i)

) (4.69)

σ̂ 2(c) k =

( S

(c−1) 3k +

∑ i∈Bc

Jiα̂ 2 ik

) (4.70)

ρ̂ (c) kl =

S (c−1) 4kl +

∑ i∈Bc

Jiα̂ikα̂il√ S

(c−1) 3k +

∑ i∈Bc

Jiα̂ 2 ik

√ S

(c−1) 3l +

∑ i∈Bc

Jiα̂ 2 il

(4.71)

φ̂(c) =

( S

(c−1) 5 +

∑ i∈Bc

X (t−1)′ i X

(t−1) i

)−1( S

(c−1) 6 +

∑ i∈Bc

X (t−1)′ i r

(t) i

) (4.72)

σ̂ 2(c) 0 =

Tc − 2

( S

(c−1) 7 +

∑ i∈Bc

r̂es 2 i

) (4.73)

for Tc = c∑ i=1

mi∑ j=1

(Jj − 1).

Iteration and Output Step

We iterate the E-Step and M-Step until the parameter estimates are con-

verged i.e. relative error = |β̂(c)

(1,j) − β̂(c)

(1,(j−1))| |β̂(c)

(1,(j−1))| ≤ δ. At block Bc, when the

parameters are converged, the converged parameter estimates (β̂ (c)

, σ̂ 2(c) k ,

ρ̂ (c) kl , σ̂

2(c) 0 , φ̂

(c)) and the following sum components are saved to be used for

updating parameter estimates when block B(c+1) is added.

S (c) 1 =

c∑ j=1

∑ i∈Bj

Xi ′(Σ̂(i)

� )−1Xi (4.74)

S (c) 2 =

c∑ j=1

∑ i∈Bj

Xi ′(Σ̂(i)

� )−1(Yi −Ziα̂(c)) (4.75)

S (c) 3k =

c∑ j=1

∑ i∈Bj

Jiα̂ 2 ik (4.76)

S (c) 4kl =

c∑ j=1

∑ i∈Bj

Jiα̂ikα̂il (4.77)

S (c) 5 =

c∑ j=1

∑ i∈Bj

X (t−1)′ i X

(t−1) i (4.78)

S (c) 6 =

c∑ j=1

∑ i∈Bj

X (t−1)′ i r

(t) i (4.79)

S (c) 7 =

c∑ j=1

∑ i∈Bj

r̂es 2 i (4.80)

PEM procedure will update the parameter estimates until all C blocks are

included in the analysis. Then the parameter estimates (β̂ (C)

, σ̂ 2(C) k , ρ̂

σ̂ 2(C) 0 , φ̂

(C)) after adding all C blocks are used for statistical inference.

Chapter 5

Performance of bLMM PEM

A simulation study was conducted to compare the performance of our bLMM

PEM with four benchmark procedures. The first two were based on entire

data, then the entire data can be inputed into R. The next two use blocks

of the entire data for analyses.

1. Using entire data to fit

• Linear mixed effects model using Broyden−Fletcher−Goldfarb−Shanno

(BFGS) algorithm (R package nlme), or

• Linear mixed effects model using the full EM algorithm

2. Using a block/blocks of data to fit

• Linear mixed effects model using meta analysis of sub LMMs

• Linear mixed effects model using one block of the data

Procedure 1, LMM using BFGS algorithm, was implemented in function lme

in the R package nlme. This package used BFGS algorithm [32] to find MLE

of the parameters of interest.

Procedure 2, LMM using the full EM algorithm, employed an expectation

and maximization approach to find the parameter estimates that maximize

the likelihood fuction using the entire dataset. The EM algorithm required

initial values of the parameters. Even though EM algorithm was known to

be robust to the initial values, the computation time depends on the initial

values. The reasonable initial values we chose were data driven by deriving

the estimates from a random partition of the data. So, first we chose a

subset of the data to derive the parameter estimates to be used as initial

value. Using these initial values, we iterated the EM algorithm until the

estimates of the parameters met the convergence criteria having the relative

errors smaller than 10−3 or the algorithm reached 200 iterations.

Our Procedure, LMM using PEM algorithm, follows the PEM algorithm in

(4.54). We randomly partitioned the data into C blocks. For the first block,

the function lme in R package nlme was used to fit a LMM. Using the

resulting parameter estimates, summation information and next blocks of

data, we applied our PEM to update the parameter estimates, as introduced

in Chapter 4 Section 4.4. Our PEM procedure iterated until the estimated

parameters met the convergence criteria of 10−3, or the algorithm reached

200 iterations, using the same convergence criteria as those in the full EM

algorithm.

Procedure 3, LMM using meta analysis, was done on the same data parti-

tion used for our bLMM PEM procedure. We treated the partitions to be

independent data. We parallelized the LMM models among the partitions to

get the parameter estimates using the function lme in R package nlme. We

then merged the estimates using fixed effects meta analysis [46]. For the ith

block, let γ̂i be a parameter estimate, vi be the standard error from fitting

the LMM model to the ith block and r be the true value of γ̂i. Then a fixed

effects model γ̂i = r + ei is used to combine the estimates from all C blocks.

The overall parameter estimate γ̂ by merging γ̂1, ...γ̂C can be calculated by

γ̂ =

C∑ i=1

wiγ̂i

C∑ i=1

(5.1)

where wi = v −1 i and var(γ̂) = 1/

C∑ i=1

wi.

Procedure 4, using LMM on a block of the data, computed the estimates

based on the first block partitioned for our bLMM PEM procedure. So, the

estimates were derived from applying the function lme in R package nlme

on the first block of the data.

Modified PEM

Our preliminary simulation study showed that the parameter estimates using

PEM procedure had reasonable variance and bias except for the φ parameter.

Finding the reason for this bias is a remaining task to be studied but for the

purpose of using PEM to analyze our SCAN-ECHO data, we were able to

improve the performance of PEM by using a hybrid procedure with the pa-

rameter estimates of φ from the BFGS algorithm. Therefore, the simulation

study and the application in Chapters 5 and 6 uses the hybrid algorithm.

Our hybrid algorithm works exactly same as our PEM procedure introduced

in Section 4.4 but for φ we use the parameter estimate derived from BFGS

algorithm from each block. For example, when updating the parameter esti-

mates for block 1, φ will be estimated from BFGS algorithm, which will be

identical to the estimates from the meta analysis, and stay the same as the

PEM procedure iterates within the block for convergence.

Model Setup

The model used for the simulation study was a longitudinal random effects

model with three fixed effects, one random intercept and one random slope.

The errors had AR(1) serial correlation.

Yi = β0 + Xi1β1 + Xi2β2 + Xi3β3 + (α1i + αi2Xi3) + �i (5.2)

where,

• E(αi) = (

0 0

) and var(αi) =

( σ21 ρσ1σ2

ρσ2σ1 σ 2 2

) for αi = (α1i,α2i)

• E(�i) = 0 and var(�i) = σ20

1 −φ2 Σ

(i) � where Σ

(i) � =

 

1 φ · · · φJi−1 φ 1 · · · φJi−2 ...

... . . .

... φJi−1 · · · φ 1

 

A dataset was simulated for 5,000 subjects with 5 repeated outcome measures

using β0 = 6, β1 = 1.1, β2 = 1.3, β3 = −2.1, σ21 = 2.4, σ22 = 1.7, ρ =

0.3, σ20 = 0.3, φ = 0.6. The covariates represented age at baseline (X1), a

binary intervention (X2), and time (X3). X1 was generated from N(50, 5),

X2 was generated from binomial distribution with p(intervention) = 0.6 and

p(control) = 0.4, and X3 = time = (1,2,3,4,5) which is same for all subjects.

The parameters used to generate these variables are similar to those from

our original data from the VA patients. For the three procedures bLMM

PEM, meta analysis and partial data analysis, the subjects were divided

into 5 partitions with 1,000 patients in each partition. MSE, bias, %bias (=∣∣true bias

∣∣) and variance of the estimates from the simulation for the 5 methods are provided in Tables 5.1, 5.2, 5.3, and 5.4 respectively. The average time

used for estimation is provided in Table 5.5.

‘true’ indicates the true values of the parameters used to generate simulated

data, ‘nlme’ represents the method using BFGS algorithm, ‘EM’ stands for

the full EM algorithm, ‘PEM’ stands for the Partial EM, ‘Meta’ stands for the

meta analysis with a fixed effects model, and ‘partial’ stands for using partial

data and fitting the model using BFGS algorithm. The result is based on 450

simulations and the mean-squared error was calculated by adding the esti-

mated variance and square of the bias of the parameters estimated from the

450 simulations. Among the 5 methods, we expect BFGS algorithm (‘nlme’)

and full EM (‘EM’) to perform better than the other methods because it uses

the entire data at once for estimation.

param true nlme EM PEM Meta Partial β1 1.1 0.00001 0.00004 0.00007 0.00001 0.00012 β2 1.3 0.00019 0.00159 0.00168 0.00019 0.00191 β3 -2.1 0.00006 0.00006 0.00006 0.00006 0.00066 σ21 2.4 0.00323 0.03308 0.05324 0.00449 0.04098 σ22 1.7 0.00066 0.00088 0.00084 0.00065 0.00575 ρ 0.3 0.00015 0.00102 0.0018 0.00018 0.00172 σ20 0.3 0.14883 0.00604 0.00094 0.15728 0.16353 φ 0.6 0.00043 0.00044 0.00044 0.00044 0.00447

Table 5.1: Mean-square error of the estimates

param true nlme EM PEM Meta Partial β1 1.1 -0.0001 -0.0002 -0.00018 -0.0001 -0.00032 β2 1.3 0.000004 -0.00041 -0.00067 0.00001 -0.00115 β3 -2.1 0.00023 0.00023 0.00026 0.00022 0.00063 σ21 2.4 -0.00263 0.17418 -0.17382 -0.02612 -0.02952 σ22 1.7 0.00044 0.0147 -0.00995 -0.00017 -0.00035 ρ 0.3 -0.00005 -0.02986 0.03191 0.00332 0.00247 σ20 0.3 0.38529 -0.07756 -0.02523 0.39596 0.39796 φ 0.6 -0.00012 0.00241 0.00241 0.00241 0.00484

Table 5.2: Bias of the estimates

param true nlme EM PEM Meta Partial β1 1.1 0.00009 0.00018 0.00016 0.00009 0.00029 β2 1.3 0.00000 0.00032 0.00051 0.00000 0.00088 β3 -2.1 0.00011 0.00011 0.00013 0.0001 0.0003 σ21 2.4 0.0011 0.07257 0.07242 0.01088 0.0123 σ22 1.7 0.00026 0.00865 0.00585 0.0001 0.0002 ρ 0.3 0.00015 0.09953 0.10637 0.01107 0.00822

σ20 0.3 1.28428 0.25854 0.08412 1.31987 1.32652 φ 0.6 0.00021 0.00402 0.00402 0.00402 0.00807

Table 5.3: % Bias of the estimates

param true nlme EM PEM Meta Partial β1 1.1 0.00001 0.00004 0.00007 0.00001 0.00012 β2 1.3 0.00019 0.00159 0.00168 0.00019 0.00191 β3 -2.1 0.00006 0.00006 0.00006 0.00006 0.00066 σ21 2.4 0.00322 0.00274 0.02302 0.0038 0.04011 σ22 1.7 0.00066 0.00066 0.00074 0.00065 0.00575 ρ 0.3 0.00015 0.00013 0.00078 0.00017 0.00171 σ20 0.3 0.00038 0.00002 0.00031 0.0005 0.00517 φ 0.6 0.00043 0.00044 0.00044 0.00044 0.00445

Table 5.4: Variance of the estimates

nlme EM PEM Meta Partial time 38.77 150.45 54.64 27.80 2.84

Table 5.5: Efficiency (sec)

The MSE of the parameter estimates using BFGS algorithm were smallest

overall except for σ20 . The MSE is expected to be smaller for BFGS algorithm

and EM algorithm because they use the entire data at once to derive the

parameter estimates. However, for EM algorithm, the MSE was not as good

as the BFGS algorithm. Among our PEM procedure, Meta analysis, and

using partial data which uses block(s) of data, the performance of PEM

procedure or Meta analysis differed by parameters. Especially for σ20 , the

MSE, bias and variance was smallest among all five algorithms. This maybe

because PEM procedure has “superefficiency”. The computation time of

PEM was a little longer than meta analysis with only few seconds.

Chapter 6

Evaluation of SCAN-ECHO training on diabetes treatment using bLMM PEM

To examine the effect of SCAN-ECHO training on diabetes treatment using

the trend of original A1c values by days, we fitted LMMs using our bLMM

PEM method on each of homogeneous subgroups and for the entire patients

with clusters as a covariate. Because we were not able to conduct cluster

analysis when using the original A1c values, we subgrouped the patients

based on the result from our cluster analysis used the mean of A1c by year

in Section 3.2. The cluster analysis using the mean of A1c suggested that the

patients can be divided into 5 subgroups as in Figure 3.2. So, a LMM was

fitted for each subgroup as in (6.1)and then for the entire patient population

with the cluster variable as a covariate as in (6.2).

A1ci = scan + log(time) + baseage + gender + race + BMI

+ SBP + DBP + med + cons + cvdiag + α1i + α2ilog(timei) (6.1)

A1ci = scan + log(time) + cluster + baseage + gender + race + BMI

+ SBP + DBP + med + cons + cvdiag + α1i + α2ilog(timei) (6.2)

where i is the index for the ith patient. The distribution of time in days

was heavily right skewed. Therefore, in our LMMs (6.1, 6.2), time was log

transformed to reduce the skewness.

The parameter estimates of the fixed effects from fitting LMMs (6.1) and

(6.2) are provided in Table 6.1 and Table 6.2 respectively.

Cluster A Cluster B Cluster C Cluster D Cluster E Estimate Estimate Estimate Estimate Estimate

(Intercept) 4.68529 5.62723 6.64122 8.17743 11.16059 scan1 -0.01255 0.00314 0.01521 0.04584 0.22487

log(time) -0.00447 0.01238 0.03701 0.03667 0.00089 baseage 0.00485 0.00334 0.00023 -0.00267 -0.00864

GenderM 0.10577 0.08439 0.06080 0.04939 -0.17455 raceOTHER -0.00457 0.00345 -0.03526 -0.03058 -0.04296 raceWHITE 0.00028 -0.03478 -0.02947 -0.04619 -0.13193

SBP -0.00067 0.00047 0.00028 -0.00153 -0.00496 DBP 0.00095 0.00107 0.00366 0.00675 0.01499 BMI 0.02112 0.01083 0.00617 -0.00161 -0.02344

med1 0.10046 0.16606 0.09695 0.05824 0.08578 cons1 0.06127 0.03829 0.07746 0.10958 0.03940

cvdiag1 -0.00882 -0.01206 0.00150 0.02552 0.07696

Table 6.1: Estimated coefficients of the fixed effects in the LMM for original A1c by clusters

The effect of SCAN-ECHO training increased from Cluster A to Cluster E.

SCAN-ECHO training decreased the A1c levels of the patients in Cluster

A by 0.01255 but starting from Cluster B, SCAN-ECHO increased the A1c

levels of the patients by 0.00314, 0.01521, 0.04584, 0.22487 in Clusters B,

C, D, E respectively. This shows that SCAN-ECHO helps to reduce the

A1c levels for patients with relatively healthy diabetic status where as, for

patients with relatively poor diabetic status, SCAN-ECHO training does not

help to reduce A1c levels.

Covariate Estimate Covariate Estimate Covariate Estimate (Intercept) 5.15236 ClusterE 2.73716 DBP 0.00353

scan1 0.02065 baseage 4.44322 BMI 0.00524 time 0.02070 GenderM 0.05369 med 0.13118

ClusterB 0.00077 raceOTHER -0.01516 cons 0.06148 ClusterC 0.73788 raceWHITE -0.03075 cvdiag -0.00039 ClusterD 1.65668 SBP -0.00040

Table 6.2: Estimated coefficients of the fixed effects in the LMM for original A1c from all patients

The estimated fixed effect of SCAN-ECHO training from our model (6.2)

showed that the treatments from PCPs who had SCAN-ECHO training in-

creased A1c level in average by 0.02065 relative to the treatment from PCPs

who did not have SCAN-ECHO training.

Chapter 7

Discussion and Conclusion

In the first part of this thesis, we introduced the challenges and their solu-

tions in evaluating the effect of SCAN-ECHO training on treating diabetes.

Section 3.2 evaluates the effects of SCAN-ECHO training based on three

summary measures of A1c level which were annual mean, SD and CV of A1c

levels. SCAN-ECHO training appeared to slightly decrease the mean of A1c

for the subgroup of patients whose mean of A1c is around 5.5, and with a flat

trend over 6 years. In contrast, SCAN-ECHO training appeared to increase

the mean of A1c for the subgroup of patients who had flat trend of mean A1c

levels around 10.2 over 6 years. For variability measures by SD and CV of

A1c, SCAN-ECHO statistically significantly decreased the SD and CV of A1c

over time for subgroups with a relatively smaller variability. Hence, based on

the analysis of these measures, SCAN-ECHO training showed statistically

significant effect in decreasing (or maintaining) the yearly mean, SD, and

CV of A1c over time for the subgroup patients who were relatively healthy,

whereas, for patients with high mean A1c or elevated SD or CV, SCAN-

ECHO showed a negative effect by increasing the A1c slightly. As shown in

Chapter 6, the results from LMMs using original A1c levels were similar to

the results for annual mean of A1c. Overall, SCAN-ECHO training appear

to be beneficial for patients with better diabetic health conditions, whereas

for patients with worse diabetic health conditions, it might be helpful to see

a specialist for diabetes treatment rather than the PCPs in the outpatient

clinics.

Our proposed PEM procedure for fitting bLMM provided a significant con-

tribution to our analysis. It allowed us to fit a LMM using the entire data

when the data exceeded the memory capacity. Data partition made it pos-

sible to fit a LMM to a block of the big data within the memory limitation

and the PEM procedure made it possible to update the parameter estimates

and information as additional blocks of data are added. The parameter es-

timates passed over from a previous block make the PEM procedure very

efficient because they are not only good initial values, but also placed in the

updating formula of our PEM procedure when the next block is added into

the calculation. PEM procedure is a general method that can be adapted

to various types of big-data when fitting a longitudinal linear mixed effects

model.

At the present time, however, it is unclear how to parallelize the sequential

updating algorithm in the PEM procedure. Also, as each block is added into

the calculation, the PEM procedure iterates within the current block of data

until the convergence of the estimates. Therefore, PEM can be computation-

ally intensive. However, as shown in Chapter 5, the calculation time by PEM

is not too much more than that by the BFGS algorithm using the entire data,

and meta analysis using BFGS algorithm for each block of data. Therefore,

when BFGS algorithm using the entire data cannot be used for a big data,

considering the performance of PEM being better than the meta analysis

in terms of the MSE, the time difference between PEM and meta-analysis

procedures can be a beneficial trade-off.

Current PEM procedure is developed for a longitudinal LMM when the ran-

dom error has an AR(1) correlation structure. AR(1) structure is widely

used to capture the correlation between the repeated measures within a sub-

ject, but it ignores the actual length of the time between each consecutive

measures. This could be a limitation when correlation varies greatly as the

length between the measures changes in a study. Generalization of the cor-

relation structure of the random error is one of our future research agenda

for bLMM.

Also, current PEM does not provide hypothesis testing results for the fixed

effects. However, hypothesis testing for the fixed effect can be informative.

For the fixed effects, we already have considered different approaches in de-

riving the standard error of the fixed effects such as using our updating PEM

procedure, or bootstrap sampling which will be implemented in future re-

search. The challenge in finding a way to derive the standard error of the

variance components of the LMM still remains a challenge.

In short, our proposed PEM procedure is a significant contribution to fitting a

longitudinal LMM for big-data. It sets a good foundation for future research

when more general correlation structures of the random errors and more

complex models including interactions between the fixed and random effects

need to be considered.

Part II

Projection Pursuit Ellipse (PPE)

Chapter 8

Original PPE

In biomedical research, when investigating the effect of an intervention, the

simplest analysis is mean comparison between a control group and the in-

tervention group, or among many groups. Widely-used statistical tests for

mean comparison in low-dimensions are t-test and ANOVA, and Hotelling’s

T2 and MANOVA in high-dimensions. The key assumptions for all of these

tests are that the data are distributed normally and have a common variance-

covariance. In this second part of the dissertation, we introduce Projec-

tion Pursuit Ellipse (PPE) for testing the equality of covariances for high-

dimensional data and present our proposed enhanced high-dimensional PPE

(hPPE) algorithm, evaluate the performance of PPE procedures in a com-

parative study with the Bartlett’s test and a modern benchmark proposed

by T.Cai et al [39].

8.1 Introduction

Since equal covariance is one of the basic assumptions in a large number

of multivariate techniques, a test of equal covariance matrices is often used

in preliminary data exploration. For high-dimensional data, classic tests to

evaluate if several covariance matrices are equal include the likelihood ratio

test [1] and a modified likelihood ratio test, such as, Bartlett’s test [2, 45].

Under the normality assumption, consider a G group comparisons where

the random samples in the gth population Xg,1, ...,Xg,ng iid∼ Np(µg, Σg) for

g = 1, ...,G. Then to test

H0 : Σ1 = Σ2 = ... = ΣG

Ha : Σi 6= Σj for at least one pair of (i,j) where i 6= j = 1, ...,G

the Bartlett’s test statistic is

B =

G∏ g=1

|Sg| 1 2

(ng−1)

|S0| 1 2

(n−G) (8.1)

which has a χ2G−1 distribution under H0

where, G∑ g=1

ng = n, Sg =

ng∑ i=1

(Xi−X̄g)(Xi−X̄g)′

ng−1 , S0 =

G∑ g=1

ng∑ i=1

(Xi−X̄g)(Xi−X̄g)′

n−G , g =

1, ...,G.

Clearly, the Bartlett’s test statistic B is a function of generalized variance

that is only based on the volumes (i.e. determinants) of the sample co-

variance matrix, and thus may not be powerful enough to detect the true

difference among population covariances that have the same volume but dif-

ferent orientations and shapes. To consider all possible information from

a multivariate dataset in high-dimension, developing an alternative way to

find the maximum discrepancies of the covariance matrices is necessary to

capture the true covariance differences better than the Bartlett’s test would.

The idea of Projection pursuit Ellipses (PPE) was introduced by Wei and

Sun (manuscript). PPE was motivated by the concept of Projection Pursuit

[12, 34, 47, 48]. PPE explores the covariance structures of high-dimensional

data by projecting high-dimensional data into low-dimensional spaces and

comparing the projected ellipses of the covariances. Three PPE indices are

proposed where the maximum of each index shows the maximum discrepancy

of the projected ellipses. A PPE algorithm that maximizes these indices us-

ing constrained optimization had been developed. The three PPE indices

and the PPE algorithm will be reviewed in Sections 8.2 and 8.3 respectively.

In Chapter 9, our proposed hPPE algorithm will be introduced and a power

analysis using simulated data will be conducted.

8.2 PPE Index

Consider the case that G = 2 groups. W1,1, ...,W1,n1 ∈<2 and W2,1, ...,W2,n2 ∈

<2 be the linear projections of X1,1, ...,X1,n1 and X2,1, ...,X2,n2 by α i.e.,

W1,i = α TX1,i and W2,i = α

TX2,i for α Tα = I2. Here α is a p × 2 matrix

α = (α1,α2) representing a particular 2−dimensional space spanned by two

p−dimensional vectors α1 and α2.

Then W (j) (α)

= (Wj,1, ...,Wj,nj ) has mean α TµX,j and covariance α

T ΣX,jα,

denoted as W (j) (α) ∼ (αTµX,αT ΣXα) for j = 1, 2. The three PPE indices

proposed were

• PPE Index 1: PPE index based on 2 −dimensional Bartlett’s test

• PPE Index 2: PPE index based on a generalized distance

• PPE Index 3: PPE index based on a singular value decomposition

(SVD)

PPE index 1 applies Bartlett’s test to the projected data, PPE index 2 applies

a generalized F-test for testing equality of variance. PPE index 3 uses SVD to

incorporate the orientation and shape of the ellipse of the projected data. The

rationale for the three indices is as follows. An F-test for testing the equality

of 2 1 −dimensional variances is the most powerful test in 1 −dimension.

Both the Bartlett’s test and F-test are simple generalizations of the univariate

F-test to higher dimensions. Although, these simple direct generalization

may not carry its power to high dimension, it may have a better power in

2 −dimensional projection pursuit. Hence, we proposed PPE indices 1 and

2. PPE index 3 uses SVD to bring back the information that may have lost

in PPE indices 1 and 2.

8.2.1 PPE Index 1

PPE Index 1, I1 is based on 2-d Bartlett’s test and is defined as

I1 = −(n1 − 1)ln|S1|− (n2 − 1)ln|S2| + (n− 2)ln|S0| (8.2)

where S1 and S2 are 2×2 covariance matrix of W (1) = W (1) (α)

and W (2) =

W (2) (α)

and S0 is the pooled covariance matrix of S1 and S2. Under the null

distribution, it is known that I1 ∼ χ21.

8.2.2 PPE Index 2

PPE index 2, I2 is a generalization of the univariate F-test for testing the

equality of 2 covariances Σ1 and Σ2. In the univariate case p = 1, under

the null hypethesis, s22/s 2 1 ∼ F(n2 − 1,n1 − 1) where s21 and s22 are sample

variances of the first and second group samples.

Similarly, I2(α) is defined as

I2(α) = d̃(S −1 1 S2 − I2) (8.3)

where S1 and S2 are 2×2 covariance matrix of W (1) = W (1) (α)

and W (2) = W (2) (α)

and d̃ = ||D|| = ||(αT Σ1α)−1(αT Σ2α)−I2|| = √ d211 + d

2 12 + d

2 21 + d

2 22, where

α = (α1,α2) and D ∆ = (αT Σ1α)

−1(αT Σ2α) − I2 ∆ =

( d11 d12 d21 d22

) . I2(α) does

not follow a known distribution, so its critical value for conducting the test of

equal covariance was simulated under the null distribution. Specifically, 1000

I2(α) were simulated and its simulated value at 95% quantile was selected

to be the critical value at the significance level of 0.05.

8.2.3 PPE Index 3

PPE index 3 (I3) used SVD to incorporate the shape and orientation of the

ellipse by a projected data draws. By SVD on S1 and S2, where S1 and S2

are 2×2 covariance matrix of W (1) = W (1) (α)

and W (2) = W (2) (α)

it can be shown

that,

S1 =

( cos θ1 −sin θ1 sin θ1 cos θ1

)( a21 0 0 b21

)( cos θ1 −sin θ1 sin θ1 cos θ1

) (8.4)

S2 =

( cos θ2 −sin θ2 sin θ2 cos θ2

)( a22 0 0 b22

)( cos θ2 −sin θ2 sin θ2 cos θ2

) (8.5)

Then I3(α) is defined as

I3(α) =

[(( a2 a1

)2 −1 )2

(( b2 b1

)2 −1 )2

( a2/b2 a1/b1

)2 sin2(θ1−θ2)

]1 2

(8.6)

where,

(( a2 a1

)2 − 1 )2

and

(( b2 b1

)2 − 1 )2

has the shape information and

sin2(θ1 − θ2) has the orientation information on the ellipse of the projected

data. I3(α) does not follow a known distribution, so its critical value for

conducting the test of equal covariance was simulated under the null distri-

bution. Specifically, 1000 I3(α) were simulated and the simulated value at

95% quantile was selected to be the critical value at the significance level of

0.05.

8.3 PPE Algorithm

A PPE algorithm is an optimization algorithm that maximizes a PPE in-

dex over all possible projections of the data, i.e. over α s.t. αTα = I2.

Thus, this is a constrained optimization problem. A PPE algorithm must

be developed in consideration of efficiency and reliability of the algorithm.

As a testing procedure, a PPE algorithm needs to find the global maximum

of I(α) : max α:αT α=I2

I(α). As an exploratory or visualization tool, a PPE al-

gorithm needs to find the substantial local maxima instead of finding one

single global maximum in order to have additional insight into the multiple

discrepancies of the high-dimensional variance matrices. Constrained op-

timization itself is a difficult problem, especially when both the objective

function and the constraint functions are non-linear as in the PPE algo-

rithm. The original PPE algorithm employed an optimization tailored to a

non-linear objective function subject to non-linear equality constraints [18].

The key idea of the method is to incorporate the constraints into a modified,

unconstrained objective function, which can then be optimized by uncon-

strained optimization techniques such as a quasi-Newton method. The al-

gorithm, therefore, can be called a constrained quasi-Newton method. This

optimization method was efficient and reliable in finding a local maximum.

Since PPE index 2 and PPE index 3 have many local maxima under the

null distribution, a PPE algorithm with an enhanced global search on the

quasi-Newton method was developed. Specifically, let X and Y be two data

matrices with dimensions p×n1 and p×n2 respectively. With two p×1 pro-

jection vectors α1 and α2, W (1) (α)

and W (2) = W (2) (α)

be the linear projections of

X1 = (X1,1, ...,X1,n1 ) and X2 = (X2,1, ...,X2,n2 ) with respect to α = (α1,α2),

i.e. W (j) (α)

= (Wj,1, ...,Wj,nj ) for j = 1, 2. The constraints required by the

PPE algorithm are

• α1α1 = 1

• α2α2 = 1

• α1α2 = 0

Under the three constraints, the following the next 3 steps form the PPE

algorithm.

1. Preliminary Search: Search along p(p− 2)

2 pairs of coordinate axes and

find α(1) = (α (1) 1 ,α

(1) 2 ) that maximizes the PPE index.

2. Course Search: Starting at α(1), perform a large stepping search along

the pairs of orthogonal directions and find α(2) that maximizes the PPE

index.

3. Local Search: Starting at α(2), perform a local search using a derivative

free constrained optimization by linear approximations (COBYLA) al-

gorithm [39].

To simulate critical values of our PPE index (or test), we generate 2-data

samples. We keep the maximum from step 3 as an estimate of max α:αT α=I2

I(α)

under the H0. We repeatedly generate 2-data samples under H0 and obtain

the max α:αT α=I2

I(α) 1000 times and then compute the empirical upper γ-quantile

of max α:αT α=I2

I(α) as the critical value at the level γ, e.g. γ = 0.05.

Chapter 9

High-dimensional PPE (hPPE)

9.1 hPPE algorithm

A PPE algorithm for testing the equality of two covariance matrices provides

informative information in finding the projection that describes the most

dependencies of the two data. Therefore, we included another step of the

optimization algorithm to the original PPE algorithm to improve the chance

of finding the true global maximum of the index to improve the power of PPE

in testing for equal covariance matrices in high-dimension. High-dimensional

PPE (hPPE) algorithm

1. Step 1: Select T number of random projection satisfying the constraints

and conduct local search using COBYLA as in the step 3 of the original

PPE algorithm using each of random projections as its initial value.

2. Step 2: Conduct the original PPE algorithm.

Compare the index values obtained from steps 1 and 2 an decide the projec-

tion that maximizes the PPE index among these T +1 values. PPE algorithm

and hPPE algorithm are reprogrammed and implemented using Julia version

0.6.2.

9.2 hPPE power analysis

To evaluate the performance of our hPPE, we conducted a power analysis

using simulated data to compare our method with the Bartlett’s test and to

a high-dimension method proposed by T.Cai et al [4]. Bartlett’s test was

provided in Section 8.1. T.Cai’s method is introduced below.

Let X1,1, ...,X1,n1 iid∼ F(Σ1) and X2,1, ...,X2,n2

iid∼ F(Σ2) where, Σ1 = (σij1)p×p,

Σ2 = (σij2)p×p and n = n1 + n2. Then the null hypothesis H0 : Σ1 = Σ2 is

identical to H0 : max1≤i≤j≤p|σij1 −σij2| = 0. Thus, T.Cai et al proposed the

test statistic,

Mn =: max 1≤i≤j≤p

Mij = max 1≤i≤j≤p

(σ̂ij1 − σ̂ij2)2

(θ̂ij1/n1 + θ̂ij2/n2) (9.1)

where, θ̂ij1/n1 and θ̂ij2/n2 are estimated variances of σ̂ij1 and σ̂ij2. Then test

Φα is defined as

Φα = I(Mn ≥ qα + 4logp− loglogp)

where qα is the 1 − α quantile of the Type 1 extreme value distribution

p,n →∞ with qα = −log(8π) − 2loglog(1 −α)−1.

The power analysis was done to test for H0 : Σ1 = Σ2 where Σ1 and Σ2 are

covariance matrices of simulated p − dimensional data X1 and X2. X1 =

(X1,1, ...,X1,n) iid∼ N(0, Σ1) and X2 = (X2,1, ...,X2,n)

iid∼ N(2, Σ2) where Σ1 =

Ip, Σ2 = R ′IpR and R is a known p×p matrix that reflects the ratio of the

data at each dimension.

We compared 5 different methods using PPE and hPPE.

• PPE Index 1 (Using 2-d Bartlett’s Test)

• PPE Index 2 (Using Generalized Distance)

• PPE Index 3 (Using Singular Value Decomposition)

• Bartlett’s Test

• T.Cai et al Method

For the significance level of 0.05, n = 100 and m number of simulations,

the result using PPE is provided in Table 9.1 and the result using hPPE is

provided in Table 9.2.

Method p = 20, m = 1000 p = 50, m = 500 PPE Index1 90.4 89.2 PPE Index2 86.4 72.8 PPE Index3 99.9 96.4

Bartlett’s Test 89.6 86.4 T.Cai et al Test 99.8 95.4

Table 9.1: Empirical power in percents comparison for PPE

Method p = 20, m = 100 PPE Index1 86 PPE Index2 88 PPE Index3 100

Bartlett’s Test 74 T.Cai et al Test 99

Table 9.2: hPPE empirical power in percents comparison for hPPE

Table 9.1 shows that when using PPE algorithm, PPE Index 3, which in-

corporates the shape and orientation of the data using SVD, has the highest

power for data with 20 dimensions and 50 dimensions. However, T.Cai et

al’s method almost has as good power as PPE index 3. Table 9.2 shows that

for a data with 20 dimensions and 100 simulation, using hPPE, PPE index

3 captures all of the difference and T.Cai et al’s method did not capture 1.

Chapter 10

Discussion and Conclusion

Power analysis shows that PPE index 3 using SVD to incorporate data

shape and orientation performs the best in capturing the difference of the

covariance matrices between two high-dimensional samples for both PPE and

hPPE. This implies that PPE index using SVD works the best in conducting

the test of equal covariance between two groups for high dimensional data.

For data with 20 dimensions and 50 dimensions, PPE with PPE index 3 will

capture 99.9% and 96.4% of the difference in covariance matrices respectively.

Even though the number of simulation was 100, hPPE was able to capture

all of the different covariances. However, hPPE is designed to be applied

to much higher dimension and the number of simulations was 100 which

could not be enough to derive the relative and absolute performance of the

method. Therefore, further studies for higher dimension and higher number

of simulations could be our first future research goal using better hardware

environment.

Both PPE and hPPE are computationally intensive. Using a high perfor-

mance computing environment, under 36 hours of limitation, PPE was able

to conduct power analysis with 1000 simulations for 20 dimensions and 500

dimensions for 50 dimensions and hPPE was able conduct power analysis

with 100 simulations for 20 dimensions. Power analysis for higher dimen-

sions and more simulation iterations will require better hardware or a better

PPE algorithm. Both PPE and hPPE use Constrained Optimization by Lin-

ear Approximation (COBYLA) which is a local derivative free optimization

method for local optimization. However, the local optimization with the con-

straint of PPE and hPPE algorithm resulted in very intensive computation

based on the duration of the computation. This could be extended to general

local optimization problem which could be our second future research goal

to improve PPE and hPPE.

Appendix A

Appendix

In this Appendix, we will provide the proof of Proposition 1 in Section A.1,

proof of Lemma 1 in Section A.2, and proof of Proposition 2 in Section

A.3.

A.1 Proof of Proposition 1

Recall that when αi are observable, the log-likelihood function of our model

in (4.14) was

l(β,(σ2k)k=1,...,q, (ρkl)k 6=l=1,...,q,σ 2 0,φ)

( − m

2 ln(σ20 ) +

2 ln(1 −φ2) −

m∑ i=1

ln|Σ(i) � |−

1 −φ2

2σ20

m∑ i=1

r′ i (Σ(i)

� )−1ri

) +

( − N

2 ln|G|−

m∑ i=1

Ji∑ j=1

αi ′G−1αi

) + C

(A.1)

for some constant C and the covariance matrix of the random effects G is

identical for all subjects.

A.1.1 MLE for β

Take the partial derivative of the log-likelihood with respect to β

∂l

∂β =

∂

∂β

( −

1 −φ2

2σ20

m∑ i=1

r′i(Σ (i) � ) −1ri

)

= ∂

∂β

( −

1 −φ2

2σ20

m∑ i=1

(Yi −Xiβ −Ziαi)′(Σ(i)� ) −1(Yi −Xiβ −Ziαi)

)

= 1 −φ2

2σ20

m∑ i=1

X′i(Σ (i) � ) −1(Yi −Xiβ −Ziαi)

Setting ∂l ∂β

= 0 leads to

( m∑ i=1

X′i(Σ (i) � ) −1Xi

) β =

m∑ i=1

Xi(Σ (i) � ) −1(Yi −Ziαi)

Thus,

β̂MLE =

( m∑ i=1

X′i(Σ (i) � ) −1Xi

)−1( m∑ i=1

Xi(Σ (i) � ) −1(Yi −Ziαi)

) (A.2)

A.1.2 MLE for (σ2k)k=1,...,q and (ρkl)k 6=l=1,...q

To get MLE for (σ2k)k=1,...,q, take the partial derivative of the log-likelihood

with respect to σ2k.

∂l

∂σ2k =

∂

∂σ2k

( − N

2 ln|G|−

m∑ i=1

Ji∑ j=1

α′iG −1αi

)

= − N

2 tr

( G−1

∂G

∂σ2k

) −

m∑ i=1

Ji∑ j=1

α′i ∂G−1

∂σ2k αi

= − N

2 tr

( G−1

∂G

∂σ2k

) −

m∑ i=1

Ji∑ j=1

α′iG −1 ∂G

∂σ2k G−1αi

Setting ∂l ∂σ2k

= 0 leads to the following equation.

m∑ i=1

Ji∑ j=1

α′iG −1 ∂G

∂σ2k G−1αi = Ntr

( G−1

∂G

∂σ2k

) (A.3)

MLE of σ2k can be obtained by solving the equation (A.3) for σ 2 k.

For a model with q random effects, G is a q × q matrix and the inverse of

G needs to be calculated. Therefore, without loss of generality (WLOG), we

will consider a simple case with two random effects to derive the MLE of

(σ2k)k=1,2 and (ρkl)k 6=l=1,2. In the simple case, the random effects αi and the

covariance matrix of the random effects G can be defined as,

αi =

( αi1 αi2

) ,G =

( σ21 ρ12σ1σ2

ρ21σ2σ1 σ 2 2

)

where, ρ12 = ρ21 = ρ. So, to solve the equation (A.3) for σ 2 1 , we first need to

get G−1 and ∂G ∂σ21

as below.

G−1 = 1

σ21σ 2 2 (1 −ρ

( σ22 −ρσ1σ2

−ρσ2σ1 σ21

)

∂G

∂σ21 =

 

∂σ21 ∂σ21

∂ρσ1σ2 ∂σ21

∂ρσ2σ1 ∂σ21

∂σ22 ∂σ21

  =

( 1

ρσ2 2σ1ρσ2

2σ1 0

)

where

∂ρσ1σ2

∂σ21 = ρσ2

∂(σ21 ) 1 2

∂σ21 = ρσ2 2

(σ21 ) −1

2 = ρσ2 2σ1

Then by multiplying G−1 with ∂G ∂σ21

, we get,

G−1 ∂G

∂σ21 =

σ21σ 2 2 (1 −ρ

( σ22 −ρσ1σ2

−ρσ2σ1 σ21

)( 1

ρσ2 2σ1ρσ2

2σ1 0

)

= 1

σ21σ 2 2 (1 −ρ

 σ22 − ρ22 σ22 ρσ322σ1 −ρ 2 σ1σ2 −

ρ2

2 σ22

 

By multiplying G−1 again we get,

G−1 ∂G

∂σ21 G−1 =

σ41σ 4 2 (1 −ρ

2)2

  σ42 −ρ2σ42 ρ32 σ1σ32 − ρ2σ1σ32 ρ3

2 σ1σ

3 2 −

ρ 2 σ1σ

3 2 0

 

Using these calculations, we can now solve the equation (A.3) as below.

(A.3)LHS = m∑ i=1

Ji∑ j=1

α′iG −1 ∂G

∂σ21 G−1αi =

σ41σ 4 2 (1 −ρ

2)2

m∑ i=1

Ji(σ2α 2 i1 −ρσ1αi1αi2)

(A.3)RHS = Ntr

( G−1

∂G

∂σ21

) = N

σ21

Setting (A.3)LHS = (A.3)RHS, when σ21 6= 0, σ22 6= 0, and ρ2 6= 1, we get

Nσ21σ2(1 −ρ 2) = σ2

m∑ i=1

Jiα 2 i1 −ρσ1

m∑ i=1

Jiαi1αi2 (A.4)

Similarly, from ∂l ∂σ22

we get,

Nσ1σ 2 2 (1 −ρ

2) = σ1

m∑ i=1

Jiα 2 i2 −ρσ2

m∑ i=1

Jiαi1αi2 (A.5)

To get MLE for ρ we need to take the partial derivative of the log-likelihood

with respect to ρ and solve equation (A.6) for ρ,

m∑ i=1

Ji∑ j=1

α′iG −1 ∂G

∂ρ G−1αi = Ntr

( G−1

∂G

∂ρ

) (A.6)

To solve (A.6), we need to calculate ∂G ∂ρ

in addition to what we already have.

∂G

∂ρ =

  ∂σ

2 1

∂ρ ∂ρσ1σ2 ∂ρ

∂ρσ2σ1 ∂ρ

∂σ22 ∂ρ

  = ( 0 σ1σ2

σ2σ1 0

)

Then by multiplying G−1 with ∂G ∂ρ

, we get,

G−1 ∂G

∂ρ =

σ21σ 2 2 (1 −ρ

( σ22 −ρσ1σ2

−ρσ2σ1 σ21

)( 0 σ1σ2

σ2σ1 0

)

= 1

σ21σ 2 2 (1 −ρ

( −ρσ21σ22 σ1σ32 σ31σ2 −ρσ21σ22

)

G−1 ∂G

∂ρ G−1 =

σ21σ 2 2 (1 −ρ

2)2

( −2ρσ22 (1 + ρ2)σ1σ2

(1 + ρ2)σ1σ2 −2ρσ21

)

Using these calculation, we can solve the equation (A.6) as below.

(A.6)LHS = m∑ i=1

Ji∑ j=1

α′iG −1 ∂G

∂ρ G−1αi =

σ21σ 2 2 (1 −ρ

2)2

m∑ i=1

Ji(−2ρ(σ22α 2 i1 + σ

2 1α

2 i2) + 2(1 + ρ

2)σ1σ2αi1αi2)

(A.6)RHS = Ntr

( G−1

∂G

∂ρ

) = −

2Nρ

1 −ρ2

Setting (A.6)LHS = (A.6)RHS, when ρ2 6= 1, we get

− 2Nρσ21σ 2 2 (1 −ρ

2) = −2ρ ( σ22

m∑ i=1

Jiα 2 i1 + σ

2 1

m∑ i=1

Jiα 2 i2

) + 2(1 + ρ2)σ1σ2

m∑ i=1

Jiαi1αi2

(A.7)

Denote A, B, and C as below,

A ∆ ≡

m∑ i=1

Jiα 2 i1, B

∆ ≡

m∑ i=1

Jiα 2 i2, C

∆ ≡

m∑ i=1

Jiαi1αi2

Then we can rewrite (A.4), (A.5), and (A.7) as,

Nσ21σ2(1 −ρ 2) = Aσ2 −Cρσ1 (A.4-1)

Nσ1σ 2 2 (1 −ρ

2) = Bσ1 −Cρσ2 (A.5-1)

− 2Nρσ21σ 2 2 (1 −ρ

2) = −2ρσ22A− 2ρσ 2 1B + 2(1 + ρ

2)σ1σ2C (A.7-1)

By solving (A.4-1), (A.5-1), and (A.7-1), we get,

σ̂21MLE = 1

m∑ i=1

Jiα 2 i1 (A.8)

σ̂22MLE = 1

m∑ i=1

Jiα 2 i2 (A.9)

ρ̂MLE =

m∑ i=1

Jiαi1αi2√√√√( 1 N

m∑ i=1

Jiα 2 i1

)√√√√( 1 N

m∑ i=1

Jiα 2 i2

) (A.10)

This can be easily extended to q number of random effects when the random

effect αi for subject i and the covariance matrix of the random effects G is

defined as,

αi =

  αi1 αi2 ... ... αqi

  ,G =

 

σ21 ρ12σ1σ2 ρ13σ1σ3 · · · ρ1qσ1σq ρ21σ2σ1 σ

2 2 ρ23σ2σ3 · · · ρ2qσ2σq

... ... σ23 · · ·

... ...

. . . ...

ρq1σqσ1 ρq2σqσ2 ρq3σqσ3 · · · σ2q

 

Then the MLE of σk where k = 1, ...,q and ρkl where k 6= l = 1, ...,q can be

defined as,

σ̂2k = 1

m∑ i=1

Jiα 2 ik

ρ̂kl =

m∑ i=1

Jiαikαil√√√√( 1 N

m∑ i=1

Jiα 2 ik

)√√√√( 1 N

m∑ i=1

Jiα 2 il

)

A.1.3 MLE for σ20

First, we take partial derivative of the log-likelihood function with respect to

σ20 and get

∂l

∂σ20 =

∂

∂σ20

( − m

2 ln(σ20 ) −

1 −φ2

2σ20

m∑ i=1

ri ′(Σ(i)� )

−1ri

)

= − m

2σ20 +

1 −φ2

2σ40

m∑ i=1

ri ′(Σ(i)� )

−1ri

where ri = Yi −Xiβ̂−Ziα̂i. Setting ∂l ∂σ20

= 0 leads to

2σ20 =

1 −φ2

2σ40

m∑ i=1

ri ′(Σ(i)� )

−1ri

Then solving this equation for σ20 when σ 2 0 6= 0 will result in,

σ̂20 = 1 −φ2

m∑ i=1

ri ′(Σ(i)� )

−1ri (A.11)

To get the MLE of σ20 from (A.11) requires calculation of (Σ (i) � ) −1 which can

be obtained using Cholesky decomposition. In case for 4 repeated measures

(Ji = 4),

Σ(i)� =

 

1 φ φ2 φ3

φ 1 φ φ2

φ2 φ 1 φ φ3 φ2 φ 1

  (Σ(i)� )−1 = 11 −φ2

 

1 −φ 0 0 −φ 1 + φ2 −φ 0 0 −φ 1 + φ2 −φ 0 0 −φ 1

 

and with 5 repeated measures (Ji = 5),

Σ(i)� =

 

1 φ φ2 φ3 φ4

φ 1 φ φ2 φ3

φ2 φ 1 φ φ2

φ3 φ2 φ 1 φ

  (Σ(i)� )−1 = 11 −φ2

 

1 −φ 0 0 0 −φ 1 + φ2 −φ 0 0 0 −φ 1 + φ2 −φ 0 0 0 −φ 1 + φ2 −φ 0 0 0 −φ 1

 

Because (A.11) has two unknown variable sigma20 and φ, we need to derive

the equation for φ to get the MLE of σ20 .

A.1.4 MLE for φ

First we take the partial derivative of the log-likelihood with respect to φ

and get

∂l

∂φ =

∂

∂φ

( m

2 ln(1 −φ2) −

m∑ i=1

ln|Σ(i) � |−

1 −φ2

2σ20

m∑ i=1

ri ′(Σ(i)

� )−1ri

)

= − 2φ

1 −φ2 −

m∑ i=1

( (Σ(i)

� )−1

∂Σ(i) �

∂φ

) +

2σ20

m∑ i=1

ri ′(Σ(i)

� )−1

∂Σ(i) �

∂φ (Σ(i)

� )−1ri

Setting ∂l ∂φ

= 0 leads to

4φσ20 1 −φ2

+ σ20

m∑ i=1

( (Σ(i)

� )−1

∂Σ(i) �

∂φ

) =

m∑ i=1

ri ′(Σ(i)

� )−1

∂Σ(i) �

∂φ (Σ(i)

� )−1ri

(A.12)

and solve (A.12) for φ. We will solve the equation for Ji = 4 and Ji = 5 and

find the general solution. After a careful calculation of tr

( (Σ(i)

� )−1 ∂Σ

(i) �

∂φ

)

and ri ′(Σ(i)

� )−1 ∂Σ

(i) �

∂φ (Σ(i)

� )−1ri for Ji = 4 and Ji = 5, we got,

(A.12)LHS (Ji=4)

= 4φσ20

1 −φ2 +

m∑ i=1

−σ20φ 1 −φ2

× (1 + 2 + 2 + 1)

(A.12)LHS (Ji=5)

= 4φσ20

1 −φ2 +

m∑ i=1

−σ20φ 1 −φ2

× (1 + 2 + 2 + 2 + 1)

Therefore, in general,

(A.12)LHS = φσ20

1 −φ2

( 4 −

m∑ i=1

(2 + 2 × (Ji − 2)) )

= −2φσ20 1 −φ2

(N − (m− 2))

For Ji = 4, let ri = (ri1,ri2,ri3,ri4) and for Ji = 5, let ri = (ri1,ri2,ri3,ri4,ri5)

then

(A.12)RHS (Ji=4)

= 1

(1 −φ2)2 m∑ i=1

((−2φr2i1 − 4φr 2 i2 − 4φr

2 i3 − 2φr

2 i4)

+ 2(1 + φ2)(ri1ri2 + ri2ri3 + ri3ri4))

(A.12)RHS (Ji=5)

= 1

(1 −φ2)2 m∑ i=1

((−2φr2i1 − 4φr 2 i2 − 4φr

2 i3 − 4φr

2 i4 − 2φr

2 i5)

+ 2(1 + φ2)(ri1ri2 + ri2ri3 + ri3ri4 + ri4ri5))

Therefore, in general,

(A.12)RHS = 1

(1 −φ2)2 m∑ i=1

[ − 2φ(r2i1 + r

2 iJi

+ 2

Ji−1∑ j=2

r2ij) + 2(1 + φ 2)

Ji−1∑ k=1

rikri(k+1)

]

Setting (A.12)LHS = (A.12)RHS, we get,

σ20 (N − (m− 2))φ 3 −Cφ2 + (D −σ20 (N − (m− 2)))φ−C = 0 (A.13)

when φ2 6= 1 and denoting C and D as,

C ∆ ≡

m∑ i=1

Ji−1∑ k=1

rikri(k+1) D ∆ ≡

m∑ i=1

( r2i1 + r

2 iJi

+ 2

Ji−1∑ j=2

r2ij

)

By solving (A.11) and (A.12) we can obtain the MLE of σ20 and φ.

100

A.2 Proof of Lemma 1

Let p(·) denote the probability density of its argument(s). Denote Y =

(Y1, ...,Ym) and Y(−i) = (Y1, ...,Y(i−1),Y(i+1)Ym). If (αi,Yi) for i = 1, ...m are

independent to each other, then,

p(αi|Y ) = p(αi|Yi,Y(−i)) = p(αi,Yi,Y(−i))

p(Yi,Y(−i))

= p(αi,Yi)p(Y(−i))

p(Yi)p(Y(−i)) = p(αi|Yi)

(A.14)

A.3 Proof of Proposition 2

When the joint distribution of Y and α follow Normal as,( α Y

) ∼ N

(( 0 Xβ

) ,

( var(α) cov(α,Y )

cov(Y ,α) var(Y )

)) (A.15)

Then the conditional distribution of α given Y follows Normal distribution

as,

α|Y ∼ N(E(α) + cov(α,Y )var(Y )−1(Y −Xβ),

var(α) − cov(α,Y )var(Y )−1cov(Y ,α)) (A.16)

and,

E(α) = 0,

var(Y ) = var(Xβ + Zα + �) = var(Zα) + var(�) = Zvar(α)Z′ + var(�)

= ZGZ′ + R (A.17)

where R = σ20

1 −φ2 Σ� and Σ� =

 

Σ� (1) 0 0 · · · 0

0 Σ� (2) 0 · · · 0

0 0 Σ� (3) · · · 0

... ...

. . . . . .

...

0 · · · · · · 0 Σ�(m)

 

cov(α,Y ) = cov(α,Xβ + Zα + �) = cov(α,Zα) = var(α)Z′ = GZ′

(A.18)

101

Since Gi = G for i = 1, ...,m, using (A.17) and (A.18), we can get

E(αi|Yi) = E(αi) + cov(αi,Yi)var(Yi)−1(Yi −Xiβ)

= 0 + GiZi ′(ZiGiZi

′ + Ri) −1(Yi −Xiβ)

= GZi ′(ZiGZi

′ + Ri) −1(Yi −Xiβ) (A.19)

var(αi|Yi) = var(αi) − cov(αi,Yi)var(Yi)−1cov(Yi,αi)

= Gi −GiZi′(ZiGiZi′ + Ri)−1ZiGi

= G−GZi′(ZiGZi′ + Ri)−1ZiG (A.20)

E(αi|Yi)E(αi|Yi)′ = GZi′(ZiGZi′ + Ri)−1(Yi −Xiβ)

× (Yi −Xiβ)′(ZiGZi′ + Ri)−1ZiG (A.21)

Thus,

E(αiαi ′|Yi) = var(αi|Yi) + E(αi|Yi)E(αi|Yi)′

= (G−GZi′(ZiGZi′ + Ri)−1ZiG)

+ GZi ′(ZiGZi

′ + Ri) −1(Yi −Xiβ)(Yi −Xiβ)′(ZiGZi′ + Ri)−1ZiG

(A.22)

where, G = G(σ2k,ρkl) and Ri = Ri(σ 2 0,φ).

102

Bibliography

[1] Anderson, T. An Introduction to Multivariate Statistical Analysis,

3rd ed. Wiley Series in Probability and Statistics, 2003.

[2] Bartlett, M. Properties of sufficiency and statistical tests. Proceed-

ings of the Royal Society A: Mathematical, Physical and Engineering

Sciences 160(901) (1937), 268–282.

[3] Bishop, C. Pattern Recognition and Machine Learning. Springer, 2006.

[4] Cai, T., Liu, W., and Xia, Y. Two-sample covariance matrix testing

and support recovery in high-dimensional and sparse settings. Journal

of American Statistical Association 108(501) (2013), 265–277.

[5] Calinski, T., and Harabasz, J. A dendrite method for cluster

analysis. Communications in Statistics 3 (1974), 1–27.

[6] Cenik, C., E.S. Cenik, a. G. B., Grubert, F., Candille, S.,

Spacek, D., Alsallakh, B., Tilgner, H., Araya, C., Tang, H.,

Ricci, E., and Snyder, M. Integrative analysis of rna, translation,

and protein levels reveals distinct regulatory variation across humans.

Genome Research 25 (2015), 1610–1621.

[7] Davies, D., and Bouldin, D. A cluster separation measure. IEEE

Transactions on Pattern Analysis and Machine Intelligence 1 (1979),

224–227.

[8] Department of Veteran Affairs. Specialty care access network-

extension for community healthcare outcomes (SCAN-ECHO) fact

sheet.

[9] Faraway, J. J. Extending the Linear Model with R. Chapman and

Hall/CRC, 2006.

103

[10] Firpoa, S., A.F.Galvaob, and Song, S. Measurement errors in

quantile regression models. Journal of Econometrics 198(1) (2017), 146–

164.

[11] Fisher, R. The correlation between relatives on the supposition of

mendelian inheritance. Transactions of the Royal Society of Edinburgh

52(2) (1918), 399–433.

[12] Friedman, J., and Tukey, J. A projection pursuit algorithm for

exploratory data analysis. IEEE trans. comput. C-23 (1974), 881–889.

[13] Gao, K., and Owen, A. Efficient moment calculations for variance

components in large unbalanced crossed random effects models. Elec-

tronic Journal of Statistics 11(1) (2017), 1235–1296.

[14] Gebregziabher, M., Egede, L., Gilbert, G., Hunt, K., Ni-

etert, P., and Mauldin, P. Fitting parametric random effects mod-

els in very large data sets with application to vha national data. BMC

Medical Research Methodology 12 (2012), 163–177.

[15] Genolini, C., and Falissard, B. Kml: A package to cluster lon-

gitudinal data. Computer Methods and Programs in Biomedicine 104

(2011), e112–e121.

[16] Genolini, C., and Falissard, B. Package ‘kml’, 2016.

[17] Guide, S. . U. The hpmixed procedure, 2013.

[18] Haarhoff, P., and Buys, J. D. A new method for the optimization

of a nonlinear function subject to nonlinear constraints. The Computer

Journal 13(2) (1970), 178184.

[19] Hartley, H., and Rao, J. Maximum-likelihood estimation for the

mixed analysis of variance model. Biometrika 54(1-2) (1967), 93–108.

104

[20] Henderson, C., Kempthorne, O., Searle, S., and von

Krosigk, C. The estimation of environmental and genetic trends from

records subject to culling. Biometrics 15(2) (1959), 192–218.

[21] Jackson, C., N, B., and Richardson, S. Hierarchical related re-

gression for combining aggregate and individual data in studies of socio-

economic disease risk factors. J R Stat Soc, Series A 171 (2008), 159178.

[22] Jiang, J. Reml estimation: asymptotic behavior and related topics.

The Annals of Statistics 24(1) (1996), 255–286.

[23] Jiang, J. Linear and Generalized Linear Mixed Models and Their Ap-

plications. Springer, 2007.

[24] Johnson, D., and Thompson, R. Restricted maximum likelihood

estimation of variance components for univariate animal models using

sparse matrix techniques and average information. Journal of Dairy

Science 78 (1995), 449456.

[25] Johnson, R., and Wichern, D. Applied Multivariate Statistical

Analysis 6th Edition. Pearson.

[26] Kackar, R., and Harville, D. Approximations for standard errors

of estimators of fixed and random effects in mixed linear models. Journal

of American Statistical Association 79(388) (1984), 853–862.

[27] Koonce, T., Giuse, N., Kusnoor, S., Hurley, S., and Ye, F.

A personalized approach to deliver health care information to diabetic

patients in community care clinics. Journal of the Medical Library As-

sociation 103(3) (2015), 123130.

[28] Laird, N., and Ware, J. Random-effects models for longitudinal

data. Biometrics 38(4) (1982), 963–974.

105

[29] Lindstrom, M., and Bates, D. Newton-raphson and em algorithms

for linear mixed-effects models for repeated measures data. Journal of

the American Statistical Association 83(404) (1988), 1014–1022.

[30] Lucio-Eceiza, E., and González-Rouco, J. Quality control of

surface wind observations in northeastern north america. part ii: Mea-

surement errors. Journal of Atmospheric and Oceanic Technology 35(1)

(2018), 163182.

[31] Miller, R. Big data curation. 20th International Conference on Man-

agement of Data (COMAD) 2014, Hyderabad, India, 2014.

[32] Mordecai, A. Nonlinear Programming: Analysis and Methods. Dover

Publishing, 2003.

[33] of Diabetes, N. I., Digestive, and Diseases, K. The a1c test &

diabetes, 2014.

[34] Park, M., and Sun, J. Tests in projection pursuit regression. Journal

of Statistical Planning and Inference 75(1) (1998), 65–90.

[35] Patterson, H., and Thompson, R. Recovery of inter-block infor-

mation when block sizes are unequal. Biometrika 58(3) (1971), 545–554.

[36] Pennell, M., and Dunson, D. Fitting semiparametric random ef-

fects models to large data sets. Biostatistics 8(4) (2007), 821834.

[37] Pfanzagl, J. Parametric Statistical Theory. Walter de Gruyter and

Co., 1991.

[38] Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D., Heis-

terkamp, S., and Willigen, B. V. Package nlme, 2018.

[39] Powell, M. Direct search algorithms for optimization calculations.

Acta Numerica 7 (1998), 287–336.

106

[40] Ray, S., and Turi, R. Determination of number of clusters in k-means

clustering and application in colour image segmentation. in: Proceedings

of the 4th International Conference on Advances in Pattern Recognition

and Digital Techniques (ICAPRDT99) (1999), 137–143.

[41] Robinson, G. That blup is a good thing: The estimation of random

effects. Statistical Science 6(1) (1991), 15–32.

[42] SAS. SAS/STAT(R) 9.2 User’s Guide, Second Edition PROC MIXED,

2010.

[43] Shim, Y., Chung, J., and Choi, I. A comparison study of clus-

ter validity indices using a nonhierarchical clustering algorithm. in:

Proceedings of CIMCA-IAWTIC05-Volume 01, IEEE Computer Society

(2005), 199–204.

[44] Shumway, R. Applied Statistical Time Series Analysis. PRENTICE

HALL SERIES IN STATISTICS, 1988.

[45] Snedecor, G., and Cochran, W. Statistical Methods, 8th ed. Iowa

State University Press, 1989.

[46] Stanleym, T., and Jarrell, S. Meta-regression analysis: A quanti-

tative method of literature surveys. Journal of Economic Surveys 19(3)

(1989), 299–308.

[47] Sun, J. Significance levels in exploratory projection pursuit.

Biometrika, 78(4) (1991), 759–769.

[48] Sun, J. Projection Pursuit: Encyclopedia of Statistical Sciences. John

Wiley and Sons, Inc, 2006.

[49] Watts, S., L.Roush, Julius, M., and Sood, A. Improved glycemic

control in veterans with poorly controlled diabetes mellitus using a spe-

107

cialty care access network-extension for community healthcare outcomes

model at primary care clinics. Journal of Telemedicine and Telecare

22(4) (2015), 221–224.

[50] Watts, S., Strauss, G., Pascuzzi, K., ODay, M., Young, K.,

Aron, D., and Kirsh, S. Shared medical appointments for patients

with diabetes: Glycemic reduction in high-risk patients. Journal of the

American Association of Nurse Practitioners 17(8) (2015), 450456.

[51] Xu, Y., Sun, J., Carter, R., and Bogie, K. Personalized pre-

diction of chronic wound healing: An exponential mixed effects model

using stereophotogrammetric measurement. Journal of Tissue Viability

23(2) (2014), 48–59.

[52] Y.Wu, Ma, Y., and Yin, G. Smoothed and corrected score ap-

proach to censored quantile regression with measurement errors. JASA

110(512) (2016), 1670–1683.