Principal Components and Factor Analysis using python code in 40 hours

Hussain2018
2reportexample_4.pdf

Assignment  #7   Jake  Rifkin    

Introduction:  

In  this  report,  exploratory  factor  analysis  is  performed  on  stock  portfolio  return  data.  The  goal  of  the   report  is  to  explore  and  understand  the  common  factors  shared  between  the  various  stocks  included  in   the  portfolio  by  examining  their  correlation  structure  through  factor  analysis.  Rotations  are  used  to   increase  the  interpretability  of  the  factor  analyses.  The  data  set  is  pruned  to  contain  4  key  sectors   (banking,  oil  field  services,  oil  refining,  and  industrial  chemical),  thus  we  would  ideally  expect   exploratory  factor  analysis  to  identify  four  common  factors.  All  data  is  log  transformed  per  the   instructions  of  the  assignment  to  transform  our  daily  return  calculation  into  a  more  normal  distribution   thus  meeting  a  fundamental  requirement  of  factor  analysis  that  the  predictor  variables  have  a   multinormal  distribution.  

Results:  

Principal  Factor  Analysis  without  Rotation  

The  first  exploratory  factor  analysis  is  fit  using  the  principal  factor  analysis  method  with  priors   determined  by  smc  and  no  apriori  suggestion  of  number  of  factors.  SAS  selects  the  number  of  factors  for   us  present  in  the  data  using  the  proportion  of  variance  explained  by  calculating  eigenvalues  and   identifying  the  point  of  diminishing  returns.  Visually,  this  corresponds  to  the  elbow  in  the  scree  plot.  

 

The  first  factor  analysis  includes  2  factors  which  is  fewer  factors  than  expected  considering  we  have  a   dataset  that  contains  four  sectors.  This  method  of  factor  selection  is  not  immune  to  Heywood  cases  as   indicated  by  the  presence  of  negative  eigenvalues  that  lead  to  some  mathematical  violations.  It  also   leads  to  the  first  two  factors  explaining  a  cumulative  1.0101  of  the  variance.  It  is  odd  to  consider  that   more  than  100%  of  the  variance  could  be  explained.  Given  that  Heywood  cases  are  present,  we  should   not  proceed  with  this  model.  The  factors  loadings  produced  by  this  first  analysis  are  presented  below.  A   simple  structure  is  not  present  and  interpretation  of  these  factors  is  difficult.  The  first  factor  is  relatively   flat  in  terms  of  variance  and  all  values  are  positive  loadings.  The  second  factor  has  a  greater  variance   than  the  first,  contains  many  negative  loadings,  and  in  examining  the  absolute  values  of  the  factors,   none  are  greater  than  .4.    

 

While  it  was  expected  exploratory  factor  analysis  would  identify  the  sectors  as  factors,  plotting  the  two   factors  loadings  against  each  other  it  is  interesting  to  note  the  clumps  of  the  different  sectors  are   located  close  to  each  other  in  the  coordinate  plane.    As  there  is  little  variance  in  the  first  factor,  all   returns  fall  relatively  close  together  on  the  horizontal  axis.  The  vertical  axis  is  what  creates  the   separation  between  the  clusters.  The  bottom  three  returns  (BHI,  HAL,  and  SLB)  are  all  oil  field  services,   the  next  higher  three  (XOM,  CVA,  and  HES)  all  belong  to  the  oil  refining  sector.  All  oil  related  stocks  have   negative  loadings  in  the  second  factor.  The  top  right  quadrant  of  the  graph  contains  the  remaining  two   sectors  and  the  banking  and  industrial  chemical  sectors  are  also  clustered  closely  together.  

 

Principal  Factor  Analysis  with  Orthogonal  Rotation  

The  second  principal  factor  analysis  model  fit  for  this  assignment  includes  an  orthogonal  varimax   rotation.  As  we  have  not  specified  the  number  of  factors  to  expect  apriori,  SAS  performs  the  same   eigenvalue  calculation  as  the  first  model  and  thus  arrives  at  the  same  number  of  factors  to  include  into   the  model.  Specifying  a  rotation  did  not  help  SAS  generate  additional  factors  as  the  rotation  step  occurs   after  fitting  the  model.    The  same  issue  with  Heywood  cases  is  still  present  in  this  version  of  the  model.   Before  the  rotation,  SAS  has  produced  the  same  factor  loadings  as  the  previous  model.  After  the   varimax  rotation,  the  variance  explained  by  each  factor  becomes  much  closer  than  the  first  model.  A   simple  structure  has  still  not  been  achieved  as  there  is  a  lack  of  zeros  in  our  factor  loading  matrix.  The   components  that  changed  the  most  are  the  three  oil  field  service  stocks  (SLB,  HAL,  and  BHI).  The  

rotational  change  is  most  visible  in  the  plot  of  the  components  against  the  two  factor  loadings.   Examining  the  plot,  the  clusters  of  our  sectors  still  remain  closely  related  to  each  other,  but  the  rotation   has  moved  all  of  the  components  into  the  top  right  quadrant.  

 

As  for  the  interpretability  of  the  factor  loadings,  the  rotation  seems  to  add  some  clarity  into  what  the   factors  represent.  The  first  factor  has  larger  loadings  for  the  banking  industries  (BAC,  JPM,  and  WFC)   while  the  second  factor  has  high  positive  loadings  for  the  oil  field  services  and  moderately  high  loadings   for  the  oil  refining  industry.  The  interpretation  may  be  a  bit  clearer,  but  2  factors  are  still  less  than  the   four  sectors  we’d  expect  and  we  have  not  yet  achieved  a  simple  structure.  That  coupled  with  the  

presence  of  Heywood  cases  in  the  estimation  of  the  factors  means  we  should  continue  our  search  for  an   acceptable  factor  analysis.  

 

Maximum  Likelihood  Estimation  Factor  Analysis  with  Varimax  rotation  

The  next  model  uses  maximum  likelihood  estimation  to  perform  factor  analysis  with  a  varimax  rotation.   Like  the  previous  two  models,  the  number  of  factors  is  not  provided  to  the  model  apriori  and  the  MLE   process  also  reaches  the  conclusion  that  2  factors  should  be  considered.  Unlike  the  previous  two   models,  MLE  has  a  formal  statistical  test  that  can  aide  in  the  interpretation  of  how  many  factors  to   consider.  

 

The  test  uses  a  chi-­‐square  distribution  and  has  the  null  hypothesis  that  there  are  no  common  factors   with  the  alternative  hypothesis  that  there  is  at  least  one  common  factor.  If  the  test  is  significant  at  our   predetermined  alpha  rate,  we  reject  the  null  hypothesis,  conclude  that  at  least  one  common  factor   exists,  and  iterate  through  the  tests  by  increasing  the  number  of  common  factors  in  the  null  hypothesis   until  we  reach  a  result  that  is  not  significant.  There  are  a  few  other  major  benefits  that  are  achieved   when  using  MLE  as  the  factor  estimation  tool  including  the  ability  to  formally  compare  factor  models   against  each  other  using  criteria  like  AIC  of  BIC  as  the  presence  of  the  likelihood  function  allows  these   direct  statistical  comparisons  (assuming  all  assumptions  are  met).    The  rotated  factor  loadings  produced   by  the  MLE  are  slightly  different  than  the  rotated  factor  loadings  produced  by  our  second  model  but   very  similar.  In  terms  of  interpretability,  there  is  a  similar  level  of  interpretability  of  the  factor  loadings   as  the  previous  model  since  both  models  achieve  similar  factor  loadings  after  the  rotation.    A  simple   structure  is  still  not  present  as  no  zeros  or  near  zero  values  are  present  in  the  loadings.  It  is  reassuring  to   see  that  similar  results  are  produced  from  different  techniques.  While  a  formal  bootstrap  or  cross   validation  method  should  be  employed  to  ensure  that  the  factor  loadings  are  not  specific  to  this   particular  sample,  at  the  very  least,  we’re  seeing  consistent  results  from  different  processes.    

 

Maximum  Likelihood  Estimation  Factor  Analysis  with  rotation  Max  prior  

The  final  factor  analysis  model  in  this  report  is  another  maximum  likelihood  estimation  factor  analysis   that  uses  the  orthogonal  varimax  rotation  but  uses  a  different  prior  communality.  This  approach  uses   the  largest  absolute  correlation  for  a  variable  with  any  other  variable  as  the  communality  estimate  for   the  variable.  Using  this  prior  structure,  SAS  suggests  five  factors  should  be  included  in  this  model  using   the  same  iterative  statistical  testing  framework  as  the  previous  model.  The  large  difference  in  factors   included  in  the  model  based  on  differing  prior  communality  estimates  suggests  that  the  prior   communality  has  a  large  effect  on  factor  analysis.    

Looking  at  the  eigenvalues  produced  by  this  model,  there  are  fewer  Heywood  cases  than  the  previous   models  but  Heywood  cases  are  still  present.  Because  of  the  presence  of  Heywood  cases,  this  model  is  

potentially  misspecified  and  should  not  be  used.  It  is  difficult  to  interpret  if  the  five-­‐factor  model  fits  the   data  better  than  the  two-­‐factor  model.  In  terms  of  interpretability,  the  five-­‐factor  model’s  first  four   factors  align  with  our  expectations  of  seeing  a  factor  per  sector.  Note  that  we  still  have  not  reached  a   simple  structure  as  indicated  by  the  lack  of  zeros  in  the  loadings  matrix.  The  fifth  factor  is  especially   puzzling  and  difficult  to  interpret.  The  absolute  values  of  the  factor  loadings  for  this  model  are  all  below   .25  and  the  largest  three  loadings  in  the  fifth  factor  are  spread  across  our  sectors.    

 

Conclusions:  

In  this  report,  we  fit  four  different  factor  analysis  models.  Two  using  principal  factor  analysis  and  two   uses  maximum  likelihood  estimation.  None  of  our  models  perfectly  aligned  with  our  intuitive   understanding  of  the  data  being  split  into  four  sectors  though  all  models  hinted  at  the  four  sectors   either  through  clusters  in  the  factor  relationship  or  through  the  individual  factors.  All  of  our  models  had   Heywood  cases  present  meaning  that  these  models  should  not  be  used.    Perhaps  these  Heywood  cases  

are  indicating  that  our  data  does  not  perfectly  align  with  the  strict  assumptions  of  the  factor  analysis   models.  Rotations  helped  increase  the  interpretability  of  our  model,  but  we  were  unable  to  find  a  simple   structure  in  any  of  the  models.  Perhaps  if  we  had  used  an  oblique  rotation  and  relaxed  the  orthogonality   requirement  enforced  by  varimax  rotations,  we  could  have  found  a  simpler  structure,  but  we  would  no   longer  be  looking  at  the  correlation  structure.  Additionally,  the  transformation  of  the  returns  to  the  log   scale  may  have  helped  our  data  fit  the  model,  though  we  never  verified  this  assumption  in  this  report.   Overall,  the  factor  analysis  process  is  interesting  to  try  and  understand  how  the  data  are  correlated  to   each  other,  though  it  did  confirm  knowledge  that  we  already  had  and  was  readily  accessible  to  us  by   simply  researching  the  companies.  It  seems  that  a  great  deal  more  time  would  be  required  to  fiddle  with   the  knobs  of  this  model  to  achieve  a  simple  structure  that  is  interpretable  and  informative  of  our  use   case.  Given  the  Heywood  cases,  I  would  not  suggest  using  any  of  the  models  developed  in  this  report.    

Code:  

libname  mydata  "/scs/wtm926/"  access=readonly;     /*  1  */   data  temp;     set  mydata.stock_portfolio_data;     drop  AA  HON  MMM  DPS  KO  PEP  MPC  GS  ;   run;     proc  sort  data=temp;  by  date;  run;   data  temp;     set  temp;   *  Compute  the  log-­‐returs;   return_BAC  =  log(BAC/lag1(BAC));   return_BHI  =  log(BHI/lag1(BHI));   return_CVX  =  log(CVX/lag1(CVX));   return_DD    =  log(DD/lag1(DD));   return_DOW  =  log(DOW/lag1(DOW));   return_HAL  =  log(HAL/lag1(HAL));   return_HES  =  log(HES/lag1(HES));   return_HUN  =  log(HUN/lag1(HUN));   return_JPM  =  log(JPM/lag1(JPM));   return_SLB  =  log(SLB/lag1(SLB));   return_WFC  =  log(WFC/lag1(WFC));   return_XOM  =  log(XOM/lag1(XOM));   response_VV  =  log(VV/lag1(VV));   run;     data  return_data;   set  temp  (keep=  return_:);   run;     /*  2  */  

  ods  graphics  on;   proc  factor     data=return_data     method=principal     priors=smc     rotate=none   plots=(all);     run;   ods  graphics  off;     /*  3  */   ods  graphics  on;   proc  factor     data=return_data     method=principal     priors=smc     rotate=varimax   plots=(all);     run;   ods  graphics  off;     /*  4  */   ods  graphics  on;   proc  factor     data=return_data     method=ML     priors=smc     rotate=varimax   plots=(loadings);     run;   ods  graphics  off;     /*  5  */     ods  graphics  on;   proc  factor  data=return_data     method=ML     priors=max     rotate=varimax   plots=(loadings);     run;   ods  graphics  off;