Principal Components and Factor Analysis using python code in 40 hours
Assignment #6 Jake Rifkin
Introduction:
In this report, principal component analysis (PCA) is explored for its properties in dimensional reduction as well as its applicability to linear modeling. The data set used is a stock portfolio dataset and the goal of PCA and linear regression is to help explain the variation in the market index with 20 daily individual stock returns. PCA allows us to reduce and group the stocks into principle components to better understand how the stocks are related to each other and what the covariance structure looks like while linear regression allows us to evaluate the impact of an individual stock on the market index close.
Results:
Correlations between stocks and the market index
The daily log return for each stock is calculated as well as the log return of the market index variable. The correlation between the log return of each stock and the log return of the market index is summarized in the table below. All of the log returns have a positive correlation with the log market return ranging from .44 to .76. In other words, all stock returns have at least a moderately strong correlation with the return of the market index. At this point, noting that so many stocks are positively correlated to the market index, we should expect that some degree of multicollinearity is present between the predictor stocks. As all log return are positively correlated, each of these stocks is giving a relatively similar piece of information to our model if we were to employ a linearly model to predict the daily market return. A full interpretation of the covariance matrix would be a worthwhile exercise to unpack the underlying correlation structure especially since we’re looking to fit a linear model to explain the variation in the log of the market index. Multicollinearity is a major issue for linear models.
return_AA return_BAC return_BHI return_CVX return_DD return_DOW return_DPS Response_vv corr
.63241 .65019 .5775 .7209 .68952 .62645 .4435
return_GS return_HAL return_HES return_HON return_HUN return_JPM return_KO Response_vv corr .71216 .5975 .6108 .76838 .58194 .65785 .5998
return_MMM return_MPC return_PEP return_SLB return_WFC return_XOM Response_vv corr
.76085 .47312 .50753 .69285 .73357 .72111
It is worth noting that many of these stocks belong to the same sector. For example, the BAC, GS, JPM, and WFC stocks all belong to the banking sector. One way to potentially achieve some dimensional reduction and gain a better understanding if we’re getting similar information from similar stocks is to group the correlation by their industry. The following chart presents the stock correlations to the market index grouped by sector.
Looking at the banking sector, all four banking stocks have high correlation to the market index and low variation within the group. The oil refining group has a much greater variability. While our base sizes for the various clusters are fairly small, we can start to gain an understanding of how sectors relate to the market index returns. A way to summarize by sector would be to take the mean correlation coefficient by sector, but it should be noted that this method is not recommended for interpreting the effect of a sector on the market index returns. For one, it is inappropriate to simply average the correlation coefficients to produce a mean coefficient. If a mean coefficient is desired for Pearson correlation coefficients, a Fisher z transformation where the correlation coefficient is transformed to a z-‐value and then averaged to produce a single correlation coefficient is a preferable interpretation1. Additionally, averaging the correlation coefficients ignores the joint effect of the various individual stocks on the market index. In other words, as a Pearson correlation coefficient is only between a response variable and an explanatory variable, there is an implicit assumption of et ceteris paribus of the other variables. This is an over simplified view of potentially joint effects and gives rise to the reason why the covariance matrix and PCA is used to analyze the underlying covariance structure of a set of explanatory variables on a single response variable.
1 Corey, David M., William P. Dunlap, and Michael J. Burke. "Averaging Correlations: Expected Values
Principal Component Analysis
In this section, PCA is used to further examine the underlying covariance structure of the stock return variables (excluding the market index variable) with the goal of dimensional reduction and pre-‐ supposing potential multicollinearity. The following plot represents the proportion of variance explained by the principal component. Approximately 48% of the variance is explained by the first principle component while the second principle component drops to explaining less than 8% of the variance. As there is no exact way to choose the number of principal components to use, there are a couple of general wisdoms to employ. One states that components should be included until at least 80% of the variance is explained. In our chart, this would suggest including 8 principal components. This method ensures that a high proportion of variance is explained by the included components but may not yield the simplest interpretation. Another wisdom to consider is that of parsimony. The concept of creating the most parsimonious explanation of the phenomenon means explaining as much as possible with as few components as possible. The concept of parsimony prevents including too many components which makes the interpretation of the effects difficult. Ideally, the first few principal components contain the vast majority of the variance (70-‐90%). A way to potentially achieve parsimony is to exclude any components whose Eigenvalue is less than the mean Eigenvalue of all the other components. Applying this threshold to our data, we would only include the first 3 components and only explain 62.61% of the variance.
For this report, I will include the first 8 principle components in the regression portion as they explain 80% of the variance. In this section, the first two principal components are examined more closely by plotting the first two principal components in a scatter plot. Additionally, a table of the first two principal components is included.
Looking at the first principal component, which contains 48% of the variance; note that all of the stock tickers are positive and relatively closely contained. One interpretation of this principal component could be a weighted measure of the overall market performance. Some stocks contribute slightly less than others, but for the most part, all stocks are contributing fairly equally across the board. The second principal component has 3 distinct outliers clearly visible in the scatter plot while the rest of the data clusters around the bottom right corner. Interestingly, the 3 stocks towards the high end of principal component 2 are all soft drink manufacturers. This suggests that the soft drink sector provides disproportionately to explaining the variance of the second principal component.
Multiple Regression Modeling
In this section, a training and test set is created to aid in comparing two multiple regression models on metrics like mean absolute error and goodness of fit. The first model fit uses all of the log stock return variables to predict the market index variable. The second model fit uses the first eight principal
components. The models are compared across their goodness of fit as well as their ability to predict in and out of sample data with MSE and MAE.
The first model includes all 20 stocks as predictors attempting to explain the variance of the training set market return. The overall adjusted r square is .8919 indicating that the model explains a lot of the training set variation. Not all of the parameter estimates are statistically significant and if this model was to be used for inferential purposes, it should be noted that those variables should be trimmed and the model refit on a subset of the variables. In addition to examining the coefficients, the variance inflation factor (VIF) of each coefficient is checked. While no VIFs are greater than 5 (an indicator of potential multicollinearity), the blind log transformation we employed on the returns at the beginning of the analysis is most likely suppressing the values of the VIF. Considering that the data in the regression is all on the log scale, the VIFs of 2 or 3 may be indicators of multicollinearity.
Looking at the residual plots, the log return model fits the data fairly well and it appears that no major assumptions of linear modeling are violated. The residual plot shows a good random spread. The QQ plot indicates incredibly minimal departures form normality and cook’s D indicates only a couple of outliers. The histogram of the residuals appears to follow the assumptions of normality and homogeneity. Overall, this model seems to fit well though concerns about multicollinearity of the predictor variables exist and the model should prune the non-‐significant coefficients and refit.
The second model is fit on the first eight principal components to explain the log return of the market index. This model has an adjusted r squared value of .8886 which indicates a very high proportion of the variance of the log return of the market index is explained by these 8 variables. Already, we see a reduction in model complexity with this model in terms of the number of terms included. Looking at the coefficients, the first 4 and the eighth principle components have statistically significant non zero effects on the log of the market index. The insignificant variables should be pruned from the model and the model should be refit. So far, this model has taken a slight hit in adjusted r square compared to the first model but the real benefit of the second model is apparent when looking at the VIFs of the coefficients. All VIFs are within .03 of 1. While the previous model has probably multicollinearity, this model is relatively free of multicollinearity as indicated by the low VIF values. This indicates that a more stable
model that better and more consistently explains the variation in the log of the market index. Additionally, this model will most likely extrapolate better than the previous model.
Looking at the residual of the second model, the model has just as good of a goodness of fit as the first model. All assumptions of linear regression appear to be maintained. The residual scatter plot is random and devoid of a clear pattern while the QQ plot shows minimal deviations from normality. The Cook’s D plot shows only a few influential points and the histogram of the residual appears to fit the assumption of normality perfectly. Overall, this model fits as well as the previous model and avoids some issues with multicollinearity that are present in the log return version of the model. The insignificant variables should be pruned and the model should be refit. This model has the added benefit of having potentially easier interpretability than the previous model as long as the interpretation of the PCA components is logical.
Finally, the two models are compared on their ability to predict using MSE and MAE. The table below summarizes the findings on training and testing data. Overall, both models perform similarly. They both have higher error rates in the out of sample data than the in sample data. The first model fit on log returns outperforms the principal component fit model at the hundred-‐thousandths place in the MAE on both the training and test data. The first model outperforms the second model by more in the training set than the out of sample test set.
Data/Model Model 1 MSE Model 1 MAE Model 2 MSE Model 2 MAE Test .000009306 .002144904 .000009677 .002179249 Train .000005994 .001902032 .000006410 .001975239
Conclusions:
In this report, stock return data was analyzed with the goal of fitting and comparing two models that explain the variation in the log return of the market index. Principal component analysis was conducted to aide in dimension reduction as well as for use as regression inputs. Overall, PCA was only modestly beneficial over the log transformation of the stock returns because it took several principal components to reach an appropriate threshold of variance explained. Principal components beyond the second only accounted for less than 7% of the variance of the explanatory variables. Additionally, including many principal components directly challenges the goal of having a parsimonious interpretation.
One major assumption this report makes is that the log transformations of the daily returns were necessary and good for the modeling. First, the log transformation at the beginning made it difficult to identify multicollinearity using the VIF metric and secondly PCA is not scale invariant and thus performing PCA on the non-‐log returns could yield different results. In terms of modeling help, PCA helped our linear model reduce the multicollinearity issue present in the raw data though both models seemed to perform similarly on testing and training data. In terms of which model is preferable, in reality both models perform very similarly and given the nature of the data, if our goal is accurate prediction, a time series approach may be a more appropriate. In choosing between the two models fit in this report, the second model has less multicollinearity with very similar performance. This model would be the preferred model to employ.
Code:
libname mydata "/scs/wtm926/" access=readonly; data temp; set mydata.stock_portfolio_data; run; proc print data=temp(obs=10); run; quit; proc sort data=temp; by date; run; quit; data temp; set temp; * Compute the log-‐returns -‐ log of the ratio of today's price to yesterday's price; * Note that the data needs to be sorted in the correct direction in order for us to compute the correct return; return_AA = log(AA/lag1(AA)); return_BAC = log(BAC/lag1(BAC)); return_BHI = log(BHI/lag1(BHI)); return_CVX = log(CVX/lag1(CVX)); return_DD = log(DD/lag1(DD)); return_DOW = log(DOW/lag1(DOW));
return_DPS = log(DPS/lag1(DPS)); return_GS = log(GS/lag1(GS)); return_HAL = log(HAL/lag1(HAL)); return_HES = log(HES/lag1(HES)); return_HON = log(HON/lag1(HON)); return_HUN = log(HUN/lag1(HUN)); return_JPM = log(JPM/lag1(JPM)); return_KO = log(KO/lag1(KO)); return_MMM = log(MMM/lag1(MMM)); return_MPC = log(MPC/lag1(MPC)); return_PEP = log(PEP/lag1(PEP)); return_SLB = log(SLB/lag1(SLB)); return_WFC = log(WFC/lag1(WFC)); return_XOM = log(XOM/lag1(XOM)); *return_VV = log(VV/lag1(VV)); response_VV = log(VV/lag1(VV)); run; proc print data=temp(obs=10); run; quit; * We can use ODS TRACE to print out all of the data sets available to ODS for a particular SAS procedure.; * We can also look these data sets up in the SAS User's Guide in the chapter for the selected procedure.; *ods trace on; ods output PearsonCorr=portfolio_correlations; proc corr data=temp; *var return: with response_VV; var return_:; with response_VV; run; quit; *ods trace off; proc print data=portfolio_correlations; run; quit; data wide_correlations; set portfolio_correlations (keep=return_:); run; * Note that wide_correlations is a 'wide' data set and we need a 'long' data set; * SAS has two 'standard' data formats -‐ wide and long; * We can use PROC TRANSPOSE to convert data from one format to the other; proc transpose data=wide_correlations out=long_correlations; run; quit;
data long_correlations; set long_correlations; tkr = substr(_NAME_,8,3); drop _NAME_; rename COL1=correlation; run; proc print data=long_correlations; run; quit; * Merge on sector id and make a colored bar plot; data sector; input tkr $ 1-‐3 sector $ 4-‐35; datalines; AA Industrial -‐ Metals BAC Banking BHI Oil Field Services CVX Oil Refining DD Industrial -‐ Chemical DOW Industrial -‐ Chemical DPS Soft Drinks GS Banking HAL Oil Field Services HES Oil Refining HON Manufacturing HUN Industrial -‐ Chemical JPM Banking KO Soft Drinks MMM Manufacturing MPC Oil Refining PEP Soft Drinks SLB Oil Field Services WFC Banking XOM Oil Refining VV Market Index ; run; proc print data=sector; run; quit; proc sort data=sector; by tkr; run; proc sort data=long_correlations; by tkr; run; data long_correlations; merge long_correlations (in=a) sector (in=b); by tkr;
if (a=1) and (b=1); run; proc print data=long_correlations; run; quit; * Make Grouped Bar Plot; * p. 48 Statistical Graphics Procedures By Example; ods graphics on; title 'Correlations with the Market Index'; proc sgplot data=long_correlations; format correlation 3.2; vbar tkr / response=correlation group=sector groupdisplay=cluster datalabel; run; quit; ods graphics off; * Still not the correct graphic. We want the tickers grouped and color coded by sector; * We want ticker labels directly under the x-‐axis and sector labels under the ticker labels denoting each group. Looks like we have an open SAS graphics project.; ods graphics on; title 'Correlations with the Market Index'; proc sgplot data=long_correlations; format correlation 3.2; vbar sector / response=correlation group=tkr groupdisplay=cluster datalabel; run; quit; ods graphics off; * SAS can produce bar plots by sector of the mean correlation; proc means data=long_correlations nway noprint; class sector; var correlation; output out=mean_correlation mean(correlation)=mean_correlation; run; quit; proc print data=mean_correlation; run; ods graphics on; title 'Mean Correlations with the Market Index by Sector'; proc sgplot data=mean_correlation; format mean_correlation 3.2; vbar sector / response=mean_correlation datalabel; run; quit; ods graphics off;
* Note that we have been using PROC SGPLOT to display a data summary, and hence we have not been able to make the display that we want. In reality PROC SGPLOT is designed to take an input data set, perform some routine data summaries, and display that output. Unfortunately, routine data summaries are typically frequency counts for discrete data or averages for contiuous data. Here is an example of the default use of PROC SGPLOT.; ods graphics on; title 'Mean Correlations with the Market Index by Sector -‐ SGPLOT COMPUTES MEANS'; proc sgplot data=long_correlations; format correlation 3.2; vbar sector / response=correlation stat=mean datalabel; run; quit; ods graphics off; * Reset title statement to blank; title ''; ************************************************************************************; * Begin Modeling; ************************************************************************************; * Note that we do not want the response variable in the data used to compute the principal components; data return_data; set temp (keep= return_:); * What happens when I put this keep statement in the set statement?; * Look it up in The Little SAS Book; run; proc print data=return_data(obs=10); run; ************************************************************************************; * Compute Principal Components; ************************************************************************************; ods graphics on; proc princomp data=return_data out=pca_output outstat=eigenvectors plots=scree(unpackpanel); run; quit; ods graphics off; * Notice that PROC PRINCOMP produces a lot of output; * How many principal components should we keep?; * Do the principal components have any interpretability?; * Can we display that interpretability using graphics?; proc print data=pca_output(obs=10); run; proc print data=eigenvectors(where=(_TYPE_='SCORE')); run; * Display the two plots and the Eigenvalue table from the output;
* Plot the first two eigenvectors; data pca2; set eigenvectors(where=(_NAME_ in ('Prin1','Prin2'))); drop _TYPE_ ; run; proc print data=pca2; run; proc transpose data=pca2 out=long_pca; run; quit; proc print data=long_pca; run; data long_pca; set long_pca; format tkr $3.; tkr = substr(_NAME_,8,3); drop _NAME_; run; proc print data=long_pca; run; * Plot the first two principal components; ods graphics on; proc sgplot data=long_pca; scatter x=Prin1 y=Prin2 / datalabel=tkr; run; quit; ods graphics off; ************************************************************************************; * Create a training data set and a testing data set from the PCA output ; * Note that we will use a SAS shortcut to keep both of these 'datasets' in one ; * data set that we will call cv_data (cross-‐validation data). ; ************************************************************************************; data cv_data; merge pca_output temp(keep=response_VV); * No BY statement needed here. We are going to append a column in its current order; * generate a uniform(0,1) random variable with seed set to 123; u = uniform(123); if (u < 0.70) then train = 1; else train = 0; if (train=1) then train_response=response_VV; else train_response=.;
run; proc print data=cv_data(obs=10); run; * You can double check this merge by printing out the original data; proc print data=temp(keep=response_VV obs=10); run; quit; ************************************************************************************* *******; * Fit a regression model using the raw predictor variables; ************************************************************************************* *******; * Fit a regression model using all of the raw predictor variables and VV as the response variable; ods graphics on; proc reg data=cv_data; model train_response = return_: / vif ; output out=model1_output predicted=Yhat ; run; quit; ods graphics off; * Examine the Goodness-‐Of-‐Fit for this model. How well does it fit? Are there any problems?; ************************************************************************************* *******; * Fit a regression model using the rotated predictor variables (Principal Component Scores) ; ************************************************************************************* *******; * Now fit a regression model using your selected number of principal components and VV as the response variable; * Examine the Goodness-‐Of-‐Fit for this model. How well does it fit? Are there any problems?; ods graphics on; proc reg data=cv_data; model train_response = prin1-‐prin8 / vif ; output out=model2_output predicted=Yhat ; run; quit; ods graphics off; ************************************************************************************* *******; * Compare fit and predictive accuracy of the two fitted models ; ************************************************************************************* *******; * Use the Mean Square Error (MSE) and the Mean Absolute Error (MAE) metrics for your comparison.;
proc print data=model1_output(obs=10); run; * Model 1; data model1_output; set model1_output; square_error = (response_VV -‐ Yhat)**2; absolute_error = abs(response_VV -‐ Yhat); run; proc means data=model1_output nway noprint; class train; var square_error absolute_error; output out=model1_error mean(square_error)=MSE_1 mean(absolute_error)=MAE_1; run; quit; proc print data=model1_error; title "Model 1 (Log Return) MSE";run; * Model 2; data model2_output; set model2_output; square_error = (response_VV -‐ Yhat)**2; absolute_error = abs(response_VV -‐ Yhat); run; proc means data=model2_output nway noprint; class train; var square_error absolute_error; output out=model2_error mean(square_error)=MSE_2 mean(absolute_error)=MAE_2; run; quit; proc print data=model2_error; title "Model 2 (PCA) MSE"; run; * Merge them together in one table; data error_table; merge model1_error(drop= _TYPE_ _FREQ_) model2_error(drop= _TYPE_ _FREQ_); by train; run; proc print data=error_table; run;