Principal Components and Factor Analysis using python code in 40 hours
Assignment #7 Jake Rifkin
Introduction:
In this report, exploratory factor analysis is performed on stock portfolio return data. The goal of the report is to explore and understand the common factors shared between the various stocks included in the portfolio by examining their correlation structure through factor analysis. Rotations are used to increase the interpretability of the factor analyses. The data set is pruned to contain 4 key sectors (banking, oil field services, oil refining, and industrial chemical), thus we would ideally expect exploratory factor analysis to identify four common factors. All data is log transformed per the instructions of the assignment to transform our daily return calculation into a more normal distribution thus meeting a fundamental requirement of factor analysis that the predictor variables have a multinormal distribution.
Results:
Principal Factor Analysis without Rotation
The first exploratory factor analysis is fit using the principal factor analysis method with priors determined by smc and no apriori suggestion of number of factors. SAS selects the number of factors for us present in the data using the proportion of variance explained by calculating eigenvalues and identifying the point of diminishing returns. Visually, this corresponds to the elbow in the scree plot.
The first factor analysis includes 2 factors which is fewer factors than expected considering we have a dataset that contains four sectors. This method of factor selection is not immune to Heywood cases as indicated by the presence of negative eigenvalues that lead to some mathematical violations. It also leads to the first two factors explaining a cumulative 1.0101 of the variance. It is odd to consider that more than 100% of the variance could be explained. Given that Heywood cases are present, we should not proceed with this model. The factors loadings produced by this first analysis are presented below. A simple structure is not present and interpretation of these factors is difficult. The first factor is relatively flat in terms of variance and all values are positive loadings. The second factor has a greater variance than the first, contains many negative loadings, and in examining the absolute values of the factors, none are greater than .4.
While it was expected exploratory factor analysis would identify the sectors as factors, plotting the two factors loadings against each other it is interesting to note the clumps of the different sectors are located close to each other in the coordinate plane. As there is little variance in the first factor, all returns fall relatively close together on the horizontal axis. The vertical axis is what creates the separation between the clusters. The bottom three returns (BHI, HAL, and SLB) are all oil field services, the next higher three (XOM, CVA, and HES) all belong to the oil refining sector. All oil related stocks have negative loadings in the second factor. The top right quadrant of the graph contains the remaining two sectors and the banking and industrial chemical sectors are also clustered closely together.
Principal Factor Analysis with Orthogonal Rotation
The second principal factor analysis model fit for this assignment includes an orthogonal varimax rotation. As we have not specified the number of factors to expect apriori, SAS performs the same eigenvalue calculation as the first model and thus arrives at the same number of factors to include into the model. Specifying a rotation did not help SAS generate additional factors as the rotation step occurs after fitting the model. The same issue with Heywood cases is still present in this version of the model. Before the rotation, SAS has produced the same factor loadings as the previous model. After the varimax rotation, the variance explained by each factor becomes much closer than the first model. A simple structure has still not been achieved as there is a lack of zeros in our factor loading matrix. The components that changed the most are the three oil field service stocks (SLB, HAL, and BHI). The
rotational change is most visible in the plot of the components against the two factor loadings. Examining the plot, the clusters of our sectors still remain closely related to each other, but the rotation has moved all of the components into the top right quadrant.
As for the interpretability of the factor loadings, the rotation seems to add some clarity into what the factors represent. The first factor has larger loadings for the banking industries (BAC, JPM, and WFC) while the second factor has high positive loadings for the oil field services and moderately high loadings for the oil refining industry. The interpretation may be a bit clearer, but 2 factors are still less than the four sectors we’d expect and we have not yet achieved a simple structure. That coupled with the
presence of Heywood cases in the estimation of the factors means we should continue our search for an acceptable factor analysis.
Maximum Likelihood Estimation Factor Analysis with Varimax rotation
The next model uses maximum likelihood estimation to perform factor analysis with a varimax rotation. Like the previous two models, the number of factors is not provided to the model apriori and the MLE process also reaches the conclusion that 2 factors should be considered. Unlike the previous two models, MLE has a formal statistical test that can aide in the interpretation of how many factors to consider.
The test uses a chi-‐square distribution and has the null hypothesis that there are no common factors with the alternative hypothesis that there is at least one common factor. If the test is significant at our predetermined alpha rate, we reject the null hypothesis, conclude that at least one common factor exists, and iterate through the tests by increasing the number of common factors in the null hypothesis until we reach a result that is not significant. There are a few other major benefits that are achieved when using MLE as the factor estimation tool including the ability to formally compare factor models against each other using criteria like AIC of BIC as the presence of the likelihood function allows these direct statistical comparisons (assuming all assumptions are met). The rotated factor loadings produced by the MLE are slightly different than the rotated factor loadings produced by our second model but very similar. In terms of interpretability, there is a similar level of interpretability of the factor loadings as the previous model since both models achieve similar factor loadings after the rotation. A simple structure is still not present as no zeros or near zero values are present in the loadings. It is reassuring to see that similar results are produced from different techniques. While a formal bootstrap or cross validation method should be employed to ensure that the factor loadings are not specific to this particular sample, at the very least, we’re seeing consistent results from different processes.
Maximum Likelihood Estimation Factor Analysis with rotation Max prior
The final factor analysis model in this report is another maximum likelihood estimation factor analysis that uses the orthogonal varimax rotation but uses a different prior communality. This approach uses the largest absolute correlation for a variable with any other variable as the communality estimate for the variable. Using this prior structure, SAS suggests five factors should be included in this model using the same iterative statistical testing framework as the previous model. The large difference in factors included in the model based on differing prior communality estimates suggests that the prior communality has a large effect on factor analysis.
Looking at the eigenvalues produced by this model, there are fewer Heywood cases than the previous models but Heywood cases are still present. Because of the presence of Heywood cases, this model is
potentially misspecified and should not be used. It is difficult to interpret if the five-‐factor model fits the data better than the two-‐factor model. In terms of interpretability, the five-‐factor model’s first four factors align with our expectations of seeing a factor per sector. Note that we still have not reached a simple structure as indicated by the lack of zeros in the loadings matrix. The fifth factor is especially puzzling and difficult to interpret. The absolute values of the factor loadings for this model are all below .25 and the largest three loadings in the fifth factor are spread across our sectors.
Conclusions:
In this report, we fit four different factor analysis models. Two using principal factor analysis and two uses maximum likelihood estimation. None of our models perfectly aligned with our intuitive understanding of the data being split into four sectors though all models hinted at the four sectors either through clusters in the factor relationship or through the individual factors. All of our models had Heywood cases present meaning that these models should not be used. Perhaps these Heywood cases
are indicating that our data does not perfectly align with the strict assumptions of the factor analysis models. Rotations helped increase the interpretability of our model, but we were unable to find a simple structure in any of the models. Perhaps if we had used an oblique rotation and relaxed the orthogonality requirement enforced by varimax rotations, we could have found a simpler structure, but we would no longer be looking at the correlation structure. Additionally, the transformation of the returns to the log scale may have helped our data fit the model, though we never verified this assumption in this report. Overall, the factor analysis process is interesting to try and understand how the data are correlated to each other, though it did confirm knowledge that we already had and was readily accessible to us by simply researching the companies. It seems that a great deal more time would be required to fiddle with the knobs of this model to achieve a simple structure that is interpretable and informative of our use case. Given the Heywood cases, I would not suggest using any of the models developed in this report.
Code:
libname mydata "/scs/wtm926/" access=readonly; /* 1 */ data temp; set mydata.stock_portfolio_data; drop AA HON MMM DPS KO PEP MPC GS ; run; proc sort data=temp; by date; run; data temp; set temp; * Compute the log-‐returs; return_BAC = log(BAC/lag1(BAC)); return_BHI = log(BHI/lag1(BHI)); return_CVX = log(CVX/lag1(CVX)); return_DD = log(DD/lag1(DD)); return_DOW = log(DOW/lag1(DOW)); return_HAL = log(HAL/lag1(HAL)); return_HES = log(HES/lag1(HES)); return_HUN = log(HUN/lag1(HUN)); return_JPM = log(JPM/lag1(JPM)); return_SLB = log(SLB/lag1(SLB)); return_WFC = log(WFC/lag1(WFC)); return_XOM = log(XOM/lag1(XOM)); response_VV = log(VV/lag1(VV)); run; data return_data; set temp (keep= return_:); run; /* 2 */
ods graphics on; proc factor data=return_data method=principal priors=smc rotate=none plots=(all); run; ods graphics off; /* 3 */ ods graphics on; proc factor data=return_data method=principal priors=smc rotate=varimax plots=(all); run; ods graphics off; /* 4 */ ods graphics on; proc factor data=return_data method=ML priors=smc rotate=varimax plots=(loadings); run; ods graphics off; /* 5 */ ods graphics on; proc factor data=return_data method=ML priors=max rotate=varimax plots=(loadings); run; ods graphics off;