Principal component analysis

23 May 2017

Reading time ~2 minutes

I’ve been involved quite a few statistical projects in past. Where either the dimension of explanatory variables was a problem or multicollinearity was an issue. I want to share my experience by going through a simple example how PCA (Principal component analysis) can help to achieve both those goals. It is practically not possible to cover both diagnostic analysis, regression and visualization in one blog.

I will be laying emphases on dimension reduction and multicollinearity.

PCA Brief

PCA helps to reduce the no of variables, by not losing much of information. Large complex data can be analyzed smaller and understandable data set.

PCA is also used when there is high amount of correlation between explanatory variables. It removes the multicollinearity. There has been a lot of research done in this field. PCA is also one of the leading unsupervised learning techniques. I am going to explain this further in what conditions where I will use PCA. I will discuss this using R and how it can be used effectively.

Running through an example

The data set I’m using is in my analysis is corn data set. Iowa state Link to data is one of the leading corn producer. To evaluate the overall quality of soil, 23 explanatory variables was used. Varibles like slope of land, water infiltration to name a few. Scree and variance plots will be created.

Snapshot of the correlation matrix

alt image

If you see above there seems to be high multicollinearity between the explanatory variables. It is very high, and because of that estimates will be inflated. To fixed this we can either use Ridge Regression or, do PCA because of high VIF, but multicollinearity was still a problem.

Code

prn<-read.csv('corn.csv')

## Checking multicollinearity

ggpairs(data=prn, # data.frame with variables
        columns=c(2,3,4,5,6,7), # columns to plot, default to all.
        title="Data")+ggsave("corr.jpg")        
}

Code

Splitting explanatory variables and response variables

prn_expl <- (prn[, 2:24])

prn_response <- prn[,1]

Explanatory variables

alt image

Running PCA

prn.pca <- prcomp(na.omit(prn_expl),scale=TRUE)

summary(prn.pca)

**Table Below **

alt image

Scree Plot

alt image

Analysis

From above we can very well see, importance of the PCs. Last row in the table describes cumulative proportion of explained variance. We can see that first six PC’s account for almost 80% of the variance of the data. We can also see that from scree plot, there is a big bend at 6. Maximum variance can be explained.

Code

#Using
a1 <- prn.pca$rotation[,1:6]
a1

Principal componet analysis table

alt image

Final note

As we have seen above in the PCA analysis. PCA not only removes multicollinearity but also reduces the dimension. There is always a trade off, when you use, PCA as it only be used in building a model for prediction/ It cannot be used for finding significant predictors, and it can be sometime difficult to interpret.

I will be doing separate blog, on regression analysis and on visualization using the same dataset running PCA regression.

References

https://stat.ethz.ch

http://www.sthda.com