Pain-Free Data Science Tutorial: Principal Components Analysis

PCA is a much-used and poorly understood way of reducing the number of features in your analysis. At this year's PyBay, I gave a lecture on the intuition behind PCA. What drew me to this lecture was the challenge of explaining something that appears mathematically intimidating in a more accessible way. 

You can look through for the lesson, but what I enjoyed about designing this talk was actively trying to stay away from equations and code. It's easy to fall back on yet another python tutorial or yet another linear algebra explanation. Let's be honest, most newcomers to data science will often apply tools like PCA without fully understanding what it does and why. Most will tell you the following: 

1) You reduce dimensions the same way you do feature selection, but without 'losing information' (whatever that means!)

2) You type in a line of code and voila! instant dimensions reduced, and plow ahead with your classification or clustering. 

lda = LDA(n_components=2)
X_train_lda = lda.fit_transform(X_train, y_train)

 

A deeper understanding helps you become a better data scientist. It's easy and tempting to plug and play code, but a good data scientist knows when not to use a tool:

First, a data scientist should understand the appropriate use of dimensionality reduction. In many cases, it's approached incorrectly - a user will say, "I want to project my data down to n dimensions." Instead, check your scree plot for the cumulative percent variance explained with each additional dimension, to make sure you're picking up an adequate amount of variance. 

Second, PCA doesn't change your data. It's just a shift in perspective used to combat the curse of dimensionality. In my talk, I use the example of the duck/rabbit optical illusion to illustrate how a different perspective can make a projection look completely different. 

PCA = Ducks and Bunnies

 

We comprehend dimensionality reduction every day - on our phones, televisions, laptops and movie theaters. All an image is is a 2 dimensional representation of a 3 dimensional image. Think about the picture below - it is clearly three dimensional, but we are clearly able to understand what is imparted, as the third dimension, though useful, is not necessary for our understanding of what the image portrays. In PCA terms. we're able to collapse the data by removing one dimension without sacrificing information: 

 

Finally, it's important to note that you do lose interpretability for a non-data science audience. A simple way to explain it is that you weed out the signal from the noise. However, it's much easier to explain a beta coefficient of a linear model than to explain the eigenvalues that compose the eigenvector of each component. It's quite important to keep that in mind when choosing to perform dimensionality reduction on your model. For people in sensitive data environments (banking or healthcare), you may have to stick to basic feature selection.