The sheer size of data in the modern age is not only a challenge for computer hardware but also the main bottleneck for the performance of many machine learning algorithms. The main goal of a PCA analysis is to identify patterns in data. PCA aims to detect the correlation between variables. If a strong correlation between variables exists, the attempt to reduce the dimensionality only makes sense. It is a statistical method used to reduce the number of variables in a dataset. It does so by lumping highly correlated variables together. Naturally, this comes at the expense of accuracy. However, if you have 50 variables and realize that 40 of them are highly correlated, you will gladly trade a little accuracy for simplicity.
The entire subject of statistics is based around the idea that you have this big set of data, and you want to analyse that set in terms of the relationships between the individual points in that data set. I am going to look at a few of the measures you can do on a set of data, and what they tell you about the data itself.
- Standard Deviation: In statistics, the standard deviation (SD, also represented by the Greek letter sigma σ) is a measure that is used to quantify the amount of variation or dispersion of a set of data values. A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values. How do we calculate it? The English definition of the SD is: “The average distance from the mean of the data set to a point”. The way to calculate it is to compute the squares of the distance from each data point to the mean of the set, add them all up, and take the positive square root. As a formula:
- Variance: In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean, and it informally measures how far a set of (random) numbers are spread out from their mean. The variance has a central role in statistics. It is used in descriptive statistics, statistical inference, hypothesis testing, the goodness of fit, and Monte Carlo sampling, amongst many others. It is the square of Standard Deviation.
- Covariance: Standard deviation and variance only operate on 1 dimension, so that you could only calculate the standard deviation for each dimension of the dataset independently of the other dimensions. However, it is useful to have a similar measure to find out how much the dimensions vary from the mean with respect to each other. Covariance is such a measure. Covariance is always measured between 2 dimensions. If you calculate the covariance between one dimension and itself, you get the variance. The formula for covariance is very similar to the formula for variance. The formula for variance could also be modified and rewritten like this:
where I have simply expanded the square term to show both parts. So given that knowledge, here is the formula for covariance:
How does this work? Let’s use some example data. Imagine we have gone into the world and collected some 2-dimensional data, say, we have asked a bunch of students how many hours in total that they spent studying, and the mark that they received. So we have two dimensions, the first is the dimension, the hours studied, and the second is the dimension, the mark received. So what does it tell us? The exact value is not as important as its sign (ie. positive or negative). If the value is positive, then that indicates that both dimensions increase together, meaning that, in general, as the number of hours of study increased, so did the final mark.
If the value is negative, then as one dimension increases, the other decreases. If we had ended up with a negative covariance here, then that would have said the opposite, that as the number of hours of study increased the final mark decreased. In the last case, if the covariance is zero, it indicates that the two dimensions are independent of each other.
Principal Component Analysis
The assumptions of PCA:
- Linearity – Assumes the data set to be linear combinations of the variables.
- The importance of mean and covariance – There is no guarantee that the directions of maximum variance will contain good features for discrimination
- That large variances have important dynamics – Assumes that components with larger variance correspond to interesting dynamics and lower ones correspond to noise. In simpler terms suppose if we want to classify Male and Female using the height dimension then the data in the height dimension should be dispersed data with negligible variance will be of no use ie. if all the observant are having same height then we will not be able to use this dimension to classify Male/Female.
Steps for PCA:
- Step 1: Data Preparation
In my simple example, I am going to use my own made-up data set. It’s only got 2 dimensions, and the reason why I have chosen this is so that I can provide plots of the data to show what the PCA analysis is doing.
- Step 2: Subtract the mean
For PCA to work properly, you have to subtract the mean from each of the data dimensions. The mean subtracted is the average across each dimension. So, all the xvalues have (the mean of the values of all the data points) subtracted, and all the yvalues have subtracted from them. This produces a data set whose mean is zero.
- Step 3: Calculate the covariance matrix
Since the data is 2 dimensional, the covariance matrix will be 2. Here I will just give you the result.
So, since the non-diagonal elements in this covariance matrix are positive, we should expect that both the x and y variable increase together.
- Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix
Since the covariance matrix is square, we can calculate the eigenvectors and eigenvalues for this matrix. These are rather important, as they tell us useful information about our data.
Here are the eigenvectors and eigenvalues:
It is important to notice that these eigenvectors are both unit eigenvectors i.e. their lengths are both 1. This is very important for PCA, but luckily, most math’s packages, when asked for eigenvectors, will give you unit eigenvectors.
So what do they mean? If you look at the plot of the data in Figure 1.2 then you can see how the data has quite a strong pattern. As expected from the covariance matrix, they two variables do indeed increase together. On top of the data I have plotted both the eigenvectors as well. They appear as diagonal dotted lines on the plot. As stated in the eigenvector section, they are perpendicular to each other. But, more importantly, they provide us with information about the patterns in the data. See how one of the eigenvectors goes through the middle of the points, like drawing a line of best fit? That eigenvector is showing us how these two data sets are related along that line. The second eigenvector gives us the other, less important, pattern in the data, that all the points follow the main line, but are off to the side of the main line by some amount. So, by this process of taking the eigenvectors of the covariance matrix, we have been able to extract lines that characterize the data. The rest of the steps involve transforming the data so that it is expressed in terms of them lines.
- Step 5: Choosing components and forming a feature vector
Here is where the notion of data compression and reduced dimensionality comes into it. If you look at the eigenvectors and eigenvalues from the previous section, you will notice that the eigenvalues are quite different values. In fact, it turns out that the eigenvector with the highest eigenvalue is the principle component of the data set. In our example, the eigenvector with the large eigenvalue was the one that pointed down the middle of the data. It is the most significant relationship between the data dimensions. In general, once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. This gives you the components in order of significance. Now, if you like, you can decide to ignore the components of lesser significance. You do lose some information, but if the eigenvalues are small, you don’t lose much. If you leave out some components, the final data set will have less dimensions than the original. To be precise, if you originally have dimensions in your data, and so you calculate eigenvectors and eigenvalues, and then you choose only the first eigenvectors, then the final data set has only dimensions.
What needs to be done now is you need to form a feature vector, which is just a fancy name for a matrix of vectors. This is constructed by taking the eigenvectors that you want to keep from the list of eigenvectors and forming a matrix with these eigenvectors in the columns.
Given our example set of data and the fact that we have 2 eigenvectors, we have two choices. We can either form a feature vector with both of the eigenvectorsor, we can choose to leave out the smaller, less significant component and only have a single column:
- Step 6: Deriving new variables
This the final step in PCA, and is also the easiest. Once we have chosen the components (eigenvectors) that we wish to keep in our data and formed a feature vector, we simply take the transpose of the vector and multiply it on the left of the original data set, transposed.
Where Row Feature Vector is the matrix with the eigenvectors in the columns transposed so that the eigenvectors are now in the rows, with the most significant eigenvector at the top, and Row Adjust Data is the mean-adjusted data transposed, ie. The data items are in each column, with each row holding a separate dimension. Final data set is the final data set, with data items in columns, and dimensions along rows.
What will this give us? It will give us the original data solely in terms of the vectors we chose. Our original data set had two axes, x and y, so our data was in terms of them. It is possible to express data in terms of any two axes that you like. If these axes are perpendicular, then the expression is the most efficient. This was why it was important that eigenvectors are always perpendicular to each other. We have changed our data from being in terms of the axes x and y, and now they are in terms of our 2 eigenvectors. In the case of when the new data set has reduced dimensionality, ie. we have left some of the eigenvectors out, the new data is only in terms of the vectors that we decided to keep. In the case of keeping both eigenvectors for the transformation, we get the data and the plot found in Figure 1.3. This plot is basically the original data, rotated so that the eigenvectors are the axes. This is understandable since we have lost no information in this decomposition.
So what have we done here? Basically, we have transformed our data so that is expressed in terms of the patterns between them, where the patterns are the lines that most closely describe the relationships between the data. This is helpful because we have now classified our data point as a combination of the contributions from each of those lines. Initially, we had the simple x and y axes. This is fine, but the x and y values of each data point don’t really tell us exactly how that point relates to the rest of the data. Now, the values of the data points tell us exactly where (ie. above/below) the trend lines the data point sits. In the case of the transformation using both eigenvectors, we have simply altered the data so that it is in terms of those eigenvectors instead of the usual axes. But the single-eigenvector decomposition has removed the contribution due to the smaller eigenvector and left us with data that is only in terms of the other.
Getting Original Data Back
Wanting to get the original data back is obviously of great concern if you are using the PCA transform for data compression.
So, how do we get the original data back? Before we do that, remember that only if we took all the eigenvectors in our transformation will we get exactly the original data back. If we have reduced the number of eigenvectors in the final transformation, then the retrieved data has lost some information. Recall that final transformation is this :
Where (RowFeatureVector)ˆ (-1) s the inverse of RowFeatureVector. However, when we take all the eigenvectors in our feature vector, it turns out that the inverse of our feature vector is actually equal to the transpose of our feature vector. This is only true because the elements of the matrix are all the unit eigenvectors of our data set. This makes the return trip to our data easier, because the equation becomes
But, to get the actual original data back, we need to add on the mean of that original data (remember we subtracted it right at the start). So, for completeness.
This formula also applies to when you do not have all the eigenvectors in the feature vector. So even when you leave out some eigenvectors, the above equation still makes the correct transform.