A simple method for chosing the number of components to use in principal component analysis
A common problem in data science is the curse of dimensionality. Essentially, the more different variables a dataset encompasses, the more mathematically intractible it is to make measurements based on them all. The usual method for dealing with this problem is Principal Component Analysis, which seeks to reduct the data to a lower number of dimension while retaining as much information as possible. The most common method of doing this is as follows.
- Obtain either the covariance matrix of the variables of a similarity matrix of the observations, using a metric such as cosine similarity
- Calculate the eigenvalues and eigenvectors of this matrix
- Use the eigenvectors corresponding to the N largest eigenvalues to form an orthonormal basis
This however, raises the question of how to select an appropriate value of N. Since our aim is to explain the maximum amount of variance with the minimum number of components, a simple approach is to find a maximum in the sum of the proportion of components discarded with the proportion of variance retained, as measured by the eigenvalues.
numpy helps us with this by returning eigenvalues and eigenvectors in increasing order of eigenvector. The following code illustrates the method
import numpy import numpy.linalg def reduce_data(X,metric=None): """Reduces X to the number of dimensions that retains the maximum amount of infomation for the minimum number of components Parameters ---------- X : numpy.ndarray (n * m) array containing n rows of m-dimensional observations metric : function (optional, default = None) Similarity metric. Takes an (n * m) array and returns an (n * n) array of similarities Returns ------- numpy.ndarray The data reduced to the optimum number of dimensions """ similarity = numpy.cov(X, rowvar=False) if metric is None else metric(X) (eigenvalues, eigenvectors) = numpy.linalg.eigh(similarity) n = eigenvalues.shape excluded = numpy.arange(n)/n explained = 1.0 - (eigenvalues.cumsum()/eigenvalues.sum()) cutoff = (excluded + explained).argmax() basis = eigenvectors[:,cutoff:] return X.dot(basis) if metric is None else basis