1
1

I have a set of matrices which should fall into 3 distinct set/groups/clusters. They are unlabelled. I wish to do unsupervised clustering with PCA. I am using matlab as well. At the end I would also like to examine the eigenvectors.

Matlab has a function call "princomp" which I believe can do this task; is this correct?

When I give "princomp" a matrix the output can be interpreted how?

For example:

dataTmp=[1 1; 2 2; 1 2; 2 3; 4 6; -1 1; -2 2; -4 3; -5 8]
dataTmp =
 1     1
 2     2
 1     2
 2     3
 4     6
-1     1
-2     2
-4     3
-5     8

princomp(dataTmp)

ans =

0.9207    0.3902
-0.3902    0.9207

or should I being using the function "zscore" beforehand to standardise the values first?

princomp(zscore(dataTmp))

ans =

0.7071    0.7071
-0.7071    0.7071

How do I interpret the answer? The data I made were simple points in either the first or second quandrant.

asked Sep 06 '11 at 08:02

VassMan's gravatar image

VassMan
209121518


One Answer:

First of all, PCA does not do clustering. It does dimensionality reduction/feature extraction. It returns the principal directions of your data. These are orthogonal vectors showing the directions of greatest variation in your data. Since your data is two dimensional, you'll get two vectors. The first corresponding to the main direction, the second corresponding to the orthogonal direction (as there is only one in 2D). So PCA tells you (0.9207, -0.3902)^T is the main direction of your data.

As you said you want to find groups in your data, you should try a clustering algorithm such as kmeans. It's contained in the matlab statistics toolbox.

answered Sep 06 '11 at 09:45

Andreas%20Mueller's gravatar image

Andreas Mueller
2686185893

@Andreas Mueller: 1)And I have heard that the method, "eigenfaces", categorizes/recognizes faces so I assume that this is a form of clustering in a way (right?). 2)Having orthogonal vectors is not a restriction? 3)Do I not need to do "zscore" which subtracts the variance to standardise? 4) and in the example data I gave the first column was the x dimension and the second the y, so (0.9207, -0.3902)^T means that 0.9207 is the x dimension? Thanks

(Sep 07 '11 at 07:40) VassMan

Eigenfaces are features on the faces. A classifier is build on top of these features to classify the faces. So no, PCA is not a clustering method.

For your problem I REALLY recommend using KMeans (as a starting point).

For the other questions: 2) Yes, having orthogonal vectors is somewhat of a restriction. Other dimensionality reduction/feature extraction techniques don't have this restriction.

3) If you need to zscore the data depends on what the data means. Do the relative scales matter? For example if the first dimenstion is between 0 and 1 and the second is between 0 and 100, does that mean that in the first dimension everything is "close"? or is 0 and 1 in the first dimension as far as 0 and 100 in the second? If the relative scales don't matter, you should do zscoring. Otherwise don't. (Same goes for all other methods, for example KMeans).

4) Yes.

To summarize: if you want clusters, use a clustering algorithm such as KMeans (other popular choices being mean shift and spectral clustering).

(Sep 07 '11 at 09:26) Andreas Mueller

@Andreas Meuller: this link http://www.vision.jhu.edu/gpca/CVPR07-Tutorial-GPCA-Algebra.pdf speaks about PCA and clustering.

(Sep 07 '11 at 09:30) VassMan

It talks about Generalized PCA which apparently is a SUBSPACE clustering technique. I would suggest you look at the wikipedia pages on PCA, KMeans, clustering and dimensionality reduction that I linked to.

You should start with the basics before you dive into more involved techniques.

(Sep 07 '11 at 09:38) Andreas Mueller

Do you want to do subspace clustering or just clustering?

(Sep 07 '11 at 09:39) Andreas Mueller

@Andreas Mueller: thank you for your answers. I want to do clustering. I tried using the principle components to do some clustering, and it worked. I did using only the first eigenvector which was 90% of the variance. All I did was generate 20 rand numbers, add 10 to half of them, subtract 10 from the other half, find the projection of the data onto this first eigenvector, and it partitions the data onto a scalar value of either positive or negative overlap projection. In a way that is similar to kmeans it seems. Where kmeans has the cluster centers, PCA gives the vector direction between these clusters... right?

(Sep 07 '11 at 10:38) VassMan
2

well, PCA gives you the direction of variance. If the clusters separate well along the direction of variance, that works. But this is a strong assumption.

Imagine you have three clusters in 2D. If they don't lie on a line, you can not separate them with pca.

PCA is, as you said, a preprocessing step. You reduced the number of dimensions to one and then clustered with a simple threshold.

Usually there is no such nice projection that separates the clusters - and even if there is, it does not need to be the principal direction of the data.

(Sep 07 '11 at 11:13) Andreas Mueller
2

PCA can even fail for two clusters in 2D: imagine two "parallel" elongated elipses as clusters. If you don't zscore, the SMALLEST eigenvalue will separate them, not the biggest. If you zscore them, the eigenvalues will be arbitrary.

And if your clusters look like this: http://www.ml.uni-saarland.de/code/pSpectralClustering/images/eigenvector11b.png

your're completely screwed ;)

(Sep 07 '11 at 11:18) Andreas Mueller

@Andreas Mueller: thanks so much for this! I did clustering with kmeans, and get nice results :) my data is very high dimensional; 112^2. What can I do with the principle components (eigenvectors)? They are very long, and I believe that they show the direction of variance which may be good insight for such high dimensional data right? since you said initially that they are for dimensionality reduction. How do I proceed from here? thanks

(Sep 07 '11 at 12:33) VassMan

You're welcome :)

Yes, PCA definitely gives you insight into your data ;) If you have some way of visualizing your datapoints (maybe they are images or something?) then you can visualize the eigenvectors since they have the same shape as your input. For dimensionality reduction you just use the N first eigenvectors and project on these. Sometimes it's helpful to project down to 2 or 3 dimensions so you can look at your data as points in the plane (or in space). Though that often only works if the structure is very simple. Take a look at this page, it explains the details: http://www.cs.ait.ac.th/~mdailey/matlab/

(Sep 08 '11 at 05:55) Andreas Mueller

@Anreas Mueller: For some reason that link seems to not work :/ these matrices are network connectivity matrices, not images. 1)Do absolute large or small values have any significance if I plot them? 2)If I were to select the top largest values do they say anything about the data? I keep the first 10 eigenvectors since they hold 70% of the variance. Projecting, is multiplying one of the data matrices by an eigenvector; 3)just gives you a scalar value-but what use is that?, what does that say? Again, thanks a ton!

(Sep 08 '11 at 07:30) VassMan
showing 5 of 11 show all
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.