|
Probabilistic canonical correlation analysis (CCA) assumes a shared latent variable Z for two datasets X and Y, and the dimensionality of Z (say K) is assumed to be smaller than the minimum of dimensionalities of both X and Y; see theorem 2 in Bach and Jordan's tech report on CCA. CCA is also often used for supervised dimensionality reduction for data X (NxD) and labels Y(NxM). Note that most such settings assume multilabel data so Y is actually a matrix of size N x M where N is the number of examples and M is the number of labels per example. Now, if CCA assumes K <= min{D,M} and if we further assume that M < D, then CCA would never find a Z with K > M. This seems very restrictive because we can't apply CCA reliably for supervised dimensionality reduction if M is small, say M = 1 or 2, because in that case you can never project X to a subspace having dimensionality more than M (which means you will have to throw away lots of information in X). However, is there anything in the probabilistic setting that stops us from assuming a K larger than min{D,M}? What would be the consequence of choosing a K larger than min{D,M}? Would the model still be well-defined? Note that the non-probabilistic setting of CCA can't have K > min{D,M} because you have to solve the eigenvalue problem which can never give you more non-zero eigenvalues than min{D,M}. So for non-probabilistic CCA, we are indeed constrained by the condition K <= min{D,M}. I want to know why it is needed for the probabilistic interpretation? Is it at all required if we have suitable priors in our model and we are doing appropriate model selection as part of inference (assuming a Bayesian setting)? |
|
This paper and this similar paper show how you can automate the model selection for bayesian CCA if you use an indian buffet process prior (it is a probability distribution on possibly infinite feature matrices with only finitely many non-zero values; the connection with CCA is very easy to see). Choosing a K larger than min(D,M) never makes a lot of sense, since CCA (even probabilistically) is about linear combinations of the variables, and you will never need, in a linear model, more dimensions than what you already have. My question was that do we really need to make the assumption K <= min(D,M) in the probabilistic model (the nonparametric Bayesian approach - for example the papers you mentioned - doesn't assume K a priori so it isn't an issue there)? But if this assumption on K is made then isn't non-probabilistic CCA (or even the parametric probabilistic CCA) a bad model for supervised dimensionality reduction because for small M, we are forced to project data X at most to M many dimensions?
(Jul 27 '10 at 22:19)
spinxl39
My point is that, since CCA is linear, you don't gain anything from projecting into more than M dimensions. That is, any correlation you observe a projection to more than M dimensions, by properties of a linear projection, can be observed in a projection into at most M dimensions. So, yeah, if M is small it is a bad method for dimensionality reduction, but so will any linear method be (my point also is: you can't get linear dimensionality augmentation, and projecting into more than M dimensions amounts to that).
(Jul 27 '10 at 22:24)
Alexandre Passos ♦
|