5
1

I am trying to work out if I can parallelise the training aspect of a machine learning algorithm. The computationally expensive part of the training involves Cholesky decomposing a positive-definite matrix (covariance matrix). I'll try and frame the question purely in terms of the matrix algebra. Let me know if you need any more info.

Lets say we have a block matrix

 
M = A  B  
    B* C

where A and C relate to training data from two different sets. Both A , and B are positive definite.

There is a formula for carrying out block Cholesky decomposition. See http://en.wikipedia.org/wiki/Block_LU_decomposition. Summarising we have the following result.

M = LU
where (* indicates transpose)
L = A^{1/2}      0 
    B*A^{-*/2}  Q^{1/2}
where
Q = C - B*A^{-1}B

Now lets say training related to matrices A and C has already been carried out, so we have carried out the cholesky decomposition for A, and C giving A^{1/2}, and C^{1/2} (It is therefore straightforward to calculate the inverses A^{-1/2}, and C^{-1/2} using forward substitution).

Rewriting the Q in terms of these quantities we now have.

Q = Q^{1/2} Q^{/2} = C^{1/2} C^{/2} - B*A^{-*/2}A^{-1/2} B

My question is this: Given this set up is it possible to algebraicly calculate Q^{1/2} without having to apply cholesky decomposition to Q. Or in other words can I use C^{1/2} to help me in the calculation of Q^{1/2}. If this were possible it would then be possible to easily parallelise the training.

Thanks in advance for any replies. Sorry about the matrix typesetting. This is my first post!!! Is there any way sensible way to typeset maths or matrices in particular?

Matt.

asked Jul 01 '10 at 08:20

Matthew%20Gretton's gravatar image

Matthew Gretton
76137

edited Jul 02 '10 at 11:01

Matthew, thanks for joining. There is not yet support for Math, but I hope the devs add it to OSQA core (http://jira.osqa.net/browse/OSQA-199)

(Jul 01 '10 at 10:21) Joseph Turian ♦♦

3 Answers:

Thanks to all above who replied to my question. I think I've come to an answer although it is not exactly as I'd hoped.

Removing the machine learning context, my question boiled down to whether knowing C^{1/2} would help in the calculation of Q^{-1/2}. I'll go into more detail below but to cut the chase, the answer is yes, but only with respect to stability and not computation (can't prove this to be the case currently, but fairly certain).

For why the answer is yes wrt to stability we look at the definition Q from the original question has been rearranged as follows.

Q = C - B* A^{-1} B = (C^{1/2} + B*A^{-*/2})(C^{1/2} - B*A^{-*/2})*

In both the formulations of Q we do not need to invert A directly but the second may be more stable if Q is of low rank.

Sadly, although I have done a fair amount of research on the subject, it does not appear that C^{1/2} helps wrt computation in the exact calculation of Q^{-1/2}. The best approach appears to be to calculate Q using C^{1/2} as above and then use Cholesky to decompose Q to Q^{1/2} and then forward substitution to calculate Q^{-1/2}.

Further Research

One area I did not look into in much detail was whether it was possible to use C^{1/2} to approximate Q^{-1/2}. Something along the lines of an iterative method using C^{1/2} as a starting point. I do not know of any such iterative approximation process, but I'll keep searching. I may even start a new question with that as the focus.

Again, thanks to all who helped me getting to this answer. I'll update you all if I have any major breakthroughs.

This answer is marked "community wiki".

answered Jul 07 '10 at 14:14

Matthew%20Gretton's gravatar image

Matthew Gretton
76137

edited Jul 08 '10 at 21:02

It's been pointed out to me that the instability associated with inverting A directly is avoided in both formulations of Q. Having said that, there may be some stability gains if Q is low rank. So there is no real harm in the reformulation of Q.

(Jul 07 '10 at 20:38) Matthew Gretton

Most decompositions can be implemented very efficiently in parallel using random projections. Moreover, most iterative Krylov space methods can be reduced to multiple random projections using block-wise iterations.

LU decompositions are generally quite difficult to parallelize because of their inherently serial statement.

See http://arxiv.org/abs/0909.4061 for more details

answered Sep 24 '10 at 01:10

Ted%20Dunning's gravatar image

Ted Dunning
636815

Thanks! The linked article looks really interesting. Exactly the kind of thing I was looking for. I'll have a proper read over the weekend. Thanks again.

Matt.

(Sep 24 '10 at 06:25) Matthew Gretton

Hi Matthew,

I don't know whether this fits your requirements, but you may have a look at

http://www.cse.illinois.edu/courses/cs554/notes/07_cholesky.pdf

and

http://www.springerlink.com/content/k1w62523x277h6x8/

Best regards, Z.

answered Jul 01 '10 at 10:58

zeno's gravatar image

zeno
2111510

edited Jul 01 '10 at 11:18

Thanks for the links. The first one is a really nice explanation of how the standard parallel implementation of Cholesky Decomposition works. It doesn't go about the decomposition quite in way I require unfortunately. Once I've exhausted seeing if I can do it the way I want, it would definitely be the next method I'd try, so thanks! Not able to view the other one as I don't have a university account for viewing. I know plenty of other people who will be able to however. I'll report back when I've had a look.

(Jul 01 '10 at 11:09) Matthew Gretton

I've had a chance to look at both now, and while they both offer general methods for paralleising cholesky decomposition they are still not quite what I need. To be honest that may be because the answer to my initial question is not the one I want... I'll keep searching... Thanks again.

(Jul 05 '10 at 18:00) Matthew Gretton
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.