Implementing a machine learning algorithm, I need to calculate a particular matrix product M = DZtZD (Zt being Z transpose, Z is sparse, D is a diagonal matrix). I multiplied everything out, and Mij=Dii * Djj * SUM(Zki * Zkj). I don't actually care about D or Z, I just need M.

Now, I know the conventional wisdom is that you should never roll your own matrix multiplication. But I'm wondering about cases like this, where I can avoid creating the intermediate matrices altogether. I calculate the columns of Z one at a time: rather than building that matrix, I can sum the nonzero results directly in M. It seems like that should be faster than building all of Z, transposing it, and then doing a multiplication. But I also realize my memory access patterns in M will be basically random.

I realize the best choice is to do it both ways and profile, but I was wondering if other people had run into this same question, and which you think/have measured is faster. Or if there are properties of matrices that make this a better or worse idea (in this case, Z being sparse makes it seem cheaper to just make use of the nonzero rows directly, but that's more intuition than anything)

asked Jul 27 '11 at 09:35

Paul%20Barba's gravatar image

Paul Barba
4464916

Out of curiosity, efficiency aside: how do you construct M by processing Z "one column at a time"?

(Jul 30 '11 at 20:45) Radim

What I meant was that in this particular Z, (i,j) is defined as the normalized distance between a point i and a cluster j, if j is one of the n closest clusters to i. I have a separate algorithm that (relatively) quickly returns the n closest clusters and their distances.

So I can either iterate over i, filling in each column of Z, then do the transposes and multiplications. Or if (i,a) and (i,b) are two close clusters to point i, I can do M(a,b)+= (i,a) * (i,b); M[b,a] += (i,a) * (i,b). Then for all M(i,j) I multiply by D(i,i) and D(j,j)

By the way, I measured and the matrix library was indeed faster.

(Aug 03 '11 at 09:54) Paul Barba
1

Matrix multiplication is memory-bound (not CPU bound), so if your measurements show faster execution with a generic Z.T * Z, without taking special matrix structure into account, you're probably measuring the wrong thing :) I'd suggest the SYRK routine of BLAS, which computes Z.T * Z directly (symmetric, without transposing). If you can access Z by rows (not columns!) as you write you can, you can also operate point-by-point by calling SYR (rank-1 update) repeatedly, simply summing the intermediate results. This makes sense if your Z is too large to fit in memory, otherwise SYRK is just simpler. Btw sparse storage implies trickier memory access patterns, so definitely profile to see if it's worth it (usually it isn't and dense is faster).

(Aug 04 '11 at 10:27) Radim

@Radim, can you submit that as an answer? I think it is better than the other ones in addressing the actual problem.

(Aug 04 '11 at 10:29) Alexandre Passos ♦

3 Answers:

Another option is to use a library/system/language that can optimize away the creation of intermediate data structures, ie you write a natural looking matrix expression & the compiler performs the transformation you did by hand, and eliminates anything unnecessary.There are a number of matrix libraries in C++ that use template metaprogramming to do this: Blitz++ & POOMA just to name 2, but there are many now. Todd Veldhuisen invented the expression templates technique for this purpose. The problem can be that if your matrices are very large your compiler might choke. Some matrix oriented systems like Matlab & Numpy or algebraic systems like Mathematica, Maple & Axiom may also do some of this, but I am not an expert on their implementation. Numerical libraries for functional languages could also be doing some of this.

answered Jul 27 '11 at 19:10

Daniel%20Mahler's gravatar image

Daniel Mahler
122631322

edited Jul 27 '11 at 19:16

Memory is cheap. If the time it takes to actually compute D & Z is pretty low, I think you'd be better off constructing them and then check your BLAS library to see if there's an optimized routine for doing that specific matrix multiplication.

answered Jul 27 '11 at 18:51

Brian%20Vandenberg's gravatar image

Brian Vandenberg
824213746

edited Jul 27 '11 at 18:51

Definitely profile, and preferrably with matrices as big as the ones you're going to use in real-life, as there might be caching issues and memory size issues that significantly affect the performance of your algorithm. You can usually trade off memory for speed or memory/instructions for cache-friendliness, and these things end up mattering a lot.

answered Jul 27 '11 at 17:17

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.