13
6

GraphLab is an abstraction for parallel programming. (Low et al, 2010) It is higher level than MPI and Pthreads, but lower level and more expressive than MapReduce.

In what circumstances is MapReduce preferable? In what circumstances is GraphLab preferable?

asked Jul 02 '10 at 02:42

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
573051124145


3 Answers:

MapReduce is good for single-iteration and embarassingly parallel distributed tasks like feature processing, while GraphLab is good for iterative algorithms with computational dependencies or complex asynchronous schedules. For instance, we find that GraphLab is highly suitable for Gibbs sampling, EM-style algorithms and some classes of optimization algorithms. Programs which fit well on a systolic abstraction (such as PageRank on Pregel) will also work well with GraphLab. There are probably a lot more algorithms that will fit well in the GraphLab and we are still exploring the capabilities and implications of the abstraction (and whether further extensions will be needed).

We are in the process of releasing our (shared memory) multi-core implementation of GraphLab and are in the process of designing a (distributed-memory) cluster implementation. Currently we do not provide fault tolerance so if fault tolerance is a critical requirement, we recommend the use of Hadoop/MapReduce. The GraphLab abstraction does not preclude fault tolerance though due to the computational / state dependencies, distributed fault tolerance is slightly more difficult and is currently ongoing research.

The current implementation is a C++ API which can easily work with other tools like Hadoop and SQL (we use some of these for logging already) and does not require any additional language support. It is important to keep in mind that while we provide a reference implementation, we are describing an abstraction that could be integrated into other software tools like Hadoop.

We have posted the initial public beta release of the GraphLab API at http://www.graphlab.ml.cmu.edu and welcome any comments or suggestions. The GraphLab webpage provides tutorials and detailed descriptions of the GraphLab API features. We are in the processing of migrating to a Google code project to give contributors access to the API and the code as it evolves.

answered Jul 03 '10 at 16:22

Joseph%20Gonzalez's gravatar image

Joseph Gonzalez
151112

edited Aug 02 '10 at 12:13

1

I don't see fault tolerance to an issue with small scale clusters. Where it really becomes important is with >100 nodes participating in a computation.

Is Graphlab susceptible to an abstraction layer like FlumeJava? That was substantially ease the introduction of a new computational paradigm.

(Jul 03 '10 at 16:45) Ted Dunning

What are some algorithms in which GraphLab cannot be parallelized? In particular, can it parallelize matrix-multiplication?

(Jul 03 '10 at 19:34) Joseph Turian ♦♦
1

@joseph matrix multiplication can be easily parallelized through Map-Reduce (Hadoop), Actually Mahout and Hama have code for Matrix multiplication.

(Jul 03 '10 at 19:44) DirectedGraph
1

Agree that fault tolerance is not an issue with small clusters. However, we do hope have GraphLab will eventually scale to web-scale problems.

We have not thought much about higher level abstractions that might fit on top of GraphLab. A higher level pipeline-based abstraction might be worth thinking about. Though an interesting consideration is that GraphLab does not fit a data-flow pipeline naturally. Unlike MapReduce / Dryad which are data-flow abstractions (data moves to computation), GraphLab prefers to think the other way round (computation moves to data). One could of course design a data-flow pipeline that will simply make use of GraphLab as one of the operators, but it would be interesting to see if it can be integrated at a more fundamental level.

Interesting, Matrix Multiplication was one of the first applications of Systolic Arrays (http://en.wikipedia.org/wiki/Systolic_array). Systolic arrays is an old architecture idea that appears to have fallen out of popularity. It is "probably" possible to parallelize matrix multiplication in the same way using GraphLab. Though you may want to increase the size of the vertex operations such as feeding in submatrices instead of single values. (Disclaimer: I have not yet thought this through carefully).

Some algorithms which may not work well with GraphLab are algorithms with what we call "dense-dependencies". For instance naive dual SVM (computation of the dense kernel matrix). They can be represented and evaluated in GraphLab, but they may not operate efficiently. These methods don't generally scale well anyway.

Algorithms with "dynamic-dependencies" are also an issue (for instance Delaunay triangulation, etc.). These may require dynamic graphs and cannot be represented properly in GraphLab (yet).

Basically GraphLab works well when you have "static sparse dependencies". That is, you know before hand, exactly what are the data dependencies of every operation in your algorithm. And the data dependencies of each operation is relatively small. There might be ways to "coerce" other types of problems into GraphLab, but that would be problem specific.

(Jul 03 '10 at 20:15) Yucheng Low
1

If you are interested in matrix-vector multiplication, the easiest way to represent it in GraphLab will be to treat the matrix as an adjacency matrix and construct a graph with it. A Matrix-vector product can then be represented simply as executing an update function on each vertex once. Iterated matrix vector product can be done the same way very efficiently.

Hadoop is not the best platform for matrix-vector multiplication since a significant amount of time will be wasted on IO since you perform no more than one multiplication and one addition for each matrix element.

(Jul 03 '10 at 20:37) Yucheng Low

Actually, the first thing we implemented in GraphLab was an iterative eigenvector solver which repeatedly computes A * x to find the first eigenvector. It is surprisingly easy to implement the algorithm in GraphLab and we spent some time playing with the role of dynamic scheduling and asynchronous computation. While dynamic scheduling could in theory improve performance we did not obtain conclusive experimental results on the toy matrices (A) we considered.

One of the challenges with implementing basic linear algebra functions is that they quickly become memory bandwidth bound limiting the degree of parallelism. I am a little skeptical of general purpose parallel frameworks that claim to achieve high performance linear algebra libraries. The highly tuned parallel linear algebra libraries developed by the scientific computing community heavily optimize memory access patterns for individual architectures and cache models and are surprisingly difficult to beat.

(Jul 03 '10 at 22:14) Joseph Gonzalez
showing 5 of 6 show all

At a pretty obvious level, Graphlab is very new and appears to have not been used at very large scale yet. Both of those factors are going to go against it on large problems. The newness will have a corollary of poor integration with other tools such as databases and Hadoop which is where most of the data that feed ML problems in practice must normally come from.

On the other hand, machine learning is littered with problems that don't scale well beyond a single machine anyway and if Graphlab is able to move toward providing an effective substrate for use of GPU enabled machines, the scaling question may be moot.

answered Jul 03 '10 at 13:49

Ted%20Dunning's gravatar image

Ted Dunning
636815

Graphlab could be significantly useful with Cloud on Chip Machines which Intel is planning. my 2c.

(Jul 03 '10 at 19:45) DirectedGraph

We are actively looking at implementing the GraphLab abstraction on GPUs as well as the new Intel Single-Chip cloud computers. While the GraphLab abstraction was intended to be hardware agnostic, the reference implementation described in the Low et al, 2010 UAI paper focuses on the smaller scale multi-core setting. That said, there are a lot of problems that can fit in a relatively large multi-core system that can benefit from parallel performance gains. As far as the cluster setting goes, all we can say is "stay tuned" :-).

answered Jul 03 '10 at 21:43

Joseph%20Gonzalez's gravatar image

Joseph Gonzalez
151112

1

We have posted the initial public beta release of the GraphLab API at: http://www.graphlab.ml.cmu.edu and welcome any comments or suggestions. The GraphLab webpage provides tutorials and detailed descriptions of the GraphLab API features. We are in the processing of migrating to a Google code project to give contributors access to the API and the code as it evolves.

(Aug 02 '10 at 12:10) Joseph Gonzalez
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.