Has there been any research on combining a graph database such as pregel and deep learning? For example, take a graph of encoders

where

encoder+decoder = autoencoder (contractive/denoising/sparse) or encoder = ica or encoder = sparse pca or encoder = non-negative matrix factorization

The encoder graph is directed and acyclic

The output of two or more encoders must match the input of the one their plug into. In the simplest case, input = 2*hidden, then 2 encoders plug into another encoder.

Pregel is a graph database (there are open source versions too) which can handle trillion+ connections. As far as I can tell, the database is almost perfectly suited to the encoder graph formulation of unsupervised deep learning, and should probably take only a few lines of code.

Is there any obvious flaws in the above, has it been already tried?

asked Aug 18 '12 at 03:56

marshallp's gravatar image

marshallp
8391016

edited Aug 20 '12 at 02:09

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

So this doesn't really answer your question, but I googled "graphlab neural networks" to see if there was any discussion of whether neural networks can be efficiently trained using GraphLab (another graph-level parallel abstraction).

Before I got a chance to review the GraphLab papers, I saw that slides from Quoc Le's talk "Tera-Scale Deep Learning" was hosted on a graphlab server. It doesn't mention GraphLab, but the talk was present at the GraphLab workshop (http://graphlab.org/workshop2012/agenda/), as well as the BigLearn workshop (http://biglearn.org/index.php/Papers).

(Aug 20 '12 at 02:20) Joseph Turian ♦♦

They have what they're calling asynchronous parallel sgd. My point is to say that completely remove that sgd.

Basically, 'stitch" together a huge bunch (billions/trillions) of dimensionality reducers (that are trained on small subsets of the data). Maybe use the technique from this paper on superfast deep learning http://arxiv.org/abs/1105.0972 so that each one is trained quickly.

Use pregel to automatically get all the work they do in running a distributed graph. I brought up pregel instead of graphlab because but of the synchronicity and it already runs at google, (though maybe graphlab has all that too).

The main hunch is that increasing the model size is what matters (as demonstrated by the google brain paper) so just find a way to get bigger by ditching optimization, and also remove the "stacking" of deep learning and go all the way to be a directed graph.

If anyone has an in with quoc le or someone else at google, it would be nice if they could run this on imagenet and see if it works.

(Aug 20 '12 at 12:54) marshallp
2

The problem is that you get bad dimensionality reducers if you train them on small subsets of data. Getting effective dimensionality reducer like the cat detector needs a large interconnected network and lots of data. Anyway, GraphLab is public so you could try out your idea on one machine.

(Aug 20 '12 at 22:43) Yaroslav Bulatov

If you have massive data, the dimensionality reducer will only pick up a few of the largest patterns in the data, that's why it's probably better to have lots of them. Also, the method I mentioned is also interconnected. I don't think trying this on one computer will give good results and will be ignored.

(Aug 22 '12 at 01:03) marshallp

2 Answers:

A distributed graph-based system is used to train deep networks at Google. Here's a talk on it here: http://graphlab.org/wp-content/uploads/2012/07/graphlab_talk.pdf

Theoretically you could use Pregel, but I think in practice if you put a large neural network into it, the update will take too long. You need neural network specific optimizations that are hard to fit into Pregel. For instance, Pregel's bulk synchronous parallel model is an overkill for gradient descent, see Hogwild by Niu,Recht. By forcing synchronous updates you are going to spend most time waiting for few stragglers to finish or have to spend a lot of resources duplicating effort with backup workers. There's also an issue of network -- you want to organize your task updates in a smart way so that you are not network bandwidth bound. For vision tasks, this means putting computations corresponding to nearby patches to same machine or same rack. If you don't control the bandwidth requirements, it's easy to end up with a distributed gradient descent that works slower than a single GPU, and gets slower as you distribute it over more machines

answered Aug 18 '12 at 19:09

Yaroslav%20Bulatov's gravatar image

Yaroslav Bulatov
2333214365

edited Aug 18 '12 at 19:16

The model I've proposed doesn't have global stochastic gradient descent - the training is limited to per (auto)encoder.

(Aug 18 '12 at 19:36) marshallp

You will have the same issues training auto-encoders.

(Aug 18 '12 at 20:36) Yaroslav Bulatov

I'm maybe misunderstanding pregel (so excuse me if that is the case), but it keeps things in memory right? And it tries to keep nodes "closely" connected on the same computer to minimize going over the network (like erlang vm does).

So you send a minibatch of data in, they find a node (an autoencoder) after passing through the graph (thus being transformed). At the autoencoder, the minibatch is trained (possibly by being uploaded to a gpu computer on the network) - so this is being done in isolation to the rest of the graph, ie. other minibatches are being sent in while this happens and autoencoders are added to other parts of the graph.

At run time, data is sent in and goes through all parts of the graph and out through the "leafs" - this will be fast as pregel can handle, which i gather is as fast as distributed graph can handle.

The model is similar to the asynchronous one I read in jeff deans talk at snowbird conference. except that everything is local to an autoencoder (node in the graph), and by piggy-backing on pregel you get to use it's optimized network code and can go to trillion+ connections (or more accurately trillion multiplied by autoencoder connection size, so many many trillions).

I dont' have experience with pregel, but it seems to perfectly match the BSP model and so very little code would probably be required.

(Aug 18 '12 at 21:24) marshallp

Figuring out how to keep nodes "closely" connected in a graph on one computer is actually a hard problem for general graphs. It's similar to a problem of Tree Decomposition, which is NP-hard. However, you can do it for very specific graphs like what occurs in Neural Networks with local receptive fields.

(Aug 22 '12 at 13:43) Yaroslav Bulatov

That's basically the graph partitioning problem and it's been studied quite well with lots of solutions.

(Aug 22 '12 at 19:09) marshallp

I assume it would be much more efficient to have all the model parameters live contiguously in memory (possibly sharded on a cluster of machines) rather than using a database that would enforce disk write operations for each weight update. It's also probably better to isolate parameter updates as much as possible to avoid synchronization (blocking message passing between cluster nodes) by using tricks such as Local Receptive Fields.

You can always snapshot the states of the network weights to a persistent storage such as a distributed FS or a Amazon S3-like blob store asynchronously in the background from time to time for backup purposes.

answered Aug 20 '12 at 12:20

ogrisel's gravatar image

ogrisel
498995591

edited Aug 20 '12 at 12:21

Well, basically, a pregel-like system that does all the work of bulk synchronicity for you (iterative update) so that the algorithm is simple to write and automatically as efficient as possible (in the same way mapreduce has done for parallel jobs).

(Aug 20 '12 at 12:59) marshallp
1

Well I would not call pregel a graph database but rather a data processing framework that uses a graph structure to model the task dependencies and run them concurrently on a cluster of machines.

A graph database is something to store vertex and edges with properties and possibly perform graph oriented data retrieval such as graph traversal queries as in neo4j and OrientDB.

Furthermore pregel will not give you the concurrency for free: the sharding of the likely high dimensional feature space seems critical to be able to optimize the autoencoder objective function on a distributed infrastructure while avoiding too frequent performance killing synchronization barriers. Approach such as Selecting Receptive Fields in Deep Networks by Coates and Ng or "hardcoded" local receptive fields seems mandatory to me.

(Aug 21 '12 at 10:39) ogrisel

The method I've mentioned does not do optimization - there is no objective function.

Also, the graph can be specified so that it's densely connected in some areas (clusters that fit on one computer) and sparsely connected between those clusters (to minimize going over the network).

The bulk synchronous method is ideal because it's a deterministic system (you want the same inputs giving the same outputs everytime). Doing it with other code means you would just end up reimplementing BSP.

(Aug 22 '12 at 01:08) marshallp
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.