0
1

Hi,

I need to implement some algorithms about topic models in a project that will go into production. Since we are using Bayesian inference I have to choose between variational bayesian methods and gibbs sampling. Which one will you chose based on:

  • Accuracy
  • Easiness of related coding
  • Easiness of parallelization
  • To lessen the probability of bugs or at least to ease how to solve them (I know it is coder dependent but you could have an idea of trickiness or TDD friendly)
  • Whatever you think is relevant when coding such algorithms from scratch

Regards

asked Sep 20 '11 at 12:06

Toni%20Cebri%C3%A1n's gravatar image

Toni Cebrián
26236

edited Sep 21 '11 at 11:08

I've found a blog post that somehow summarizes this discussion http://www.phontron.com/blog/?p=24

(Sep 26 '11 at 05:44) Toni Cebrián

2 Answers:

How important is accuracy in this production setting? How are you willing to trade between computational time and accuracy? Will you want to retrain?

For coding slow gibbs sampling sampling is far easier than slow variational. To make sure the code is correct is the other way around: variational is far easier to check.

Variational is embarasingly parallel in two steps (running inference on all documents, then aggregating this to compute the global parameters). Parallelizing gibbs sampling is trickier, but I think it can end up scaling more easily to more machines (see Yahoo!'s implementation of LDA).

I don't think there's an easy answer here, but I suggest you look more closely to your requirements, or then just go with whatever you feel more comfortable, if your requirements don't seem to rule out any possibility.

answered Sep 20 '11 at 13:11

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Yes, retraining will happen on a daily basis as new data is received, it will be a lot of data but computations could happen overnight

(Sep 21 '11 at 03:34) Toni Cebrián

I believe GraphLab has a parallelized gibbs sampler, which will probably be super fast. Check out the talk from NIPS 2010 on videolectures.net: http://videolectures.net/nipsworkshops2010_guestrin_kml

The motivating examples for the first part of the talk is gibbs sampling.

(Sep 23 '11 at 00:28) Steve Lianoglou

I suggest you try Gensim. I don't know whether it satisfied all your requests, but the core of Gensim's LDA model is Hoffman's online LDA which can be faster compared with vanilla LDA(BTW, you can check Hoffman's paper, it was published at NIPS 2010). More to that, Gensim supports parallal computing, so you can use several machines at the same time, which will bring up the speed, presumably. And Gensim is written in Python, which guarantees that you can find extra help easily in Python community.

answered Sep 22 '11 at 04:17

Zhibo%20Xiao's gravatar image

Zhibo Xiao
26571213

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.