1
1

What is the current state of the art in topic modeling. I've been using LDA on a project, but I read something about Hinton using deep learning for topic modeling. Does anyone know if there is example code for like this? And is anyone using anything like this in production?

Thanks

asked Feb 23 '12 at 18:44

Ryan%20Stout's gravatar image

Ryan Stout
16122


2 Answers:

LDA and related methods are still very popular and being actively developed, from what I've been seeing (although I'm not specifically trying to follow it). Deep learning and autoencoding also seem like natural ways to do topic modeling, if you think of topic modeling as just dimensionality reduction. But I'm not familiar with their performance compared to LDA.

answered Feb 24 '12 at 10:51

Kevin%20Canini's gravatar image

Kevin Canini
126021330

If you are willing to trust AIS, the replicated softmax RBM gives much better log probs than LDA does.

(Feb 24 '12 at 15:41) gdahl ♦

Read the replicated softmax paper and the deep topic modeling paper. The replicated softmax, even just as a one layer model, is a much better probabilistic model of documents than LDA is. Take a look at Ruslan Salakhutdinov's other papers also. I think the hierarchical Dirichlet process deep Boltzmann machine paper might be relevant if you want easily interpretable "topics" from distributed representations of word count vectors.

If you want a good probabilistic model of documents or good features from word count vectors or good codes for documents for information retrieval, LDA isn't going to cut it. If you want mutually exclusive, qualitatively pleasing "topics" then LDA is a good option. On the reuters dataset (if you are OK with AIS being used to estimate RBM log probs), an LDA model with 50 topics has a perplexity halfway between a unigram model and a replicated softmax RBM with 50 hidden units. Using a deeper model will only make the difference more extreme. Of course LDA isn't designed to necessarily make efficient low dimensional representations of documents the same way, so make of that what you will. But even if you give LDA more topics, a topic model that can learn distributed representations of documents will have an advantage.

answered Feb 24 '12 at 15:38

gdahl's gravatar image

gdahl ♦
341453559

edited Feb 25 '12 at 02:10

I haven't read the replicated softmax paper, but LDA finds very sparse representations. Does this approach also find very sparse representations? If not, I don't think it's fair to compare a 50-dimensional sparse representation to a 50-dimensional non-sparse representation.

(Feb 24 '12 at 16:13) Kevin Canini

You're right, it isn't fair. The paper also compares to higher dimensional LDA representations.

(Feb 24 '12 at 16:33) gdahl ♦

That being said, you can get sparse topics if you want them in the HDP-DBM and also improve the generative model. In most applications, I don't think the sparsity is an inherent advantage.

(Feb 24 '12 at 16:39) gdahl ♦

Are there open and easy to use implementations of these RBM approaches? The nice thing about LDA is that there's a robust and easy to use implementation in at least the Mallet toolkit (http://mallet.cs.umass.edu/) and i'm sure other libraries.

(Feb 24 '12 at 17:01) Keith Stevens

Thanks, I'll take a look at those. My main goal is to take documents and automatically list the topics associated with the document. LDA does a pretty good job on this, I was mainly just wondering if there had been any advances. Also, the LDA run time is pretty intense, but my guess is all of these methods would take quite a while.

(Feb 24 '12 at 19:31) Ryan Stout

If you want a list of comprehensible topics for each document, then LDA should perform very well. There are recent implementations that run very fast. I like to use FastLDA: http://www.ics.uci.edu/~newman/code/fastlda/

(Feb 24 '12 at 19:38) Kevin Canini

The HDP-DBM is the best option I know of, but it is a complicated method compared to LDA. There is lots of code to make LDA fast and convenient around online.

(Feb 25 '12 at 02:13) gdahl ♦
showing 5 of 7 show all
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.