1
2

I am wondering if there are works around extensions of LDA that automatically select the number of proper topics (in LDA the user needs to specify the number of topics)

asked Mar 18 '11 at 06:04

Mark%20Alen's gravatar image

Mark Alen
1323234146


2 Answers:

Yes. The most common one is HDP-LDA. The idea behind HDP-LDA is to use Hierarchical Dirichlet processes to model the Dirichlet admixture in LDA nonparametrically.

A dirichlet process is a common way of doing bayesian nonparametric clustering without previously specifying the number of clusters. A DP has two parameters: a concentration parameter alpha and a base measure, and samples from the DP are either exatcly equal to previous samples (with probability proportional to how many times those samples have appeared in total) or samples from the base measure with probability proportional to alpha. A hierarchical dirichlet process is when the base measure of a dirichlet process is itself a dirichlet process.

The standard LDA model is as follows:

phi_d ~ Dirichlet(alpha)
theta_t ~ Dirichlet(gamma)
z_{d,i} ~ Discrete(phi_d)
w_{d,i} ~ Discrete(theta_{z_{d,i}})

So the HDP-LDA is

G ~ DP(alpha, all_words)
G1_d ~ DP(gamma, G)
z_{d,i} ~ Discrete(G1_d)
w_{d,i} ~ Discrete(z_{d,i})

which means that now for each word you have a unique topic distribution z that is sampled from a hierarhcical dirichlet process model where G1_d encourages the words in the same document to reuse the same topics and G encourages different documents to share topics.

You can find a well-commented implementation of HDP-LDA here.

answered Mar 18 '11 at 08:07

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

I'm also interested in this.

In "Finding scientific topics", model selection is used to decide the appropriate value of topics. More speciffically, the harmonic mean of a set of values P(w|z,T) is used to approximate P(w|T). For a model with K topics, is it 1/P(w|T=K) approx (1/M)*(sum_over_d 1/P_(din M)(w|z,T=K)) with a set M of documents d?

I'm not sure whether the formula is right. Any advices are appreciated. Thanks.

answered Mar 29 '11 at 06:02

lily's gravatar image

lily
16114

edited Mar 29 '11 at 09:19

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.