Hi, I'm working on a search utility, and I'm trying to use LDA to improve my ability to search for documents. I use Mallet's LDA to train my collection of documents. Then I use the trained model to infer a topic distribution for my query.

The problem that I'm experiencing is that my topic distribution is relatively flat, since the queries only contains 1-3 terms, making then very small documents. However, when I multiply the number of times the query terms are expressed in the query, I get a good distribution, which gives my decent results when I apply KL divergence b/w the query and the documents in my corpus. I'm assuming that document length plays a part during inferencing. Is there a way around having to repeat the occurrences of term in the query? My other idea was to use the weights of terms to topics that Mallet outputs, and somehow calculate the topic distribution for a query by multiplying the probability of a query term in a topic for each topic. This, however, seems wrong b/c I'm essentially not doing inferencing, which is one of the points of LDA, right?

Any advice/suggestions would be greatly appreciated. Thanks.

asked Mar 27 '11 at 18:50

Karthik's gravatar image

Karthik
1111


One Answer:

The length of the document definitely influences your topic distribution.

When you estimate the topic distribution for your query, you are actually sampling a topic for each word of your query from Mallet's Gibbs sampler given all the topic assignments in the training corpus. You will then get at most 3 different topics, evenly distributed in your query.

By repeating the words and consequently increasing the size of the query, you allow that one word gets assigned to different topics (you actually build an empiric conditional distribution of topics given this word, by sampling topic assignments repeatedly for the same word). You could of course try to obtain these distributions directly from Mallet, but I wouldn't know how right now -- I'd have to look at the source. If this is what you meant by the second idea, go ahead.

You are right though, that if you get the distributions directly you would stop updating the model, i.e. after training on the training corpus, your model would not change any more.

But then again, it is questionable if LDA should be learned throughout time on a changing corpus, as a dynamic corpus might hurt the exchangeability assumption on the documents, which underlies LDA. If you want your topics to reflect the current topical landscape, you should look if LDA is an acceptable solution, or if you should use time-aware models, such as Topics over Time (http://portal.acm.org/citation.cfm?id=1150450) or Dynamic Topic Models (http://portal.acm.org/citation.cfm?id=1143859).

answered Mar 28 '11 at 11:06

Breno's gravatar image

Breno
81136

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.