1
1

Good day folks,

I'm looking into LDA to fit some of my requirements for my personal project but I do not know whether it is appropriate so please help me out.

I have a collection of text documents (with timestamp) whereby new documents arrives at the end of every month, like a data stream. At the end of the month, I need to find out what are the topics for documents in that month and compare with topics in previous months, finding out which topics get hotter/colder and if new topics emerge or die.

I am curious if LDA can satisfy these requirements.

Also, I need clarification over some concepts of LDA.

I read about some implementations about online LDA - incremental addition of documents over time. Can this be the savior of what is indicated above?

There are also topic inference for new documents from an existing model. What happens if the new documents are very different from the documents used to built the existing model?

thanks for taking time to answer my question.

asked Sep 05 '11 at 00:12

flynn's gravatar image

flynn
21345

edited Sep 05 '11 at 02:59

Can you please edit the title to be a bit clearer on the actual question? Something like: "Can online LDA determine trends in topic modelling?". Helps with searches as well, and you will get a nice shiny "Revisionist" badge as well :)

(Sep 05 '11 at 01:20) Robert Layton
1

@Robert, alright as you advised.

(Sep 05 '11 at 03:00) flynn

2 Answers:

Hello Flynn,

First, you need a clear idea of what LDA basically is. LDA models 2 distributions:

  1. A distribution of topics over words. That is, each topic assigns different probabilities to different words in your dictionary. E.g. Topic Computer Science would have high probabilities for words like "computer" or "algorithms", and low probabilities for "emperor" or "river"

  2. A distribution of documents over topics. So, each documents will represent a mixture of topics. For example, a Document from Nature will likely assign high probs to topic like "biology", "genetics", "molecular biology", but low probs to topics like "computer science", "history", "literature"

With that in mind, once you did your training from your first data set (before your first month), if the documents are complete enough, new documents should be correctly allocated, and further help for future months.

If you use a fix set of topics, the probability of new documents being allocated in different topics will decrease.

Edwin Chen has a nice introduction on the topic here, hope it helps

answered Sep 05 '11 at 00:49

Leon%20Palafox's gravatar image

Leon Palafox
31265471107

Thank you v much @Leon, I think I understand your explanation - train a model with large enough data and use it to infer for the rest of the months. My question is, the documents I received are blog articles from fellows in my school. They might have a large domain - if new topics emerge, (for instance in 2009 about the new influenza virus), how will I able to detect this new topic from emerging if I am doing inference?

(Sep 05 '11 at 01:07) flynn

Ok, what you are looking for is called topic evolution in LDA, it is fair to say that LDA isn't the only solution for such a problem. You can try looking into these papers to get a better idea of the topic. I wish I could help you more, but I haven't really looked that deep into LDA. http://users.informatik.uni-halle.de/~hinnebur/PS_Files/sdm09_APLSA.pdf, http://cs.gmu.edu/~carlotta/publications/AlsumaitL_onlineLDA.pdf

(Sep 05 '11 at 01:33) Leon Palafox

for simplicity sake, I wonder if I can generate separate topic model (similar parameters) for docs in each time slice, then stitching topics from t-1 and t together by comparing their KL divergence? sort of like a semi-supervised approach

(Sep 05 '11 at 04:47) flynn

In that case you should be really careful with your alpha and beta parameters (the parameters over the 2 distributions), since if you lax them too much, the distribution over words may be different enough to make really similar topic to be different (e.g. Molecular Biology and Cellular Biology), I guess is a matter of tuning after that.

(Sep 05 '11 at 05:41) Leon Palafox

I have no experience with online LDA for topic tracking so I cannot directly answer your question but I recently came across (thx Vlad) a paper on Online Orthogonal NMF that seems to focus on exactly what you describe:

Detect and Track Latent Factors with Online Nonnegative Matrix Factorization by Bin Cao, Dou Shen, Jian-Tao Sun, Xuanhui Wang, Qiang Yang and Zheng Chen

answered Sep 06 '11 at 05:52

ogrisel's gravatar image

ogrisel
398464480

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.