|
Good day folks, I'm looking into LDA to fit some of my requirements for my personal project but I do not know whether it is appropriate so please help me out. I have a collection of text documents (with timestamp) whereby new documents arrives at the end of every month, like a data stream. At the end of the month, I need to find out what are the topics for documents in that month and compare with topics in previous months, finding out which topics get hotter/colder and if new topics emerge or die. I am curious if LDA can satisfy these requirements. Also, I need clarification over some concepts of LDA. I read about some implementations about online LDA - incremental addition of documents over time. Can this be the savior of what is indicated above? There are also topic inference for new documents from an existing model. What happens if the new documents are very different from the documents used to built the existing model? thanks for taking time to answer my question. |
|
Hello Flynn, First, you need a clear idea of what LDA basically is. LDA models 2 distributions:
With that in mind, once you did your training from your first data set (before your first month), if the documents are complete enough, new documents should be correctly allocated, and further help for future months. If you use a fix set of topics, the probability of new documents being allocated in different topics will decrease. Edwin Chen has a nice introduction on the topic here, hope it helps Thank you v much @Leon, I think I understand your explanation - train a model with large enough data and use it to infer for the rest of the months. My question is, the documents I received are blog articles from fellows in my school. They might have a large domain - if new topics emerge, (for instance in 2009 about the new influenza virus), how will I able to detect this new topic from emerging if I am doing inference?
(Sep 05 '11 at 01:07)
flynn
Ok, what you are looking for is called topic evolution in LDA, it is fair to say that LDA isn't the only solution for such a problem. You can try looking into these papers to get a better idea of the topic. I wish I could help you more, but I haven't really looked that deep into LDA. http://users.informatik.uni-halle.de/~hinnebur/PS_Files/sdm09_APLSA.pdf, http://cs.gmu.edu/~carlotta/publications/AlsumaitL_onlineLDA.pdf
(Sep 05 '11 at 01:33)
Leon Palafox
for simplicity sake, I wonder if I can generate separate topic model (similar parameters) for docs in each time slice, then stitching topics from t-1 and t together by comparing their KL divergence? sort of like a semi-supervised approach
(Sep 05 '11 at 04:47)
flynn
In that case you should be really careful with your alpha and beta parameters (the parameters over the 2 distributions), since if you lax them too much, the distribution over words may be different enough to make really similar topic to be different (e.g. Molecular Biology and Cellular Biology), I guess is a matter of tuning after that.
(Sep 05 '11 at 05:41)
Leon Palafox
|
|
I have no experience with online LDA for topic tracking so I cannot directly answer your question but I recently came across (thx Vlad) a paper on Online Orthogonal NMF that seems to focus on exactly what you describe: Detect and Track Latent Factors with Online Nonnegative Matrix Factorization by Bin Cao, Dou Shen, Jian-Tao Sun, Xuanhui Wang, Qiang Yang and Zheng Chen |
Can you please edit the title to be a bit clearer on the actual question? Something like: "Can online LDA determine trends in topic modelling?". Helps with searches as well, and you will get a nice shiny "Revisionist" badge as well :)
@Robert, alright as you advised.