Usually, nlp tasks need test pre-processing, such as stop word removed... but lda don't need this, why?

asked Mar 06 '11 at 21:11

Fischer%20Yu's gravatar image

Fischer Yu
16223


2 Answers:

in my experience LDA does indeed benefit[*] from preprocessing. For English, it is useful to include some multiword expressions and named entities as single tokens. For languages with richer morphology it is helpful to work with lemmas instead of the inflected forms.

[*] that is, to the extent LDA models can actually be evaluated.

answered Mar 06 '11 at 21:15

yoavg's gravatar image

yoavg
741122331

Also, simple things like lemmatizing, normalizing capitalization, treating named entities as single words, removing stopwords (including words that are not usually seen as stopwords but are just too frequent in a given corpus), etc, also make for models that are a lot more interpretable.

(Mar 06 '11 at 21:44) Alexandre Passos ♦
-1

I think the most direct answer to this is the answer Blei gives in his own lecture about it:

Link here

He heavily emphasizes that LDA can be used in a broader spectrum of problems. So, he mentions at the very begging that he treats the documents as a bag of words, thus, he does not really care about order or grammar.

He does mention though, that he gets rid of repetitive words, such as "the, a, etc" (Sorry, I'm not familiar with NLP terminology)

answered Mar 07 '11 at 00:15

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128

Treating documents/topics as bags of words is very typical -- but it doesn't entail not doing any preprocessing. The thing is that "what is a word" is not clear, and the pre-processing helps with that. For example, I am sure even Blei does tokenization and not just splits everything on whitespace (so that e.g. "tonight," it split into "tonight" "," ). Similarly, you could argue that "took" "apart" is actually one word "took_apart" (the same holds for named entities e.g. "George_Bush" and that "walk"/"walks"/"walked" are actually the same word/concept "walk". These things helps. The reason many people (in the Academic ML community) are not doing them is probably because (a) they are lazy (b) it's more hackish and less "clean", and (c) no one can evaluate topic models anyhow, and hence no one "competes" against other methods based on model quality, and hence there is just no incentive for doing it.

(Mar 07 '11 at 15:40) yoavg
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.