|
Usually, nlp tasks need test pre-processing, such as stop word removed... but lda don't need this, why? |
|
in my experience LDA does indeed benefit[*] from preprocessing. For English, it is useful to include some multiword expressions and named entities as single tokens. For languages with richer morphology it is helpful to work with lemmas instead of the inflected forms. [*] that is, to the extent LDA models can actually be evaluated. Also, simple things like lemmatizing, normalizing capitalization, treating named entities as single words, removing stopwords (including words that are not usually seen as stopwords but are just too frequent in a given corpus), etc, also make for models that are a lot more interpretable.
(Mar 06 '11 at 21:44)
Alexandre Passos ♦
|
|
I think the most direct answer to this is the answer Blei gives in his own lecture about it: He heavily emphasizes that LDA can be used in a broader spectrum of problems. So, he mentions at the very begging that he treats the documents as a bag of words, thus, he does not really care about order or grammar. He does mention though, that he gets rid of repetitive words, such as "the, a, etc" (Sorry, I'm not familiar with NLP terminology) Treating documents/topics as bags of words is very typical -- but it doesn't entail not doing any preprocessing. The thing is that "what is a word" is not clear, and the pre-processing helps with that. For example, I am sure even Blei does tokenization and not just splits everything on whitespace (so that e.g. "tonight," it split into "tonight" "," ). Similarly, you could argue that "took" "apart" is actually one word "took_apart" (the same holds for named entities e.g. "George_Bush" and that "walk"/"walks"/"walked" are actually the same word/concept "walk". These things helps. The reason many people (in the Academic ML community) are not doing them is probably because (a) they are lazy (b) it's more hackish and less "clean", and (c) no one can evaluate topic models anyhow, and hence no one "competes" against other methods based on model quality, and hence there is just no incentive for doing it.
(Mar 07 '11 at 15:40)
yoavg
|