Just out of curiosity, has anyone wondered when people start using stop word list/function words list? (For the differences of these two, please see the question asked before.)

Recently when I am reading Ani Nenkova's Automatic Summarization survey(btw, full of awesomeness!), the paper mentioned that Luhn used stop word list in his seminal document summarization work in 1958.

Basicly everywhere I see, I can see stop word list, for example, in LDA, the results are full of meaningless functional words like "the a", so stop word list is applied beforehand to remove them. I sorted of take it for granted(although I don't like the idea of it).

As far as I know, Luhn's work is the earliest work to adopt stop word list.

asked Dec 07 '11 at 08:15

Zhibo%20Xiao's gravatar image

Zhibo Xiao
26571213

1

I've actually found that stopword removal isn't always necessary for LDA. Whenever I've neglected to do it, I've ended up with one or two topics containing most of the common stopwords, effectively segregating them out from the interesting topics.

(Dec 07 '11 at 14:45) Kevin Canini
2

@Kevin: this is specially true if you reestimate the base measure of the document-topic dirichlet priors and/or allow one topic to have a prior pseudocount than the others.

(Dec 07 '11 at 17:27) Alexandre Passos ♦

@Kevin are you talking about vanilla LDA or some improvement you made?

(Dec 07 '11 at 19:15) Zhibo Xiao

@Alexandre noted, I will try your way later, thank you

(Dec 07 '11 at 19:16) Zhibo Xiao
Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.