|
Just out of curiosity, has anyone wondered when people start using stop word list/function words list? (For the differences of these two, please see the question asked before.) Recently when I am reading Ani Nenkova's Automatic Summarization survey(btw, full of awesomeness!), the paper mentioned that Luhn used stop word list in his seminal document summarization work in 1958. Basicly everywhere I see, I can see stop word list, for example, in LDA, the results are full of meaningless functional words like "the a", so stop word list is applied beforehand to remove them. I sorted of take it for granted(although I don't like the idea of it). As far as I know, Luhn's work is the earliest work to adopt stop word list. |
I've actually found that stopword removal isn't always necessary for LDA. Whenever I've neglected to do it, I've ended up with one or two topics containing most of the common stopwords, effectively segregating them out from the interesting topics.
@Kevin: this is specially true if you reestimate the base measure of the document-topic dirichlet priors and/or allow one topic to have a prior pseudocount than the others.
@Kevin are you talking about vanilla LDA or some improvement you made?
@Alexandre noted, I will try your way later, thank you