I am very interested in a question like the following, which wasn't really discussed, or maybe was discussed elsewhere, in which case I would like some pointers. The previous question focuses on sentiment analysis in which case extraction of quantifiers like "very good", "sort of good", "not so good", etc is very important. My application is to extract semantic tags from my documents which could be compound nouns as well, like "machine learning", "software engineer", "data visualization", "feature selection" etc. I am thinking of expanding/refining the dictionary, to contain these multi-terms features, and then run LDA to extract the tags from the corpus. In order to expand the dictionary I could add pairs of terms with high pointwise mutual information in my corpus.

Any ideas, other more elegant approaches, suggestions would be appreciated.

asked Jun 10 '11 at 07:12

Oliver%20Mitevski's gravatar image

Oliver Mitevski


It might make sense to not include shorter sequences for which longer terms are available: that is, not add "bag" when it occurs inside "bag of words".

The simplest thing I can think of is greedily doing one pass over the corpus computing all n-grams, tresholding based on frequency, removing them from the text, and then repeating the process for n-1-grams.

(Jun 10 '11 at 09:15) Alexandre Passos ♦

I did this, and I am satisfied with the results.

(Jun 28 '11 at 11:08) Oliver Mitevski

2 Answers:


Topic model, but with selected informative n-grams being used. Doesn't look easy to implement though.

More generally, I think there are other methods for finding relevant n-grams that use old-school text processing (e.g. longest common subsequence) combined with basic statistics. I just don't know what they are.

answered Jun 14 '11 at 23:23

Jacob%20Jensen's gravatar image

Jacob Jensen

Various ngram-strength metrics are described and implemented in Pedersen's NSP package.

(Jul 04 '11 at 06:34) yoavg

This software will allow you to explicitly (and fairly efficiently) generate all n-grams up to given length from your corpus.

answered Jun 14 '11 at 11:20

Georgiana%20Ifrim's gravatar image

Georgiana Ifrim

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.