I am very interested in a question like the following, which wasn't really discussed, or maybe was discussed elsewhere, in which case I would like some pointers. The previous question focuses on sentiment analysis in which case extraction of quantifiers like "very good", "sort of good", "not so good", etc is very important. My application is to extract semantic tags from my documents which could be compound nouns as well, like "machine learning", "software engineer", "data visualization", "feature selection" etc. I am thinking of expanding/refining the dictionary, to contain these multi-terms features, and then run LDA to extract the tags from the corpus. In order to expand the dictionary I could add pairs of terms with high pointwise mutual information in my corpus.
Any ideas, other more elegant approaches, suggestions would be appreciated.
asked Jun 10 '11 at 07:12
Topic model, but with selected informative n-grams being used. Doesn't look easy to implement though.
More generally, I think there are other methods for finding relevant n-grams that use old-school text processing (e.g. longest common subsequence) combined with basic statistics. I just don't know what they are.
answered Jun 14 '11 at 23:23