|
I am very interested in a question like the following, which wasn't really discussed, or maybe was discussed elsewhere, in which case I would like some pointers. The previous question focuses on sentiment analysis in which case extraction of quantifiers like "very good", "sort of good", "not so good", etc is very important. My application is to extract semantic tags from my documents which could be compound nouns as well, like "machine learning", "software engineer", "data visualization", "feature selection" etc. I am thinking of expanding/refining the dictionary, to contain these multi-terms features, and then run LDA to extract the tags from the corpus. In order to expand the dictionary I could add pairs of terms with high pointwise mutual information in my corpus. Any ideas, other more elegant approaches, suggestions would be appreciated. |
|
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4470313 Topic model, but with selected informative n-grams being used. Doesn't look easy to implement though. More generally, I think there are other methods for finding relevant n-grams that use old-school text processing (e.g. longest common subsequence) combined with basic statistics. I just don't know what they are. Various ngram-strength metrics are described and implemented in Pedersen's NSP package.
(Jul 04 '11 at 06:34)
yoavg
|
|
This software will allow you to explicitly (and fairly efficiently) generate all n-grams up to given length from your corpus. |
It might make sense to not include shorter sequences for which longer terms are available: that is, not add "bag" when it occurs inside "bag of words".
The simplest thing I can think of is greedily doing one pass over the corpus computing all n-grams, tresholding based on frequency, removing them from the text, and then repeating the process for n-1-grams.
I did this, and I am satisfied with the results.