|
Given two corpora, corpus A and corpus B say, content from grade 3 science and grade 4 science. What would be the best way of identifying n-grams (keywords) that uniquely or clearly distinguish one of the corpus from the other. I have been experimenting with different methods including a language model based approach that uses a "foreground" and a "background" corpus (http://acl.ldc.upenn.edu/W/W03/W03-1805.pdf) Another method, I was thinking of determining most discriminative "features" which could be built using n-gram tf-idfs scores etc, and these could be the set of keywords. Any ideas/prior work that could be useful for this task? |
|
You can treat this as a classification/feature selection problem. First, extract n-grams, then (optionally) use tf-idf or a similar weighting, then apply any feature selection algorithm. E.g, do a chi² test for term/corpus dependence and keep the top k n-grams by the test statistic (those that are most dependent on the corpus), or fit a linear classifier such as a linear SVM or logistic regression and keep the n-grams that have the highest weight (coefficient) in the classifier. |