Given two corpora, corpus A and corpus B say, content from grade 3 science and grade 4 science. What would be the best way of identifying n-grams (keywords) that uniquely or clearly distinguish one of the corpus from the other. I have been experimenting with different methods including a language model based approach that uses a "foreground" and a "background" corpus (http://acl.ldc.upenn.edu/W/W03/W03-1805.pdf) Another method, I was thinking of determining most discriminative "features" which could be built using n-gram tf-idfs scores etc, and these could be the set of keywords. Any ideas/prior work that could be useful for this task?

asked May 06 '13 at 02:10

dancoder's gravatar image

dancoder
1111


One Answer:

You can treat this as a classification/feature selection problem.

First, extract n-grams, then (optionally) use tf-idf or a similar weighting, then apply any feature selection algorithm. E.g, do a chi² test for term/corpus dependence and keep the top k n-grams by the test statistic (those that are most dependent on the corpus), or fit a linear classifier such as a linear SVM or logistic regression and keep the n-grams that have the highest weight (coefficient) in the classifier.

answered May 06 '13 at 05:03

larsmans's gravatar image

larsmans
67651424

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.