I'm trying to implement information gain ratio to find how much a variable affects contributes to class membership in a naive bayesian classifier.

I hope to use this for both weighting and to find which variables I can (safely) ignore.

However, I have two different definitions of information gain. One on the information gain ratio wikipedia page, and another on the Kullback-Leibler divergence (aka information gain) page.

Assuming I have my distributions for each class (In my case needle and haystack), what's the correct way to implement IGR? Is there better material than wikipedia available? Google could not find it for me.

asked Feb 05 '13 at 13:56

Steven's gravatar image

Steven
1222

edited Feb 05 '13 at 14:09

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
2039133450


One Answer:

Information gain (IG) and information gain ratio (GR) are two different, but related, functions. Here is the paper that introduces GR; if you read through it you'll understand the motivation and be able to decide for your setting which of IG or GR suits your needs:

Quinlan. Induction of Decision Trees. Machine Learning 1(1):81--106, 1986.

IG and GR are also described in standard machine learning text books that cover decision tree learning (e.g., Mitchell's "Machine Learning").

There is an alternative way to check for split information that I suggest you also check out, although it is slightly harder to compute. It should be better behaved for situations where classes are extremely imbalanced:

Martin. An Exact Probability Metric for Decision Tree Splitting and Stopping. Machine Learning 28(2):257--291, 1997.

answered Feb 08 '13 at 12:33

Art%20Munson's gravatar image

Art Munson
64611316

Art, Thanks for the answer and the papers. Indeed, my data is very imbalanced. My target class is in the single digit percentages with quite noisy features. (It is however very amenable to enrichment via external data)

(Feb 08 '13 at 19:36) Steven
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.