The measure to calculate: entropy / information gain

Data: Categorical / Ordinal data with 10 - 40 different values per Attribute, a binary label (two classes)

Problem: Either a total small amount of data or a heavy class skew (e.g. class1 = 0.05).

When I calculate the entropy, I know there is a certain error included, simply because the formula uses the maximum likelihood. My Question is now: Is there a way (or even a software package :) ) to approach the calculation the bayesian way and create a "entropy posterior distribution" ? If not, how would you attack this problem ?

Edit: Sorry, I forgot to state my goal

Measure the expressiveness of attributes according to different labels under the given data conditions to get a rough forecast whether the label can be learned or not.

asked Jul 15 '10 at 04:35

steffen's gravatar image

steffen
6113

edited Jul 16 '10 at 02:47

Why do you want such a regularization? What are you using this information gain value for, exactly? If you're trying to build a decision tree (which I think you are, by the shape of the question), this shouldn't be a problem.

(Jul 15 '10 at 10:11) Alexandre Passos ♦

Yes, I use this measure for measuring the importance of attributes to build a classification model.

100 examples => 95 negative and 5 positive, categorical attributes with 40 values. Of course I can calculate the entropy. But who tells me that the best entropy I got is not a fluke and the attribute has actually no expressiveness at all. When dealing with binomials, I sometimes use the 20%-percentile of the beta-distribution instead of maximum likelihood to incorporate the size of the "base". I wondered whether this is possible with entropy, too.

(Jul 16 '10 at 02:39) steffen

3 Answers:

If I understand correctly, you're looking for a metric that quantifies the ability of an attribute to effectively separate two classes in the presence of high class imbalance / class skew.

As Alexandre mentions, decision tree induction algorithms have to solve this problem... they often choose, at each level, the split that "best" separates the two classes according to some metric of effectiveness (such as information gain or difference in entropy). There are a number of practical considerations that you may or may not already know.

Information Gain is one of the most popular metrics for quantifying the "effectiveness" of an attribute. However, it is A) affected by class skew, as you have noted, and B) tends to be weaker in the presence of highly-branching nominal attributes, which you have.

In order to address the issue of highly-branching nominal attributes, the C4.5 algorithm computes gain ratio rather than information gain. The gain ratio is simply the information gain divided by the entropy of the split itself. I can go into details if you need, but Wikipedia has a page on gain ratio. This normalization controls for the fact that very-highly-branching attributes are likely to have high information gain just by chance.

To deal with the difficulties posed by class skew, there have been skew-insensitive decision-tree splitting criteria developed. This paper describes two of them in detail: the so-called DKM criterion (developed elsewhere) and the Helliger distance metric. Hellinger distance is completely skew-insensitive, which means that its assessment of feature effectiveness is unaffected by class skew. However, it has a similar weakness to information gain in the presence of highly-branching nominal attributes.

To get around this limitation, you could evaluate the Hellinger distance for all possible binary splits of a given nominal attribute and choose the maximum of these values as your answer (if this makes any sense).

Hope this helps.

Reference: D.A. Cieslak, N.V. Chawla Learning Decision Trees for Unbalanced Data ECML 2008

answered Jul 19 '10 at 00:23

Troy%20Raeder's gravatar image

Troy Raeder
72071721

Keeping as close to your original question and approach you can perform the same calculation but smooth the zero counts using, say, Good-Turing to help manage the sparseness.

answered Jul 17 '10 at 07:53

gordon007's gravatar image

gordon007
163

I'm going to agree with Alexandre that I'm not exactly sure why you want to do this, but I'll take a stab at it.

It seems to me that your question focuses on the estimation of p(x), not the calculation of entropy.

H(X) = - sum p(x) log p(x) where x in X

the traditional (frequentist) idea is to calculate p(x) as N(x) / N(X).

If you want to be bayesian about it, use a posterior estimate of p(x) = p(x|alpha)p(alpha) using a Dirichlet prior rather than a point estimate.

I've never done this and I don't know what you'll get. But I don't see why it wouldn't work...

answered Jul 15 '10 at 11:23

Andrew%20Rosenberg's gravatar image

Andrew Rosenberg
156252135

edited Jul 15 '10 at 11:24

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.