I'm wondering why people use negative log scores, instead of just log scores? (I mean log is used in the first place to avoid underflow, but the minus sign doesn't really do anything.)

So people deal with "costs" instead of "weights". Is there any reason actually? Is it just because traditionally we talk about the "lightest derivation" and "shortest path", which is based on "smaller is better"?

asked Mar 17 '11 at 18:01

Frank's gravatar image

Frank
1349274453

2

I think it's convenience, yes. In entropy, for example, it might feel weird to talk about -2 bits of information, so the definition is negated. This makes sense intuitively as the more information is in something the more negative sum_i p_i log p_i is, which doesn't make that much sense.

(Mar 17 '11 at 18:10) Alexandre Passos ♦
1

I don't think that for entropy the reason for the negative sign is convenience. The Shanon information content of an outcome is defined as h(x)=log_2(1/P(x)). The more improbable an outcome is, the higher its information content. The minus sign comes just from the properties of the log. See this: http://tinyurl.com/4u9rhsp

(Mar 17 '11 at 22:31) Alejandro

@Alejandro: he's asking why people negate the result instead of just using the negative result. The shannon entropy formula for information content in a message is: $(mathbf{-1})sum_{i}{p(x_i)ln(p(x_i))}$, for all possible messages $x_i$ in the sample space of messages under observation. See http://en.wikipedia.org/wiki/Entropy_(information_theory) for a better description.

(Mar 18 '11 at 03:20) Brian Vandenberg
1

Please rethink your answer Brian. E.g. consider E[h(x)], h(x)= log_2(1/p(x)) (defined by Alejandro), E[] the expectation with respect to p(x).

(Mar 18 '11 at 07:40) osdf
1

@Brian: All I am saying is that, for the particular case of entropy, the minus sign has nothing to do with convenience.

(Mar 18 '11 at 09:24) Alejandro

@Alejandro: Good point. I'll readily admit I didn't read your comment carefully enough.

(Mar 18 '11 at 12:38) Brian Vandenberg

@Alejandro: I think that the object that people are taking the log of is a probability, so it really is a case of -log(p).

(Aug 24 '11 at 14:11) Neil
showing 5 of 7 show all

4 Answers:

I think it also has to do with a historical difference between the neural network community and the statistics community. Somehow, the former often left out the probabilistic interpretation of the objective function and talked about minimizing an error function as opposed to maximizing a likelihood function. This is often more analogous to minimizing for example the classification error until it reaches 0. In statistics it seems more common to think about maximizing the likelihood function and omit the minus sign. If you want to show your model assigns a high likelihood to a dataset it makes sense to talk about maximization and if you want to say that your model is precise at predicting new data it might make more sense to talk about error.

These conventions seem as arbitrary as talking about the percentage of correct classifications vs talking about the percentage of incorrect classifications. Perhaps it also plays a role that many off-the-shelve optimizers expect a minimization problem with a history in which cost might be the amount of material used to construct something for example. This is just speculation though...

answered Mar 18 '11 at 08:24

Philemon%20Brakel's gravatar image

Philemon Brakel
2445103560

"Negative log probability" is the surprisal. (An equivalent way to write entropy is the expected surprisal.)

answered Aug 24 '11 at 14:09

Neil's gravatar image

Neil
51223

edited Aug 25 '11 at 03:15

It also falls out of the MDL approach as well. As Alexandre Passos pointed out, the negative log likelood is nothing else but the number of bits you need to encode an observation x, if you use the code corresponding to p(x).

There is actually a nice analogy between the length of a code you need to encode x and its probability distribution. Say your data is distributed according to p. Then an "optimal compressor for x" would compress x to a length of - log p(x) bits. For further information, check out Grünwald's tutorial on mdl and search for "probability mass functions are code length functions".

Thus, it makes much more sense to minimize the number of bits needed to encode x than to minimize the negative number bits needed. :)

answered Mar 18 '11 at 07:49

Justin%20Bayer's gravatar image

Justin Bayer
170693045

It's probably just a psychological thing in most cases, as Alex pointed out.

The net effect is one of semantics. If the log loss is strictly a sum of log probabilities, the result will be a negative quantity. If gradient learning is used, the difference between negating (or not) the log loss is whether to add or subtract the parameter gradient.

This answer is marked "community wiki".

answered Mar 17 '11 at 19:35

Brian%20Vandenberg's gravatar image

Brian Vandenberg
824213746

edited Mar 18 '11 at 17:57

Hmm. So maximize or minimize p(x)? p(x)? I don't understand. You probably are talking about the likelihood of an observed dataset? (Which you definitely want to maximize). I also don't understand what your last two sentences try to say.

(Mar 18 '11 at 07:53) osdf

What's not to understand? You have a sample (or set of samples) x. Do you want to increase the probability of encountering them, or decrease it? As an example, in boltzmann machines you sample from 2 different probability distributions: one that is closely tied to the data (the 'data distribution'), and one that is representative of what the model is capable of recognizing/understanding/encoding (or whatever appropriate term applies). The weight update that is generated is one that simultaneously increases the probability of encountering samples from the data distribution, and decreases the probability of encountering samples from the distribution the model currently understands.

(Mar 18 '11 at 12:17) Brian Vandenberg

edited It was poorly worded the first time. In my rush to finish, it came across as inflamatory/rude. Sorry, osdf. The last two sentences are pretty straightforward in the context of gradient learning w/regard to a log loss. I kept it brief under the assumption that if the author of the question was familiar with log losses, then my [admittedly terse] explanation would suffice.

(Mar 18 '11 at 12:21) Brian Vandenberg

re: negative votes, do you think my answer didn't apply, needs more detail, or what? Any constructive criticism welcome.

(Mar 18 '11 at 15:50) Brian Vandenberg

I did not downvote you, but there are some issues. I dislike being downvoted and not knowing why as well, so here go my thoughts. For one, the gradient thing is not really necessary (there are methods where there is no gradient used anywhere). Then, you differentiate with respect to the space of your observations, which is weird. Furthermore, you (to my knowledge) never minimize a probability. There perhaps is a method where it's done, but .... and in RBMs, you sample from the data and from the model, but you follow some gradient which maximizes the posterior.

(Mar 18 '11 at 16:39) Justin Bayer

On the gradient, you're absolutely right. I didn't mean to give the impression that gradient learning is the only way. re: the parameter mismatch, you're right. I'll edit/fix it, that was a dumb mistake. Other than the boltzmann machine case, I've not seen a case where you'd choose to minimize; it was an irrelevant point I'll remove from the answer. However, I don't see why it wouldn't be a good idea to try to minimize over some set of samples (B) related but not part of your training set (A) in an effort to teach it to ignore Bs and be sensitive to As. There may be a genius way to go about it far better than what I hinted at ... but as I said, it's an irrelevant point and I should remove it. I hate not being able to use newlines in comments.

(Mar 18 '11 at 17:43) Brian Vandenberg

Actually, the minimization thing you talk about is one of the ideas of contrastive learning: you pick items from your input space by walking down the energy landscpae around your training samples and increase the energy of those. However, that is only done to maximize the probability of your real training set.

(Mar 18 '11 at 18:12) Justin Bayer

Yes, exactly. While it's used with the intent of improving the model's ability to [recognize, generate samples from, 'believe in'] the data distribution, it's simultaneously increasing the likelihood of generating data-dist samples and decreasing the likelihood of generating samples that don't fit the data distribution. Though, I think other cases could make sense. For example, I worked on an engine whose primary use was to determine whether employees were looking at porn or not. I didn't care about classifying cooking websites, but if I didn't train on that category then websites about cupcakes were often rated as porn.

(Mar 18 '11 at 18:33) Brian Vandenberg

Brian, I apologize for downvoting without giving any constructive feedback. I just had severe troubles following your notation, which you now have removed any way, so I have removed the downvote as well.

(Mar 18 '11 at 19:20) Oscar Täckström

No problem. As soon as we can use latex here, that problem should go away.

(Mar 18 '11 at 19:22) Brian Vandenberg
showing 5 of 10 show all
Your answer
toggle preview

Subscription:

Once you sign in you will be able to subscribe for any updates here

Tags:

×2

Asked: Mar 17 '11 at 18:01

Seen: 3,585 times

Last updated: Aug 25 '11 at 03:15

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.