|
I'm wondering why people use negative log scores, instead of just log scores? (I mean So people deal with "costs" instead of "weights". Is there any reason actually? Is it just because traditionally we talk about the "lightest derivation" and "shortest path", which is based on "smaller is better"?
showing 5 of 7
show all
|
|
I think it also has to do with a historical difference between the neural network community and the statistics community. Somehow, the former often left out the probabilistic interpretation of the objective function and talked about minimizing an error function as opposed to maximizing a likelihood function. This is often more analogous to minimizing for example the classification error until it reaches 0. In statistics it seems more common to think about maximizing the likelihood function and omit the minus sign. If you want to show your model assigns a high likelihood to a dataset it makes sense to talk about maximization and if you want to say that your model is precise at predicting new data it might make more sense to talk about error. These conventions seem as arbitrary as talking about the percentage of correct classifications vs talking about the percentage of incorrect classifications. Perhaps it also plays a role that many off-the-shelve optimizers expect a minimization problem with a history in which cost might be the amount of material used to construct something for example. This is just speculation though... |
|
"Negative log probability" is the surprisal. (An equivalent way to write entropy is the expected surprisal.) |
|
It also falls out of the MDL approach as well. As Alexandre Passos pointed out, the negative log likelood is nothing else but the number of bits you need to encode an observation x, if you use the code corresponding to p(x). There is actually a nice analogy between the length of a code you need to encode x and its probability distribution. Say your data is distributed according to p. Then an "optimal compressor for x" would compress x to a length of - log p(x) bits. For further information, check out Grünwald's tutorial on mdl and search for "probability mass functions are code length functions". Thus, it makes much more sense to minimize the number of bits needed to encode x than to minimize the negative number bits needed. :) |
|
It's probably just a psychological thing in most cases, as Alex pointed out. The net effect is one of semantics. If the log loss is strictly a sum of log probabilities, the result will be a negative quantity. If gradient learning is used, the difference between negating (or not) the log loss is whether to add or subtract the parameter gradient.
This answer is marked "community wiki".
Hmm. So maximize or minimize p(x)? p(x)? I don't understand. You probably are talking about the likelihood of an observed dataset? (Which you definitely want to maximize). I also don't understand what your last two sentences try to say.
(Mar 18 '11 at 07:53)
osdf
What's not to understand? You have a sample (or set of samples) x. Do you want to increase the probability of encountering them, or decrease it? As an example, in boltzmann machines you sample from 2 different probability distributions: one that is closely tied to the data (the 'data distribution'), and one that is representative of what the model is capable of recognizing/understanding/encoding (or whatever appropriate term applies). The weight update that is generated is one that simultaneously increases the probability of encountering samples from the data distribution, and decreases the probability of encountering samples from the distribution the model currently understands.
(Mar 18 '11 at 12:17)
Brian Vandenberg
edited It was poorly worded the first time. In my rush to finish, it came across as inflamatory/rude. Sorry, osdf. The last two sentences are pretty straightforward in the context of gradient learning w/regard to a log loss. I kept it brief under the assumption that if the author of the question was familiar with log losses, then my [admittedly terse] explanation would suffice.
(Mar 18 '11 at 12:21)
Brian Vandenberg
re: negative votes, do you think my answer didn't apply, needs more detail, or what? Any constructive criticism welcome.
(Mar 18 '11 at 15:50)
Brian Vandenberg
I did not downvote you, but there are some issues. I dislike being downvoted and not knowing why as well, so here go my thoughts. For one, the gradient thing is not really necessary (there are methods where there is no gradient used anywhere). Then, you differentiate with respect to the space of your observations, which is weird. Furthermore, you (to my knowledge) never minimize a probability. There perhaps is a method where it's done, but .... and in RBMs, you sample from the data and from the model, but you follow some gradient which maximizes the posterior.
(Mar 18 '11 at 16:39)
Justin Bayer
On the gradient, you're absolutely right. I didn't mean to give the impression that gradient learning is the only way. re: the parameter mismatch, you're right. I'll edit/fix it, that was a dumb mistake. Other than the boltzmann machine case, I've not seen a case where you'd choose to minimize; it was an irrelevant point I'll remove from the answer. However, I don't see why it wouldn't be a good idea to try to minimize over some set of samples (B) related but not part of your training set (A) in an effort to teach it to ignore Bs and be sensitive to As. There may be a genius way to go about it far better than what I hinted at ... but as I said, it's an irrelevant point and I should remove it. I hate not being able to use newlines in comments.
(Mar 18 '11 at 17:43)
Brian Vandenberg
Actually, the minimization thing you talk about is one of the ideas of contrastive learning: you pick items from your input space by walking down the energy landscpae around your training samples and increase the energy of those. However, that is only done to maximize the probability of your real training set.
(Mar 18 '11 at 18:12)
Justin Bayer
Yes, exactly. While it's used with the intent of improving the model's ability to [recognize, generate samples from, 'believe in'] the data distribution, it's simultaneously increasing the likelihood of generating data-dist samples and decreasing the likelihood of generating samples that don't fit the data distribution. Though, I think other cases could make sense. For example, I worked on an engine whose primary use was to determine whether employees were looking at porn or not. I didn't care about classifying cooking websites, but if I didn't train on that category then websites about cupcakes were often rated as porn.
(Mar 18 '11 at 18:33)
Brian Vandenberg
Brian, I apologize for downvoting without giving any constructive feedback. I just had severe troubles following your notation, which you now have removed any way, so I have removed the downvote as well.
(Mar 18 '11 at 19:20)
Oscar Täckström
No problem. As soon as we can use latex here, that problem should go away.
(Mar 18 '11 at 19:22)
Brian Vandenberg
showing 5 of 10
show all
|
I think it's convenience, yes. In entropy, for example, it might feel weird to talk about -2 bits of information, so the definition is negated. This makes sense intuitively as the more information is in something the more negative sum_i p_i log p_i is, which doesn't make that much sense.
I don't think that for entropy the reason for the negative sign is convenience. The Shanon information content of an outcome is defined as
h(x)=log_2(1/P(x)). The more improbable an outcome is, the higher its information content. The minus sign comes just from the properties of the log. See this: http://tinyurl.com/4u9rhsp@Alejandro: he's asking why people negate the result instead of just using the negative result. The shannon entropy formula for information content in a message is: $(mathbf{-1})sum_{i}{p(x_i)ln(p(x_i))}$, for all possible messages $x_i$ in the sample space of messages under observation. See http://en.wikipedia.org/wiki/Entropy_(information_theory) for a better description.
Please rethink your answer Brian. E.g. consider E[h(x)], h(x)= log_2(1/p(x)) (defined by Alejandro), E[] the expectation with respect to p(x).
@Brian: All I am saying is that, for the particular case of entropy, the minus sign has nothing to do with convenience.
@Alejandro: Good point. I'll readily admit I didn't read your comment carefully enough.
@Alejandro: I think that the object that people are taking the log of is a probability, so it really is a case of
-log(p).