|
I'm not much familiar with classifiers' cost functions. So I would like to know how much better are their performances. If I use least squares (minimum error entropy, correntropy and things like that which works quite well for regression) to train a classifier should I expect much worse performance than cross-entropy for softmax or hinge loss for SV??? Please point me some benchmarked comparisons if possible. |
|
When you fool around empirically, you'll be shocked that least squares, cross entropy and hinge all do roughly the same (and I think liblinear implements all of them, so it's easy to try). After reflecting a bit, you'll end up with thoughts like this blog post from Hal Daume : maybe SVM does better, if you tweak the parameter. If you want to see the theory, check out Are all Loss functions the same Lorenzo Rosasco et al, Neural Computation, May 2004, Vol. 16, No. 5, Pages 1063-1076 Great blog post that one you posted. It captured the exact feeling I was starting to have with all those exoteric hyperparameter selection.
(Jun 21 '13 at 15:26)
edersantana
|
|
Recent empirical work seems to suggest that at least for the classical Deep Learning tasks, the squared hinge loss does a better job than the traditionally used cross entropy loss on the output layer of a deep neural network. |
Just found Charlie Tang's results showing better performance of Hinge loss + SV over cross-entropy + softmax http://arxiv.org/pdf/1306.0239v1.pdf. I just don't have clue of how those functions differently slice feature space.
Sum of squared errors is definitely the worst because it is very sensitive to outliers.
Also, unless you clip the least squares loss, it can penalize you for being on the correct side of the margin, which is obviously silly for classification.
Actually, sum of squares is better than cross entropy if you have wrongly labeled data. It's referred to as "soft zero one loss", as it approaches the zero one loss if you anneal a hyper parameter in it. Cross entropy punishes wrongly labelled data exponentially, while sum of squares only does so quadratically and up to 1 (since one still squashes the outputs to (0, 1)).