|
I'm currently using a SVR to solve a regression problem. I applied the algorithm to the same dataset twice. Once for the "untouched" case, i.e. using the raw data as I got it and putting Xtrain, Ytrain, Xtest and Ytest into the algorithm and the second time by normalizing the data, i.e. substraction of the minimum in each dimension from the data, and subsequently by dividing each dimension by its maximum value. The result is that the data has 0 as the minimum in each dimension and 1 as the maximum respectivly. In matlab the operation would look something like this:
So I often hear that you should do that before appling an algorithm. But why? What are the benefits of this approach? Beside of some numerical issues like substraction of two numbers where number1 << number2 (number1 considerably smaller than number2), it is not clear to me why this approach would be advantageous? |
|
There are many reasons why normalization is a good some. Some have to do with the condition number of the learning problem, which tends to be better when the different dimensions are roughly the same size and independent. The other is to do with regularization: if the values in one dimension always have twice the magnitude as values in another dimension the effective penalty for the larger dimension will be half (that is, on average the larger dimension will have a higher effect on the objective value because it can get the same effect using a parameter of only half the size, which squared regularization prefers). This is generally a bad thing. Thanks for this answer. Now I conclude from your statement that there will always be a negative effect if two dimensions are somehow correlated if there is no normalization, right?
(Nov 21 '11 at 17:57)
Tom
Usually, yes, but often it's not measurable, or the cost of normalizing (in terms of sparsity and such) makes it not worth it.
(Nov 21 '11 at 18:11)
Alexandre Passos ♦
Great! Sorry for this additional request: What do you mean by "condition number of the learning problem"?
(Nov 21 '11 at 18:54)
Tom
I mean http://en.wikipedia.org/wiki/Condition_number , the ratio between the eigenvalues of the Hessian.
(Nov 21 '11 at 18:55)
Alexandre Passos ♦
Ah okay, I already suspected this :)
(Nov 21 '11 at 19:02)
Tom
|
|
Here's an anecdote for why you should usually normalise your data. I worked on a project involving classification of images based on feature histograms, and I forgot to first normalise my histograms before feeding them into a SVM algorithm for training. My training/testing images were fairly large (2MB each), therefore their histograms also had fairly large counts, e.g. [423423,1231,2343435,2342342,3545,...]. After cross-validation, I was getting an accuracy of around 90%. However, when I went to test new images that had labels the classifier had seen, but the images were of a much lower resolution (e.g. 500KB), and therefore had histograms with much smaller counts, I was getting a classification accuracy of 5-10%. Pretty horrible. What I eventually realised was that my classifier was essentially "cheating" by using the magnitude of the histogram to help it label a feature set. If I gave it a 2MB image similar to what it had been trained on, it would likely classify it correctly, but if I would take that same image, and scale it down to half the original size, it would likely classify it incorrectly because the magnitude of the histogram would be completely different. Normalising my histograms and retraining produced a much more reliable classification on new data, albeit I had slightly lower accuracy (75-80%) since the algorithm couldn't use the magnitude to partition the data. So to summarise, you should probably always normalise your data, unless the magnitude of your numbers conveys important information that also makes sense when compared to other arbitrary samples. Thanks, that's obviously a very nice example!
(Nov 21 '11 at 18:54)
Tom
|