|
I just recently started to deal with data mining. Even after reading some articles it's still not always clear to me, when scaling makes sense, when to use which scaling and when (in general) it has no effect. As it is my understanding, standardizing data so it has a mean of 0 and a variance of 1 should have a significant influence on knn - since attributes with different distribution will be standardized and so the distances may change dramatically. However, I conducted some experiments in weka, for instance with the diabetes example data set. I tried the IBk classifier with k=3 for the original data and then for the standardized data - Yet the results are 1:1 the same. So my questions are:
Edit: I did this little calculation:
Mean an variance for x, y and z:
After scaling:
So the second result now is closer to (0, 0, 0) after scaling than before, which seems to support my assumption. Then it is still unclear why it seemed to have no effect on the classification in weka. |
|
It certainly has to have an effect, as you point out. That it doesn't change accuracy in this specific example could be because while the relative distances change in this case the nearest neighbors are still of the same classes as before. If you try this on other datasets you're sure to see a bigger change. |