|
While searching for numerical datasets, I noticed the libsvm dataset converted the adult dataset (a mixture of both categorical and integer attributes) into quantiles (which I believe are buckets with a binary value if the value falls into this bucket range or not). So they converted the 14 attribues in the adult dataset into a 123 binary-valued feature vector. What is the gain in doing so? I tried to use the 123 binary-valued feature vector with a SOM clustering and it does extremely poorly. This is likely because the SOM is relying on euclidean distance as a similarity measure so 1 0 0 0 is the same distance from 0 1 0 0 and 0 0 1 0 which when converted to the original attribute is not true at all. |
|
Say, for simplicity, that you have a linear model for regression. That is, the value returned by your model is a weighted sum of the values of the features, with fixed weights. Now assume the dependence between one of these features and the final value is non-linear (say, quadratic): having a small value for the feature is bad, as the value increases it gets better, then it gets worse again. This happens a lot in things like "how much bass should I put when equalizing this song in this environment", etc. If you don't quantize you have no hope of capturing these nonlinearities, but with quantization you can assign independent "goodness" weights to different ranges of the feature values. I'm not experienced with SOM at all, but I think there are better alternatives to it in almost any setting (for example, t-SNE works a lot better to visualize high-dimensional data), so this could be a fault of SOM or your optimizer. Thanks, that was a great explanation. I'm using the SOM implementation in MATLAB's neural net toolbox. The SOM seems to generate good clusters for the MNIST dataset. The quantiles are probably not benefitting the SOM the same way as a linear svm. I don't believe SOMs have any issue with nonlinearity as it is a nonlinear transformation (via competitive matching). It is actually very similar to k-means clustering. But t-SNE looks very interesting.
(Apr 28 '11 at 17:34)
crdrn
Based on Alexandre's answer, another realistic example: suppose you are interested in estimating the weight of a baby. In the first 4 months, its weight increases by 750 g/months; in the next 4 months by 500 g/months and so on; by partitioning the time interval, different weights (pun intended) can be used for estimation.
(May 01 '11 at 15:59)
Lucian Sasu
|
|
The conversion from real values to four binary bins was an application specific transformation, as stated by Alexandre. This representation works better than the "thermometer" encoding when learning linear models. The "thermometer" representation does have interesting distance properties and might be useful with rbf kernels. Since you are clustering based on Euclidean distance, the real number values should produce better results. You should normalize these values first. If you want to get a little fancier, look into using the Mahalanobis distance. http://en.wikipedia.org/wiki/Mahalanobis_distance I agree, normalization when using euclidian distance seems very important. I think I may be having some issues where some features are 'overpowering' other features in the distance calculation simply because their absolute values are much larger. And variance in the direction of the categorical features is much higher than the real features (which I rescaled from 0 to 1) because they can only take the binary values 0 and 1.
(May 03 '11 at 10:44)
crdrn
|
|
Rather than the binary one-hot encoding you are using (zeros everywhere, and one 1 for the corresponding bucket), you should maybe use a "thermometer" encoding, where you put a 1 if the value exceeds the lower threshold of the corresponding range. For instance, instead of representing values in the first, second, and third range as (respectively) "1 0 0 0", "0 1 0 0", and "0 0 1 0", you can represent them as "0 0 0", "1 0 0", "1 1 0", and so on. That way, there is a bigger distance between quantiles that are further apart (neighboring ranges would have a Euclidean distance of 1, but the distance between the min and max values would be sqrt(n)). But would I benefit more from using the real continuous values of the features since I'm not constrained to linearity? Although I think the "thermometer" encoding you proposed might give the same benefit to linear svms as the quantile encoding done in libsvm.
(May 02 '11 at 09:31)
crdrn
|