1
2

While searching for numerical datasets, I noticed the libsvm dataset converted the adult dataset (a mixture of both categorical and integer attributes) into quantiles (which I believe are buckets with a binary value if the value falls into this bucket range or not). So they converted the 14 attribues in the adult dataset into a 123 binary-valued feature vector.

What is the gain in doing so? I tried to use the 123 binary-valued feature vector with a SOM clustering and it does extremely poorly. This is likely because the SOM is relying on euclidean distance as a similarity measure so 1 0 0 0 is the same distance from 0 1 0 0 and 0 0 1 0 which when converted to the original attribute is not true at all.

asked Apr 28 '11 at 11:14

crdrn's gravatar image

crdrn
327151825


3 Answers:

Say, for simplicity, that you have a linear model for regression. That is, the value returned by your model is a weighted sum of the values of the features, with fixed weights. Now assume the dependence between one of these features and the final value is non-linear (say, quadratic): having a small value for the feature is bad, as the value increases it gets better, then it gets worse again. This happens a lot in things like "how much bass should I put when equalizing this song in this environment", etc. If you don't quantize you have no hope of capturing these nonlinearities, but with quantization you can assign independent "goodness" weights to different ranges of the feature values. I'm not experienced with SOM at all, but I think there are better alternatives to it in almost any setting (for example, t-SNE works a lot better to visualize high-dimensional data), so this could be a fault of SOM or your optimizer.

answered Apr 28 '11 at 13:30

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

Thanks, that was a great explanation.

I'm using the SOM implementation in MATLAB's neural net toolbox. The SOM seems to generate good clusters for the MNIST dataset. The quantiles are probably not benefitting the SOM the same way as a linear svm. I don't believe SOMs have any issue with nonlinearity as it is a nonlinear transformation (via competitive matching). It is actually very similar to k-means clustering.

But t-SNE looks very interesting.

(Apr 28 '11 at 17:34) crdrn

Based on Alexandre's answer, another realistic example: suppose you are interested in estimating the weight of a baby. In the first 4 months, its weight increases by 750 g/months; in the next 4 months by 500 g/months and so on; by partitioning the time interval, different weights (pun intended) can be used for estimation.

(May 01 '11 at 15:59) Lucian Sasu

The conversion from real values to four binary bins was an application specific transformation, as stated by Alexandre. This representation works better than the "thermometer" encoding when learning linear models. The "thermometer" representation does have interesting distance properties and might be useful with rbf kernels. Since you are clustering based on Euclidean distance, the real number values should produce better results. You should normalize these values first. If you want to get a little fancier, look into using the Mahalanobis distance. http://en.wikipedia.org/wiki/Mahalanobis_distance

answered May 03 '11 at 01:04

Brent%20Payne's gravatar image

Brent Payne
80239

edited May 03 '11 at 01:05

I agree, normalization when using euclidian distance seems very important. I think I may be having some issues where some features are 'overpowering' other features in the distance calculation simply because their absolute values are much larger.

And variance in the direction of the categorical features is much higher than the real features (which I rescaled from 0 to 1) because they can only take the binary values 0 and 1.

(May 03 '11 at 10:44) crdrn

Rather than the binary one-hot encoding you are using (zeros everywhere, and one 1 for the corresponding bucket), you should maybe use a "thermometer" encoding, where you put a 1 if the value exceeds the lower threshold of the corresponding range.

For instance, instead of representing values in the first, second, and third range as (respectively) "1 0 0 0", "0 1 0 0", and "0 0 1 0", you can represent them as "0 0 0", "1 0 0", "1 1 0", and so on. That way, there is a bigger distance between quantiles that are further apart (neighboring ranges would have a Euclidean distance of 1, but the distance between the min and max values would be sqrt(n)).

answered Apr 30 '11 at 14:47

Pascal%20Lamblin's gravatar image

Pascal Lamblin
10614

But would I benefit more from using the real continuous values of the features since I'm not constrained to linearity? Although I think the "thermometer" encoding you proposed might give the same benefit to linear svms as the quantile encoding done in libsvm.

(May 02 '11 at 09:31) crdrn
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.