1
2

I am trying to cluster some data points using K-Means. The problem is that some of the features are Nominal/Ordinal and some are real values. What kind of distance measure can we use to take into consideration all the features ?

asked Nov 01 '11 at 16:35

Saurabh%20Saxena's gravatar image

Saurabh Saxena
16446


3 Answers:

Although you could project the real valued features to some interval, or transform them through some function that brings them to the same scale as your discrete values, I would go for discretizing the real valued features so that you get only ordinal features. This can be done by simple uniform bucket width, or some more complex discretization scheme.

answered Nov 01 '11 at 17:28

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
1459102743

What kind of distance measure should I use for the ordinal features, after transforming the real features.

(Nov 01 '11 at 17:33) Saurabh Saxena

I would recommend staring with Euclidian distance and if you run into problems you could start thinking about other measures. There's a large body of research of more complicated clustering methods and distance measures, but if you don't have any experience with these methods and you don't have any domain specific reasons for particular model choices, the standard k-means approach seems most appropriate.

(Nov 01 '11 at 18:00) Oscar Täckström

Euclidian distance will not work in the case of nominal features. e.g. feature can take only take value Black, White or Brown.

(Nov 01 '11 at 18:09) Saurabh Saxena
1

Sorry, I thought you had binarized your nominal features. Instead of having one feature take the values Black, White or Brown, you add three binary features of the form Color=Black, Color=White, Color=Brown.

(Nov 01 '11 at 18:31) Oscar Täckström

Although that would make it impossible to generalize over bin borders. E.g say you have data (0.999, 1.0001, 3) and your binning border is 1.0. Then the first and second item fall into different bins although they are closely related. If you run into this problem, you can bin real values by running k-means and use the clusters as bins. Or let each item go into several bins.

(Nov 02 '11 at 04:40) Justin Bayer

I'm not an expert on either but I know that the Value Distance Metric (VDM) and closely-related Heterogeneous Value Distance Metric (HVDM) were designed for this purpose. If I recall, VDM is just for nominal features and HVDM tries to combine nominal and continuous, so that is closer to what you want.

As @Oscar points out below, HVDM makes use of class values and so is inappropriate for clustering applications. The HVDM paper linked above discusses at least one other method that that doesn't require supervision (see Section 2.3).

answered Nov 01 '11 at 17:41

Troy%20Raeder's gravatar image

Troy Raeder
73571721

edited Nov 01 '11 at 20:04

Troy, it seems that the VDM requires supervision, so I don't think it's applicable in this scenario. Since HVDM is based on VDM, it seems that it wouldn't work either.

(Nov 01 '11 at 18:36) Oscar Täckström

Ooh sorry I didn't think that one through clearly enough. At any rate the paper that introduces HVDM has a summary of methods for "combined" distance (using continuous and nominal), so it's still a good place to look. I guess I'll edit my answer.

(Nov 01 '11 at 19:11) Troy Raeder

I would transform the Nominal features using a one-hot encoding, so you only have numerical features. If you do zero mean, unit variance, you won't have to worry about scaling. If you don't I would start with scaling them to the min and max of the range of the other features.

answered Nov 02 '11 at 07:33

Andreas%20Mueller's gravatar image

Andreas Mueller
1817133671

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.