Hi,

I'm busy working on a project involving k-nearest neighbour clustering (specifically KNN regression). I have mixed numerical and categorical fields. How do I go about incorporating categorical values into the analysis?

Btw, the categorical values are ordinal (e.g. bank name, account type). Numerical values would be, for e.g., age and salary. There are also some binary types (e.g., male, female).

Help would be much appreciated.

asked Nov 29 '12 at 12:06

Peytonator's gravatar image

Peytonator
1224

edited Nov 29 '12 at 16:03


3 Answers:

Decision trees are very good at handling categorical data. But they are not unsupervised. But there seems to some material on decision tree clustering(paper).

answered Aug 21 '13 at 09:29

Arun%20Kumar's gravatar image

Arun Kumar
286101016

Hi I have a problem such as you. I wanna to cluster my data, it has both categorical and numerical data, one of my feature is 'type' and it has 21 values (type of movie) so i can not make a dummy feature for them! please help me, and say me what can I do!

answered Aug 21 '13 at 02:24

samira%20rezai's gravatar image

samira rezai
1

For binary features, you can map (TRUE/FALSE) to (0,1). For categorical features, a common approach is to create a binary clustering feature for each value of the attribute. Suppose your bank_name feature can have the values "Chase", "CitiBank", or "BBT". Then, you would create dummy features like bank_chase, bank_citibank, and bank_bbt. Whenever bank_name is "Citibank", you would set bank_chase=0, bank_citibank=1, and bank_bbt=0 (and similarly for other values of bank_name).

Note: You stated that "bank name" and "account type" are ordinal but I am guessing they are actually just categorical, since there is likely no meaningful ordering of the values. For a truly ordinal value such as quality, which might take the values "low", "medium", or "high", you can map those values to 0, 0.5, and 1, respectively.

answered Nov 29 '12 at 16:43

bogatron's gravatar image

bogatron
471156

Thanks bogatron. How do I practically use this? At the moment, I'm implement KNN searching with a simple MATLAB function. The function calculates Euclidean distances to each neighbour in the cluster. So the algorithm would see, for e.g., male = 0; female = 1. However, I can't pass in three dummy variables for other categorical data. What to do?

(Nov 30 '12 at 03:05) Peytonator

I was thinking about it a bit more - are you saying that you create 3 NEW dummy variables which are purely binary now? Also, surely binary numbers have different weights (e.g., male = 0, female = 1), so this won't work with Euclidean distances? Is hamming a better option, and will it also work with numerical values?

(Nov 30 '12 at 05:20) Peytonator

If your numerical attributes are normalized, then you can map a binary attribute to (0,1). So you could have a gender attribute that takes the value of 0 or 1 when gender is male or female, respectively.

Yes, I was suggesting that you could replace "bank name" with three features (one for each value). The benefit of that approach is that it will easily fit into a standard clustering algorithm. The downside is that the dimensionality of your data can become very large if you have many bank names. You could also use hamming distance, as you suggested. That would keep the dimensionality of the data lower but you would have to explicitly code the knn algorithm to combine hamming distance with the "normal" numerical feature distances.

(Nov 30 '12 at 09:05) bogatron
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.