|
Hi, I'm busy working on a project involving k-nearest neighbour clustering (specifically KNN regression). I have mixed numerical and categorical fields. How do I go about incorporating categorical values into the analysis? Btw, the categorical values are ordinal (e.g. bank name, account type). Numerical values would be, for e.g., age and salary. There are also some binary types (e.g., male, female). Help would be much appreciated. |
|
Decision trees are very good at handling categorical data. But they are not unsupervised. But there seems to some material on decision tree clustering(paper). |
|
Hi I have a problem such as you. I wanna to cluster my data, it has both categorical and numerical data, one of my feature is 'type' and it has 21 values (type of movie) so i can not make a dummy feature for them! please help me, and say me what can I do! |
|
For binary features, you can map (TRUE/FALSE) to (0,1). For categorical features, a common approach is to create a binary clustering feature for each value of the attribute. Suppose your Note: You stated that "bank name" and "account type" are ordinal but I am guessing they are actually just categorical, since there is likely no meaningful ordering of the values. For a truly ordinal value such as Thanks bogatron. How do I practically use this? At the moment, I'm implement KNN searching with a simple MATLAB function. The function calculates Euclidean distances to each neighbour in the cluster. So the algorithm would see, for e.g., male = 0; female = 1. However, I can't pass in three dummy variables for other categorical data. What to do?
(Nov 30 '12 at 03:05)
Peytonator
I was thinking about it a bit more - are you saying that you create 3 NEW dummy variables which are purely binary now? Also, surely binary numbers have different weights (e.g., male = 0, female = 1), so this won't work with Euclidean distances? Is hamming a better option, and will it also work with numerical values?
(Nov 30 '12 at 05:20)
Peytonator
If your numerical attributes are normalized, then you can map a binary attribute to (0,1). So you could have a Yes, I was suggesting that you could replace "bank name" with three features (one for each value). The benefit of that approach is that it will easily fit into a standard clustering algorithm. The downside is that the dimensionality of your data can become very large if you have many bank names. You could also use hamming distance, as you suggested. That would keep the dimensionality of the data lower but you would have to explicitly code the
(Nov 30 '12 at 09:05)
bogatron
|