(suggested post from cstheory)

I am looking to cluster users together in a database, with each user represented by a number of features that are both discrete and continuous in nature. "Similar" users should be clustered together in a way that underlying strongly correlated features can be easily discovered. The number of clusters are unknown a priori.

To make more concrete, let's say the features per user are:

  • gender (m/f)
  • location (one of 10 cities)
  • favorite color (red/green/blue).

Let's say that we have N users and that favorite color is a R.V. dependent on gender and city. How are we to discover possible strong correlations with gender and/or location and favorite colors? There are a number of techniques, ranging from from K-NN, k-means to matrix factorization and PCA, available. PCA can indicate which dimensions can be reduced, but is not really clustering. Clustering algorithms seem to hide the underlying correlations that tie the users together. Any advice?

asked Dec 08 '10 at 17:14

Dan%20Benyamin's gravatar image

Dan Benyamin
16113

1

Why don't you estimate the correlations directly without clustering?

Have you tried factor analysis? If you have some sort of labels CCA might be useful as well.

(Dec 08 '10 at 18:49) Alexandre Passos ♦

5 Answers:

This is an ill-defined problem (not a criticism of your question). This means that you probably have a bit of latitude in your choice of method.

PCA would identify strongly-correlated combinations of many original variables. The first problem with PCA is that on very large datasets it can be very expensive (see SVD by randomized projections to compute it on large datasets) This might not be what you want: you might want combinations of a finite, possibly small set of original variables. Sparse PCA would get you some of that, but in my experience it can be very expensive.

This is why I would have a look at hierarchical clustering methods if you are interested in finding only combination of a small number of variables. Ward clustering would be the best option from a theoretical point of view, but if your data is really huge, single linkage can be implemented in a really efficient online way. If you are only looking the pick up the most correlated features, you don't have to go very high up in the tree.

Of course, my answers do not solve the problem of the number of clusters, but that a very hard problem that I would try to work around. You can have tests on the clustering criteria to see if it goes through a 'jump', in which case you decide that you stop your clustering when this jump happens, but in practice this happens with only very well separated clusters. If you can have your users play interactively with the data, you could have an initial guess with regards to the height at which you cut the hierarchical clustering tree and let them go up and down in this tree.

This answer is marked "community wiki".

answered Dec 11 '10 at 09:08

Gael%20Varoquaux's gravatar image

Gael Varoquaux
92141426

Very helpful, thank you!

(Dec 14 '10 at 00:07) Dan Benyamin

Maybe this is helpful: Mixture of Factor Analysers with mixed types.

answered Dec 09 '10 at 04:05

osdf's gravatar image

osdf
67031119

edited Dec 09 '10 at 04:06

if gender and color where uncorrelated then distinct(gender,color) ~= distinct(gender)distinct(color) so if distinct(gender,color) is much smaller then distinct(gender)distinct(color) then there is a correlation.

One other thing you could do if try to train a predictor that predict gender from color or color from gender. And see which kind of accuracy you acheive.

answered Sep 07 '11 at 13:44

maxime%20caron's gravatar image

maxime caron
51447

if gender and color where uncorrelated then distinct(gender,color) ~= distinct(gender)distinct(color) so if distinct(gender,color) is much smaller then distinct(gender)distinct(color) then there is a correlation.

One other thing you could do if try to train a predictor that predict gender from color or color from gender. And see which kind of accuracy you acheive

answered Sep 07 '11 at 13:44

maxime%20caron's gravatar image

maxime caron
51447

Sounds like what you really want is to transpose your data matrix & treat the features as items & items as features, then some form of clustering or other form of unsupervised learning will give you what you want. However this assumes that your features are kind of homogenous. It does not make much sense if you have discrete features over different domains. However, if you turn all the discrete features into sets of binary features (one for each possible value), then I think the approach could still work. Correspondence Analysis may also be of interest to you. It treats items and features symmetrically emdeds them into the same space.

answered Sep 07 '11 at 14:26

Daniel%20Mahler's gravatar image

Daniel Mahler
122631322

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.