1
1

Let's say I want to cluster 1M instances. Without loss of generality, but for explaining the problem, assume the instances are user profiles.

I have written an exhaustive feature generation technique, to capture most of the interesting information. Let's assume there are 10K-1M features.

The downside of this approach is that many features will be redundant. Worse yet, features representing one type of information that is less useful (messaging behavior) might be highly redundant and overgenerated, whereas features representing more important information (individual profile fields, like Gender and Income Level) is not redundant and expressed only once in the feature vector.

I do not want to cluster in the original space, because the highly overgenerated features will have too strong an impact on the distance measure in clustering, and overwhelm the less frequent but highly important features.

What is a simple and effective technique for doing feature selection? Keep in mind I have a large feature matrix (1M users x 100K features). Also I prefer not to use an n^2 algorithm, where n is the number of features.

Most of all, I want a technique that will be very simple to code, or has good existing implementations. So if that means it must be n^2, so be it. I can workaround.

asked May 17 '13 at 14:28

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

edited May 17 '13 at 14:29


One Answer:

I assume that the simplest answer, if I don't mind an n^2 algorithm, is using one of the standard feature selection techniques.

Yang and Pederson (1997) recommends information gain and chi-squared, with document frequency thresholding as a recommended cheaper alternative.

Forman (2003) recommends something called Bi-Normal Separation.

answered May 17 '13 at 14:34

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

edited May 17 '13 at 14:34

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.