I have a huge dataset (50,000 2000-dimensional sparse feature vectors). I want to cluster them in to k (unknown)clusters. As hierarchical clustering is very expensive in terms of time complexity (though it provides better result), I have designed my clustering framework as follows:

  1. do K-means clustering to partition the data into several bins (k is unknown so I make it reasonably large. eg. k=500)
  2. get centroids of all 500 partitions
  3. do hierarchical clustering on those 500 centroids (kind of merging based on some threshold value t)
  4. assign the data points to the nearest centroid(centroids emerged from hierarchical clustering)

I would like to know, whether my approach is efficient and if possible any other good solution to this problem.

Thank you.

asked Jul 26 '12 at 14:07

Mahin's gravatar image

Mahin
1334

edited Jul 26 '12 at 14:10


One Answer:

answered Jul 26 '12 at 17:22

Gael%20Varoquaux's gravatar image

Gael Varoquaux
92141426

edited Jul 26 '12 at 17:23

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.