|
Let's say I am doing clustering on a dataset of 1M or 10M points. Clustering is currently exploratory. Is it preferred to downsample to 10K points and use more expensive clustering algorithms? Or should I only use cheap clustering (DBSCAN, kmeans) on the full data set? What are the plusses + minuses of each approach? |
|
This is very open ended, totally depends on what you are trying to do, so I suggest working back from your end goal. What would finding 100 small clusters vs 3 big ones tell you? I think down-sampling and experimenting with different methods is very reasonable, but again if different methods produce different clusters how are you going to evaluate which set of clusters is "better"? As an aside, doing PCA is totally feasible on 10M samples (how many dimensions?), using things like partial SVD, and that is also a form of (soft) clustering, you can visually inspect the principal component plots to detect clusters. |
|
Why not do both if it's exploratory? You might gain different insights. Although if you do exploratory clustering it's likely that you want a small-ish number of clusters (e.g. less than 100), otherwise you will have a hard time getting any insight by manually introspecting thousands of clusters. So it's very likely that 100K samples is enough to find approximately stable k-means centroids for manual inspection. Also maybe you can partition your data along a meaningful dimension (e.g. time or geolocation) and running clustering in each partition and qualitatively compare the outcome or evolution over-time (e.g. cluster overlaps, new cluster centers or cluster disappearing over time). |