|
There are a number of ways to determine whether the clustering of some data itself is valid, but assuming that the clusters are valid, are there any established ways to calculate the "confidence" for a particular data-point within a cluster? For example, being able to break down each point and say it's 80% "cluster 1", 10% "cluster 2", and 10% "cluster 3", or even just "the algorithm is 90% that the label 'cluster 1' is valid." |
|
It depends on your clustering algorithm. Bayesian clustering (even with EM, but also with sampling for nonparametric models and more complicated varieties) do give you this sort of information out of the box (and you have to threshold it to get actual cluster memberships). Some spectral clustering algorithms (like the original Shi and Malik normalized cuts) also have a parameter that, as you tweak, changes which points are in each cluster (and from discretizing it you can easily compute a probability). For single-linkage algorithms this is harder, and I know of no standard way. What's EM?
(Jul 18 '10 at 19:07)
sbirch
The Expectation Maximization algorithm -- basically used for estimating parameters in latent variable models. See a brief description here in the context of mixture models for clustering: http://en.wikipedia.org/wiki/Mixture_model
(Jul 18 '10 at 19:11)
spinxl39
It's an inference algorithm for probabilistic models, expectation maximization. It is not a clustering algorithm, but mixture of gaussians clustering (where you assume the points come from independent gaussians and you try to estimate the mean and variance of these gaussians) is usually referred to as EM. See http://en.wikipedia.org/wiki/Mixture_of_gaussians#Expectation_maximization_.28EM.29 for a description.
(Jul 18 '10 at 19:11)
Alexandre Passos ♦
I can't find anything on Bayesian clustering either, is this an algorithm or a theoretical idea? Is there a classic text on clustering which covers everything like this? I feel like I just had a knowledge bomb dropped on me, and I'm still trying to recoup :).
(Jul 18 '10 at 19:35)
sbirch
By bayesian clustering I mean any sort of mixture model or more complex thing. For example, this paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.4467&rep=rep1&type=pdf presents an almost-general algorithm for inference in this sort of model. You can also look for chapters 20, 21, 22, 27 of Mackay's book (it's also very good reading, btw). You can find it free online at http://www.inference.phy.cam.ac.uk/mackay/itila/
(Jul 18 '10 at 19:47)
Alexandre Passos ♦
|
|
BTW, in case you don't want to go via the probabilistic way (EM or Bayesian approaches), you can use the "soft" variant of the K-means algorithm, which uses a set of K "responsibility" variables for each point as a proxy for the probabilities of it belonging to the K clusters. See Mackay's book (Alexandre has given the link above); in particular, chapter-20 on clustering for a description of soft K-means. |
|
Probably not the answer you're looking for, but fuzzy clustering algorithms, such as fuzzy k-means, return soft clusters, where each individual has a degree membership to each cluster. |
|
As others have said, it definitely depends on the approach. But, in most clustering problems I've encountered, determining a point's contribution to, or membership in, each cluster is MUCH easier than determining if the overall clustering is "valid". And, you don't actually need to use a fuzzy clustering method (where a point can belong to many clusters) to do so. If you are clustering a set of points in a vector space, than any measure of distance (of which there are many) can help you. In such cases, clusters have centers or means. The distance between a point and a cluster center is often suitable for creating a "confidence" or "membership" metric: the closer a point is to a cluster center, the more similar it is to that cluster, the more confidence you can have that it belongs to said cluster. This doesn't work very well when the clusters don't approximate d-spheres (in whatever metric space you choose).
(Jul 25 '10 at 15:10)
sbirch
True. Or, more correctly, it works well to the extent that your data is approximated by n-dimensional spheres. In other words, it is not an either/or case and sometimes there are acceptable losses in accuracy though gains in simplicity. In text clustering, for example, this method works pretty well, in that assigning documents to clusters with this method often maintains an acceptable level of semantic similarity between the documents assigned to a cluster. And, even when your data is not well approximated by n-dimensional spheres, this method can point that out: some data points will have high association with multiple clusters, while other may not have high association with any cluster. So, perhaps it is not the distance that gives you confidence a particular point belongs to a particular cluster, perhaps it is the difference between the distances between the point and the two closest clusters. Finally, you can also modify this methods if you suspect your results are suffering from a lack of spherical data. Rather than taking distance from the center of the cluster, you can try many other types of distance measures: distance from the furthest point in the cluster, distance from the closest point in the cluster, a weighted average, and many others.
(Jul 26 '10 at 10:43)
TR FitzGibbon
|
|
I'm not sure why no one has suggested bootstrapping yet, but it's one of those magical things which works with all algorithms. Basically, run the clustering algorithm on n-k datapoints, where k is a value of your choosing. Repeat till desired convergence is reached. |