|
Suppose we have a very simple online k-means where each new data-point is assigned to its nearest center (the mean is updated incrementally). Each center (cluster) is labelled with the most common label of data-points assigned to that cluster. In this special configuration: is it possible to compute a sort of "posterior probability"? I.e., can the posterior probability of a class label $y$ given a data-point $x$ ($P(y|x)$) just be $1/text{distance}(x, m_y)$, where $m_y$ is a center labelled with $y$ which is nearest to $x$? I want to have some confidence about how probable the instance x is of a class y. I want to use this probability to let the learning algorithm be able to actively queries the data-points about which it is least certain how to label (i.e. ask for there labels). This is often strainghtforward for probabilistic learning models. For example when using a probabilistic model for binary classification, it will simply queries the instances whose posterior probability of being positive is nearest 0.5. Now how to define P(y|x) when the model is not probabilistic (e.g. with the online KM) ? |
|
There is a probabilistic interpretation of the K-means algorithm: see, for example, http://mlg.eng.cam.ac.uk/teaching/3f3/1011/lect4.pdf Assignment probability estimation based on this interpretation seems pretty straightforward. EDIT: My understanding is that one has to fit a Gaussian to each of the clusters (the means are already there, so you only need the covariance matrix). This done, one can estimate the probability of any given data point being generated by one of the Gaussians, which can be used as a posterior probability estimate. It is not clear, can you please explain that probabilistic interpretation ? What is P(y|x) in this case ?
(Nov 08 '12 at 04:15)
shn
|
|
You could also look at: http://mathoverflow.net/questions/19184/what-is-the-probabilistic-counterpart-of-weighted-k-means A very similar question is answered with references. I think the other posters reference is quite Ok as well. Look at the last couple slides. |