|
I want to see if a datapoint x should (or not) be assigned to a nearest component y using the following condition: if ($d > T$) then {do not assign x to y}. With $d = distance(x,y)$ and $T = bar{d}+sigma$ where $bar{d}$ is the mean distance of y to datapoints that were already assigned to it in the past, and $sigma$ the associated stdev. I want to replace this condition (i.e. ($d > T$)) by an equivalent expression which is expressed as a probability $p$. How can I do that ? I've tried for exemple $p = exp(frac{-d^2}{2sigma Note: it does not really matter if it is not a true probability (that sum to one). |
|
Let me rephrase, to see if I understood you correctly. You seem to have an online clustering problem, and you want to do so by deciding, as each new point arrives, of whether to add it to an existing cluster or to start a new cluster for that point. The non-probabilistic rule you're using right now looks a lot like the rule in the Dirichlet process k-means paper, and so a natural way of going from these hard rules to probabilities would be to look at either the variational or the sampling algorithms for Dirichlet process. The problem is that, as you noted, there seems to be no clear way of setting sigma, which is equivalent to setting the threshold d in a way, to get clusters at the right granularity. Indeed, the only way to do so, I'd say, is to define some objective function about the clusters (say, how many are there, how big is the average cluster, etc), and choose the sigma that maximizes this objective function. If you can't afford to run the sigma-selection routine on a bunch of held-out data and want to do so online, I advise you to run many instances of your algorithm in parallel, for many different values of sigma (as in 1.1^n, for n from -10 to 10, for example) and discard some of these if they seem to lead to degenerate solutions (many will lead to one big cluster, others to too many small clusters, etc). |
|
I think you are confusing unobserveable/residual, which I will call epsilon, with it's standard deviation which I will call sigma. The original equation can re-expressed as Pr[epsilon<dist(x,y)-bar{d}] You can choose any family you want for the distribution of epsilon... That distributional assumption (e.g. gaussian) will generate an equation as you've shown for p. You can evaluate the likelihood (product of probabilities over your data) for a range of values of sigma. Then choose the sigma that gives the highest likelihood. Finding the optimal sigma should be done using a maximization routine on the likelihood equation. Thanks for this answer, if it can be improved, it will be perfect. Firstly, I'm actually using the standard deviation ($sigma$) corresponding to the computed mean ($bar{d}$). Is the residual more convenient here ? How can I compute it incrementally ? Secondly, when you say P[epsilon<dist(x,y)-bar{d}], do you mean that I should generate a random number epsilon (according to uniform distribution for example) and just see if epsilon < d-bar{d} ? This is not true because d-bar{d} is not a probability. Did you rather meant to see if epsilon is < than $exp( (-(dist(x,y)-bar{d})^2) / (2*sigma^2) )$ ? Thirdly, do you mean that for the component y, each time there is a data x nearest to it, I compute $p
(Jun 05 '12 at 15:20)
shn
I may be confused about what you are trying to do. To use maximum likelihood estimation, it is standard to set up a model where the classification of each point is a function of both a deterministic part and a random part. The random part is the residual (which can be calculated in training data, and which can be integrated out for probabilities for out-of-sample predictions.) The random part guarantees that the probabilities aren't strictly 0 or 1. It sounds like you are setting up a model where everything is deterministic. In other words, since you are using a specific known value (sigma) where I thought you meant to use a random component, every point is either "definitely assigned to component Y" or "definitely not assigned to component Y." In most datasets, you won't be able to run maximium likelihood estimation with such a deterministic model, because there will inevitably be at least one point with a probability of 0. As a result, the likelihood equation will be 0 no matter what happens with the other points. Unfortunately, I don't have a clear understanding of what you are doing in the bigger picture here. So it's possible we are talking past each other because we are thinking of different contexts.
(Jun 06 '12 at 00:23)
dbecker
Shna: you're confusing a few things. If epsilon is a random variable, the expression P(epsilon < n) is well-defined for any real number n, which doesn't have to be a probability. This is the probability that by drawing a random epsilon it will be bigger than that difference.
(Jun 06 '12 at 01:28)
Alexandre Passos ♦
@dbecker did you meant that Pr[epsilon<dist(x,y)-bar{d}] is the probability that x will not be assigned to component y ?
(Jun 07 '12 at 07:36)
shn
@shna Yes. I haven't thought to much about whether that's a great equation to assign to components... that is what I meant, because it's consistent with what you wrote (just using algebra to solve for epsilon in your equation for when x will not be assigned to y).
(Jun 07 '12 at 21:38)
dbecker
@dbecker I think the condition I wrote (dist(x,y) > T), is consistent with Pr[epsilon < dist(x,y)-T] (i.e. T instead of bar{d}), isn't it ? Moreover, since the distance can never be negative, should we instead use: Pr[0 < epsilon < dist(x,y)-T]. And also I don't understand why it is not just Pr[epsilon < dist(x,y)], where epsilon distributed according to Gaussian(bar{d}, sigma) or Gaussian(T, sigma).
(Jun 08 '12 at 08:43)
shn
@Shna: I assumed you meant for the sigma to be an epsilon. Then I solved for epsilon, which produced the expression I listed. I'm not entirely sure Pr[epsilon < dist(x,y)-T] comes from because I just don't know where you are adding the epsilon into your original equation. If you say how you are modifying it to incorporate epsilon, then I'm sure we'll end up in the same place (since it's just algebra to solve for epsilon).
(Jun 08 '12 at 08:59)
dbecker
showing 5 of 7
show all
|