I want to see if a datapoint x should (or not) be assigned to a nearest component y using the following condition:

if ($d > T$) then {do not assign x to y}. With $d = distance(x,y)$ and $T = bar{d}+sigma$ where $bar{d}$ is the mean distance of y to datapoints that were already assigned to it in the past, and $sigma$ the associated stdev.

I want to replace this condition (i.e. ($d > T$)) by an equivalent expression which is expressed as a probability $p$. How can I do that ? I've tried for exemple $p = exp(frac{-d^2}{2sigma^2})$ but it does not behave like the first condition unless we manually find a good value for $\sigma$, and I don't know if there is any way to "learn" the good value for $sigma`$.

Note: it does not really matter if it is not a true probability (that sum to one).

asked Jun 05 '12 at 10:11

shn's gravatar image

shn
462414759

edited Jun 05 '12 at 10:38


2 Answers:

Let me rephrase, to see if I understood you correctly. You seem to have an online clustering problem, and you want to do so by deciding, as each new point arrives, of whether to add it to an existing cluster or to start a new cluster for that point.

The non-probabilistic rule you're using right now looks a lot like the rule in the Dirichlet process k-means paper, and so a natural way of going from these hard rules to probabilities would be to look at either the variational or the sampling algorithms for Dirichlet process.

The problem is that, as you noted, there seems to be no clear way of setting sigma, which is equivalent to setting the threshold d in a way, to get clusters at the right granularity. Indeed, the only way to do so, I'd say, is to define some objective function about the clusters (say, how many are there, how big is the average cluster, etc), and choose the sigma that maximizes this objective function. If you can't afford to run the sigma-selection routine on a bunch of held-out data and want to do so online, I advise you to run many instances of your algorithm in parallel, for many different values of sigma (as in 1.1^n, for n from -10 to 10, for example) and discard some of these if they seem to lead to degenerate solutions (many will lead to one big cluster, others to too many small clusters, etc).

answered Jun 06 '12 at 01:33

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

I think you are confusing unobserveable/residual, which I will call epsilon, with it's standard deviation which I will call sigma.

The original equation can re-expressed as Pr[epsilon<dist(x,y)-bar{d}]

You can choose any family you want for the distribution of epsilon... That distributional assumption (e.g. gaussian) will generate an equation as you've shown for p. You can evaluate the likelihood (product of probabilities over your data) for a range of values of sigma. Then choose the sigma that gives the highest likelihood. Finding the optimal sigma should be done using a maximization routine on the likelihood equation.

answered Jun 05 '12 at 12:14

dbecker's gravatar image

dbecker
163

Thanks for this answer, if it can be improved, it will be perfect.

Firstly, I'm actually using the standard deviation ($sigma$) corresponding to the computed mean ($bar{d}$). Is the residual more convenient here ? How can I compute it incrementally ?

Secondly, when you say P[epsilon<dist(x,y)-bar{d}], do you mean that I should generate a random number epsilon (according to uniform distribution for example) and just see if epsilon < d-bar{d} ? This is not true because d-bar{d} is not a probability. Did you rather meant to see if epsilon is < than $exp( (-(dist(x,y)-bar{d})^2) / (2*sigma^2) )$ ?

Thirdly, do you mean that for the component y, each time there is a data x nearest to it, I compute $p= exp( (-(dist(x,y)-bar{d})^2) / (2*sigma^2) )$ and multiply it to the previous p computed for this component (p=p*p) ? And if yes, how do you choose then sigma for this component, that maximize the likelihood p ? it's not very clear.

(Jun 05 '12 at 15:20) shn

I may be confused about what you are trying to do. To use maximum likelihood estimation, it is standard to set up a model where the classification of each point is a function of both a deterministic part and a random part. The random part is the residual (which can be calculated in training data, and which can be integrated out for probabilities for out-of-sample predictions.)

The random part guarantees that the probabilities aren't strictly 0 or 1.

It sounds like you are setting up a model where everything is deterministic. In other words, since you are using a specific known value (sigma) where I thought you meant to use a random component, every point is either "definitely assigned to component Y" or "definitely not assigned to component Y."

In most datasets, you won't be able to run maximium likelihood estimation with such a deterministic model, because there will inevitably be at least one point with a probability of 0. As a result, the likelihood equation will be 0 no matter what happens with the other points.

Unfortunately, I don't have a clear understanding of what you are doing in the bigger picture here. So it's possible we are talking past each other because we are thinking of different contexts.

(Jun 06 '12 at 00:23) dbecker

Shna: you're confusing a few things. If epsilon is a random variable, the expression P(epsilon < n) is well-defined for any real number n, which doesn't have to be a probability. This is the probability that by drawing a random epsilon it will be bigger than that difference.

(Jun 06 '12 at 01:28) Alexandre Passos ♦

@dbecker did you meant that Pr[epsilon<dist(x,y)-bar{d}] is the probability that x will not be assigned to component y ?

(Jun 07 '12 at 07:36) shn

@shna Yes. I haven't thought to much about whether that's a great equation to assign to components... that is what I meant, because it's consistent with what you wrote (just using algebra to solve for epsilon in your equation for when x will not be assigned to y).

(Jun 07 '12 at 21:38) dbecker

@dbecker I think the condition I wrote (dist(x,y) > T), is consistent with Pr[epsilon < dist(x,y)-T] (i.e. T instead of bar{d}), isn't it ? Moreover, since the distance can never be negative, should we instead use: Pr[0 < epsilon < dist(x,y)-T]. And also I don't understand why it is not just Pr[epsilon < dist(x,y)], where epsilon distributed according to Gaussian(bar{d}, sigma) or Gaussian(T, sigma).

(Jun 08 '12 at 08:43) shn

@Shna: I assumed you meant for the sigma to be an epsilon. Then I solved for epsilon, which produced the expression I listed. I'm not entirely sure Pr[epsilon < dist(x,y)-T] comes from because I just don't know where you are adding the epsilon into your original equation. If you say how you are modifying it to incorporate epsilon, then I'm sure we'll end up in the same place (since it's just algebra to solve for epsilon).
As notation goes, I prefer for the distribution of centered at 0... so if you have a T that belongs in what you are trying to do, leave it as part of the visible expression rather than incorporating it into the mean of the epsilon distribution. In terms of implementation, you'll end up calling the standard normal cdf function anyway. I'd emphasize though: it will be more productive if you write out the condition that involves epsilon, and then solve for epsilon. Right now I think we're just speculating about where epsilon might be while looking at an equation without it.

(Jun 08 '12 at 08:59) dbecker
showing 5 of 7 show all
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.