1. Incremental clustering algorithms
2. Online clustering algorithms
3. Data stream clustering algorithms
4. Single pass clustering algorithms

Question (1) : Are all these expressions related ? does some of them include others ? what is the difference between them ? what are the constraints that each one should face unlike others ?

Question (2) : I'm searching for clustering algorithms where all the following constraints are considered (I don't know if such an algorithm exists actually, and I don't know to which of the above mentioned categories it belongs):

  1. We don't have initial representatives beforehand, and we can not extract them from the dataset using a seeding procedure since we don't have this dataset beforehand, because actually it's a continuously arriving and evolving data stream.
  2. The data are processed (clustered) one by one (one at a time) in the order of arriving, i.e. we can not for example wait until we get a sufficient number of data and then cluster them ...
  3. and of course processing each new data should be done incrementally (we do not redo all clustering since the beginning on all previously seen data).

asked Feb 08 at 10:03

Shna's gravatar image

Shna
284162029

edited Feb 08 at 10:04


One Answer:

Terminology is a funny thing, but from my readings I've found the following:

(1) and (2) are about the same. With either incremental or online algorithms, the full dataset is typically known and can be gone through several times, however the algorithm typically does some incremental update observing each data point. As an example, an online k-means implementation can reassign a data point to a new cluster and update the k-means centroid immediately, rather than determine all data point assignments and update the cluster after evaluating all assignments.

Similarly, (3) and (4) are about the same. Both labelings assume that the data can be evaluated once and only once. But there are subtle differences. Single pass algorithms assume that the size of the dataset is known in advance and can use this knowledge while clustering, however Data stream algorithms may or may not make this same assumption.

What you want sounds a lot like Data stream clustering or single pass clustering. Depending on the size of your dataset I would recommend looking into Fast and Accurate k-means for Large Datasets. Even if it doesn't fit your exact requirements, it should be a good place to start as it's new and fast. Their model assumes that you, roughly, know the size of the dataset and can then proceed to make a single pass through the data. It can automatically determine the number of clusters and updates each cluster incrementally. I would note that the clustering assignments can dramatically change overtime as it merges clusters when the system finds too many clusters. It also isn't too hard to implement.

Lastly, I would consider looking into particle filters. I don't have too much experience with them, but my understanding is that they largely try address datasets like the one you describe.

answered Feb 08 at 12:06

Keith%20Stevens's gravatar image

Keith Stevens
4363820

Great explanation, thanks. I will read the paper you cited, it seems interesting; can I ask you if I don't understand some points about that algorithm since you already know it ? Also can you point me to some papers about particle filters for this context ?

(Feb 08 at 14:55) Shna
1

Feel free to ask about the K-Means paper. I'll see if i can drum up some good particle filter papers for this idea. It may take a day or so.

(Feb 08 at 18:46) Keith Stevens

@KeithStevens I've some questions about the kmeans paper, I asked here: http://metaoptimize.com/qa/questions/9043/streaming-k-means-for-large-datasets

(Feb 10 at 07:44) Shna

And i've given an answer :)

(Feb 10 at 12:37) Keith Stevens

@KeithStevens thank you. By the way, do you have some good orientations about how to use particle filter for this idea ?

(Feb 13 at 07:34) Shna
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.