|
Generally, in the context of clustering data streams (where not all the data is available at the same time), the data points are processed one by one in an incremental way (one data point at a time), i.e. except the representatives, only the new coming data point is allowed to fit in memory and be processed. Now, if we suppose a configuration where we have a data streams, and only a fixed number N > 1 of data points is allowed to fit in memory (i.e. N = the new coming data point + the last N-1 data points previously processed), do you have any idea on how we can exploit these data points to improve the clustering ? Actually, it's like a sliding windows over the dataset. |
|
A lot of streaming algorithms work by building coresets, which are essentially a weighted, constant-sized, set of points which are representative of the overall stream of data in a precise way. For an example of this approach to clustering, see StreamKM++, by Ackerman et al. I find that the coreset perspective turns out to be very useful when reasoning about streams. So in the scenario you proposed, given a memory budget, there's clearly a choice on whether to spend a given portion of it on the last N data points or on N points added to the coreset. As adding points to the coreset is guaranteed to lower the expected error, you're better off carefully choosing which points to save than saving a list of the N last. If keeping this list were completely free, however, you could probably enrich your approximations by looking at things like minibatches instead of single points. Unfortunately, as far as I know, it turns out to be hard to prove rigorous streaming bounds on things which use minibatches. For an example of a minibatch approach to K-Means clustering, see Web-scale K-means clustering, by D Sculley. When I read how the coreset is constructed in link StreamKM++, I see that they use the k-means++ seeding procedure on the original dataset to get a set S = {q1, q2, ..., qm} of m points, where each point qi is weighted by the number of points closest to it from the original dataset. I didn't yet read all the paper, but the problem is that: we don't have the original dataset, the data points arrive one by one (a data stream), they are not all available at the same time to be able to extract the coreset from them ! Any explanation ? Maybe I'll find the answer in the rest of the paper.
(Jan 20 '12 at 09:58)
shn
You will find the answer in the rest of the paper. The KMeans++ procedure indeed needs to see a significant amount of points at once to work effectively. This paper provides a streaming approximation to that.
(Jan 20 '12 at 10:06)
Alexandre Passos ♦
|
|
The algorithm you propose is, in fact, the algorithm proposed in the following work: Efficient Clustering of Web-Derived Data Sets (Sarmento et al, 2009). I outline the algorithm in the answer to Large-scale clustering. The author told me that they have a slightly updated two-pass algorithm which works even better than the results given in the paper. If you want more details, I can dig up his email. |