|
I would like to implement mean-shift clustering on Hadoop. I read a presentation which states that in the map phase the algorithm "select a point as the center and compute distances from the center to all the data points, and assign a label to the points within the bandwidth" and in reducer phase "collect data points of the same label and compute the center of mass of them". However with Hadoop data is divided between nodes, so then how can a node just pick a point (to be the center), then form a window around that window, and all other nodes somehow perform the same function for that same point, and for the same window? One way this approach could work, at first iteration, is by picking well-known locations as labels at each mapper node. But then we reduce, calculate new centers, and ... ? At the next iteration can I feed the new center info as new input data? The problem this data gets divided as well, so every mapper node will not have access to the new (center) information. I hesitate using -files, or some othersimilar approach to pass this info to all nodes because it will be inefficient, it is not suggested for big files anyway (plus if I used every data point as potential seed, output would be big). Any ideas? |
|
Mean Shift is implemented in Mahout, which is Map-Reduced. Mahout is apparently using canopies as its input, their raw input data is associated with the canopies. Canopies also represent the mean-shift window. My question was "where is the data, especially in the beginning?" and their doc says, "the algorithm is initialized with a canopy containing each input point.". It is unclear how this giant canopy is divided up later to other (newly formed) canopies. It also says "[each mapper] compares each canopy with each one it has already seen", I am not sure what they mean by this.
(Mar 21 '13 at 05:32)
Stat Q
Ah I think I understand. What the doc meant was at first, there is one shift window (canopy) per point, so 100 points means 100 canopies. Then the algorithm treats shifting and merging as one process, so while moving a canopy runs into more points (data), and gets shifted, it also builds up its cluster members. I still dont understand the "[each mapper] comparing each canopy with each one it has already seen" business though.
(Mar 21 '13 at 06:20)
Stat Q
|