|
Hi, I would like to ask a question about grouping data in clusters. We are trying to group sales of our mobile phone company. When the subscriber buys airtime for his phone, we are locating the latitude/longitude of the subscriber at the precise moment when he/she does the transaction. After mapping all the sales for the day, the result is a lot dots, but what we want to find is the point of sale where the reseller is located, which must be somewhere where the dots are concentrated at most. The reseller is given a 'reseller_id' identifier but he might use the same id in many stores which may be close or far away from each other. Details on the problem: Input: We have millions of latitude/longitude pairs (points) of where the sales happened (of certain period in time) over a geographical area of say 5,000 x 5,000 km , each pair of coordinates is labeled with 'reseller_id' field Output: We need to produce latitude/longitude pairs of each point of sale of each reseller Issues:
I have read the Wikipedia page about 'Cluster analysis' and it lists many methods, but I am not sure which one should I use? I think it has to be some mix of 'hierarchical clustering' with 'density based clustering', but it seems that all of them return polygons, and we need to find just the dot of the point of sale (preferably with an added probability of how much this dot is correct). Will appreciate very much any advice on how to solve our problem Thanks |
This sounds like a constrained clustering problem (the points with different reseller_id's cannot belong to the same cluster). It also looks like you'd have to use nonparametric clustering or try to figure out how many clusters there actually is, as that is not known.
Non-parametric clustering with Dirichlet processes looks very powerfull, I will have to try and see. Thank you!