Hi,

I would like to ask a question about grouping data in clusters. We are trying to group sales of our mobile phone company. When the subscriber buys airtime for his phone, we are locating the latitude/longitude of the subscriber at the precise moment when he/she does the transaction. After mapping all the sales for the day, the result is a lot dots, but what we want to find is the point of sale where the reseller is located, which must be somewhere where the dots are concentrated at most. The reseller is given a 'reseller_id' identifier but he might use the same id in many stores which may be close or far away from each other.

Details on the problem:

Input: We have millions of latitude/longitude pairs (points) of where the sales happened (of certain period in time) over a geographical area of say 5,000 x 5,000 km , each pair of coordinates is labeled with 'reseller_id' field

Output: We need to produce latitude/longitude pairs of each point of sale of each reseller

Issues:

  1. Sometimes a subscriber might ask another person to buy airtime for him, and if we try to locate the position of the subscriber at the time of purchase, we might find him very far away from the point of where the reseller is actually standing and making sales.
  2. Some resellers, use the same reseller_id field for many stores they open, so if the stores are close enough we might get incorrect results
  3. The precision we use to locate the subscriber is not very good, it may have an error of up to 100 mts (it is not the GPS), however it works for all subscribers equally
  4. Sometimes our systems slow down (during peak hours), and the query on subscriber's location is issued late, the subscriber might have gone away from the point of sale, say up to 50 meters or maybe more if he/she uses a car.

I have read the Wikipedia page about 'Cluster analysis' and it lists many methods, but I am not sure which one should I use? I think it has to be some mix of 'hierarchical clustering' with 'density based clustering', but it seems that all of them return polygons, and we need to find just the dot of the point of sale (preferably with an added probability of how much this dot is correct).

Will appreciate very much any advice on how to solve our problem

Thanks

asked Oct 06 '12 at 15:33

Nick2's gravatar image

Nick2
1111

1

This sounds like a constrained clustering problem (the points with different reseller_id's cannot belong to the same cluster). It also looks like you'd have to use nonparametric clustering or try to figure out how many clusters there actually is, as that is not known.

(Oct 08 '12 at 03:02) Mikhail

Non-parametric clustering with Dirichlet processes looks very powerfull, I will have to try and see. Thank you!

(Oct 08 '12 at 12:37) Nick2
Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.