|
I am trying to implement a product recommender using an item-based collaborative filtering approach. I have successfully accomplished this before, but now I have the issue of a much larger dataset. I currently have ~1 million items rated with ~200,000 users. Item similarities are computed using Pearson similarity (for now) of item ratings given by users. I am in the process of creating an item similarity matrix. However, as it is implemented now, the program with take months to complete. My question is with a dataset of this size, should I try and optimize the algorithm or just port it over to a MapReduce architecture. Edit: I guess I am looking more for a of rule of thumb. At what point is the dataset too large that it is not feasible to compute on a single machine or is this highly dependent on the problem trying to be solved? I would like compute the item-item similarity matrix on one machine if it can be done in less than a matter of a couple hours. Otherwise, I will migrate to amazon EC2. |
|
1 million ratings is definitely doable on a single machine in a couple hours. I've done something on a similar scale without trying particularly hard to optimize. How are you implementing the algorithm? I don't have a good rule of thumb for when to move to a MapReduce architecture, but just as a small comparison: in the Netflix prize, the training dataset consisted of 100 million ratings (about 500k users on 18k movies), and I believe most of the approaches were non-distributed (though a lot of optimizations probably went into the code). In case you want to port to MapReduce anyways, though, there's a nice tutorial (with code) on computing item-item similarities on Amazon's Elastic MapReduce infrastructure here: http://aws.amazon.com/articles/2294 Answer to comment: It's likely that you don't need to perform all n(n-1)/2 calculations [since not every pair of items have users in common]. So it should help a lot if you do something like the following instead:
Also, if you're curious about optimizations in general (though not necessarily applicable to Python), there's probably a lot of good information to be found if you Google for something like "netflix prize correlations". For example, I found a nice description of computing 316 million movie correlations in 2 minutes (!) in C here. I am currently using python for my implementation. I have and index (which is a dictionary) of items -> user ratings. I am looping over this index and creating an item-item sim matrix which requires n(n-1)/2 calculations when taking advantage of the symmetry of the matrix. For the most part this matrix is extremely sparse. I am using scipy's sparse matrices to help with this. Is there a better way to calculate this matrix? Let me know if you need more clarification.
(Jun 28 '11 at 15:28)
arasraj
@Raj: You probably don't actually need to perform all n(n-1)/2 calculations, since a lot of pairs of items don't have users in common. See the update to my answer for some pseudocode that should help.
(Jun 28 '11 at 16:10)
grautur
|
|
Since there'll be a lot of independent calculations your problem should be easily paralleled by MapReduce. Just google "collaborative filtering map-reduce" and you'll get a lot of examples. However, don't expect that such implementation would get you results online. |