I am trying to implement a product recommender using an item-based collaborative filtering approach. I have successfully accomplished this before, but now I have the issue of a much larger dataset. I currently have ~1 million items rated with ~200,000 users. Item similarities are computed using Pearson similarity (for now) of item ratings given by users. I am in the process of creating an item similarity matrix. However, as it is implemented now, the program with take months to complete. My question is with a dataset of this size, should I try and optimize the algorithm or just port it over to a MapReduce architecture.

Edit: I guess I am looking more for a of rule of thumb. At what point is the dataset too large that it is not feasible to compute on a single machine or is this highly dependent on the problem trying to be solved? I would like compute the item-item similarity matrix on one machine if it can be done in less than a matter of a couple hours. Otherwise, I will migrate to amazon EC2.

asked Jun 27 '11 at 13:06

arasraj's gravatar image

arasraj
466610

edited Jun 28 '11 at 11:09


2 Answers:

1 million ratings is definitely doable on a single machine in a couple hours. I've done something on a similar scale without trying particularly hard to optimize. How are you implementing the algorithm?

I don't have a good rule of thumb for when to move to a MapReduce architecture, but just as a small comparison: in the Netflix prize, the training dataset consisted of 100 million ratings (about 500k users on 18k movies), and I believe most of the approaches were non-distributed (though a lot of optimizations probably went into the code).

In case you want to port to MapReduce anyways, though, there's a nice tutorial (with code) on computing item-item similarities on Amazon's Elastic MapReduce infrastructure here: http://aws.amazon.com/articles/2294

Answer to comment:

It's likely that you don't need to perform all n(n-1)/2 calculations [since not every pair of items have users in common]. So it should help a lot if you do something like the following instead:

for each item i1:
    for each customer c who rated i1:
        for each other item i2 > i1 that c also rated:
            record that customer c rated both i1 and i2
    for each item that we just found (i.e., each i2 > i1 that was rated together with i1 by some user):
        compute the similarity between i1 and i2

Also, if you're curious about optimizations in general (though not necessarily applicable to Python), there's probably a lot of good information to be found if you Google for something like "netflix prize correlations". For example, I found a nice description of computing 316 million movie correlations in 2 minutes (!) in C here.

answered Jun 28 '11 at 14:46

grautur's gravatar image

grautur
961122328

edited Jun 28 '11 at 16:09

I am currently using python for my implementation. I have and index (which is a dictionary) of items -> user ratings. I am looping over this index and creating an item-item sim matrix which requires n(n-1)/2 calculations when taking advantage of the symmetry of the matrix. For the most part this matrix is extremely sparse. I am using scipy's sparse matrices to help with this. Is there a better way to calculate this matrix? Let me know if you need more clarification.

(Jun 28 '11 at 15:28) arasraj

@Raj: You probably don't actually need to perform all n(n-1)/2 calculations, since a lot of pairs of items don't have users in common. See the update to my answer for some pseudocode that should help.

(Jun 28 '11 at 16:10) grautur

Since there'll be a lot of independent calculations your problem should be easily paralleled by MapReduce. Just google "collaborative filtering map-reduce" and you'll get a lot of examples. However, don't expect that such implementation would get you results online.

answered Jun 27 '11 at 19:03

Sergey%20Bartunov's gravatar image

Sergey Bartunov
81111116

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.