Hi there, I'm a MSc student and I'm working on implementing the stochastic gradient descent algorithm for recommender systems (the well known Funk version) using sparse matrices with Scipy.

This is how a first basic implementation looks like: ````

N = self.model.shape[0] #no of users
M = self.model.shape[1] #no of items
self.p = np.random.rand(N, K)
self.q = np.random.rand(M, K)
rows,cols = self.model.nonzero()        
for step in xrange(steps):
    for u, i in zip(rows,cols):
        e=self.model-np.dot(self.p,self.q.T) #calculate error for gradient
        p_temp = learning_rate * ( e[u,i] * self.q[i,:] - regularization * self.p[u,:])
        self.q[i,:]+= learning_rate * ( e[u,i] * self.p[u,:] - regularization * self.q[i,:])
        self.p[u,:] += p_temp

```

I have two questions and I am hoping some seasoned recommender systems experts here could provide me with some insight into this:

1) How often should the error e be adjusted? It is not clear from Funk's and Koren's papers if it should be once for every rating or if it should be once per every factor. Meaning, should i take it out of the for loop, leave it there, or put it even deeper (over the k iterations) ?

2) Unfortunately, my code is pretty slow, taking about 20 minutes on ONE iteration over a 100k ratings dataset. This means that computing the factorization over the whole dataset takes about 10 hours. I was thinking that this is probably due to the sparse matrix for loop. I've tried expressing the q and p changes using fancy indexing but since I'm still pretty new at scipy and numpy, I couldn't figure a better way to do it. Do you have any pointers on how i could avoid iterating over the rows and columns of the sparse matrix explicitly?

asked Nov 19 '13 at 17:00

Anamaria%20Todor's gravatar image

Anamaria Todor
21115

edited Nov 19 '13 at 17:01


2 Answers:

Some references to papers you followed are often helpful - I knew this method, but it took me some time to recognize it ;)

First of all, let's rename self.model to self.data, since it actually represents original data you've got and not the model you are trying to build. If you really meant model itself, you probably have an error in your code.

Both of your questions boil down to understanding of gradient descent and stochastic gradient descent. (Ordinary) gradient descent calculates error between model and all real data. Based on this error, it updates parameters in a direction that minimizes the error. That is, on each iteration it uses the entire dataset.

Stochastic (or "on-line") gradient descent works in a similar way, but updates parameters based on little batches of data. That is, it splits entire dataset into pieces and works out each piece as a separate dataset.

What you have implemented is stochastic gradient descent with batch size equal to 1. Batch size of 1 makes your code work slow, because you calculate error for every datum you get (you can profile your script, but I bet that e=self.data-np.dot(self.p,self.q.T) will be the slowest part, since it calculates errors over the whole dataset). You can convert your code into simple gradient descent by exchanging several lines of code:

for step in xrange(steps):
    e=self.data-np.dot(self.p,self.q.T)
    for u, i in zip(rows,cols):
        # rest of nested loop

This way you calculate error only one time per step. However, I believe you want stochastic GD and still need reasonable performance. I have a good news for you: it's pretty simple. All you need is just to increase batch size. Something like this:

for step in xrange(steps):
    for batch in self.data.split(100):    # this won't compile, but  you've got the idea
        e=self.data-np.dot(self.p,self.q.T)
        for u, i in batch:
            # rest of nested loop

Probably you would want to restruct your code somehow, but I hope the idea itself is clear now.

You may be wondering, if stochastic gradient descent is so slow, why would one ever prefer it over standard gradient descent? One major reason is that it allows you to learn endless number of user ratings. For example, if you were making reputation engine for real web site, you wouldn't like to recalculate all the data each time you get new rating from the user. Stochastic gradient descent (also called "on-line" GD) allows you to add new data incrementally.

answered Nov 20 '13 at 11:18

ffriend's gravatar image

ffriend
56146

edited Dec 12 '13 at 02:37

Thank you so much! You've clarified this a lot to me.

Sorry for not mentioning the papers. It's Koren's "Factorization meets the neighborhood" and Funk (http://sifter.org/~simon/journal/20061211.html).

I've also followed your suggestion to optimize the error function computation and I've taken out the whole matrix calculation and now i'm only computing based on column and vector row. This takes me down to a whooping 10 seconds per iteration.

It would be great if I could use GD directly, as I can express it completely using matrix operations, but I'm sticking to SGD because I want to compare this method with some other methods once I get it running well. Thanks a lot once again!

(Nov 20 '13 at 12:26) Anamaria Todor

Sounds like you're taking the recommender systems coursera course. I'm taking the Machine Learning coursera course and what I learned so far about gradient descent is that you need to make sure your learning rate is not too high. If it is, it may never converge (or if it does it's terribly slow) to a global optima (e.g. minimum low across the whole function). Did you try experimenting with alpha?

I bet there's a way to express your factorization without loops (I know you can do this in Octave) so can't help there :(

answered Nov 19 '13 at 22:41

Mike's gravatar image

Mike
11

edited Nov 19 '13 at 22:41

Yes, I'm taking the course. However, I need this cleared up for a personal project. From what I know, the final assignment in the Coursere RS course is about classic SVD and not SGD.

(Nov 20 '13 at 11:50) Anamaria Todor
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.