|
Hi there, I'm a MSc student and I'm working on implementing the stochastic gradient descent algorithm for recommender systems (the well known Funk version) using sparse matrices with Scipy. This is how a first basic implementation looks like: ````
``` I have two questions and I am hoping some seasoned recommender systems experts here could provide me with some insight into this: 1) How often should the error e be adjusted? It is not clear from Funk's and Koren's papers if it should be once for every rating or if it should be once per every factor. Meaning, should i take it out of the for loop, leave it there, or put it even deeper (over the k iterations) ? 2) Unfortunately, my code is pretty slow, taking about 20 minutes on ONE iteration over a 100k ratings dataset. This means that computing the factorization over the whole dataset takes about 10 hours. I was thinking that this is probably due to the sparse matrix for loop. I've tried expressing the q and p changes using fancy indexing but since I'm still pretty new at scipy and numpy, I couldn't figure a better way to do it. Do you have any pointers on how i could avoid iterating over the rows and columns of the sparse matrix explicitly? |
|
Some references to papers you followed are often helpful - I knew this method, but it took me some time to recognize it ;) First of all, let's rename Both of your questions boil down to understanding of gradient descent and stochastic gradient descent. (Ordinary) gradient descent calculates error between model and all real data. Based on this error, it updates parameters in a direction that minimizes the error. That is, on each iteration it uses the entire dataset. Stochastic (or "on-line") gradient descent works in a similar way, but updates parameters based on little batches of data. That is, it splits entire dataset into pieces and works out each piece as a separate dataset. What you have implemented is stochastic gradient descent with batch size equal to 1. Batch size of 1 makes your code work slow, because you calculate error for every datum you get (you can profile your script, but I bet that
This way you calculate error only one time per step. However, I believe you want stochastic GD and still need reasonable performance. I have a good news for you: it's pretty simple. All you need is just to increase batch size. Something like this:
Probably you would want to restruct your code somehow, but I hope the idea itself is clear now. You may be wondering, if stochastic gradient descent is so slow, why would one ever prefer it over standard gradient descent? One major reason is that it allows you to learn endless number of user ratings. For example, if you were making reputation engine for real web site, you wouldn't like to recalculate all the data each time you get new rating from the user. Stochastic gradient descent (also called "on-line" GD) allows you to add new data incrementally. Thank you so much! You've clarified this a lot to me. Sorry for not mentioning the papers. It's Koren's "Factorization meets the neighborhood" and Funk (http://sifter.org/~simon/journal/20061211.html). I've also followed your suggestion to optimize the error function computation and I've taken out the whole matrix calculation and now i'm only computing based on column and vector row. This takes me down to a whooping 10 seconds per iteration. It would be great if I could use GD directly, as I can express it completely using matrix operations, but I'm sticking to SGD because I want to compare this method with some other methods once I get it running well. Thanks a lot once again!
(Nov 20 '13 at 12:26)
Anamaria Todor
|
|
Sounds like you're taking the recommender systems coursera course. I'm taking the Machine Learning coursera course and what I learned so far about gradient descent is that you need to make sure your learning rate is not too high. If it is, it may never converge (or if it does it's terribly slow) to a global optima (e.g. minimum low across the whole function). Did you try experimenting with alpha? I bet there's a way to express your factorization without loops (I know you can do this in Octave) so can't help there :( Yes, I'm taking the course. However, I need this cleared up for a personal project. From what I know, the final assignment in the Coursere RS course is about classic SVD and not SGD.
(Nov 20 '13 at 11:50)
Anamaria Todor
|