|
Hello, I'm trying to build a collaborative filtering recommendation system for books implemented as a multilayer perceptron with back-propagation. I've got most of the pieces in place, even tested it for a few values and it seems fine. I've got one hidden layer, with books on the input layer and books on the output layer. I have a bunch of data about users' preference for books (lists of user rated books), but I can't figure out the best way to use this data to train the network. The only thing I could come up was this:
But that doesn't produce uniform results since it is influenced by the order the books are iterated through which is arbitrary. Thank you. |
|
There's a lot of research on recommendation algorithms. Have you read any papers from the winners of the netflix prize contest? Basically, a conclusion you can use to help yourself is that a better baseline predictor can perform better than a neural network in many circumstances. There seems to be some methodological issues with your setup, howver. Are you using a separate validation set? How are you representing books? It seems you're using binary vectors, which will keep your system from generalizing well. And last, why don't you use this loop you showed to generate the training set and then train a neural network until convergence (using stochastic gradient descent, as you're doing, or any other algorithm, really)? This should improve you performance perceptibly. |
|
Even better than simple SVD approaches, try using this latent factor approach: Menon + Elkan (2010). It is also susceptible to stochastic gradient descent training and is pretty simple to implement. Btw... you mention that you have a problem with algorithms that produce different results depending on order of presentation of training data. The essence of stochastic gradient descent is the idea that you are not so much passing over a finite training set as you are iterating through an infinite random sequence of training examples. Thus, the metric of convergence should be in terms of training examples, not passes over your data. When you have a finite sample of this infinite random sequence, then you must approximate that infinite sequence by random sampling (with replacement, preferably) from your finite sample. The idea that random sampling from a finite set approximates the random sampling from the original distribution the key idea underneath bootstrapping. In that framework, it isn't surprising if your algorithms produce somewhat different results depending on the order and you control for that by presenting data in random order. It isn't usually necessary to completely permute your data even if that is a nice ideal. In practice, keeping a finite window of samples and using that to rearrange training examples suffices. There should be an implementation of this in Mahout before long. |
|
Hello, I know almost nothing about neural nets, but I have some experience with collaborative filtering. Sometimes it is a good idea to use recently rated items in the end of learning process. This will ensure that the most recent user's interests are reflected better than earlier interests. So, the system would recommend items “to read next” better. But this holds only if you can sort items in such a way that recently rated items are in the tail of the list. |
|
Let's say you have training examples of the form (user, book) => score. Then, you have a k-dimensional vector that you associate with each user, and one for each book. (k=50 maybe) You sample a random (user, book) tuple. You put both k-dim vectors into the NN as input. You train it to predict the score. You update the weights in the net as well as the k-dim representations for the user and the book. Note that you want the learning rate to be far smaller (e.g. 100-1000x) for the NN weights than for the representations, because the weights are being updated on each step, but a particular representation is updated only every time it is sampled. Except for the NN part, this is very close to what is in the Menon and Elkan paper that I mentioned a few comments ago.
(Sep 24 '10 at 12:53)
Ted Dunning
Thank you for the reference, I was not familiar with that work.
(Sep 24 '10 at 15:37)
Joseph Turian ♦♦
1
You don't even need to have scores in your training set. You can train this purely from observed (user,book) pairs. Think of it as learning the joint distribution of (user,book) pairs, or think of your score as an energy function that corresponds to an unnormalized log-probability for that pair. You could use score matching to train the parameters (both the representation and the parameters that go from the representations to score). You could also train something like an RBM or a denoising auto-encoder in the same spirit, i.e., taking (user,book) pairs as the input vector. The other option that you can find in the literature is to force the score to have a particular form such that it is enough to just maximize it over the observed positive examples of (user,book) pairs. For example if the score is related to the cosine angle between the two representations, then I believe it would work.
(Sep 30 '10 at 21:38)
Yoshua Bengio
|
|
What is your output vector? Is it n nodes, one for each book, with only the book of interest being non-zero, or do all the book nodes have their rating? The problem with either approach might be that you need "don't care" values for the books that the user didn't rate, but off the shelf NN packages won't give you that. If you're training with zeros for the unrated books (or worse, only the current book's rating), you're asking the network to learn irrelevant and probably impossible facts in addition to the rating you're interested in. In the Netflix challenge, people got good results from Restricted Boltzmann Machines as one of many approaches. RBMs are neural networks that first learn to reconstruct their input in training. Then they can be allowed to generate "stuff that looks like their input", which in your case would be the rating on a particular unknown book. There's an RBM tutorial in the Python package Theano's documentation here: http://deeplearning.net/tutorial/ You could also find some material in Yehuda Koren's final Netflix paper. Your best bet is probably to give up on neural nets and use Singular Value Decomposition with stochastic gradient descent. Many people posted their code from the Netflix prize, and the contest forums would be a good place to look. It's pretty trivial, gives 90% of the benefit of complicated approaches, and doesn't require the dataset to fit in memory. Neural nets don't require the dataset to fit in memory either, since they are trained in a stochastic way. (I think you know that, just the writing was unclear.)
(Sep 24 '10 at 12:04)
Joseph Turian ♦♦
Can you elaborate a bit on how you'd solve this problem with SVD? How can SVD be used in place of a neural net?
(Sep 24 '10 at 12:55)
Miles Egan
@Miles Egan: represent your data as a sparse UxB matrix M, where U is the number of users and B the number of books, where M_ij is the score user i assigned to book j. You then have some M_ij values from the training data and want to predict the others. Now compute the truncated SVD approximation of this matrix that minimizes the regularized square (for example, you can use others) error of the observed entries. This will give you a k-dimensional vector for each user and a k-dimensional vector for each book. To predict the score a user would assign to a book you just compute their dot products.
(Sep 24 '10 at 14:47)
Alexandre Passos ♦
|