|
I have a learning problem that is somewhat like the Netflix problem (but not exactly). I'll describe the model briefly below. tl;dr It's a classification problem but with infinitely many classes. Can I use vowpal wabbit for this or some other toolkit that supports large-scale learning on hadoop? (I have hundreds of thousands of examples and hundreds of millions of features.) Here is the model, which, I think, is pretty standard: I have N observations of pairs (person, movie). For example:
Now I define a logistic (i.e., maximum entropy) model p(m|u) based on
the features of users and movies (
Example features:
Each feature has a feature weight, and I'd like to learn the feature
weight vector that maximizes Then, given a learned feature vector, I can make predictions on new
users. For example, Summary. Again, the question is: Can I use vowpal wabbit for this, or some other toolkit that supports large-scale learning on hadoop? (I have hundreds of thousands of examples and hundreds of millions of features.) For vowpal wobbit, all examples I've seen are multi-class and assume that we have few classes only. But in my example I have an infinite number of classes, so I'm not sure how I'd enter the input data for learning. Some alternative model would be okay too (simple linear instead of log-linear), I just want to do large-scale learning and get a meaningful prediction score for new users and movies based on their features. |
I think you should do some more research - it is a standard problem, and there will be plenty of papers out there.
I am no expert but a standard way is to build a recommender system [ I have matrix factorisation schemes in mind] then you build a separate system to predict the factors.
ie the recommender system develops a "film tastes" representation, then you develop a (nonlinear ) regression to train on that ie predict recommender "film tastes" representation from your features [studied at harvard etc]
How can you train a model to predict an infinite number of classes without seeing those classes in training time, unless you have a strong prior on probabilities of these classes? Otherwise, you will in effect only have a model for the classes actually seen which may actually be very low (don't know).
@digdug The classes I haven't seen in training are modeled using the features that fire on them. For example, I might never have observed the movie Shrek but I know a whole lot of features that fire on it, e.g., it's an animation, etc.
Mahout (Hadoop's first machine learning framework) has a collaborative filter and a recommendation system. There's also GraphLab that is pretty optimized for this sort of problem (but it doesn't scale as well as Mahout AFAIK). Try my references for those frameworks at my answer to the following question: http://metaoptimize.com/qa/questions/6339/mldata-miningbig-data-popular-language-for-programming-and-community-support#13058