|
I have a data set with a few discrete variables having a lot of classes(>100,000). This is too large to encode using a one-hot method. What is the best way to deal with such variables? Should I be using some other model? I have yes/no responses to a set of questions and am trying to predict features based on these. But each user answers only a very small random subset of questions, so in order to make a prediction I need to determine the probability of answering yes/no for the rest of the questions. Grouping questions subjectively is not helpful since it does not help determine likely response. I can't cluster people by similar responses because the data is too sparse. There was a similar question posted - http://metaoptimize.com/qa/questions/8028/discrete-valued-features-logistic-regression, however it does not specifically address my problem. There was something mentioned in the answer that I did not understand - "If the inputs are from a larger discrete set, you may want to do the same thing conceptually, but replace the multiplication by the weights with a table lookup. If you need substantial parameter sharing for statistical reasons/data sparsity reasons, you will need to change the model or go down the Bayesian path." Could someone explain this in greater detail? |
|
The other post is saying that if you have a feature which you encode as a 1-of-k binary vector, for some large k, it will be faster to use a lookup table (dictionary or such) for storing weights rather than doing the full multiplication with a large vector which is mostly zeros. For example, if your classifier was of the form P(y|x) = softmax(Wx), you could simply pick of the columns of W which correspond to the non-zero dimensions of x and add row-wise. At least that is what I took away. In regards to your actual problem, it sounds like what you want is collaborative filtering. Collaborative filtering seems to be a name for this type of problem. Any idea about how people deal with these inputs which have many classes? I will try some form of clustering approach to reduce the dimentionality and keep as much information as possible.
(Jan 09 at 18:06)
Vishal
I guess I'm still not completely clear on what you're trying to do. Are you only interested in the likely value the unanswered questions would take for a particular user, or are your trying to predict some other value based on the answers? I'm also confused about your encoding problem. If all you have is yes/no answers to questions, it seems your input should just be a n dimensional binary (1 = yes, 0 = no) vector, where n is the number of questions.
(Jan 09 at 20:32)
alto
For now I am only interested in the probability of the user answering with a particular value(yes/no). The data is sparse so its encoded as user_identifier, question_identifier, answer + extra features that I am not using right now. A user will typically answer about 1% of the question set. The typical size of the samples is ~ 200k+ users and about 1000 questions in the question set. If I do it on a per user basis, then I loose information about how similar users would answer the question.
(Jan 10 at 00:05)
Vishal
If it was me, I would probably try something like the approach described in www.cs.utoronto.ca/~hinton/absps/netflixICML.pdf . I imagine it would be even easier in your case since you have binary variables(yes/no instead of 1-5 stars). You might want to also look at what other people have done with the netflix data since there are similarities between the datasets, i.e. inferring values for a large set of unanswered/unrated questions/movies from a small subset of answers/ratings for a particular user.
(Jan 10 at 10:55)
alto
|