|
I am dealing with a data set containing roughly n=4000 binary observations Y1,…,Yn with p=1000 binary explanatory variables. I suspect that a lot of these explanatory variables are not relevant to the prediction of the observations. Moreover, it is clear that there are certain groups of explanatory variables that are very highly correlated. What are the usual approaches to deal with these kinds of situation. Indeed, quick simulations show that classical logistic regression and SVM do not work too well. Is there any method that can tackle the identification of these groups of highly correlated binary variables at the same time that the model is fitted to the data. I am looking something in the spirit of the LASSO that can do variable selection and model fitting at the same time. Question also asked on stats.stackexchange without much success.
showing 5 of 8
show all
|
|
If you know the group structure (likely you don't...), then group lasso may be an option and I believe there are existing works on modify group lasso for classification with logistic regression. Otherwise, trace lasso may be better. But this work is relatively new. From what they wrote in the paper that this will work well either with group structure or without because it automatically adapt to the the variables. When the variable are highly uncorrelated, it's equivalent to l1, while if the variables are consisting of highly correlated groups, it's equivalent to l2. But I don't think there is any work on modifying this to do classification (though this is what we are currently doing for a course project...). But anyway, hope you could get the 'spirit' of lasso. |
|
l1 penalties, cannot find groups of highly-correlated variables, it is a well known problem. You can have a look at this other discussion on MetaOptimize. Also, logistic is indeed not tailored for binary explanatory variables. Here is a hack that I can suggest: cluster your correlated variables together, and use the averages of each cluster as new features. This will mitigate both of your problems. In practice, for this kind of strategy, I have found that you want many clusters, and thus a bottom-up approach like agglomerative clustering works well for the clustering step. Thanks Gael - these are precious comments and will have a try with this approach.
(Apr 13 '12 at 07:27)
alekk
|
When you say that logistic doesn't work too well, have you tried l1-penalized logistic?
thanks Gael: yes, it seems that the L1-penalized logistic cannot really find groups of highly correlated variables. Also, I am not sure that the usual logistic regression is tailored to deal with binary explanatory variables, it it?
alekk: Logistic regression can only deal very well with binary explanatory variables. If your variables are real-valued you usually need lots of normalization and binning tricks to get the most out of logistic regression.
@Alexandre Passos: I do not understand your comment -- what I meant in my question is that I am trying to predict a binary response $Y$ based on the knowledge of a $1000$-dimensional binary vector $X=(X_1, ldots, X_{1000})$. Each $X_i$ is a binary variable, and that does not seem to be the usual setup where logistic regression works well.
Logistic regression deals very well with binary features; it assigns to each such feature a "weight" as to whether it's presence, all other evidence considered, is predictive of the positive or negative class, and does so for all features jointly (the difference between naive bayes and logistic regression is that naive bayes estimates how, averaging over everything else, each feature predicts the class, while logistic regression treats everything else as given). This works very well for binary features, but fails over if your features are real-valued or ordinal; in practice, those kinds of features are usually mapped to binary features. I'm assuming you're using L2 regularization, as this usually makes a huge difference in logistic regression models.
@Alexandre Passos: this is really new to me. In my field everybody uses logistic for real-valued features. Do you have a reference to recommend?
Gael: I can't find a reference in a paper, but it's something that's well-accepted in the NLP community. See for example the following question in this website http://metaoptimize.com/qa/questions/1927/real-valued-features-in-crfs . For me the big issue is that the linearity assumption in CRFs implies that, for example, a feature value twice as big is twice as strong evidence, and that values with oppositse signs necessarily mean the same thing. After binning I've almost never found that an actual linear relationship was the best. I think it can be optimal for gaussian features or something like that, just not for features with a weird distribution.
Fair enough. I do believe that it's optimal for Gaussian-distributed features. Choosing empirically the optimal loss (without cross-validation) for a given task is an interesting problem for which I have no answer in practice.