I am dealing with a data set containing roughly n=4000 binary observations Y1,…,Yn with p=1000 binary explanatory variables. I suspect that a lot of these explanatory variables are not relevant to the prediction of the observations. Moreover, it is clear that there are certain groups of explanatory variables that are very highly correlated.

What are the usual approaches to deal with these kinds of situation. Indeed, quick simulations show that classical logistic regression and SVM do not work too well. Is there any method that can tackle the identification of these groups of highly correlated binary variables at the same time that the model is fitted to the data. I am looking something in the spirit of the LASSO that can do variable selection and model fitting at the same time.

Question also asked on stats.stackexchange without much success.

asked Apr 13 '12 at 05:27

alekk's gravatar image

alekk
15223

When you say that logistic doesn't work too well, have you tried l1-penalized logistic?

(Apr 13 '12 at 05:40) Gael Varoquaux

thanks Gael: yes, it seems that the L1-penalized logistic cannot really find groups of highly correlated variables. Also, I am not sure that the usual logistic regression is tailored to deal with binary explanatory variables, it it?

(Apr 13 '12 at 05:43) alekk

alekk: Logistic regression can only deal very well with binary explanatory variables. If your variables are real-valued you usually need lots of normalization and binning tricks to get the most out of logistic regression.

(Apr 13 '12 at 06:45) Alexandre Passos ♦

@Alexandre Passos: I do not understand your comment -- what I meant in my question is that I am trying to predict a binary response $Y$ based on the knowledge of a $1000$-dimensional binary vector $X=(X_1, ldots, X_{1000})$. Each $X_i$ is a binary variable, and that does not seem to be the usual setup where logistic regression works well.

(Apr 13 '12 at 07:30) alekk

Logistic regression deals very well with binary features; it assigns to each such feature a "weight" as to whether it's presence, all other evidence considered, is predictive of the positive or negative class, and does so for all features jointly (the difference between naive bayes and logistic regression is that naive bayes estimates how, averaging over everything else, each feature predicts the class, while logistic regression treats everything else as given). This works very well for binary features, but fails over if your features are real-valued or ordinal; in practice, those kinds of features are usually mapped to binary features. I'm assuming you're using L2 regularization, as this usually makes a huge difference in logistic regression models.

(Apr 13 '12 at 07:33) Alexandre Passos ♦

@Alexandre Passos: this is really new to me. In my field everybody uses logistic for real-valued features. Do you have a reference to recommend?

(Apr 14 '12 at 05:19) Gael Varoquaux

Gael: I can't find a reference in a paper, but it's something that's well-accepted in the NLP community. See for example the following question in this website http://metaoptimize.com/qa/questions/1927/real-valued-features-in-crfs . For me the big issue is that the linearity assumption in CRFs implies that, for example, a feature value twice as big is twice as strong evidence, and that values with oppositse signs necessarily mean the same thing. After binning I've almost never found that an actual linear relationship was the best. I think it can be optimal for gaussian features or something like that, just not for features with a weird distribution.

(Apr 14 '12 at 06:32) Alexandre Passos ♦

Fair enough. I do believe that it's optimal for Gaussian-distributed features. Choosing empirically the optimal loss (without cross-validation) for a given task is an interesting problem for which I have no answer in practice.

(Apr 14 '12 at 08:33) Gael Varoquaux
showing 5 of 8 show all

2 Answers:

If you know the group structure (likely you don't...), then group lasso may be an option and I believe there are existing works on modify group lasso for classification with logistic regression. Otherwise, trace lasso may be better. But this work is relatively new. From what they wrote in the paper that this will work well either with group structure or without because it automatically adapt to the the variables. When the variable are highly uncorrelated, it's equivalent to l1, while if the variables are consisting of highly correlated groups, it's equivalent to l2. But I don't think there is any work on modifying this to do classification (though this is what we are currently doing for a course project...). But anyway, hope you could get the 'spirit' of lasso.

answered Apr 14 '12 at 14:11

Dawen%20Liang's gravatar image

Dawen Liang
863

l1 penalties, cannot find groups of highly-correlated variables, it is a well known problem. You can have a look at this other discussion on MetaOptimize.

Also, logistic is indeed not tailored for binary explanatory variables.

Here is a hack that I can suggest: cluster your correlated variables together, and use the averages of each cluster as new features. This will mitigate both of your problems. In practice, for this kind of strategy, I have found that you want many clusters, and thus a bottom-up approach like agglomerative clustering works well for the clustering step.

answered Apr 13 '12 at 07:16

Gael%20Varoquaux's gravatar image

Gael Varoquaux
92141426

Thanks Gael - these are precious comments and will have a try with this approach.

(Apr 13 '12 at 07:27) alekk
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.