|
So, I've got two linear models trained with L2-regularized logistic regression. I'd like to describe how "similar" they are. A simple approach would be to take the cosine-similarity between the weight vectors w1 and w2. My questions are this: 1) this must have been done before. Does anyone have a citation? 2) If this hasn't been done, are there any ideas about how you would compare two logistic regression models? Preferably without evaluating on data. (Another extension to this question: L2-regularized regression is equivalent to assuming a zero-mean prior on the weight vector. This should allow a bayesian interpretation of the weights, i.e. include variances for w1 and w2. With variances, it should be easy enough to use a t-test to compare the similarity between the weight vectors. Are there any good Bayesians out there who have any insight or experience with this?) |
|
I've seen this used informally in multi-task learning papers to talk about similarity between tasks but I've never seen it formally defined or having anything interesting proved about it. Statisticians often do statistical tests to see whether a given coefficient in a logistic regression model is statistically significant from zero. You can probably tweak their formulas to get some kind of hypothesis test if your models are at all related (one has strictly more features than the other, or one is trained on strictly more data than the other, etc), though I'm not sure how one would approach that without getting more information from you. About the prior thing you mentioned, some kind of KL divergence between clasifiers has been used in the past. Confidence-weighted classification learns the variances together with the means of linear classifiers and does so minimizing the KL divergence between the distributions over weights before and after updating. A similar trick to assign some variance to a linear classifier is used in the pac-bayes bound for linear classifiers, which looks at the KL divergence between the distribution defined by a stochastic extension of a classifier and some prior distribution over classifiers. Note however that if you just put a gaussian around your classifiers with the same variance the KL divergence is just proportional to the distances between the classifiers anyway (while presumably you don't really care about their norms). I indirectly posed this question to Tony Jebara (through a student that is taking his class). His answer was to either 1) look at the cosine similarity between w1 and w2, or 2) use the p-norm of the distance vector w1-w2. Both very sensible. But neither touch on the variance estimation -- probably because the question posed by the student didn't address this aspect of the question :). The two citations look interesting, I'll need to do some reading before I have a clear sense to use them for this task. I don't think you'd want to just put equal variance gaussians over the classifiers, for the reasons you describe. But if you interpret L2 regularization as using a zero-mean constant-variance gaussian prior over the weight vector, the weight vector is the maximum likelihood estimate of the posterior mean. You should also have access to the variance though. Then you can do a series of t-tests to see which weights differ significantly. no? it seems like this is straight-forward, but i don't have a citation of anyone doing it. So i was hoping to sanity check it here. (Though it might be a simplified version or special case of what's included in the papers you added above). I'm looking at two classifiers trained to predict the same label, using the same features, drawn from different training data. if you have test data, it's pretty easy to see if the two classifiers are behaving the same way. Compare accuracy, use McNemar's test, etc. But this is essentially just a monte carlo exploration of the feature space. The random samples are the test data. Generally the test data is small, so you're only exploring a limited part of the feature space, and you don't get any explanatory power about which features/dimensions are responsible for the differences in performance. So i was thinking it'd be better to look at the models themselves.
(Nov 28 '12 at 12:06)
Andrew Rosenberg
I guess the data vs model comparison depends on the dimensionality? ie does your data lie on a subspace of your input vector. if so then a cosine similarity would give you lower similarity than you might actually observe? ie you have to project the weight vectors first ...
(Nov 28 '12 at 13:23)
SeanV
Well you can't really compare the similarity measure based on, say, mcnemar's test or another error based measure that doesn't introduce another model (data) and the model space distance of decision boundaries (models). So "lower similarity" here is really just a "lower [but incomparable] similar score", no?
(Dec 04 '12 at 18:07)
Andrew Rosenberg
Well I think that's my point...What do you want to measure with your similarity score? To me the cosine approach is equivalent to doing a Monte carlo on randomly 'uniform/spherical gaussian'? generated data ( ie exploring the whole feature space) and measuring the classification differences whereas if you only use the test data to compare the two models, you are implicitly saying that the observed data does not occupy the full feature space...(and/or is not uniformly distributed) Personally the test data approach makes more sense, and I was suggesting you could do something similar using the cosine approach by first projecting the weight vectors onto the space spanned by your training data... what I feel you are missing out is the distribution of your data ... To take some trivial examples if your features are two dimensional but only one feature changes in the data then the weight on the second feature is irrelevant. So what I believe you need to incorporate is the mean and covariance matrix of your input data - but I don't know how!?
(Dec 05 '12 at 03:54)
SeanV
OK, i see where you're coming from. Right, the measures between the decision boundaries ends up considering all regions in the feature space as equally relevant, while evaluation on test data samples from the "plausible" areas in the feature space. My thinking to avoid the test data evaluation is that I don't trust that it has sufficient coverage to thoroughly capture the generalization of the classifier and I was looking for a more comprehensive description. But you bring up a good point about not really caring about all feature space areas equally. I wouldn't want to try to approximate the mean and covariance of the input data -- I don't think it's gaussian. But measure theory might have something to say about this...
(Dec 07 '12 at 07:57)
Andrew Rosenberg
|