|
I was reading up on MaxEnt Markov models (MEMMs), an extension of logistic regression to sequences, and found out that these can be trained on a combination of labeled and unlabeled data using a modification of Baum-Welch, the EM algorithm for HMMs. It then struck me that this is (at least conceptually) very similar to the EM algorithm for semisupervised naïve Bayes (Nigam et al.). Do I conclude correctly that it should then also be possible to combine the EM algorithm of Nigam et al. with a logistic regression classifier at its core instead of a naive Bayes one? I.e.
And if so, has this been tried before/described in the literature? Or is this optimizing the wrong objective because LR is not a generative model? |
|
While everything that Oscar said is true, doing it for logistic regression itself can add no further value. To see this, imagine you have the optimal solution to your training examples. The gradient of the log loss is the difference between predicted and observed probability, in each example. So when you add the test examples with labels set identically to their predictions the gradient contribution from them will be zero, which when added to your original optimization problem that was already optimal means you're still optimal. The reason why the methods Oscar described work is that they do something other than defining a label equal to the prediction. There are generally three ways to do so: (1) is to model P(x|y) instead of P(y|x), as then observing P(x) can give you more information; (2) is to exagerate the confidence of your classifier (as in take all probabilities above, say, 0.9 to 1, and below, say, 0.1 to 0, and generate training examples); and (3) is to use domain constraints on your training examples to shift the probabilities around (also known as posterior regularization or generalized expectation, or even graph-based semi supervised learning), an example of this is having "weak labels" that tell you that some test points should have the same class, which can be fed into different estimated probabilities. Good point Alexandre! The methods I discussed all modify the conditional distribution on the unlabeled data before using it to constrain the model. Viterbi self-training is the extreme variant of this, which tends to not be robust.
(Mar 01 '13 at 17:52)
Oscar Täckström
|
|
What you describe is basically a form of self-training. I used a similar method to adapt a CRF for named-entity recognition model trained on multiple source languages to a target language in a recent paper. Instead of treating the probabilities as observations, however, I used the predicted label sequence distribution to sample self-labeled instances, but the effect should be similar. One problem with this approach (even more so in Viterbi self-training) is that you are basically training your model to mimic it's own predictions, which can lead to problems with robustness. In other words, if the model's own predictions are wrong, you force it to reinforce its own errors. We have an upcoming paper at NAACL this year ("Target Language Adaptation of Discriminative Transfer Parsers") that tries to circumvent this by something we call "relaxed self-training". The idea is to use the original model to produce ambiguous predictions on unlabeled data and then give the model freedom to distribute probability mass within the set of ambiguous predictions in order to improve likelihood. Technically, we optimize the likelihood resulting form marginalizing over the ambiguous predictions. See this invited talk by Ryan McDonald which briefly describes the approach. One thing to note about this approach is that it may be more effective for structured prediction, where there are structural constraints that can guide the learning over the latent structure. Another similar approach is the semi-supervised CRF by Jiao et al. (2006), where they apply an entropy regularizer when including unlabeled data. These papers discuss learning with ambiguous supervision: Jin and Ghahramani (2002), Riezler et al. (2002), Dredze et al. (2009), Cour et al. (2011) |
|
I've been reading up and I found out that there's an EM variant that should work with an arbitrary probabilistic classifier: semisupervised discriminant CEM. The algorithm is as follows:
The loss is the log loss on What you wrote is what Oscar referred in his comment to my post as "viterbi self training", and it is notoriously unstable. You'll have better luck with something similar to generalized expectation http://jmlr.csail.mit.edu/papers/volume11/mann10a/mann10a.pdf or entropy regularization http://www.iro.umontreal.ca/~lisa/pointeurs/entropy_regularization_2006.pdf , both of which are as easy to implement, and at least GE is more stable.
(Mar 06 '13 at 07:13)
Alexandre Passos ♦
|