I was reading up on MaxEnt Markov models (MEMMs), an extension of logistic regression to sequences, and found out that these can be trained on a combination of labeled and unlabeled data using a modification of Baum-Welch, the EM algorithm for HMMs.

It then struck me that this is (at least conceptually) very similar to the EM algorithm for semisupervised naïve Bayes (Nigam et al.).

Do I conclude correctly that it should then also be possible to combine the EM algorithm of Nigam et al. with a logistic regression classifier at its core instead of a naive Bayes one? I.e.

lr = train_lr(X[labeled], y)
while not converged:
    prob = lr.predict(X[all])
    lr = train_lr(X[all], prob)

And if so, has this been tried before/described in the literature? Or is this optimizing the wrong objective because LR is not a generative model?

asked Mar 01 '13 at 09:11

larsmans's gravatar image

larsmans
67651424

edited Mar 01 '13 at 10:21


3 Answers:

While everything that Oscar said is true, doing it for logistic regression itself can add no further value. To see this, imagine you have the optimal solution to your training examples. The gradient of the log loss is the difference between predicted and observed probability, in each example. So when you add the test examples with labels set identically to their predictions the gradient contribution from them will be zero, which when added to your original optimization problem that was already optimal means you're still optimal.

The reason why the methods Oscar described work is that they do something other than defining a label equal to the prediction. There are generally three ways to do so: (1) is to model P(x|y) instead of P(y|x), as then observing P(x) can give you more information; (2) is to exagerate the confidence of your classifier (as in take all probabilities above, say, 0.9 to 1, and below, say, 0.1 to 0, and generate training examples); and (3) is to use domain constraints on your training examples to shift the probabilities around (also known as posterior regularization or generalized expectation, or even graph-based semi supervised learning), an example of this is having "weak labels" that tell you that some test points should have the same class, which can be fed into different estimated probabilities.

answered Mar 01 '13 at 17:38

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Good point Alexandre! The methods I discussed all modify the conditional distribution on the unlabeled data before using it to constrain the model. Viterbi self-training is the extreme variant of this, which tends to not be robust.

(Mar 01 '13 at 17:52) Oscar Täckström

What you describe is basically a form of self-training. I used a similar method to adapt a CRF for named-entity recognition model trained on multiple source languages to a target language in a recent paper. Instead of treating the probabilities as observations, however, I used the predicted label sequence distribution to sample self-labeled instances, but the effect should be similar.

One problem with this approach (even more so in Viterbi self-training) is that you are basically training your model to mimic it's own predictions, which can lead to problems with robustness. In other words, if the model's own predictions are wrong, you force it to reinforce its own errors. We have an upcoming paper at NAACL this year ("Target Language Adaptation of Discriminative Transfer Parsers") that tries to circumvent this by something we call "relaxed self-training". The idea is to use the original model to produce ambiguous predictions on unlabeled data and then give the model freedom to distribute probability mass within the set of ambiguous predictions in order to improve likelihood. Technically, we optimize the likelihood resulting form marginalizing over the ambiguous predictions. See this invited talk by Ryan McDonald which briefly describes the approach. One thing to note about this approach is that it may be more effective for structured prediction, where there are structural constraints that can guide the learning over the latent structure.

Another similar approach is the semi-supervised CRF by Jiao et al. (2006), where they apply an entropy regularizer when including unlabeled data.

These papers discuss learning with ambiguous supervision: Jin and Ghahramani (2002), Riezler et al. (2002), Dredze et al. (2009), Cour et al. (2011)

answered Mar 01 '13 at 16:21

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
2039133450

edited Mar 01 '13 at 16:35

I've been reading up and I found out that there's an EM variant that should work with an arbitrary probabilistic classifier: semisupervised discriminant CEM. The algorithm is as follows:

lr = train_lr(X[labeled], y[labeled])
prob = lr.predict(X[all])
y[unlabeled] = argmax over prob for X[unlabeled]

while loss(X[all], y, prob) improves:
    prob = lr.predict(X[all])
    y[unlabeled] = argmax over prob for X[unlabeled]
    lr = train_lr(X[labeled], y, prob)

The loss is the log loss on X[labeled], y[labeled] as they appear in the training set, plus the log loss on X[unlabeled] and its predicted y[unlabeled]. This is obviously quite close to self-training, but without the need to heuristically set all kinds of thresholds. I'm going to try this algorithm.

answered Mar 06 '13 at 04:51

larsmans's gravatar image

larsmans
67651424

What you wrote is what Oscar referred in his comment to my post as "viterbi self training", and it is notoriously unstable.

You'll have better luck with something similar to generalized expectation http://jmlr.csail.mit.edu/papers/volume11/mann10a/mann10a.pdf or entropy regularization http://www.iro.umontreal.ca/~lisa/pointeurs/entropy_regularization_2006.pdf , both of which are as easy to implement, and at least GE is more stable.

(Mar 06 '13 at 07:13) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.