|
In the seminal paper by Micheal Collins which introduced structured perceptron, the title says "Discriminative Training methods for HMM ...". My question is how can you train a generative model like HMM by a discriminative method? I found another note by T Minka where he describes what is meant by discriminative training, but I believe I have confused myself further with that. |
|
While the Tom Minka note you linked to is the proper explanation of this, I prefer to think of it in a slightly different way. First, do not think of it in terms of a generative model. An HMM, when all is said and done, is an algorithm you run on training data which gives you a set of parameters, which you call log P(x|z) and log P(z|z'). Then, at prediction time, you put these numbers into the viterbi algorithm and get a prediction out of it. So my suggestion is to not think of an HMM as a generative model, but just as a way of obtaining these numbers you can put in the viterbi algorithm. So if you do that you notice a few things. First, the viterbi algorithm doesn't care that the log-probabilities you feed into it are normalized. You can put literally any real numbers there and it will still compute something, which is the sequence of hidden states which maximizes the sum of the real numbers for the Phi(x,z) and Phi(z,z') you put into it (which I'm calling Phi instead of P just to suggest they are no longer probabilities). Then you can think of the original algorithm for training HMMs, by collecting counts, as a way of estimating these Phi quantities, by setting them to probabilities which maximize the probability of the observed sequence given the latent sequence. But the thing is that when you are making predictions at test time you really want to know what is the best label sequence you can get, and you'd like to train a model to predict the label sequence given the observed sequence. So here's one way you can do it: assign an unnormalized probability to each label sequence, conditioned on the observed sequence, which is proportional to the exponentiated sum of all the active Phis. Now running forward-backward (which still works) will give you the normalization constant of this probability distribution, and running viterbi gives you the MAP estimate. So now you can, given an observed and a label sequence, compute the probability of the label sequence given the observed sequence, which is the ratio between the exponentiated score of this pair of sequences and the number you get from running forward backwards. So now you can finally write down this likelihood of the label sequence as a function of the Phi parameters in your model. This is a conditional random field. Note that when training it you really maximize P(z|x), not P(x|z). If you do stochastic gradient descent on this conditional random field likelihood what you get is that, for each training example, you update the parameters as of
where f(x,z*) is the feature vector of the sequence and true label, and E[f(x,z)] is the expected feature vector you get from running forward backward and computing expected counts. Finally, if you add a temperature parameter to this likelihood and let it go to zero the expectation converges to the MAP answer, so you get the following update
which is exactly the perceptron update described in the Collins paper. So what this is doing is producing a set of parameters which, when fed to the viterbi algorithm, tends to produce answers which look lke the answers in your labeled training data. It is not really training an HMM discriminatively because HMMs have all these constraints (the parameters have to be normalized locally, for example) which our new model doesn't have. I hope this helps. |
|
I think this paper gives a nice comparison between naive Bayes and logistic regression to make this point more clear. From there on, explaining how you can build a linear-chain conditional random field from HMM is also explained (HMM and linear-chain CRF form generative - discriminative pair). Put it simply, when we are training a model to be generative we are interested in the joint probability, say p(x, y | theta) where x is the observed and y is the latent variable (e.g: class label) and theta indicates the parameters. On the other hand, during discriminative training we are interested in finding the latent (or class) variable given the data, i.e., the conditional distribution p( y | x, zeta). Note that these two are related: lets say we write p(x, y | theta) = p(y | x, zeta) * p(x | epsilon). Now, to train a discriminatively model, we can drop the last term in the right hand side (assuming that the prior on x has a separate set of parameters) and model only the conditional p(y | x, zeta). In the generative model, however, you have an additional constraint as zeta = epsilon. So, in both the cases one is trying model all the variables using the joint probability, but different kind of constraints on the model. |