Hello everybody,

I would like to pick up a discussion from a previous question on the Difference between MRF and CRF and ask a few related questions that I'm still unsure about:

1 - In what way does the discriminative approach p( labels | data ) allow including the entire dataset while p( labels , data) doesn't? Is it even correct that we can use the entire dataset BECAUSE of the discriminative approach?

Allow me to propose the following explanation and please correct me if I'm wrong:

Say I want to label the words in the sentence "Take this white sheep" with the labels {singular, plural -} depending on the grammatical number (or "-" if the word doesn't have a number). The correct result would thus be:

[Take (-)] [this (singular)] [white (-)] [sheep (singular)].

If all information was available at each site, the labels on "this" and "sheep" for example would not be independent given "white" (Markov requirement), because "sheep" could know that "this" would be singular (not "these") and thus the probability of "sheep" being a singular word is higher.

On the other hand, if we were modelling p( labels | all data ) then the entire sentence would be given anyway, and (the label on) "this" is independent from (the label on) "sheep" given "white" AND the entire sentence. (Bottom line: Just giving the neighboring labels doesn't make "sheep" and "this" independent, but if we know the entire sentence then we don't really need to know which label "this" got in the end to conclude that "sheep" is probably singular.)

Is this the point?

2 - A common line is that modelling the joint p( labels , data ) is computationally more expensive because p( data ) must be "enumerated". What is this referring to?

3 - I remember reading papers on MRFs that don't just use the data values but also first derivatives of the data at every site. Why is this still an MRF? If arbitrarily many derivatives are allowed without violating the MRF property, I can just take the derivatives and use Taylor expansion to get a look at the entire image. Would this not rather be a case for an CRF?

Any help would be most appreciated.

Thanks,

Bert

asked Nov 16 '11 at 15:50

Bert%20Draw's gravatar image

Bert Draw
16112


One Answer:

In order,

  1. Using discriminative training allows to use features that condition on the observed sequence at no penalty in terms of model complexity. For example, if you have features between a word and the labels of the adjacent words in a sequence tagger your graphical model looks very connected, and hence inference will be expensive. In a CRF, however, these features are, as far as the model is concerned, separate factors for each individual label, so you still have a sequence, for which inference is very cheap. I don't think your explanation is exactly right, as it's not clear which features do you mean when you talk about dependence/independence of this and sheep. Are you running some kind of preprocessing and connecting determiners to the corresponding nouns, so you can have features such as "this word was preceded by a singular determiner"? Modelling P(labels | all data ) allows you to include these features at no extra cost, while modeling P(labels, all data) would mean more complex inference.

  2. The log-likelihood of an MRF is a sum of factors over the variables minus a normalizing constant which is the log of a sum over all possible assignents to all variables of the score of that assignent. This exists to make sure that the probabilities sum to one. To learn the parameters of such a model you need to compute the gradient, and the gradient of this log sum exp part is very expensive to compute, as it requires running inference and summing over all possible model configurations. If your model is a tree this is very cheap, and CRFs allow you to use features that, if you were maximizing the joint likelihood, would make your model not a tree.

  3. Derivative of the data? Do you have a reference? Probably what they're doing is computing the gradient of the parameters for each data point, which you need to optimize the parameters. I don't understand the rest of this question.

answered Nov 16 '11 at 16:39

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Dear Alexandre,

thank you very much for your detailed reply.

1 - Yes, the basic idea was that I could (if I found this reasonable) run some preprocessing like this. I admit that if the problem was finding singular/plural words it would be pointless to make this distinction already in the preprocessing. My main problem is: You've said "Using discriminative training allows to use features that condition on the observed sequence at no penalty in terms of model complexity". And this is the point I don't really understand. Let's forget about the sheep example, it was probably quite stupid. But: The reason that I can include the entire data "for free" without violating the Markov property is the following:

  • an MRF says (among other things) "node A and B are independent given all neighbors of A". If I input all data to A and B then the same feature can influence A and B. So unless I know all the data, A and B could still be connected over that common feature.

  • a CRF says "node A and B are independent given all neighbors of A and all the data" and this means that there can be no hidden connection between A and B (through the data) because the data is already openly known. Correct?

2 - I see, thanks.

3 - I'm afraid I don't have my documents here. From what I remember it was a gradient over image data that was used to detect discontinuities in the image information. I may be wrong and I think the question is in fact not so important.

Again, thank you very much for your helpful reply.

Regards,

Bert

(Nov 16 '11 at 17:09) Bert Draw

Regarding your comments on 1, remember that the only difference between CRFs and MRFs is in parameter estimation: in one case I model P(labels | data) and in the other case P(data). So, as far as the CRF is concerned, features over labels and data are just over labels, because all inference and learning is done with regards to the labels. If you use MRFs you also need to model the data, and this will often make for a more complex model. The difference is not about what is known or not, but about what is being learned, and here CRFs do not learn anything about the data (MRFs do), so dependencies involving the data are free.

  1. Yes, image people do compute gradients over the data to detect edges. This has nothing to do with MRFs or CRFs.
(Nov 16 '11 at 17:16) Alexandre Passos ♦

Thank you very much.

(Nov 16 '11 at 17:23) Bert Draw
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.