|
I would like to understand what does it means "finding most violated constraint"? this expression is used for the loss function, I have seen it at http://www.yisongyue.com/talks/structural_svm_intro.pptx Can you give me an example of it, in the sense of how it enables to find the most violated constraints? Another question:
Can we add a hamming loss function to a structured perceptron for example? How can it provide a supplementary advantage over the existing mechanism? |
|
The simplest way of framing structured prediction is as learn a parameter vector theta such that theta dot f(x, true-label) is bigger than theta dot f(x, false-label) for any false label which is distinct from the true label, and where f(x,y) is a feature map. So if you use the perceptron you will iterate over these constraints until the true label has a better score than any false label for all training examples. Any false label with a score bigger than the true label is then a violated constraint, and you will iterate over these examples until no such violated constraint is found. Then to find which update to perform you find the most violated constraint, that is, finding the false-label that maximizes theta dot f(x, false-label). Then you can perform a perceptron update, setting theta = theta + f(x, true-label) - f(x, false-label). Likewise for structured svms, the constraints are that theta dot f(x, true-label) > theta dot f(x, false-label) + loss(true-label, false-label), where loss can be any loss function you want, but if it can be written as w dot (f(x,true-label) - f(x,false-label)) you can do search in the same way, using the same kind of inference as in perceptron. |
Do you understand how inference works with structured prediction? EG, given some learned weight vector and an input sequence, do you understand how to find the most highly scoring output sequence?
Indeed, I know how it works. The part that I do not get is the folllowing: If, for instance I have a structured perceptron (easiest example) which learns from some training set. Now if I add a lost function (such as hamming) to it, the maximisation function (most higly scoring output) will be surely better in training examples in the presence of the lost function.
The problem is that, the weight vector won't be updated because while training period the loss function provides enough clues to the maximisation function to be optimised. Because that some necessary weight updates aren't done (which would have been done without the loss function), structured perceptron will perform badly on test examples. This is the first point that I don't get. Why to maximise by adding a loss function since that loss function already provides clues about the answer to the maximisation function?
The second point that I don't get is the expression "finding most violated constraint". If for instance, I have hierarchical constraints which serves a hierarchical classification, does it means the constraints closer to the root are the ones which are "mostly violated"?