|
Hi! I'm a newbie, just starting in the HCRF-field. Although I have studied the various papers and tutorials available (Maatens,Täckström,McCallum,Quattoni,Juraksky,...), I'm still stumped by, what is for me, the most basic question when creating my model, viz: How can one set up the feature functions, which actually have one or even two states as params, when these states are actually hidden. I'm implying here that "hidden" means "unknown". Does one have to create an FSM with the set of possible states beforehand ? It would be most helpful if one of you experts/gurus could give me answer, and even much better, some fully detailed examples, running under one of the available programs from the Web (CRFsuite,CRF++,BioCRF,HCRF). Thanks in advance. |
|
Stenio, I think I understand your point of confusion. It sounds like you're on the right track: The particular values of the hidden states are unknown, but the set of hidden states and their possible values are known. So, yes, in order to train the model you can imagine first building a FSM for each example (xi,yi) and storing it memory. As an example, suppose we're classifying pairs of strings as "match" or "mismatch" (e.g., for the purpose of detecting duplicates in a database). In this paper, McCallum et al. use an HCRF to learn a latent edit distance model, since features extracted from an aligned pair of strings lead to better classification performance. So, for each training example, which is a pair of strings, you would build a FSM that encodes all the possible sequences of edit operations (e.g., insert, delete, substitute) that could transform the first string into the second string. Each arc/edge is associated with a feature vector whose features may depend on, e.g., the current and previous edit operations (states). What happens in training is that you start with some initial weight vector w (say, the vector of zeros). Then, during each iteration of say L-BFGS (or whenever your w changes), you have to "re-score" the FSM for each example (xi,yi). This is done by computing the inner product of w with the feature vector for each arc aj, then setting the score/value of aj to the result. You can then compute the marginal values of the features, which are needed for the gradient calculation, by running your inference algorithm of choice (e.g., forward-backward for the edit distance FSM) on the re-scored FSM. The key point is that the structure of the FSM (including the feature vectors on the arcs) is fixed for each example (xi,yi); only the values/scores on the arcs change whenever the weights are updated. In practice, you can speed up training substantially by storing the FSMs in memory, but this may not be possible if there are a large number of training examples. In that case, you will have to reconstruct the FSM and score the arcs using the current w every time you process a given example during training. One final point is that in a k-class classification problem, the FSM for each example will typically contain k "copies" of the same FSM, such that the features in the kth copy are all paired with the class label k. These copies are each connected to the same start and final states to form a single FSM. Hopefully that's clear. See this post for more information about some of the different HCRF-like models that have been proposed in the literature. |
|
Most common CRF applications don't rely on hidden states. More specifically, during training, the states are fully observed. If your training set doesn't have labeled states, a standard thing to do is to apply EM for learning, which is kind of like creating "FSM with all possible states beforehand". |
|
Hello: I have studied CRF before, and there are somethings you should go over before using them:
|