|
I am trying to study hidden conditional random fields but I still have some fundamental questions about those methods. I would be immensely grateful if someone could provide some clarification over the notation used on most papers about the topic. This question has also been asked on a similar site, however I noted that MetaOptimize had more similar questions about this topic. In several papers the most common form of the the HCRF model is given as:
In which theta is the parameter vector, w is the class label, o is the observation sequence, s is the hidden state sequence and psi is the potential function. However, I still could't figure out what s means. Is it just a sequence of integer numbers, or is it actually a sequence of nodes in a graph? How one actually computes this summation? Most papers I have read mention only that each si in S captures certain underlying structure of each class (S being the set of hidden states in the model). But I still couldn't figure out what this actually means. |
|
Well, answering my own question: in the case of a linear chain HCRF, the hidden state sequences are calculated in exactly the same way as in hidden Markov models. The HCRF formulation using maximal cliques generalizes much of the structure of a general hidden Markov classifier. Hidden Markov classifiers are generally constructed by considering priors over each possible model and estimating the class label by computing its posterior probabilities. If we represent each model by a clique potential function, and restrict each potential function to a single class label, we can reproduce this exact structure in a HCRF. The only difference will be that parameters in a HCRF will not be constrained to probabilities, so we can also see that all possible solutions given by Markov classifiers are just a subset of the possible solutions given by HCRFs. By the way, the summation I was referring to in the original question is intractable to compute in its given form. Since it represents the result of the potential function over all possible paths, in the case of a linear chain, instead of trying to compute this summation directly, we can proceed by computing the probability of each state/transition occurring in the model and multiplying this probability by the results of the potential function along those states/transitions in a single pass using the sum-product algorithm. The model also does not need to be computed using EM. Since its gradient is readily available, one can just use any off-the-shelf function optimizer to do the job. Conjugate gradient or stochastic gradient updates seems to operate better since they can deal better with violations of convexity. Please someone correct me if I got anything wrong. The best resource I have found so far to help understand CRFs and HCRFs (which are just CRFs with latent variables) has been this tutorial by C. Sutton. I hope it could be of some help for others also having the same questions. |
|
Using a hidden-state sequence to model underlying structure in your labels is just one way of making CRFs with latent variables. I recently did some work on an alternative way of doing it: essentially, the model has binary, stochastic hidden units that are conditionally independent given the data and the labels. Like in Darrell's HCRF model, the idea here is that the hidden units model latent structure in the data that is useful for discrimination. 1
This seems very, very interesting. Would this work as well as HCRFs in the recognition of continuous sequences (such as in speech recognition, or in the recognition of temporal signals in general)? Could you please summarize the advantages of your model over the later approach? I am still reading your paper. Thanks.
(Mar 16 '11 at 22:26)
César Souza
2
If I understood HCRFs correctly, a key difference between the two models is that HCRFs are assigning a single label to a time series/structured object, whereas hidden-unit CRFs assign a label to each node (like you would do in speech recognition, part-of-speech tagging etc.). If you would have a HCRF with a hidden state sequence and a label sequence, inference would become intractable. Inference in hidden-unit CRFs is tractable because all hidden units are conditionally independent. Another difference is in the type of hidden units. In HCRFs, you have a sequence of multinomial hidden units, so you'd have K^T possible hidden-unit states (with K the number of possible states, and T the time series length). The hidden-unit CRF has H binary hidden units for each frame, so a total number of (2^H)^T hidden states. In hidden-unit CRFs, however, the hidden states at time t do not influence the hidden states at time t+1. If the transition weights in the hidden-unit CRF all have the same value, the model reduces to a sequence of discriminative RBMs. Discriminative RBMs have the theoretically interesting property that they are universal approximators of p(y|x) for discrete data x.
(Mar 17 '11 at 06:15)
Laurens van der Maaten
1
@Laurens: Nice work! Traditionally CRFs are trained to maximize joint likelihood over the whole structure (marginalized likelihood in the case of hidden units), but there are formulations for optimizing per position loss as well. In my view any kind of conditional undirected graphical model with latent variables could be called a HCRF, but I am not sure how the term is used in general. The primary strength of your model seems to be the increased modeling power due to the non-linearity introduced by the hidden units. Another potential way to introduce non-linearity would be to use Mercer kernels together with hidden units. Do you have any intuition as how this would fare against using RBMs?
(Mar 17 '11 at 09:33)
Oscar Täckström
I think that model is investigated here. I guess my main concern with that model would be that it scales quadratically in the number of training sequences(?), making it unsuitable for typical NLP and speech applications. One could also use non-stochastic hidden units to get the non-linearity, as was done in [2][one of the first CRF models] (which was published years before Lafferty et al. published their seminal CRF paper).
(Mar 18 '11 at 04:46)
Laurens van der Maaten
@Laurens: I read your paper with great interest as I am also working with log-linear and max-margin latent variable models. Your paper is really good, and unlike some papers in the same category you do the reader a favour by explaining everything very clearly. One related model you may have missed is from another paper by Morency et al. They propose a latent dynamic conditional random field (LDCRF -- not sure why they named it "Latent" instead of "Hidden" in this case, but go figure), which is a variant of the DCRF and generalizes the HCRF. The LDCRF assumes that the labels are conditionally independent given the hidden states, while making a first-order Markov assumption on the hidden state sequence. So, their graphical model looks a lot like that of your hidden-unit CRF, but with a "chain" on the hidden states instead of on the labels. Now, you said above: "If you would have a HCRF with a hidden state sequence and a label sequence, inference would become intractable." Can you please explain exactly why this is the case? Morency et al. don't really explain the reason for their independence assumption, just that it's to "keep training and inference tractable." On the other hand, in Section 3.4 of the DCRF paper, Sutton et al. describe what appears to be an LDCRF without the independence assumption. If I understand correctly (note: my confidence level is not high in this case), they are able to perform exact inference during training. (For reasons that are unclear to me, however, they need to approximate the argmax_y when making test predictions). So, to summarize my questions: 1) Why is inference with hidden state sequences and label sequences intractable?, and 2) How is it that Sutton et al. perform exact inference in their "Marginal DCRF" model, which does have both sequences?
(Apr 24 '11 at 21:49)
Ken Dwyer
1
Thanks, I am aware of the LDCRF paper, but I never quite understood the motivation of their approach. The multinomial hidden units (one per frame) have a very limited capacity to describe latent structure in the data, and to obtain a model that is tractable, the LDCRF needs to sacrifice the connections that model temporal relations between class label (for instance, that an article is often followed by a noun). In general, inference with hidden state and label sequences is intractable because the resulting graph contains loops. As a result, the sum over all possible hidden/label sequences that needs to be evaluated to compute the partition does not factorize anymore, prompting the use of approximate inference algorithms such as loopy belief propagation. There may be some cases in which exact inference can still be performed. For instance, consider a factorial CRF with two label chains; one with K possible labels and one with L possible labels. You could also view this model as a linear-chain CRF with K x L possible labels, so exact summing over (K x L) ^ 2 may still be feasible. I assume something like this happens in the Sutton paper. In general, this quickly gets out of hand though: in hidden-unit CRFs with H hidden units and K labels, we would need to sum over (2^H x K) ^ 2 possible values. In section 3.2 of the DCRF paper, this is described as well; the section discusses an approximate inference method based on loopy belief propagation.
(Apr 26 '11 at 04:05)
Laurens van der Maaten
showing 5 of 6
show all
|
|
If s is a state sequence then it is a sequence of states, each represented however you represent states in your model (which can be integers). The summation is more easily viewed as an integral; that is, the summation variable s must take each of the possible values for the whole sequence. So if you have two states and three hidden nodes, the sum would be P(w, 000, o; theta) + P(w, 001, o; theta) + P(w, 010, o; theta) + P(w, 011, o; theta) + P(w, 100, o; theta) + P(w, 101, o; theta) + P(w, 110, o; theta) + P(w, 111, o; theta), where by 100 I mean that the first node gets state 1 and the other two nodes get the state 0. As you can see this is easily intractable even for small numbers of hidden states or nodes. Thanks for the answer. I noticed that you summed P(w, 000, o; theta) twice. Was it a typo, or was it on purpose? Anyway, I initially thought of that, but assuming HCRFs have one hidden node for each observation this indeed seemed quite intractable to me. Am I still missing something?
(Mar 09 '11 at 18:07)
César Souza
2
Thanks for catching the typo. Yes, it is intractable, but then again inference on many other useful models is also intractable. The most common solutions seem to be training with EM (i.e., instead of maximizing P(w|o;theta) you maximize E_s[log P(w|o,s,theta)] using a point approximation of s, which is tractable in some scenarios), variational mean field (same thing as EM except you use a richer distribution over s instead of a point approximation), gibbs sampling (it's at first non obvious how you can update the parameters with gibbs sampling, but see Michael Wick's SampleRank http://www.cs.umass.edu/~mwick/MikeWeb/Publications_files/wick09samplerank.pdf ), loopy belief propagation (to do this well you need to take care of the nonconvergence problem or use some estimation method that can deal with noisy gradients such as SGD), and probably other techniques.
(Mar 09 '11 at 18:11)
Alexandre Passos ♦
3
Note that if you make a Markov assumption on the hidden state sequence, then you can use standard Viterbi to find the optimal hidden state assignment for a fixed observation sequence. The difficulty is that you need to sum over the joint hidden and observed sequences for the partition function and when performing joint inference. In the Sung and Jurafsky paper they use an N-best approximation, where they replace sum_Y sum_S[...] with sum_L sum_S [...] in the partition function, where L is the N most probable sequences of observation labels. Unfortunately I am not familiar with the N-best search algorithms used in that paper, but they seem to be using some kind of beam search.
(Mar 09 '11 at 18:52)
Oscar Täckström
|