|
I have two questions regarding CRFs for joint sequence models (sometimes referred to as factorial CRFs) FIrst, are there any examples of when this has been successful for both tasks individually and not only for the joint task. In the paper "Dynamic Conditional Random Fields for Jointly Labeling Multiple Sequences" by McCAllum et al. (2003) (http://www.cs.umass.edu/~mccallum/papers/dcrf-nips03.pdf), they improve the joint task and the POS tagging, but not the NER task. They blame this on the estimation which penalizes the task with more classes at the expense of the other task, which is my intuition on this problem as well. Second, what kind of inference is preferred in this scenario? In the cited paper they use approximate inference such as loopy BP. Would sampling provide similar results? The reason I want to do sampling is that I want to learn a latent segmentation of the other task, so that one chain is only semi-Markov, while the other is Markov. Does anyone have any experience with how to setup the sampling in a similar setting? |
|
Simplest way to get parameters from sampling is to do Gibbs sampling (aka MCMC, Glauber dynamics). Fix the labels to some random setting, then for each label position, get its local probability distribution conditioned on the labels around it from current settings of probability model, and sample the label from that distribution. Repeat this process for some time and use average counts as estimates of your marginals, which you can plug into your gradient equation. Gibbs sampler pseudocode: http://pgm.stanford.edu/Algs/page-506.pdf However, I don't expect this approach to be a lot better than loopy BP because information here still flows locally from label to label just like in BP. I expect that in cases when loopy BP fails to converge, corresponding Gibbs sampler will fail to mix in reasonable time. There are samplers which consider more global structure, like sampling with collapsed particles. Similarly, loopy belief propagation can be generalized to "cluster belief propagation" where messages are passed between larger regions. Koller's in her PGM book notes that one potential problem in using loopy BP for training comes from the fact that it can fail to converge. If you are doing gradient descent and enter "divergent BP" region, your gradient steps become random. Since convex BP like Tree-Reweighted BP is convergent, it can be preferable in such scenario. However, regular loopy BP is faster and more accurate (when it converges), so if your training set does not favor a "highly loopy" model, loopy BP may never enter divergent region. |