How do you use embeddings (HLBL, Collobert&Weston) as features in CRF?

Most (all?) CRF packages take singular features, do you need to perform some transformation from a multi-dimensional embedding to a single value?

Also, as CRFs don't generally support numerical features, how would you go about converting the numbers to discrete values?

asked Sep 10 '10 at 03:23

Fredrik%20J%C3%B8rgensen's gravatar image

Fredrik Jørgensen

2 Answers:

If your CRF package does support numerical features you create one feature per-word per-embedding. If you had a word named "foo" you'd have something like a feature that says "word 10 is: foo"; with the embeddings you have something that says "word 10 embedding 0: -0.1224", "word 10 embedding 1: -0.1224" etc. If it doesn't support numerical features then you must discretize it in some way. Over here I've seen suggestions for:

  • binning (ie, make a histogram os the feature values and create one feature for each possible bar in the histogram)
  • using k-means to create a lot of clusters and use discrete feature representing in which cluster each word is

Both of these, however, I've found experimentally to be worse than just including the numerical values. The only care that must be taken when including the numerical values is that, due to regularization, the CRF algorithm isn't scale invariant, so you get a second hyperparameter to tune which controls the scale of the features. More details can be found in Turian et al., Word representations, a simple general method for semi-supervised learning, or the metaoptimize page for this project, which has links to code that uses crfsuite for some classification tasks with and without the embeddings.

answered Sep 10 '10 at 06:22

Alexandre%20Passos's gravatar image

Alexandre Passos ♦

I think one easy way to do is to use a parameter matrix instead of the parameter vector. For example, given the set of input vectors {V_k}, each of size N, the parameter that map k into the l-th label would be a matrix A_k of size N*L, so that the potential function for the data-association is phi(l) = exp{sum_k(V_k)'A_kl} where A_kl is the l-th column of matrix A_k.

I guess it is normal for CRF to support numerical features, especially after you do some transformation (such as nornmalisation). Converting to discrete values makes sense when you know roughly which bin corresponds to what.

If you insists on working with discrete features given the input is continuous, then I think quantization techniques may help (e.g. using k-means for clustering, as people often use when producing the codeword in Computer Vision).

answered Sep 10 '10 at 06:24

Truyen%20Tran's gravatar image

Truyen Tran

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.