|
Although CRF feature functions may be real-valued, in practice most applications I am aware of use binary or integer count features. Anecdotal evidence suggests that is better to "bin" real-valued features and then create a binary feature for each bin. How do you handle real-valued features in CRFs? What methods are available for automatically setting the size of each bin (i.e. other than setting them with domain knowledge or using uniform-width bins over the range of values)? Are there alternatives to binning (i.e. normalizing real-valued features) that work? Is there any published work that discusses these issues? |
|
Joseph Turian et al's work on word embeddings uses a lot of real-valued features in CRFs, and they get improved performance from them. What they do to make them more amenable to this sort of model is scale the real-valued features by some constant, and choosing this constant by performance on a validation set. 1
Yes, that's essentially what I do. I can't say that this technique is superior to binning, because I have not done any controlled comparison. I would be interested to hear about any work on this topic.
(Aug 18 '10 at 15:55)
Joseph Turian ♦♦
Thanks for the suggestion. This method addresses scaling issues, but I am also interested in binning because I do not necessarily want a linear relationship between the feature value and the log score.
(Aug 24 '10 at 13:59)
Gregory Druck
|
|
I've gotten decent results on a vision problem by using feature functions of the form w1 f(x)+w0, where I pre-process the features to make them monotonic. Consider that in a binary CRF, with no local evidence save for one feature at node i, log P(y_i=1|x)/P(y_i=0|x)=psi_i(x_i)Then if you pre-process your feature such that higher feature score implies higher log-likelihood and this dependence is expected to be roughly linear, it's sufficient to learn local potential function of the form psi_i(x)=w1*f(x)+w0 A less knowledge-intensive approach is to use a non-parametric method to model psi_i(x). For instance, fit a regression tree to model conditional log-odds of i'th node. That'll be your local feature function. Since local evidence now interacts, you won't quite hit the mark, so recalculate how far off you are and add another regression tree to model the difference. Repeat until convergence. See this paper for details of later approach. Even though we used discrete features, regression trees work naturally with real-valued ones btw, is there a way to do latex on this site?
(Aug 12 '10 at 15:25)
Yaroslav Bulatov
I'd prefer a solution that does not require prior knowledge about the relationship between the feature value and the labels. Thanks for the pointer, I'll take a look.
(Aug 24 '10 at 14:04)
Gregory Druck
|
|
My intuition is that real-valued features contain more information than binary ones (or one-hot-codes for discretized continuous values), and that binning is useful mostly because it helps to go around the linear limitation of CRFs (hence you can separate linearly among more than 2 consecutive intervals of values). The least you should do if you are going to stick to a linear and shallow model such as a CRF is to bin but not discretize, i.e., use continuous values in the (non-zero) bins in such a way that the representation is completely invertible. why do you call CRFs shallow models ?
(Aug 22 '10 at 20:25)
Aman
@Aman: he does it in opposition to deep models. A CRF, for classification, can be roughly seen as choosing the Y that maximizes a linear function of X,Y, where X is the input data. Deep models, on the other hand, would maximize nonlinear functions of X,Y, which are usually modeled by function composition (creating an idea of depth). There is evidence that shallow models have significant problems, and you can see an exposition in this paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.4580&rep=rep1&type=pdf
(Aug 22 '10 at 20:29)
Alexandre Passos ♦
I agree that the linear limitation of CRFs is the problem. I like the suggestion of using continuous values within each bin. Ideally, bins could be defined such that the true relationship between the feature value and the log score is (approximately) linear within each bin.
(Aug 24 '10 at 14:34)
Gregory Druck
|