Here's a silly idea, what if we do away with propagation based training and initialize parameters of our random field directly from counts?

For random field in canonical parameterization, we could set parameter for each feature to log(N(featureON)/Ntotal), then make predictions by running regular belief propagation.

Likelihood of model trained this way will be suboptimal, but I'm interested how badly this would affect classification accuracy, has something like this been tried?

More generally, I'm interested in graphical model training when you can't afford to do anything more expensive than computing feature counts

Update 02/19: I did a toy experiment comparing parameters from ML-estimator and "simple counting"-estimator on a 3-loop and 4-loop Ising model. Count-based estimate seem to roughly follow ML estimate except when there's strong frustration. For instance, in 3-loop with j much below 0, there's non-trivial cancellation, so true log-linear parameters are quite different from what is apparent if you consider edge marginal in isolation.

asked Feb 17 '11 at 21:21

Yaroslav%20Bulatov's gravatar image

Yaroslav Bulatov
1963193458

edited Feb 19 '11 at 23:48

Interesting! What are your features in these models? Are you considering x1, x2, ... x4 as binary features? What happens if you add a lot of potentially overlapping features?

(Feb 21 '11 at 18:02) Oscar Täckström

xi's are -1,1 valued variables, so I'm plugging them directly into the formula for density. The way to get "count based" estimator is take ML estimator with the assumption that features are independent. In models above, x1*x2 is a feature. Having lots of overlapping features would violate this assumption, so lots of overlapping features could make the estimates far off. Its not clear that classification accuracy would be so badly affected -- Naive Bayes gets good classification accuracy despite getting bad probability estimates (probabilities from NB classifier tend to cluster around 0 and around 1)

(Feb 21 '11 at 18:27) Yaroslav Bulatov

Ok, I see what you are doing. What I meant was the latter part of your answer, i.e. how using complex and overlapping feature functions f(x1,x2,...) would affect the estimate. It makes sense now.

(Feb 21 '11 at 21:12) Oscar Täckström

I was thinking, can you correct this bias by using a more complex model, with more features, which you only count instead of optimize?

(Feb 22 '11 at 10:57) Alexandre Passos ♦

Yes. Exponential families are "universal approximators", so with enough features you can model any discrete distribution exactly. Having more features increases overfitting, however, if you do generative training for conditional prediction, you are not minimizing risk directly, so you are not going to be overfitting as much.

(Feb 22 '11 at 15:34) Yaroslav Bulatov

2 Answers:

In chain models it is common to train local classifiers and then do directed greedy/beam inference for the optimal sequence, but I'm not aware of any application of belief propagation to this setting. With enough samples, simple feature counts might get you somewhere, but do you think that BP would converge in general with such a simple estimator?

Perhaps you could afford to do piecewise training? It seems to work pretty well for CRFs, but that requires you to solve a more complex optimization problem.

Edit: I think what you are proposing is a cyclic dependency network with multinomials for each local distribution. See the paper "Dependency networks for inference, collaborative filtering and data visualization" by Heckerman et al. (2001). Basically you can use any ML estimate of your local models, but joint inference is difficult. I think sampling is the common approach to doing inference in these models.

answered Feb 18 '11 at 03:59

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
1459102743

edited Feb 18 '11 at 07:20

I think for such simple estimator it would make more sense to run BP for the number of rounds equal to the diameter of the graph rather than "till convergence". That's less satisfying from theoretical stand-point, but it may be all you can do if you your task is something like classifying all the pages of the internet

(Feb 18 '11 at 04:06) Yaroslav Bulatov

This is a very interesting question. This is essentially doing generative training for directed, rather than undirected, models, as training those (for discrete features) is the same thging as counting features, which is not true for undirected models (which have a far hairier maximum likelihood solution). So I wonder how this would behave in cases like grid MRFs or skip-chain CRFs. This seems similar to Hebb's rule.

answered Feb 18 '11 at 05:24

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

Yes, by the factorization of directed models, local normalization means that you get a proper global distribution.

Note that you could use more complex local distributions in a directed model as well. See for example the paper "Painless Unsupervised Learning with Features" by Berg-Kirkpatrick et al. in which they use a logistic model in place of multinomials at each factor.

In a cyclic dependency network, you can use whatever distribution for your local models. For a grid model the local distributions are p(x | Nbrhood(x)), which you could model with a multinomial so that the ML estimate is relative counts or a logistic regression model, in which the ML estimate requires numerical methods. It should work with both observed and latent variabels.

Inference will be the real problem. Sampling seems to be a standard solution, but I don't know if there are any results on the relationship between model, structure and inference.

(Feb 18 '11 at 07:14) Oscar Täckström

I think the things that make ML-estimator "hairy" are 1) cycles and 2) conditional objective. ML estimator of undirected tree model comes down to counting -- in Ising parameterization like above, the ML estimate of canonical parameter is 1/2 Log(poscount/negcount). Peter Grunwald has a paper on training Bayesian Networks to maximize conditional log loss, and you need to do gradient descent there like for CRF training. I imaging if you have a directed model with loops (in other words probability distribution is specified by some conditional distributions forming a cycle and some marginal distributions), then ML estimation will be hard as well

(Feb 20 '11 at 16:56) Yaroslav Bulatov
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.