|
Suppose we are given a graphical model and are told that inference will be done using k-steps (parallel updates) of Belief Propagation. How could you find a joint "distribution" induced by this inference method? More specifically, how would you find some representative joint distribution where node and edge marginals correspond to pseudo-marginals that come out of BP iterated for k-steps, starting with uniform messages? Considering the "Computation Tree" view of belief propagation, for k=1, induced model is just the original graphical model with all edges removed. What would the joint be for k=2? Motivation: if you could find KL-divergence between data and the distribution induced by k-step belief propagation, then this would suggest a modification of ML training tailored for BP Edit: now to think of it, it doesn't really make sense to optimize joint likelihood if you are only interested in marginals from BP during testing. One could just try to fit those pseudo-marginals directly, something Justin Domke proposes in his thesis, Ch5. I wonder if anyone else has tried this approach
showing 5 of 6
show all
|
|
OK, looks like I just need to multiply all edge marginals and divide by node marginals to get the joint. Koller calls this the "alternative representation of the joint measure" and proves correctness in Chapter 10 of "Probabilistic Graphical Models" book. Edit: Actually, that gives the right joint only for trees, and it's not obvious that this approximation would be good for general graphs However, I'm not sure any such approximation should make sense for general graphs. The inherent loopiness of it makes it hard to understand what is going on.
(Jul 24 '10 at 19:01)
Alexandre Passos ♦
|
|
I worked out an expression of joint in terms of marginals for the simplest non-tree model, and while the expression is technically "closed-form", it's quite complicated, so maybe there's no "nice" way to characterize the joint distribution induced by some marginal-finding method In particular, if X1,X2,X3 are in {-1,1}, P(X1=1)=P(X2=1)=P(X3=1)=1/2, and P(X1=X2)=m1,P(X2=X3)=m2,P(X3=X1)=m3, assuming no three-way interactions marginals determine the joint, but the simplest formula I could find for the probability of configuration x1,x2,x3 in terms of m1,m2,m3 is here ...ugh! |
I'm having a real hard time understanding your question. I am probably missing something ... what is a "node" marginal? How is that different from an "edge" marginal? Isn't the joint distribution for any graphical model found trivially by it's very definition? What do you mean by "KL Divergence" between data and distribution induced by k-step belief propagation, how is "data" a probability distribution?
@dogy: Yes, the distribution from the graphical model is trivially found by the definition, but that's not what he's talking about. His question is:
Suppose you have a graphical model and some data. Now you run belief propagation for K steps. Which graphical model (possibly different from this one), has the exact same marginal distribution as what you get after K steps of belief propagation in this model? For K=1 you get a fully disconnected Markov random field, for example.
By node marginal Yaroslav probably means the marginal probability for any given node, and by edge marginal I guess he means that edge's potential.
He also probably meant "data" in a loose sense, meaning by "kl divergence between the data and the truncated model" the KL divergence between the true posterior and the approximate at k iterations.
This seems like a hard problem, and I have no idea for how could one even characterize what is going on for k > 1. I don't think a simple truncation of the factor graph should work (otherwise belief propagation for trees would need only a constant number of iterations). Hence I'll wait for someone who knows graphical models better than me to explain it.
P(X) is node marginal, it is proportional to product of incoming messages when BP is used, P(X_i,X_j) is pairwise (edge) marginal, it is proportional to m_ij*m_ji where m_ij is ij'th message. You can view data as the empirical distribution, then ML fitting becomes the task of finding distribution with smallest KL-divergence to empirical distribution. When k=1, variables in the "induced" joint distribution are independent, so you can see that joint distribution induced by k-steps of BP not the same as the joint in the original graphical model
Well, it probably isn't even a graphical model. For instance, for k=2, BP marginal of node Yi is the same of the marginal of depth-2 tree with Yi being the root, and all neighbors of Yi as leaves. Under BP k=2 approximation, each Yi is independent of all nodes that aren't neighbors, hence the graphical structure that satisfies it consists of n disjoint depth-2 trees. However this representation has more nodes than there are variables. The "joint" that satisfies BP constraints might not even correspond to a valid probability distribution, but perhaps you could still compute KL-divergence for it
Shouldn't it be the marginal of the node corresponding to the Yi node on the truncated-depth tree? This, however, fails to explain what happens to the edges. I also think (but I'm not sure) that the joint will only fail to be a distribution in some pathological cases with cyclical graphs (since then the tree metaphor breaks, and there are convergence problems, etc).
Since pairwise distribution in BP is m_ij*m_ji, edge distribution induced by BP is somewhere between X_i,X_j joint in the computation tree rooted at X_i and X_i,X_j joint in the computation tree rooted at X_j