|
Suppose a bit-valued vector variable X={X1,...,Xm}', Xiin {0,1}. If we restrict E(XX')=C, we can obtain a maximum entropy distribution of the form f(x)=Z(A)exp(x'Ax), in which Z(A) is a normalization factor. My questions are:
I have spent several days on this problem but i was not able to solve it. My math is too poor. |
|
I believe that you have just described the Ising model with pairwise interactions. There is a vast literature on this topic since it was proposed in the early 20th century. There are many good review. The wikipedia page is a wonderful place to start. For the physics perspective it is harder to do better than the book by Leo Kadanoff "Statistical Physics: Statics, Dynamics, and Renormalization" although this book is perhaps not that easy to find. There is also a nice chapter in MacKay's textbook. Most of the literature on this topic, while very interesting, deals with topics like phase transitions and critical phenomena - these are not yet known to have any major relevance to ML or AI. What is the relationship between A and C? Note that because you are talking about a maxent distribution C = E(X,X') = d^2 (log Z(A)) / dX dX' is given by the matrix of second partial derivatives of log Z(A). For certain special cases the normalizing constant Z(A) has an exact solution. It depends on the conditional independence properties of the variables. It is best to think of the matrix A as specifying the connectivity properties of a graph. Depending on the topology of this graph there may exists analytic solutions. For example, it is well known that if the variables are coupled together as a tree by A, then the normalizing sum can be computed with the belief propagation algorithm. For arbitrary A there is no better algorithm than summing over all 2^m states, but there are many approximation techniques. One popular example is mean field theory which states that each variable is coupled to the average of its neighbors. This approximation becomes exact in the limit where the number of nearest neighbors becomes large. Another very famous example where Z(A) can be computed exactly is the 2d grid with homogenous nearest neighbor interactions. The analytic solution was given by Onsager in the 1940s. A more elegant technique to solve this is presented in a chapter in Feynman's statistical mechanics book (and is due to C.N. Yang I believe) Another great source for analytic methods is the book "Exactly soluble models in statistical physics" by Baxter.
This answer is marked "community wiki".
no analytic solution for arbitrary A? Um... That's really bad. Is it possible to efficiently compute A given C?
(Sep 26 '10 at 03:05)
x10000year
What is your specific application? As mentioned below, you choose the elements of A so that the pair correlation under the model between two variables matches your empirical correlations. E_A(X,X') = C. Computation of the model correlations is the computationally expensive part, but it can be estimated with MCMC, or other approximate techniques. See this paper and the references therein: http://arxiv.org/abs/0712.2437
(Sep 26 '10 at 04:03)
jbowlan
@jbowlan, I try to learn a model from a bit-valued (and randomly perturbed) dataset and then draw new samples from the learned model. I hope the model can hold the same univariate and bivariate statistics as the training dataset, and among all such models the maximum-entropy one is preferred. I found that the Boltzmann machine looks like the same thing.
(Sep 26 '10 at 04:23)
x10000year
|
|
By convention usually you divide by Z(A), in which case Z(A) is a quantity needed to make the density normalized. It is a sum of exp(x'Ax) over all values of x. Also, since x_i x_j=x_j x_i you should restrict C to be symmetric. Then values of A are the the solution a system of m^2 equations of the form Sum_x x_i x_j f(x)=C_{ij} which is guaranteed to exist and be unique. If you add constraints on all other products of xi's, there a closed form expression for A<->C mapping, otherwise I only know of this implicit representation in terms of system of equations. It can be solved numerically using Newton's method. Edit 08/25 : as jbowlan points out, this is an instance of an Ising model, which is well studied in Physics (over 10,000 papers). For general A you can't do better than brute force because computing Z is #P-hard, even if A is restricted to having entries only 0 and 1, see p.104 of [Welsh, (1993)]. There are exact polynomial time algorithms for restricted special cases like when xi's are 1,-1 and C factorizes according to a planar graph. Exact algorithms are fickle though. For instance, you can check if a setting of x's maximizes exp(x' A x) in polynomial time when A is an adjacency matrix of a planar graph and x's {-1,1} valued, but if you make them {-1+h,1+h} valued, same problem becomes NP-hard, go figure. Guaranteed approximation is probably also #P-hard in general case. However, in restricted cases, like when C is not very far from a low-rank matrix, there are effective and efficient approximate methods. Edit 08/26 You can probably find a solution by gradient descent where gradient is the difference between observed correlation matrix and predicted correlation matrix. Predicted correlation matrix can be computed using belief propagation which is fast and easy to implement. For instance, k steps of BP in Mathematica it would correspond to nesting the following function k times Function[{m},Table[ArcTanh[Tanh[U[[i, j]]] Tanh[Total[m[[All, i]]] - m[[j,i]]]], {i, 1, n}, {j, 1, n}]]] It will converge when interactions are not too strong. If there is a small number of strong interactions, you can merge each set of strongly interacting variables into a separate variable and run belief propagation on that problem. If there is a large number of strong interactions, all methods fail. Edit 08/26 II The fact that you are learning a model from data simplifies things. First of all, various natural laws put a limit on the strength of interactions in the model. Maximum interaction model would be a complex fully deterministic model. Also, sampling noise and averaging puts a limit on the strength of interactions you can infer from data. If your model has strong interactions, your data may have weak interactions, and vice versa. Natural thing to do for noisy data is to relax MaxEnt constraints to allow E[X'X] lie within some range of observed E[X'X]. This makes things both better from statistical point of view and from computational point of view and various approximation methods can be expected to give decent results. For instance, look up Jason Johnson's "Maximum Entropy Relaxation" for a binary setting, where he also considers modeling higher order interactions (ie, constraints on expected values of features like x1x2x3). Also Miroslav Dudik considers a similar problem in his thesis. For a practical approach to your problem, I would do gradient descent with an approximate method to compute expected covariance, and monitor estimated error of your approximation method. When it gets too high, terminate and return solution. For instance, you could do belief propagation and terminate when it starts having convergence problems. This can be viewed as the poor-man's approximation of maximum entropy relaxation. If you want to go really in depth on this problem, you could use Martin Wainwright's, described in his PhD thesis, of considering a hierarchy of approximation techniques. Depending on particular choice of C, a low-level approximation (loopy BP) could be sufficient, or you may need a higher order one. Check section 6.6.1 for details on how to determine the order of approximation needed. Edit 12/08: [Cox, 94] talks a bit about the difference between Gaussian and Boltzmann machine. Even though they are the same algebraically, one source of difficulty in the discrete case is that marginalization doesn't usually have closed form. @Yaroslav Bulatov: Actually I was asking the computation of Z(A) rather than the meaning of Z(A) :P I know A can be solved in such a brute-force way, but solving this system of equations isn't practice because in each Sum_x x_i x_j f(x)=C_{ij} you are summing over 2^m terms. I'm desiring a more efficient solution that can compute A in poly time, and I think such an efficient solution is possible because this distribution has similar form to the multivariate normal distribution.
(Sep 25 '10 at 13:24)
x10000year
Adding to yaroslav's answer, the trivial way to compute Z(A) is by marginalizing over x; Z(A) = sum_x (exp(x'Ax)). Since the xs are 0-1 vectors there can be a way of making this simpler; maybe something close to the permanent exponentiated or something like that, this is a bit fuzzy for me now. Once you can compute Z(A) you can write the entropy explicitly and maximize it, no?
(Sep 25 '10 at 13:53)
Alexandre Passos ♦
@Yaroslav Bulatov, Unfortunately there may be a large number of strong interactions. I'm wondering whether I can decompose the model into a number of conditional probability model, each modeling the probability p(X_i|X_1,...,X_{i-1}). Each model chooses the distribution that maximizes the conditional entropy with certain constrains, specifically it can be learnt via logistic regression from training data, like the maximum entropy model in nlp. But I don't know whether maximizing these conditional entropies also maximizes the joint entropy.
(Sep 26 '10 at 13:49)
x10000year
This is known as pseudo-likelihood maximization (whereas maximum entropy corresponds to likelihood maximization). If your true model can be represented exactly by your parametric form, this gives the same result as maximum entropy in the limit of infinite data. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.6624
(Sep 26 '10 at 15:25)
Yaroslav Bulatov
A stupid question: if a vector of variables X={x1,...,xm} follows the Ising model, then whether the marginal distribution of any subset of variables also follows a Ising model? Or equivalently, if f(X) is the maximum entropy distribution subject to certain second-order statistics E(X'X)=C, then whether the marginal distribution of any subset of variables is also the maximum entropy distribution subject to the same constrains? The above question is true for multivariate normal distribution. But I don't know whether it is true for binary discrete case.
(Sep 27 '10 at 07:58)
x10000year
Ah, I see. I'll delete my other comments.
(Sep 27 '10 at 09:02)
Alexandre Passos ♦
If you marginalize away a variable with 2 neighbors or less, the result will be an Ising model. WIth more neighbors, it will not be an Ising model in general. I put details on a separate page http://yaroslavvb.com/upload/meta-answer.html since there's no math support here. I think in future I'd be more inclined to answer math-requiring questions if you ask it on math-supporting question-site (like stats.stackexchange.com) and post a link here
(Sep 27 '10 at 18:44)
Yaroslav Bulatov
1
By the way, an interesting fact, if you marginalize away a variable at the leaf of the graph defining the model, the marginal model is obtained by dropping the term corresponding to that edge (assuming Ising model with no magnetic field). Hence, if the graph of your model is a tree, k'th marginal is determined solely by k'th coefficient, and the problem of converting between A and C is quadratic in the number of variables. When you take a simple chain Ising model and connect first and last vertices, it's no longer a tree, and now each entry of A is determined by each entry of C, so you might expect it to be harder to compute. However, it turns out, a dynamic programming approach can still do the inversion with quadratic complexity
(Sep 29 '10 at 15:19)
Yaroslav Bulatov
@Yaroslav Bulatov If you are really assuming an Ising model with no magnetic field, then each variable has equal probability to be assigned -1/+1, due to the symmetry of the model. Thus your conclusion is not quite meaningful :P
(Oct 01 '10 at 02:07)
x10000year
Each individual variable has equal marginal probability to be assigned, but computing probability of a particular configuration of assignments could still be hard. Tree is a special case when it's easy
(Oct 01 '10 at 13:40)
Yaroslav Bulatov
showing 5 of 10
show all
|