I'd like to process Multivariate Gaussian Distributions with reasonably high dimensional data, such as with word co-occurrence vectors, within a mixture model, like a Finite Gaussian Mixture Model or a Dirichlet Process Mixture Model. However i'm running into some noticeable problems.

In the multivariate case the likelihood includes the determinate of the covariance matrix. The alternative log-likelihood approach turns this into `sum(log(covariance)). For simplicity, I'm assuming the covariance is diagonal. However, two issues occur with my data:

  1. Some dimensions do not vary, i.e. they are always some fixed constant, and so log(0) is computed on occasion or the determinate becomes 0.
  2. With a really high dimensional space the determinate of the covariance matrix either becomes really small (due to issue 1) or really huge if there's a lot of variance.

I can handle issue 1 initially by simply eliminating any features in the data that do not vary, but isn't this case also possible while learning the parameters in a mixture model? In that case, how doe s one avoid a variance of 0 in the covariance matrix? For issue 2, computing the log-likelihood of a point given a mean and a covariance should work (assuming issue 1 is not taking place), but then how do you use the log-likelihoods in a gibbs sampling method for the Dirichlet Process Mixture Model, i.e. compute the likelihood of x for several means and covariances and then treat these likelihoods as probabilities in a multinomial and sample to get one of the components?

A side note: the scikit-learn implementation of the Gaussian Mixture Model encounters this problem but the Variational Bayes version of the Dirichlet Process Mixture Model solves it. However I'd like to know how to fix this issue when using Gibbs Sampling.

Thanks in advance!

asked May 16 '12 at 01:15

Keith%20Stevens's gravatar image

Keith Stevens
62161327


One Answer:

As long as you have a prior on your covariance matrices the prior will effectively set a minimum value for each variance parameter (as the variance in the posterior has to be at least as big as the variance in the prior, intuitively), so problem 1 is a non-issue with Gibbs sampling (unless you have a flat prior). For problem 2, however, the determinant of the covariance can indeed be arbitrarily big, and there's nothing that can be done about that. One thing you can do is use a richer model than a diagonal covariance (which is independent gamma priors or each dimension's individual variance) but not as rich as a full covariance, like a factor analyser (which essentially states that the covariances are the sum of a diagonal matrix and a low-rank matrix, which can hopefully reduce the determinant of the covariance by including off-diagonal terms for subfactors which vary jointly). There are gibbs sampling algorithms for factor analysers, and they are not very hard to derive.

answered May 19 '12 at 00:48

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.