|
I'd like to process Multivariate Gaussian Distributions with reasonably high dimensional data, such as with word co-occurrence vectors, within a mixture model, like a Finite Gaussian Mixture Model or a Dirichlet Process Mixture Model. However i'm running into some noticeable problems. In the multivariate case the likelihood includes the determinate of the covariance matrix. The alternative log-likelihood approach turns this into `sum(log(covariance)). For simplicity, I'm assuming the covariance is diagonal. However, two issues occur with my data:
I can handle issue 1 initially by simply eliminating any features in the data that do not vary, but isn't this case also possible while learning the parameters in a mixture model? In that case, how doe s one avoid a variance of 0 in the covariance matrix? For issue 2, computing the log-likelihood of a point given a mean and a covariance should work (assuming issue 1 is not taking place), but then how do you use the log-likelihoods in a gibbs sampling method for the Dirichlet Process Mixture Model, i.e. compute the likelihood of x for several means and covariances and then treat these likelihoods as probabilities in a multinomial and sample to get one of the components? A side note: the scikit-learn implementation of the Gaussian Mixture Model encounters this problem but the Variational Bayes version of the Dirichlet Process Mixture Model solves it. However I'd like to know how to fix this issue when using Gibbs Sampling. Thanks in advance! |
|
As long as you have a prior on your covariance matrices the prior will effectively set a minimum value for each variance parameter (as the variance in the posterior has to be at least as big as the variance in the prior, intuitively), so problem 1 is a non-issue with Gibbs sampling (unless you have a flat prior). For problem 2, however, the determinant of the covariance can indeed be arbitrarily big, and there's nothing that can be done about that. One thing you can do is use a richer model than a diagonal covariance (which is independent gamma priors or each dimension's individual variance) but not as rich as a full covariance, like a factor analyser (which essentially states that the covariances are the sum of a diagonal matrix and a low-rank matrix, which can hopefully reduce the determinant of the covariance by including off-diagonal terms for subfactors which vary jointly). There are gibbs sampling algorithms for factor analysers, and they are not very hard to derive. |