This is a problem I've had many times when using dirichlet-multinomial models, and I'm still not certain I know how to fix it, since most of my solutions have been full of ad-hoccery. In mixtures of dirichlet-multinomials (somewhat between naive bayes and LDA, but I observed similar behaviour in samplers for both of these models), where I define a class as a probability distribution over words and some documents (or some words) belong to some classes, I usually find that my first implementations tend to, very quickly, converge to a local solution where most of the document/words are assigned to a single class/topic.

Some things worsen this phenomenon: the higher the (initial) value of hyperparameter alpha for the symmetric dirichlet, the worse this is. Also, if I sample the class prior probabilities (say, put a beta or dirichlet prior on them as well) the problem gets a lot worse. If I seed the model with some arrangement that I expect to make sense it sometimes fails to go down this area. Usually I get rid of this by fixing the prior probabilities for class membership biasing them away from the larger classes, or even "smoothing" the probabilities right before resampling the class/topic of a document/word. Also, the more "discrete" the model is (in the sense that, for example, naive bayes is more discrete than LDA, since the class membership is a rigid yes/no instead of a smooth distribution over classes) the worse I find this problem to be. In a way, the only way I've found for reliably fixing it is by not sampling many things I should sample, like the class prior probabilities and the dirichlet hyperparameters. Switching from a collapsed gibbs sampler (what I implement by default for this sort of problem) to an uncollapsed one has sometimes helped and sometimes failed to help as well.

Some more information. I'm currently working on a model that assigns each word in a document to one of three topics (a background topic common to all documents, a "specific" background topic that is common and fixed between all documents of a given class and a "class" topic that I'm trying to learn in semi-supervised and unsupervised variations of the same model). The "class" topic is sampled from one of a few possible classes, just like in naive bayes, and the words (given the "class" topic) are sampled as if from a topic model. The model as I described almost immediately stops assigning any words to the "class" topics if I put a dirichlet prior on the class prior probabilities. If I don't, it still sometimes stops assigning words to it unless I remove the "generic" background topic. Even if I manage to keep a decent amount of words relevant to the "class" topic (which is necessary for sampling the classes, otherwise they're just sampled from the prior), but put a dirichlet prior on the probabilities of each class, one class usually dominates over 90% of the model almost all the time (and yet I can get a higher likelihood if I force the model to avoid this). Even if I fix those probabilities to be uniform, however, the model still concentrates almost all documents on a single class if the dirichlet hyperparameter for the word distributions is "high" (as in higher than 10^-5 or so). Resampling the hyperparameter in the range below 10^-5 helps, but if I initialize it above that the model pretty fast goes into a bad local situation with most documents in a single class.

I've never observed this in dirichlet process models, but then again I haven't implemented many of these. Any ideas on what might this be or how to fix it?

asked Jun 30 '10 at 21:09

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Jun 30 '10 at 21:31


One Answer:

So there are many reasons why this can happen. The simplest, but most often correct, issue is that you probably have a bug. I've had this happen to me multiple times and its almost always been a bug. I'd have to hear more details, but it shouldn't be the case that (even a single mode of) the actual model posterior has all topic latent variables on a single topic. So its either a bug or some kind of "ridge search error" with Gibbs Sampling. The best way to test this issue out, and in general a good way to test statistical models, is to test on synthetic data drawn from the model. Almost all my latent variable stuff is tested using synthetic data.

My summarization work (http://www.cs.berkeley.edu/~aria42/pubs/naacl09-topical.pdf) had a model similar to this and there I had to make sure the hyper-parameters were ordered correctly, meaning more specific models need to have a lower concentration parameter.

Let me know if that helps.

answered Jul 01 '10 at 00:04

aria42's gravatar image

aria42
209972441

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.